Expand description
Cache-friendly, parallel, sample-sort-based suffix array construction.
This crate is a Rust port of CaPS-SA (Khan et al., WABI 2023), a parallel and cache-friendly suffix-array constructor based on sample sort with LCP-enhanced comparison.
The crate is generic over the symbol type (u8, u16, …; any Ord + Copy)
and the index type (u32, u64; via the Index trait). It produces a
standard lexicographic suffix array. Callers who need a generalized suffix
array (multiple strings, sentinel-terminated) should rewrite their text with
distinct sentinels before invoking — the resulting standard SA over the
transformed text is then the generalized SA they want.
Phase 1 of the port provides the in-memory algorithm; the external-memory variant (disk-spilling buckets) is layered on top in a later phase.
Structs§
- ExtMem
Opts - Tunable options for
build_ext_mem. - LcpDispatch
- A function-pointer dispatch for byte-level LCP. The architecture- specific pointer is selected once at construction by feature detection; later calls reduce to a register-resident indirect call.
- Opts
- Tunable options for SA construction.
- Plain
Text - Default provider for non-segmented texts:
lim_at(p) = n - p. Stored as a singleusize; the#[inline(always)]lim_atfolds at monomorphization time into the samen - pthe merge used before this abstraction existed, so non-segmented callers pay zero overhead. - Segmented
Text - Provider for texts partitioned into segments at known cumulative
end positions.
lim_at(p)binary-searches the sorted ends list and returns the distance frompto the next boundary.
Traits§
- Index
- Trait implemented by integer types usable as suffix array indices.
- Limit
Provider - Per-suffix length provider. The merge and cascade-merge code use
lp.lim_at(p)instead oftext.len() - p; the LCP function itself is unchanged (the merge passes the appropriately-cappedmax_ctxto the existing SIMD path). - Symbol
- A symbol type for suffix-array construction. All stdlib unsigned
and signed integer types satisfy this, as does
[T; N]for anyT: Symbol(arrays have no padding and inheritOrdlexicographically fromT). To use a custom type as a symbol, implement this trait yourself; if the type contains padding, mark it#[repr(C, packed)]first.
Functions§
- build_
ext_ mem - Build the suffix array of
textwith bounded RAM, streaming each output position toemitin lexicographic order. - build_
ext_ mem_ for_ filter - Like
build_ext_mem_for_positionsbut takes a predicate over text positions instead of a pre-materialisedVec<u64>of kept positions. - build_
ext_ mem_ for_ filter_ with - Variant of
build_ext_mem_for_filterthat accepts aLimitProvider. Seebuild_ext_mem_withfor the semantics. - build_
ext_ mem_ for_ positions - Like
build_ext_membut sorts only the caller-suppliedpositionsby the lexicographic order of their suffixes intext. Suffix content is alwaystext[position..]; no positions are dropped from the input. To filter, the caller constructspositionswith only the indices they want. - build_
ext_ mem_ for_ positions_ with - Variant of
build_ext_mem_for_positionsthat accepts aLimitProvider. Seebuild_ext_mem_withfor the semantics. - build_
ext_ mem_ with - Variant of
build_ext_memthat accepts aLimitProvider. WithPlainTextthis matchesbuild_ext_memexactly (and monomorphizes to identical assembly); withSegmentedTextthe LCP scans stop at segment boundaries. - build_
in_ memory - Build the suffix array of
textin memory and return it. - build_
in_ memory_ for_ positions - Sort the caller-supplied
positionsby the lexicographic order of their suffixes intext. Returns the positions reordered so thattext[output[i]..]is the i-th smallest suffix among the input set. - build_
in_ memory_ for_ positions_ with - Variant of
build_in_memory_for_positionsthat accepts both aLimitProvider(for segmented LCP truncation) and tuning options. WithPlainTextthis is identical tobuild_in_memory_for_positions_with_opts. - build_
in_ memory_ for_ positions_ with_ opts - Variant of
build_in_memory_for_positionsthat accepts tuning options. - build_
in_ memory_ sample_ sort - In-memory variant of the sample-sort algorithm used by
build_ext_mem. Skips all disk I/O at the cost of holding the (pos,lcp) records in RAM throughout. Picksu32records whenn ≤ 2³², falls back tou64otherwise. The caller’semitclosure is called once per output position in lex order, just like in the ext-mem path. - build_
in_ memory_ sample_ sort_ for_ positions - Subset-positions variant of
build_in_memory_sample_sort. Same shape asbuild_ext_mem_for_positions. - build_
in_ memory_ sample_ sort_ for_ positions_ with - Variant of
build_in_memory_sample_sort_for_positionsthat accepts aLimitProvider. - build_
in_ memory_ sample_ sort_ with - Variant of
build_in_memory_sample_sortthat accepts aLimitProvider. - build_
in_ memory_ with - Variant of
build_in_memorythat accepts aLimitProvider. WithPlainTextthis is identical tobuild_in_memory; withSegmentedTextthe LCP scans stop at segment boundaries. - build_
in_ memory_ with_ opts - Variant of
build_in_memorythat accepts tuning options. - lcp
- One-off LCP. Constructs a fresh
LcpDispatchon every call — fine for tests / introspection, slower than reusing oneLcpDispatchacross an algorithm’s inner loop. - lcp_
scalar - Generic scalar LCP. Public so callers that already know they can
skip the SIMD dispatch (e.g. non-
Symbolsymbol types like arbitraryEqnewtypes) can call this directly. Still symbol-granularity; just doesn’t go through the byte view. - lcp_u8
- One-off
u8-typed LCP that auto-selects AVX-512 / AVX2 / NEON / scalar. Convenience entry point for byte texts; equivalent in cost toLcpDispatch::detect().lcp::<u8>(...)but skips the generic indirection. - suffix_
cmp - One-off suffix comparison; see
lcpfor the cost note.