Expand description
Cache-friendly, parallel, sample-sort-based suffix array construction.
This crate is a Rust port of CaPS-SA (Khan et al., WABI 2023), a parallel and cache-friendly suffix-array constructor based on sample sort with LCP-enhanced comparison.
The crate is generic over the symbol type (u8, u16, …; any Ord + Copy)
and the index type (u32, u64; via the Index trait). It produces a
standard lexicographic suffix array. Callers who need a generalized suffix
array (multiple strings, sentinel-terminated) should rewrite their text with
distinct sentinels before invoking — the resulting standard SA over the
transformed text is then the generalized SA they want.
Phase 1 of the port provides the in-memory algorithm; the external-memory variant (disk-spilling buckets) is layered on top in a later phase.
Structs§
- ExtMem
Opts - Tunable options for
build_ext_mem. - LcpDispatch
- A function-pointer dispatch for byte-text LCP. The architecture-specific pointer is selected once at construction by feature detection; later calls reduce to a register-resident indirect call.
- Opts
- Tunable options for SA construction.
Traits§
- Index
- Trait implemented by integer types usable as suffix array indices.
Functions§
- build_
ext_ mem - Build the suffix array of
textwith bounded RAM, streaming each output position toemitin lexicographic order. - build_
ext_ mem_ for_ positions - Like
build_ext_membut sorts only the caller-suppliedpositionsby the lexicographic order of their suffixes intext. Suffix content is alwaystext[position..]; no positions are dropped from the input. To filter, the caller constructspositionswith only the indices they want. - build_
in_ memory - Build the suffix array of
textin memory and return it. - build_
in_ memory_ for_ positions - Sort the caller-supplied
positionsby the lexicographic order of their suffixes intext. Returns the positions reordered so thattext[output[i]..]is the i-th smallest suffix among the input set. - build_
in_ memory_ for_ positions_ with_ opts - Variant of
build_in_memory_for_positionsthat accepts tuning options. - build_
in_ memory_ sample_ sort - In-memory variant of the sample-sort algorithm used by
build_ext_mem. Skips all disk I/O at the cost of holding the (pos,lcp) records in RAM throughout. Picksu32records whenn ≤ 2³², falls back tou64otherwise. The caller’semitclosure is called once per output position in lex order, just like in the ext-mem path. - build_
in_ memory_ sample_ sort_ for_ positions - Subset-positions variant of
build_in_memory_sample_sort. Same shape asbuild_ext_mem_for_positions. - build_
in_ memory_ with_ opts - Variant of
build_in_memorythat accepts tuning options. - lcp
- One-off LCP. Constructs a fresh
LcpDispatchon every call — fine for tests / introspection, slower than reusing oneLcpDispatchacross an algorithm’s inner loop. - lcp_
scalar - Generic scalar LCP. Public so callers that already know their symbol
type isn’t
u8can skip both dispatch andTypeIdcheck. - lcp_u8
- One-off
u8-typed LCP that auto-selects AVX2 / NEON / scalar. Skips the genericTypeIdbranch; otherwise equivalent in cost tolcpfor byte texts. - suffix_
cmp - One-off suffix comparison; see
lcpfor the cost note.