Skip to main content

Crate caps_sa

Crate caps_sa 

Source
Expand description

Cache-friendly, parallel, sample-sort-based suffix array construction.

This crate is a Rust port of CaPS-SA (Khan et al., WABI 2023), a parallel and cache-friendly suffix-array constructor based on sample sort with LCP-enhanced comparison.

The crate is generic over the symbol type (u8, u16, …; any Ord + Copy) and the index type (u32, u64; via the Index trait). It produces a standard lexicographic suffix array. Callers who need a generalized suffix array (multiple strings, sentinel-terminated) should rewrite their text with distinct sentinels before invoking — the resulting standard SA over the transformed text is then the generalized SA they want.

Phase 1 of the port provides the in-memory algorithm; the external-memory variant (disk-spilling buckets) is layered on top in a later phase.

Structs§

ExtMemOpts
Tunable options for build_ext_mem.
LcpDispatch
A function-pointer dispatch for byte-level LCP. The architecture- specific pointer is selected once at construction by feature detection; later calls reduce to a register-resident indirect call.
Opts
Tunable options for SA construction.
PlainText
Default provider for non-segmented texts: lim_at(p) = n - p. Stored as a single usize; the #[inline(always)] lim_at folds at monomorphization time into the same n - p the merge used before this abstraction existed, so non-segmented callers pay zero overhead.
SegmentedText
Provider for texts partitioned into segments at known cumulative end positions. lim_at(p) binary-searches the sorted ends list and returns the distance from p to the next boundary.

Traits§

Index
Trait implemented by integer types usable as suffix array indices.
LimitProvider
Per-suffix length provider. The merge and cascade-merge code use lp.lim_at(p) instead of text.len() - p; the LCP function itself is unchanged (the merge passes the appropriately-capped max_ctx to the existing SIMD path).
Symbol
A symbol type for suffix-array construction. All stdlib unsigned and signed integer types satisfy this, as does [T; N] for any T: Symbol (arrays have no padding and inherit Ord lexicographically from T). To use a custom type as a symbol, implement this trait yourself; if the type contains padding, mark it #[repr(C, packed)] first.

Functions§

build_ext_mem
Build the suffix array of text with bounded RAM, streaming each output position to emit in lexicographic order.
build_ext_mem_for_filter
Like build_ext_mem_for_positions but takes a predicate over text positions instead of a pre-materialised Vec<u64> of kept positions.
build_ext_mem_for_filter_with
Variant of build_ext_mem_for_filter that accepts a LimitProvider. See build_ext_mem_with for the semantics.
build_ext_mem_for_positions
Like build_ext_mem but sorts only the caller-supplied positions by the lexicographic order of their suffixes in text. Suffix content is always text[position..]; no positions are dropped from the input. To filter, the caller constructs positions with only the indices they want.
build_ext_mem_for_positions_with
Variant of build_ext_mem_for_positions that accepts a LimitProvider. See build_ext_mem_with for the semantics.
build_ext_mem_with
Variant of build_ext_mem that accepts a LimitProvider. With PlainText this matches build_ext_mem exactly (and monomorphizes to identical assembly); with SegmentedText the LCP scans stop at segment boundaries.
build_in_memory
Build the suffix array of text in memory and return it.
build_in_memory_for_positions
Sort the caller-supplied positions by the lexicographic order of their suffixes in text. Returns the positions reordered so that text[output[i]..] is the i-th smallest suffix among the input set.
build_in_memory_for_positions_with
Variant of build_in_memory_for_positions that accepts both a LimitProvider (for segmented LCP truncation) and tuning options. With PlainText this is identical to build_in_memory_for_positions_with_opts.
build_in_memory_for_positions_with_opts
Variant of build_in_memory_for_positions that accepts tuning options.
build_in_memory_sample_sort
In-memory variant of the sample-sort algorithm used by build_ext_mem. Skips all disk I/O at the cost of holding the (pos, lcp) records in RAM throughout. Picks u32 records when n ≤ 2³², falls back to u64 otherwise. The caller’s emit closure is called once per output position in lex order, just like in the ext-mem path.
build_in_memory_sample_sort_for_positions
Subset-positions variant of build_in_memory_sample_sort. Same shape as build_ext_mem_for_positions.
build_in_memory_sample_sort_for_positions_with
Variant of build_in_memory_sample_sort_for_positions that accepts a LimitProvider.
build_in_memory_sample_sort_with
Variant of build_in_memory_sample_sort that accepts a LimitProvider.
build_in_memory_with
Variant of build_in_memory that accepts a LimitProvider. With PlainText this is identical to build_in_memory; with SegmentedText the LCP scans stop at segment boundaries.
build_in_memory_with_opts
Variant of build_in_memory that accepts tuning options.
lcp
One-off LCP. Constructs a fresh LcpDispatch on every call — fine for tests / introspection, slower than reusing one LcpDispatch across an algorithm’s inner loop.
lcp_scalar
Generic scalar LCP. Public so callers that already know they can skip the SIMD dispatch (e.g. non-Symbol symbol types like arbitrary Eq newtypes) can call this directly. Still symbol-granularity; just doesn’t go through the byte view.
lcp_u8
One-off u8-typed LCP that auto-selects AVX-512 / AVX2 / NEON / scalar. Convenience entry point for byte texts; equivalent in cost to LcpDispatch::detect().lcp::<u8>(...) but skips the generic indirection.
suffix_cmp
One-off suffix comparison; see lcp for the cost note.