Skip to main content

Crate caps_sa

Crate caps_sa 

Source
Expand description

Cache-friendly, parallel, sample-sort-based suffix array construction.

This crate is a Rust port of CaPS-SA (Khan et al., WABI 2023), a parallel and cache-friendly suffix-array constructor based on sample sort with LCP-enhanced comparison.

The crate is generic over the symbol type (u8, u16, …; any Ord + Copy) and the index type (u32, u64; via the Index trait). It produces a standard lexicographic suffix array. Callers who need a generalized suffix array (multiple strings, sentinel-terminated) should rewrite their text with distinct sentinels before invoking — the resulting standard SA over the transformed text is then the generalized SA they want.

Phase 1 of the port provides the in-memory algorithm; the external-memory variant (disk-spilling buckets) is layered on top in a later phase.

Structs§

ExtMemOpts
Tunable options for build_ext_mem.
LcpDispatch
A function-pointer dispatch for byte-text LCP. The architecture-specific pointer is selected once at construction by feature detection; later calls reduce to a register-resident indirect call.
Opts
Tunable options for SA construction.

Traits§

Index
Trait implemented by integer types usable as suffix array indices.

Functions§

build_ext_mem
Build the suffix array of text with bounded RAM, streaming each output position to emit in lexicographic order.
build_ext_mem_for_positions
Like build_ext_mem but sorts only the caller-supplied positions by the lexicographic order of their suffixes in text. Suffix content is always text[position..]; no positions are dropped from the input. To filter, the caller constructs positions with only the indices they want.
build_in_memory
Build the suffix array of text in memory and return it.
build_in_memory_for_positions
Sort the caller-supplied positions by the lexicographic order of their suffixes in text. Returns the positions reordered so that text[output[i]..] is the i-th smallest suffix among the input set.
build_in_memory_for_positions_with_opts
Variant of build_in_memory_for_positions that accepts tuning options.
build_in_memory_sample_sort
In-memory variant of the sample-sort algorithm used by build_ext_mem. Skips all disk I/O at the cost of holding the (pos, lcp) records in RAM throughout. Picks u32 records when n ≤ 2³², falls back to u64 otherwise. The caller’s emit closure is called once per output position in lex order, just like in the ext-mem path.
build_in_memory_sample_sort_for_positions
Subset-positions variant of build_in_memory_sample_sort. Same shape as build_ext_mem_for_positions.
build_in_memory_with_opts
Variant of build_in_memory that accepts tuning options.
lcp
One-off LCP. Constructs a fresh LcpDispatch on every call — fine for tests / introspection, slower than reusing one LcpDispatch across an algorithm’s inner loop.
lcp_scalar
Generic scalar LCP. Public so callers that already know their symbol type isn’t u8 can skip both dispatch and TypeId check.
lcp_u8
One-off u8-typed LCP that auto-selects AVX2 / NEON / scalar. Skips the generic TypeId branch; otherwise equivalent in cost to lcp for byte texts.
suffix_cmp
One-off suffix comparison; see lcp for the cost note.