# caps-sa
A pure-Rust implementation of **CaPS-SA** (Khan et al., WABI 2023), a
cache-friendly, parallel, sample-sort-based suffix array constructor.
The crate is generic over the symbol type (`u8`, `u16`, `u32`, `u64`,
`[u8; N]`, … — anything implementing the [`Symbol`] trait) and the
index type (`u32`, `u64`, `usize`), produces a standard lexicographic
suffix array, and scales to human-genome inputs (≈ 6 × 10⁹ symbols) on
commodity hardware via an external-memory sample-sort path that
streams the SA out as positions are emitted.
## Status
Both the in-memory and external-memory paths are implemented, tested,
and benchmarked. 43 unit tests pass and the SA output is differentially
verified against a brute-force reference on small and random inputs.
On the human genome (GRCh38, 32 threads on AMD EPYC 9575F), caps-sa is
**7% faster than upstream CaPS-SA's ext-mem path** and uses **23% less
RAM**, while beating upstream's in-mem wall time by 3% at 1/10 of the
RAM. See [`bench/README.md`](bench/README.md) for the full methodology
and the optimisation ladder that got us there.
The crate ships four entry points sharing one LCP-enhanced merge
kernel:
| In-memory, parallel merge-sort | `build_in_memory` |
| In-memory, sample-sort (alternative for huge `n`) | `build_in_memory_sample_sort` |
| External-memory, disk-spilling sample-sort | `build_ext_mem` |
| Any of the three above, restricted to a subset | `*_for_positions` |
All four paths share the same SIMD LCP fast path (AVX-512BW hybrid →
AVX2 → NEON → scalar), selected once per build entry via
`LcpDispatch::detect()` and threaded into the inner loop as a function
pointer — no per-call feature-detect overhead. The same byte-level
SIMD function backs **every symbol width** via a byte-view dispatch in
`LcpDispatch::lcp<S: Symbol>`: a single AVX-512 byte-compare followed
by `byte_lcp / size_of::<S>()` recovers the symbol-LCP for `u16`,
`u32`, `[u8; 3]`, `u64`, and any other `Symbol`. Measured on a Zen 5
host this lifts the LCP function from ~200 ms scalar to 4–29 ms SIMD
across widths (7× on `u64` to 45× on `u8` for a 1 M-symbol long-LCP
microbench).
## Example
```rust
use caps_sa::build_in_memory;
let text = b"banana";
let sa: Vec<u32> = build_in_memory(text);
// `sa` is the standard lexicographic suffix array of `text`. The
// index type is generic — pick `u32`, `u64`, or `usize` for your input.
```
For large inputs, stream the SA from disk-spilling buckets so the
output is never fully materialised in RAM:
```rust
use caps_sa::{ExtMemOpts, build_ext_mem};
let opts = ExtMemOpts::default();
// The caller streams these straight to disk / a packed array.
Ok(())
})?;
```
For workflows that sort only a subset of positions (e.g. STAR-style
genome indexing where many positions are filtered out — N's, spacers),
hand only the positions you want sorted to `*_for_positions`. The
others never enter the sort:
```rust
use caps_sa::build_ext_mem_for_positions;
let positions: Vec<u64> =
(0..text.len() as u64).filter(|&p| text[p as usize] < 4).collect();
})?;
```
## Algorithm
The in-memory kernel is a parallel merge-sort whose two-way merge uses
an **LCP-enhanced comparison**: an LCP array travels alongside each
sorted run, so the merge decides the order of two candidates in `O(1)`
in two of three cases and only falls back to a symbol-by-symbol scan
when the carried LCP equals the current boundary. The three-case
analysis is in `src/sample_sort.rs::merge`.
The external-memory path wraps that kernel in a sample-sort:
1. **Sort + sample + spill.** Split positions into `p` subarrays, sort
each with the in-memory kernel in parallel, sample `~c·ln n`
suffixes uniformly, spill each sorted subarray to a disk-spilling
bucket.
2. **Select pivots.** Sort the pooled samples and pick `p − 1` evenly-
spaced pivots, defining `p` partition ranges over the global SA.
3. **Distribute.** Binary-search each sorted subarray against the
pivots and route sub-subarrays into the corresponding partition's
bucket.
4. **Per-partition merge.** Load each partition's bucket into RAM,
cascade 2-way LCP-enhanced merges over its sub-subarrays, emit the
resulting sorted positions via the caller-supplied closure.
Peak RAM is bounded at `~O(text + n/p)` per worker regardless of
input size; the SA is never fully materialised — partitions are
streamed out in lex order.
## Performance — short version
(See [`bench/README.md`](bench/README.md) for the full numbers.)
| Yeast (12 MB) | 4 | **0.99 s** | 3.94 s |
| Random DNA 100 MB | 4 | **11.39 s** | 12.17 s |
| Human genome GRCh38 (3.1 GB) | 32 | **10.47 min / 5.03 GB** | 10.93 min / 6.46 GB |
The in-memory sample-sort path (`build_in_memory_sample_sort`) is
available for hosts with enough RAM to skip disk entirely; on the
human genome it benches at 11.64 min / 55 GB — same wall, ~10× the
RAM, useful only when disk is the constraint.
## Reference
- Upstream reference C++ implementation:
<https://github.com/jamshed/CaPS-SA>
- Paper: Khan et al., *CaPS-SA: A Practical Algorithm for Parallel
Suffix Array Construction.* Workshop on Algorithms in Bioinformatics
(WABI 2023). <https://doi.org/10.4230/LIPIcs.WABI.2023.16>
## License
MIT, matching upstream CaPS-SA. See [`LICENSE`](LICENSE).