# Benchmarks — caps-sa vs upstream C++ CaPS-SA
Reproducible head-to-head of `caps-sa` (this crate) and the reference C++
implementation at [`jamshed/CaPS-SA`][upstream] (`develop` branch).
[upstream]: https://github.com/jamshed/CaPS-SA
## Running
Build both implementations, then:
```sh
# From the workspace/repo root that holds the upstream build alongside.
CAPS_SA_UPSTREAM=path/to/upstream/build/src/caps_sa \
CAPS_SA_RUST=path/to/target/release/examples/caps_sa \
bench/run.sh INPUT.txt NUM_THREADS
```
Set `UPSTREAM_LD_PATH` if the upstream binary needs a newer libstdc++
than the system default.
The harness runs four configurations on the same input:
- upstream C++ in-memory
- upstream C++ external-memory (`--ext-mem --collate-extmem-result`)
- this crate's in-memory path (auto-picks `u32` indices when `n < 2^31`)
- this crate's external-memory path (always `u64` for now)
It reports wall time and peak RSS via `/usr/bin/time`.
## Results
Machine: 64-core x86_64 Linux node, 1 socket, AVX2 enabled.
Builds: upstream `cmake -DCMAKE_BUILD_TYPE=Release`, this crate
`cargo build --release --example caps_sa` (release profile in the
workspace `Cargo.toml` enables `lto = "fat"` + `codegen-units = 1`).
All caps-sa runs include the five optimizations applied incrementally
to the Phase 2b sample-sort baseline:
| Phase 2b baseline | sample-sort + sequential cascade merge | 1.88 s |
| + Opt 1: reusable cascade buffers | one `CascadeWorkspace` reused across partitions | 1.78 s |
| + Opt 2: parallel partition merge | `rayon::par_iter_mut` over partitions in chunks | 1.16 s |
| + Opt 3: SIMD LCP (AVX2 / NEON) | runtime-dispatched `lcp_u8` for byte texts | 1.03 s |
| + Opt 4: hoist SIMD dispatch | `LcpDispatch::detect()` once, fn-pointer threaded down | 0.99 s |
| + Opt 5: AVX-512 hybrid LCP | 32-byte AVX2 head + 64-byte AVX-512BW body | **0.99 s** |
Opt 5 leaves yeast unchanged (it has no LCPs long enough to enter the
64-byte loop) but cuts ~30% off phase 1 on inputs with long repeats
(the human genome). See "Where AVX-512 helps and where it doesn't"
below.
### Yeast — *S. cerevisiae* R64-1-1, 12.16 MB raw text
| | wall s | RSS MB | ratio | wall s | RSS MB | ratio |
| upstream C++ in-mem | 7.47 | 748 | 1.00× | 2.10 | 747 | 1.00× |
| upstream C++ ext-mem | 9.21 | 461 | 1.23× | 3.94 | 462 | 1.88× |
| **caps-sa Rust in-mem** | **2.82** | **204**| **0.38×** | **0.93** | **204**| **0.44×** |
| **caps-sa Rust ext-mem** | **3.31** | **359**| **0.44×** | **0.99** | **320**| **0.47×** |
After the four optimizations, **caps-sa is ~2.1–2.8× faster than upstream
and uses ~3.5× less RAM on this input** for both the in-memory and
external-memory paths.
### Random DNA — uniform ACGT, 100 MB
| | wall s | RSS MB | ratio | wall s | RSS MB | ratio |
| upstream C++ in-mem | 38.16 | 2208 | 1.00× | **10.10** | 2208 | 1.00× |
| upstream C++ ext-mem | 39.01 | 1053 | 1.02× | 12.17 | 1121 | 1.20× |
| **caps-sa Rust in-mem** | **31.61** | 1663 | **0.83×** | 13.38 | 1663 | 1.33× |
| **caps-sa Rust ext-mem** | 36.32 | 2053 | 0.95× | **11.39** | 1913 | **1.13×** |
(Reported values are the median of three warm runs; variance ~5%.)
On uniform random DNA the per-call LCP is short (~13 bytes — the
information-theoretic minimum to disambiguate a random suffix in 100 MB
of alphabet-4 text), so the SIMD AVX2/NEON paths exit before any
32-byte chunk pays for itself. Single-thread caps-sa is ~17% faster
than upstream in-mem; multi-thread caps-sa ext-mem is ~7% faster than
upstream ext-mem. The only configuration still losing to upstream is
**rand100m / 4-thread / in-memory** — see "Where the differences still
come from" below.
### Human genome — GRCh38 primary assembly, 3.10 GB raw text
| | wall (min) | RSS (GB) | vs upstream |
| upstream C++ in-mem | 10.50 | 49.60 | 0.96× / 7.68× |
| upstream C++ ext-mem | 10.93 | 6.46 | 1.00× / 1.00× |
| **caps-sa Rust ext-mem** | **10.14** | **4.98** | **0.93× / 0.77×** |
| caps-sa Rust in-mem-ss | 11.64 | 55.00 | 1.07× / 8.52× |
`vs upstream` columns are relative to upstream ext-mem (the natural
peer-to-peer comparison). **caps-sa ext-mem is now ~7% faster than
upstream ext-mem on wall time and uses 23% less RAM** — and it beats
upstream's in-mem path by 3% on wall while using 10× less RAM.
Note: the 200 MB human-slice profile in the section below predicted a
larger phase-1 win than the full-genome run delivered (28% on the
slice vs 8% on full GRCh38). At full scale phase 1 no longer
dominates as completely — `p` rises to 8192 (vs 3200 on the slice) so
the per-partition work in phases 3 and 4 grows superlinearly with `n`
and the LCP-bound merge-sort is no longer 89% of the wall. The
hum200m profile was still the right signal for *where* to optimize
(the LCP function); it just over-estimated the *magnitude* at scale.
The in-mem-ss path runs the same sample-sort algorithm against
[`InMemBucket`][bucket] instead of disk-backed
[`ExtMemBucket`][bucket]; on this benchmark it's strictly Pareto-worse
than ext-mem (same wall, 11× the RAM), which is a useful result on
its own — it says disk I/O is essentially free in our ext-mem path at
this scale and the remaining wall-time gap is purely algorithmic.
[bucket]: ../src/ext_bucket.rs
The optimization stack that brought us here, starting from the initial
Phase 2b ext-mem implementation:
| Phase 2b baseline (`p = 128`, no streaming, no `u32` lift) | — | 39.83 GB |
| + ext-mem `Vec<u64>` streaming CLI | ~0 | →49 GB transient |
| + auto-scale `p` with text length | +1.1 min | **−31.7 GB** (→ 8.13) |
| + phase-4 `cascade_merge` consumes its workspace | ~0 | ~0 |
| + phase-3 parallelised + `p_max = 8192` | ~0 | ~0 (unlocks larger `p`) |
| + generic `SaLcp<I>` → `u32` for `n ≤ 2³²` | −0.0 | −3.03 GB (→ 5.10) |
| + filesystem caching / rerun variance | −0.8 min | −0.11 GB (→ 4.99) |
| + AVX-512 hybrid LCP (Opt 5) | **−0.85 min** (→ 10.47) | ~0 (→ 5.03) |
| + Pooled bucket files (Opt 6) | +0.3 min (→ 10.78, noise) | **−0.31 GB** (→ 4.72) |
| + Phase 4 chunk-size = 4·num_threads (Opt 7) | **−0.64 min** (→ 10.14) | ~0 (→ 4.98) |
| + `Symbol` trait + byte-view SIMD (Opt 8) | ~0 (genome stays u8) | ~0 |
| + `LimitProvider` + `SegmentedText` (Opt 9) | ~0 (PlainText path) | ~0 |
| **Net (v1 → today's tip)** | **−0.77 min** | **−34.85 GB (−88%)** |
### `LimitProvider` + `SegmentedText` — multi-segment SAs (Opt 9)
Multi-string SAs (and the SJ-indexed genome case in particular —
where every splice junction wants its own sentinel to avoid
cross-junction LCP collisions) can push the alphabet size into the
tens of thousands. Going to `u16`/`u24`/`u32` text just to encode
those sentinels multiplies the genome's resident memory by 2-4×.
Approach 3 from the WIDE_ALPHABETS analysis in `rustar-aligner`'s
`sa_build.rs`: keep the text at `u8` (just real bases) and encode
the segment-boundary structure in a parallel data structure that the
LCP function consults. This ships that infrastructure on the
caps-sa side.
The shape:
- A new `LimitProvider` trait with one method `fn lim_at(&self,
p: usize) -> usize` — the logical length of the suffix at `p`.
Plumbed through every site in `merge` / `cascade_merge` /
`suffix_cmp` that previously computed `n - p` inline.
- `PlainText { n: usize }` is the default impl. Its `lim_at` is
`#[inline(always)]`-marked and folds at monomorphization time to
the same `n - p` expression. The merge becomes a generic
`merge<S, I, L: LimitProvider>` and, on the `PlainText`
instantiation, generates **bit-identical assembly to the
non-segmented path** — confirmed by re-benching the marquee
GRCh38 / 32 t (`hum200m` / `rand100m` / yeast) wall-times after
the refactor, all within run-to-run variance.
- `SegmentedText { ends: Vec<u64> }` is the segmented impl. Stores
cumulative end positions of each segment, `lim_at(p)` is a
`partition_point` binary search. Two ergonomic constructors:
`from_lengths(text_len, &[len_0, len_1, …])` and
`from_ends(text_len, sorted_ends)`. For a 50 K-junction SA on a
6 GB text the storage cost is **400 KB** total — vs the 750 MB
a packed bitmap would need, or the 6 extra GB a `u16` text
would need.
Public API surface: every existing entry (`build_ext_mem`,
`build_in_memory`, the `_for_positions` siblings, and the in-mem-ss
ones) gets a `_with` sibling that takes `lp: &L`. The existing
entries are zero-cost wrappers that pass `&PlainText::new(text.len())`.
`LcpDispatch::suffix_cmp_with` is the analogous one-off helper.
Per-LCP-call cost in the `SegmentedText` case: O(log n_segments)
binary search per call. For 50 K segments that's ~16 comparisons,
amortisable to once per merge output by caching `lim_p` / `lim_q`
in the merge state when the pointers don't move (not done in this
patch — easy follow-up if the cost actually shows up in a profile).
`LimitProvider` also exposes a `boundary_order` method (default
`lim_a.cmp(&lim_b)`, "shorter-is-smaller") that callers can
override to flip the tie-break convention. Motivating use: STAR's
`spacer-as-largest` ordering — a `LimitProvider` impl returning
`lim_b.cmp(&lim_a).then(p_a.cmp(&p_b))` makes the merge byte-for-
byte STAR-compatible at cross-segment ties, which is what
`rustar-aligner`'s `StarSegmentedText` wrapper does to preserve
the existing oracle-equality tests. Default impl is
`#[inline]`-marked and folds to the same `lim_a.cmp(&lim_b)` the
current code computes when called from `PlainText` /
`SegmentedText`, so the override is fully opt-in.
The marquee genome bench is unaffected (the rustar-aligner path
hasn't migrated to `SegmentedText` yet); this lands the
infrastructure so the migration is a self-contained follow-up.
Differential tests against a brute-force segmented oracle cover
random small inputs (`segmented_random_validity`), specific
multi-string fixtures, and subset-positions variants.
### `Symbol` trait — SIMD LCP for arbitrary alphabets (Opt 8)
The LCP fast path was originally u8-only (TypeId-gated dispatch in
`LcpDispatch::lcp<S: Eq + 'static>`). Any other symbol type — `u16`
for >125-chromosome genome indexes, `u32` for large-alphabet integer
texts, `[u8; 3]` for 24-bit-packed data — fell through to the scalar
`lcp_scalar`. That left a clean architectural ceiling of one
microsecond per call regardless of how long the LCP ran, which is
~20× the SIMD path on a long-LCP regime.
The fix collapses every symbol width onto a single byte-level SIMD
function. Reasoning: an LCP only needs *equality* on the shared
prefix, not ordering — and **byte-equality of two `[S]` slices is
exactly value-equality** for any `S` with no padding and no invalid
bit patterns. So `LcpDispatch::lcp<S: Symbol>` becomes:
```rust
let k = std::mem::size_of::<S>();
let bytes = unsafe {
std::slice::from_raw_parts(text.as_ptr() as *const u8, size_of_val(text))
};
let byte_lcp = unsafe {
(self.lcp_bytes_fn)(bytes, p * k, q * k, max_ctx.saturating_mul(k))
};
byte_lcp / k
```
`Symbol` is the new public marker trait — `unsafe trait Symbol: Ord +
Copy + Send + Sync + 'static` — with blanket impls for every stdlib
integer (`u8` through `u128`, `i8` through `i128`, `usize`, `isize`)
and for `[T; N]` of any `T: Symbol`. Users get the SIMD path for free
on any of those; custom types opt in via `unsafe impl Symbol for
MyNewtype {}` once they've verified the no-padding / byte-faithful-
equality invariant.
The byte-view trick has a nice subtlety: the SIMD compare's "first
differing byte" might land *inside* a partially-equal symbol (e.g.
on `u16`, low byte equal, high byte differs). `byte_lcp / k` rounds
down to the symbol containing that byte, which is the correct
symbol-LCP — every preceding symbol had every byte equal, this
symbol differs at byte offset `byte_lcp mod k`. **Endianness doesn't
matter** because the SIMD compare resolves equality only; the order
of two differing symbols is recovered by `S::cmp` after the LCP via
`text[lcp].cmp(&text[lcp + 1])`.
#### Empirical impact (AMD EPYC 9575F, 1 M symbols, long-LCP regime)
```
scalar baseline Symbol-trait SIMD speedup
u8 (1 B/sym) 198.8 ms 4.4 ms 45.0×
u16 (2 B/sym) 198.8 ms 7.0 ms 28.3×
[u8; 3] (3 B/sym) 233.9 ms 10.2 ms 22.9×
u32 (4 B/sym) 198.8 ms 13.7 ms 14.5×
u64 (8 B/sym) 198.8 ms 28.6 ms 7.0×
```
The SIMD-vs-scalar ratio scales as ~`64 / size_of::<S>()` (minus per-
call constant overhead that bites harder on the wider symbols). The
u8 ratio is the AVX-512 stride (64 bytes); every other ratio comes
"for free" because the same instruction does the work.
The crate's marquee genome bench stays on `u8` (single-byte alphabet
for ACGT + N + spacer + sentinels), so this opt doesn't move the
GRCh38 wall — it's an enabling change for downstream consumers. The
practical use case in rustar-aligner: highly-fragmented assemblies
that need more than 125 spacer-sentinel values, which the previous
plan deferred to a "Phase 3 `u16` fallback" — that path is now a
one-line type swap (`Vec<u8>` → `Vec<u16>`) on the rustar-aligner
side with no caps-sa work.
### Phase 4 chunk-size — unblock rayon work-stealing (Opt 7)
Profiling the post-pool human run revealed that **phase 4's parallel
efficiency was only 52%** (CPU 2 274 s into wall 139 s on 32 cores)
while phase 1 ran at ~89%. The cause was structural:
`phase4_merge_and_emit` walks the `p` partition buckets in chunks of
`chunk_size = num_threads`, each chunk processed by a single
`par_iter_mut`. With one partition per thread per chunk, rayon has no
splittable work to redistribute — the chunk's wall is set by its
slowest partition, and the other threads idle. Sample-sort partition
sizes vary ~2× from random sampling, so this tail-straggler effect
compounds across the ~256 chunks of the GRCh38 run (`p = 8192`,
`num_threads = 32`).
Bumping the chunk to `4 × num_threads` gives rayon four partitions
per thread initially. Fast threads steal from slow ones, smoothing
the variance. Peak RAM grows by the additional in-flight merged
partitions — each holds a `Vec<I>` of ~3 MB at human-genome scale
with `u32` indices, so a chunk of 128 partitions costs ~400 MB
transient (vs the previous ~100 MB). Well within the budget already
spent on phase 1.
Empirical impact (AMD EPYC 9575F / Zen 5):
```
phase 4 wall phase 4 speedup vs 32 cores
before after before after
rand100m 32t 10.07 s 3.38 s 2.6× 7.9×
GRCh38 32t 138.89 s 98.30 s 16.6× 22.9× (52% → 72% eff.)
total wall
before after Δ
rand100m 32t 11.22 s 4.51 s −60%
hum200m 32t ~57 s ~56 s ≈0% (phase 1 dominates here)
GRCh38 32t 10.78 m 10.14 m **−6%**
```
The win is concentrated where phase 4 is a meaningful share of total
wall — the marquee human-genome run lost 38 s and the small-input
rand100m bench more than halved. Workloads dominated by phase 1
(long-LCP, repeat-rich genomic regions; the hum200m slice) see no
change because phase 1 isn't the thing we fixed.
We also tried injecting `_mm_prefetch::<_MM_HINT_T0>` 256 bytes
ahead of the AVX-512 loads in `lcp_u8_avx512`. Hum200m, rand100m,
and the full GRCh38 bench all showed indistinguishable wall vs the
bare loop (in fact slightly faster without the extra instructions).
The Zen 5 hardware prefetcher recognises the strided 64-byte access
pattern and is already issuing the same loads — software prefetch
was redundant and adds inner-loop work for nothing. Reverted; the
`lcp_u8_avx512` loop stays bare.
### Pooled bucket files — fewer fds, friendlier on NFS (Opt 6)
The sample-sort algorithm needs `2·p` bucket-shaped storage regions
(`p` for the phase-1 subarray spills, `p` for the phase-3 partition
intermediate). With `p` auto-scaling to 8192 on the human genome,
that's 16 384 backing files in the original design — one
`NamedTempFile` per bucket. Three real-world headaches:
- Open-file-handle limits. Some workstation shells default to
`RLIMIT_NOFILE = 1024`; many HPC nodes cap at 4096. 16 K files
blows through both.
- NFS metadata cost. Each `openat`/`unlink` on a networked filesystem
is a synchronous server roundtrip (5–50 ms typical). 16 K of them
is minutes of pure metadata traffic before any data is moved.
- Inode pressure on the filesystem (some tmpfs / scratch volumes are
inode-constrained).
The fix is straightforward: pool the buckets onto a small set of
anonymous tempfiles. `BucketPool::new(n_phys, work_dir)` opens
`n_phys` already-unlinked tempfiles up front; each
`PooledExtMemBucket` holds an `Arc<PhysicalFile>` to one of them and
appends its flushes at offsets handed out by a per-file
`AtomicU64::fetch_add`. Multiple threads in `pwrite` on the same fd
don't conflict at disjoint offsets — the kernel serialises by
`(fd, offset_range)` and the cursor allocates disjoint ranges,
making the write path lock-free at user level.
`load_all` reads the bucket's recorded extents back via `pread`. The
extents per bucket are `O(records / buffer_records)` — small in
absolute terms (a few hundred for a phase-4 partition at human scale),
so the extent metadata is a few MB total across all buckets.
`ExtMemOpts::physical_file_count = 0` (the default) resolves to
`rayon::current_num_threads()` — one writable inode per worker, which
empirically matches per-bucket-file wall time on rand100m, hum200m,
yeast, and the full human genome. Callers who want to override (e.g.
on a host with extreme file-descriptor or contention pressure) can
set the field directly, or override at runtime via the
`CAPS_SA_N_PHYS` env var.
#### Empirical impact (AMD EPYC 9575F / Zen 5)
```
per-bucket files pooled (N = num_threads)
rand100m 32t openat calls 9 168 76
rand100m 32t unlink calls 3 052 0 (anonymous)
rand100m 32t wall ~11.2 s ~11.1 s
rand100m 32t RSS 632 MB 610 MB
human 32t wall 10.47 min 10.78 min (≈ run variance)
human 32t RSS 5.03 GB 4.72 GB (−6%)
human 32t bucket files 16 384 64
```
The wall time is **at parity** within run-to-run variance on every
input we tested; RAM is **slightly lower** because we drop the
per-bucket `BufWriter` + `NamedTempFile` metadata. The file-count
collapse from `2·p` to `2·N` is the qualitative win — no more open-
file-handle headaches, no more NFS metadata wall.
A separate empirical guard for the `*_for_positions` rustar-aligner
integration lives at [`examples/pathology_bench.rs`](../examples/pathology_bench.rs):
it builds a synthetic spacer-padding fixture (the kind rustar-aligner
hands us when `genomeChrBinNbits` rounds a small chromosome up to a
much larger padded text) and times `build_ext_mem` (sort everything)
against `build_ext_mem_for_positions` (sort only the kept positions)
on the same fixture, showing **6–10× speedups** on padding-dominated
inputs. The pool change leaves this gap intact.
### Where AVX-512 helps and where it doesn't — the measurement
A `perf record --call-graph dwarf` run on a 200 MB human-genome slice
(`/usr/bin/perf` 4.18, AMD EPYC 9575F / Zen 5) made the picture
unambiguous. On the human slice, the entire program is concentrated in
one function:
```
97.54% caps_sa::lcp::lcp_u8_avx2
```
Phase 1 (merge-sort) was 89% of the wall and was bottlenecked
exclusively on LCP scanning — the merge bookkeeping, swaps, indexing,
even rayon work-stealing all came to under 3% combined. On the same
host, the **rand100m** profile looked completely different:
```
45.20% caps_sa::sample_sort::merge
42.23% caps_sa::lcp::lcp_u8_avx2
```
Why so different? Random DNA over a 4-symbol alphabet has an expected
maximum LCP of log₄(n) ≈ 13 bytes for n = 100 MB — every LCP call
exits inside a single 32-byte AVX2 chunk, so the **per-call** cost
(register setup, the boundary tests, the trailing-zeros mask resolve)
is what shows. The genome, with its long microsatellites, segmental
duplications and other repeats, regularly runs LCPs into the hundreds
or thousands of bytes, and there the **SIMD-stride throughput**
dominates.
This made AVX-512 the obvious next step — but only conditionally:
naïvely replacing AVX2 with a 64-byte AVX-512 stride regressed
rand100m by ~15% (one wasted 64-byte load per call instead of one
useful 32-byte load) while improving the human slice by 29%. The
**hybrid path** ships both: a 32-byte AVX2 head that resolves the
short-LCP regime without ever touching a ZMM register, falling through
to a 64-byte AVX-512BW body once the LCP has already exceeded 32
bytes. On the bench host the results were:
| rand100m 32t | 11.50 s | 13.32 s (−16%) | **11.22 s (0%)** |
| hum200m 32t | 79.67 s | 57.40 s (+28%) | **57.00 s (+28%)** |
The hybrid keeps the AVX2 short-LCP baseline intact while preserving
the full AVX-512 win on long LCPs. On Zen 5 specifically there is no
AVX-512 frequency licensing — the 512-bit data paths are native, and
upper-bit power gating is fast — so the only thing the head step
saves is a wasted 32 bytes of loaded data, which on memory-bandwidth-
bound workloads is still worth avoiding.
## SIMD LCP — AVX-512BW + AVX2 + NEON, dispatched once
The `u8`-specialized LCP path is dispatched at runtime by
`std::any::TypeId::of::<S>() == TypeId::of::<u8>()`:
- **x86_64 + AVX-512F + AVX-512BW:** 32-byte AVX2 head (resolves the
short-LCP regime without ever touching ZMM registers), then 64-byte
AVX-512 body via `_mm512_cmpeq_epi8_mask` — the byte-compare returns
a `__mmask64` directly, with no movemask round-trip. See "Where
AVX-512 helps" above for the measurement that motivated this shape.
- **x86_64 + AVX2:** 32-byte `_mm256_cmpeq_epi8` + `_mm256_movemask_epi8`,
locate the first differing byte via `(!mask).trailing_zeros()`.
- **aarch64 + NEON:** 16-byte `vceqq_u8` + the classic shrn-by-4
movemask emulation (`vshrn_n_u16<4>` over the reinterpreted `u16`
view), then `trailing_zeros / 4` for the byte index.
- **Fallback:** portable scalar `S: Eq` byte-loop.
Feature detection runs **once** per `build_*` call (Opt 4): the chosen
function pointer is captured in an `LcpDispatch` value (`Copy + Send +
Sync`) and threaded explicitly through `merge_sort` / `merge` /
`cascade_merge` / `upper_bound_by_pivot`. The inner loops perform a
single indirect call through a register-held pointer — no atomic loads
and no per-call `is_*_feature_detected!` checks. This was particularly
valuable on inputs with short LCPs (random DNA): the per-call cost was
a meaningful fraction of work-per-LCP, so removing it gave 5–10%
across-the-board wins on the 100 MB benchmark.
## Where the differences still come from
| Sample-sort partitioning | yes (in-mem + ext-mem) | yes (ext-mem + in-mem-ss) |
| LCP-enhanced 2-way merge | yes | yes |
| Parallel partition merge | yes | **yes** (Opt 2) |
| Reusable merge buffers across cascade | yes | **yes** (Opt 1) |
| AVX2 LCP comparison | yes | **yes** (Opt 3) |
| AVX-512 LCP comparison | yes (always-64B) | **yes, hybrid 32B-head/64B-body** (Opt 5) |
| NEON LCP comparison | no (x86 only)| **yes** (Opt 3) |
| SIMD dispatch hoisted out of hot path | yes (compile-time) | **yes** (Opt 4) |
| 32-bit indices when `n < 2^31` | yes | **yes** (in-mem + ext-mem via `SaLcp<u32>`) |
| `u32` index ext-mem path | yes | **yes** (generic `SaLcp<I>`) |
| In-memory sample-sort partitioning | yes | **yes** (`build_in_memory_sample_sort`) |
## Caveats
- `/usr/bin/time -f %M` reports peak resident set, not the SA
construction working set in isolation. Both binaries pre-load the
input text (~`n` bytes) and write the SA (~`n × sizeof(idx)` bytes) —
those move with index width but are not the merge-sort scratch.
- The upstream and caps-sa SA files differ in record width (i32 vs
u32/u64) so they don't `cmp` byte-for-byte. Equivalence is verified
inside caps-sa's unit tests (`build_in_memory` and `build_ext_mem`
agree with a brute-force reference).
- Numbers vary 5–10% across runs; reported values are the median of
three warm runs.