caps-sa 0.5.0

Cache-friendly, parallel, sample-sort-based suffix array construction (Rust port of CaPS-SA)
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
# Benchmarks — caps-sa vs upstream C++ CaPS-SA

Reproducible head-to-head of `caps-sa` (this crate) and the reference C++
implementation at [`jamshed/CaPS-SA`][upstream] (`develop` branch).

[upstream]: https://github.com/jamshed/CaPS-SA

## Running

Build both implementations, then:

```sh
# From the workspace/repo root that holds the upstream build alongside.
CAPS_SA_UPSTREAM=path/to/upstream/build/src/caps_sa \
CAPS_SA_RUST=path/to/target/release/examples/caps_sa \
bench/run.sh INPUT.txt NUM_THREADS
```

Set `UPSTREAM_LD_PATH` if the upstream binary needs a newer libstdc++
than the system default.

The harness runs four configurations on the same input:

- upstream C++ in-memory
- upstream C++ external-memory (`--ext-mem --collate-extmem-result`)
- this crate's in-memory path (auto-picks `u32` indices when `n < 2^31`)
- this crate's external-memory path (always `u64` for now)

It reports wall time and peak RSS via `/usr/bin/time`.

## Results

Machine: 64-core x86_64 Linux node, 1 socket, AVX2 enabled.
Builds: upstream `cmake -DCMAKE_BUILD_TYPE=Release`, this crate
`cargo build --release --example caps_sa` (release profile in the
workspace `Cargo.toml` enables `lto = "fat"` + `codegen-units = 1`).

All caps-sa runs include the five optimizations applied incrementally
to the Phase 2b sample-sort baseline:

| Step | Description | Yeast 4-thread ext-mem |
| ---- | ----------- | ---------------------- |
| Phase 2b baseline       | sample-sort + sequential cascade merge | 1.88 s |
| + Opt 1: reusable cascade buffers | one `CascadeWorkspace` reused across partitions | 1.78 s |
| + Opt 2: parallel partition merge | `rayon::par_iter_mut` over partitions in chunks | 1.16 s |
| + Opt 3: SIMD LCP (AVX2 / NEON)   | runtime-dispatched `lcp_u8` for byte texts | 1.03 s |
| + Opt 4: hoist SIMD dispatch       | `LcpDispatch::detect()` once, fn-pointer threaded down | 0.99 s |
| + Opt 5: AVX-512 hybrid LCP        | 32-byte AVX2 head + 64-byte AVX-512BW body | **0.99 s** |

Opt 5 leaves yeast unchanged (it has no LCPs long enough to enter the
64-byte loop) but cuts ~30% off phase 1 on inputs with long repeats
(the human genome). See "Where AVX-512 helps and where it doesn't"
below.

### Yeast — *S. cerevisiae* R64-1-1, 12.16 MB raw text

| Configuration            | 1 thread  |        |       | 4 threads |        |       |
| ------------------------ | --------- | ------ | ----- | --------- | ------ | ----- |
|                          | wall s    | RSS MB | ratio | wall s    | RSS MB | ratio |
| upstream C++ in-mem      | 7.47      | 748    | 1.00× | 2.10      | 747    | 1.00× |
| upstream C++ ext-mem     | 9.21      | 461    | 1.23× | 3.94      | 462    | 1.88× |
| **caps-sa Rust in-mem**  | **2.82**  | **204**| **0.38×** | **0.93** | **204**| **0.44×** |
| **caps-sa Rust ext-mem** | **3.31**  | **359**| **0.44×** | **0.99** | **320**| **0.47×** |

After the four optimizations, **caps-sa is ~2.1–2.8× faster than upstream
and uses ~3.5× less RAM on this input** for both the in-memory and
external-memory paths.

### Random DNA — uniform ACGT, 100 MB

| Configuration            | 1 thread  |        |       | 4 threads |        |       |
| ------------------------ | --------- | ------ | ----- | --------- | ------ | ----- |
|                          | wall s    | RSS MB | ratio | wall s    | RSS MB | ratio |
| upstream C++ in-mem      | 38.16     | 2208   | 1.00× | **10.10** | 2208   | 1.00× |
| upstream C++ ext-mem     | 39.01     | 1053   | 1.02× | 12.17     | 1121   | 1.20× |
| **caps-sa Rust in-mem**  | **31.61** | 1663   | **0.83×** | 13.38   | 1663   | 1.33× |
| **caps-sa Rust ext-mem** | 36.32     | 2053   | 0.95× | **11.39** | 1913   | **1.13×** |

(Reported values are the median of three warm runs; variance ~5%.)

On uniform random DNA the per-call LCP is short (~13 bytes — the
information-theoretic minimum to disambiguate a random suffix in 100 MB
of alphabet-4 text), so the SIMD AVX2/NEON paths exit before any
32-byte chunk pays for itself. Single-thread caps-sa is ~17% faster
than upstream in-mem; multi-thread caps-sa ext-mem is ~7% faster than
upstream ext-mem. The only configuration still losing to upstream is
**rand100m / 4-thread / in-memory** — see "Where the differences still
come from" below.

### Human genome — GRCh38 primary assembly, 3.10 GB raw text

| Configuration            | 32 threads |             |             |
| ------------------------ | ---------- | ----------- | ----------- |
|                          | wall (min) | RSS (GB)    | vs upstream |
| upstream C++ in-mem      | 10.50      | 49.60       | 0.96× / 7.68× |
| upstream C++ ext-mem     | 10.93      | 6.46        | 1.00× / 1.00× |
| **caps-sa Rust ext-mem** | **10.14**  | **4.98**    | **0.93× / 0.77×** |
| caps-sa Rust in-mem-ss   | 11.64      | 55.00       | 1.07× / 8.52× |

`vs upstream` columns are relative to upstream ext-mem (the natural
peer-to-peer comparison). **caps-sa ext-mem is now ~7% faster than
upstream ext-mem on wall time and uses 23% less RAM** — and it beats
upstream's in-mem path by 3% on wall while using 10× less RAM.

Note: the 200 MB human-slice profile in the section below predicted a
larger phase-1 win than the full-genome run delivered (28% on the
slice vs 8% on full GRCh38). At full scale phase 1 no longer
dominates as completely — `p` rises to 8192 (vs 3200 on the slice) so
the per-partition work in phases 3 and 4 grows superlinearly with `n`
and the LCP-bound merge-sort is no longer 89% of the wall. The
hum200m profile was still the right signal for *where* to optimize
(the LCP function); it just over-estimated the *magnitude* at scale.

The in-mem-ss path runs the same sample-sort algorithm against
[`InMemBucket`][bucket] instead of disk-backed
[`ExtMemBucket`][bucket]; on this benchmark it's strictly Pareto-worse
than ext-mem (same wall, 11× the RAM), which is a useful result on
its own — it says disk I/O is essentially free in our ext-mem path at
this scale and the remaining wall-time gap is purely algorithmic.

[bucket]: ../src/ext_bucket.rs

The optimization stack that brought us here, starting from the initial
Phase 2b ext-mem implementation:

| Step | Δ wall | Δ RSS |
| ---- | ------ | ----- |
| Phase 2b baseline (`p = 128`, no streaming, no `u32` lift) || 39.83 GB |
| + ext-mem `Vec<u64>` streaming CLI                          | ~0  | →49 GB transient |
| + auto-scale `p` with text length                           | +1.1 min | **−31.7 GB** (→ 8.13) |
| + phase-4 `cascade_merge` consumes its workspace            | ~0  | ~0 |
| + phase-3 parallelised + `p_max = 8192`                      | ~0  | ~0 (unlocks larger `p`) |
| + generic `SaLcp<I>``u32` for `n ≤ 2³²`                  | −0.0 | −3.03 GB (→ 5.10) |
| + filesystem caching / rerun variance                        | −0.8 min | −0.11 GB (→ 4.99) |
| + AVX-512 hybrid LCP (Opt 5)                                 | **−0.85 min** (→ 10.47) | ~0 (→ 5.03) |
| + Pooled bucket files (Opt 6)                                | +0.3 min (→ 10.78, noise) | **−0.31 GB** (→ 4.72) |
| + Phase 4 chunk-size = 4·num_threads (Opt 7)                  | **−0.64 min** (→ 10.14) | ~0 (→ 4.98) |
| + `Symbol` trait + byte-view SIMD (Opt 8)                     | ~0 (genome stays u8) | ~0 |
| + `LimitProvider` + `SegmentedText` (Opt 9)                   | ~0 (PlainText path) | ~0 |
| **Net (v1 → today's tip)**                                   | **−0.77 min** | **−34.85 GB (−88%)** |

### `LimitProvider` + `SegmentedText` — multi-segment SAs (Opt 9)

Multi-string SAs (and the SJ-indexed genome case in particular —
where every splice junction wants its own sentinel to avoid
cross-junction LCP collisions) can push the alphabet size into the
tens of thousands. Going to `u16`/`u24`/`u32` text just to encode
those sentinels multiplies the genome's resident memory by 2-4×.

Approach 3 from the WIDE_ALPHABETS analysis in `rustar-aligner`'s
`sa_build.rs`: keep the text at `u8` (just real bases) and encode
the segment-boundary structure in a parallel data structure that the
LCP function consults. This ships that infrastructure on the
caps-sa side.

The shape:

- A new `LimitProvider` trait with one method `fn lim_at(&self,
  p: usize) -> usize` — the logical length of the suffix at `p`.
  Plumbed through every site in `merge` / `cascade_merge` /
  `suffix_cmp` that previously computed `n - p` inline.
- `PlainText { n: usize }` is the default impl. Its `lim_at` is
  `#[inline(always)]`-marked and folds at monomorphization time to
  the same `n - p` expression. The merge becomes a generic
  `merge<S, I, L: LimitProvider>` and, on the `PlainText`
  instantiation, generates **bit-identical assembly to the
  non-segmented path** — confirmed by re-benching the marquee
  GRCh38 / 32 t (`hum200m` / `rand100m` / yeast) wall-times after
  the refactor, all within run-to-run variance.
- `SegmentedText { ends: Vec<u64> }` is the segmented impl. Stores
  cumulative end positions of each segment, `lim_at(p)` is a
  `partition_point` binary search. Two ergonomic constructors:
  `from_lengths(text_len, &[len_0, len_1, …])` and
  `from_ends(text_len, sorted_ends)`. For a 50 K-junction SA on a
  6 GB text the storage cost is **400 KB** total — vs the 750 MB
  a packed bitmap would need, or the 6 extra GB a `u16` text
  would need.

Public API surface: every existing entry (`build_ext_mem`,
`build_in_memory`, the `_for_positions` siblings, and the in-mem-ss
ones) gets a `_with` sibling that takes `lp: &L`. The existing
entries are zero-cost wrappers that pass `&PlainText::new(text.len())`.
`LcpDispatch::suffix_cmp_with` is the analogous one-off helper.

Per-LCP-call cost in the `SegmentedText` case: O(log n_segments)
binary search per call. For 50 K segments that's ~16 comparisons,
amortisable to once per merge output by caching `lim_p` / `lim_q`
in the merge state when the pointers don't move (not done in this
patch — easy follow-up if the cost actually shows up in a profile).

`LimitProvider` also exposes a `boundary_order` method (default
`lim_a.cmp(&lim_b)`, "shorter-is-smaller") that callers can
override to flip the tie-break convention. Motivating use: STAR's
`spacer-as-largest` ordering — a `LimitProvider` impl returning
`lim_b.cmp(&lim_a).then(p_a.cmp(&p_b))` makes the merge byte-for-
byte STAR-compatible at cross-segment ties, which is what
`rustar-aligner`'s `StarSegmentedText` wrapper does to preserve
the existing oracle-equality tests. Default impl is
`#[inline]`-marked and folds to the same `lim_a.cmp(&lim_b)` the
current code computes when called from `PlainText` /
`SegmentedText`, so the override is fully opt-in.

The marquee genome bench is unaffected (the rustar-aligner path
hasn't migrated to `SegmentedText` yet); this lands the
infrastructure so the migration is a self-contained follow-up.
Differential tests against a brute-force segmented oracle cover
random small inputs (`segmented_random_validity`), specific
multi-string fixtures, and subset-positions variants.


### `Symbol` trait — SIMD LCP for arbitrary alphabets (Opt 8)

The LCP fast path was originally u8-only (TypeId-gated dispatch in
`LcpDispatch::lcp<S: Eq + 'static>`). Any other symbol type — `u16`
for >125-chromosome genome indexes, `u32` for large-alphabet integer
texts, `[u8; 3]` for 24-bit-packed data — fell through to the scalar
`lcp_scalar`. That left a clean architectural ceiling of one
microsecond per call regardless of how long the LCP ran, which is
~20× the SIMD path on a long-LCP regime.

The fix collapses every symbol width onto a single byte-level SIMD
function. Reasoning: an LCP only needs *equality* on the shared
prefix, not ordering — and **byte-equality of two `[S]` slices is
exactly value-equality** for any `S` with no padding and no invalid
bit patterns. So `LcpDispatch::lcp<S: Symbol>` becomes:

```rust
let k = std::mem::size_of::<S>();
let bytes = unsafe {
    std::slice::from_raw_parts(text.as_ptr() as *const u8, size_of_val(text))
};
let byte_lcp = unsafe {
    (self.lcp_bytes_fn)(bytes, p * k, q * k, max_ctx.saturating_mul(k))
};
byte_lcp / k
```

`Symbol` is the new public marker trait — `unsafe trait Symbol: Ord +
Copy + Send + Sync + 'static` — with blanket impls for every stdlib
integer (`u8` through `u128`, `i8` through `i128`, `usize`, `isize`)
and for `[T; N]` of any `T: Symbol`. Users get the SIMD path for free
on any of those; custom types opt in via `unsafe impl Symbol for
MyNewtype {}` once they've verified the no-padding / byte-faithful-
equality invariant.

The byte-view trick has a nice subtlety: the SIMD compare's "first
differing byte" might land *inside* a partially-equal symbol (e.g.
on `u16`, low byte equal, high byte differs). `byte_lcp / k` rounds
down to the symbol containing that byte, which is the correct
symbol-LCP — every preceding symbol had every byte equal, this
symbol differs at byte offset `byte_lcp mod k`. **Endianness doesn't
matter** because the SIMD compare resolves equality only; the order
of two differing symbols is recovered by `S::cmp` after the LCP via
`text[lcp].cmp(&text[lcp + 1])`.

#### Empirical impact (AMD EPYC 9575F, 1 M symbols, long-LCP regime)

```
                 scalar baseline    Symbol-trait SIMD    speedup
u8 (1 B/sym)       198.8 ms             4.4 ms          45.0×
u16 (2 B/sym)      198.8 ms             7.0 ms          28.3×
[u8; 3] (3 B/sym)  233.9 ms            10.2 ms          22.9×
u32 (4 B/sym)      198.8 ms            13.7 ms          14.5×
u64 (8 B/sym)      198.8 ms            28.6 ms           7.0×
```

The SIMD-vs-scalar ratio scales as ~`64 / size_of::<S>()` (minus per-
call constant overhead that bites harder on the wider symbols). The
u8 ratio is the AVX-512 stride (64 bytes); every other ratio comes
"for free" because the same instruction does the work.

The crate's marquee genome bench stays on `u8` (single-byte alphabet
for ACGT + N + spacer + sentinels), so this opt doesn't move the
GRCh38 wall — it's an enabling change for downstream consumers. The
practical use case in rustar-aligner: highly-fragmented assemblies
that need more than 125 spacer-sentinel values, which the previous
plan deferred to a "Phase 3 `u16` fallback" — that path is now a
one-line type swap (`Vec<u8>` → `Vec<u16>`) on the rustar-aligner
side with no caps-sa work.

### Phase 4 chunk-size — unblock rayon work-stealing (Opt 7)

Profiling the post-pool human run revealed that **phase 4's parallel
efficiency was only 52%** (CPU 2 274 s into wall 139 s on 32 cores)
while phase 1 ran at ~89%. The cause was structural:

`phase4_merge_and_emit` walks the `p` partition buckets in chunks of
`chunk_size = num_threads`, each chunk processed by a single
`par_iter_mut`. With one partition per thread per chunk, rayon has no
splittable work to redistribute — the chunk's wall is set by its
slowest partition, and the other threads idle. Sample-sort partition
sizes vary ~2× from random sampling, so this tail-straggler effect
compounds across the ~256 chunks of the GRCh38 run (`p = 8192`,
`num_threads = 32`).

Bumping the chunk to `4 × num_threads` gives rayon four partitions
per thread initially. Fast threads steal from slow ones, smoothing
the variance. Peak RAM grows by the additional in-flight merged
partitions — each holds a `Vec<I>` of ~3 MB at human-genome scale
with `u32` indices, so a chunk of 128 partitions costs ~400 MB
transient (vs the previous ~100 MB). Well within the budget already
spent on phase 1.

Empirical impact (AMD EPYC 9575F / Zen 5):

```
                       phase 4 wall      phase 4 speedup vs 32 cores
                       before   after    before   after
rand100m 32t           10.07 s   3.38 s   2.6×    7.9×
GRCh38 32t            138.89 s  98.30 s  16.6×   22.9×   (52% → 72% eff.)

                       total wall
                       before   after    Δ
rand100m 32t           11.22 s   4.51 s   −60%
hum200m 32t           ~57 s     ~56 s   ≈0%        (phase 1 dominates here)
GRCh38 32t             10.78 m  10.14 m  **−6%**
```

The win is concentrated where phase 4 is a meaningful share of total
wall — the marquee human-genome run lost 38 s and the small-input
rand100m bench more than halved. Workloads dominated by phase 1
(long-LCP, repeat-rich genomic regions; the hum200m slice) see no
change because phase 1 isn't the thing we fixed.

We also tried injecting `_mm_prefetch::<_MM_HINT_T0>` 256 bytes
ahead of the AVX-512 loads in `lcp_u8_avx512`. Hum200m, rand100m,
and the full GRCh38 bench all showed indistinguishable wall vs the
bare loop (in fact slightly faster without the extra instructions).
The Zen 5 hardware prefetcher recognises the strided 64-byte access
pattern and is already issuing the same loads — software prefetch
was redundant and adds inner-loop work for nothing. Reverted; the
`lcp_u8_avx512` loop stays bare.

### Pooled bucket files — fewer fds, friendlier on NFS (Opt 6)

The sample-sort algorithm needs `2·p` bucket-shaped storage regions
(`p` for the phase-1 subarray spills, `p` for the phase-3 partition
intermediate). With `p` auto-scaling to 8192 on the human genome,
that's 16 384 backing files in the original design — one
`NamedTempFile` per bucket. Three real-world headaches:

- Open-file-handle limits. Some workstation shells default to
  `RLIMIT_NOFILE = 1024`; many HPC nodes cap at 4096. 16 K files
  blows through both.
- NFS metadata cost. Each `openat`/`unlink` on a networked filesystem
  is a synchronous server roundtrip (5–50 ms typical). 16 K of them
  is minutes of pure metadata traffic before any data is moved.
- Inode pressure on the filesystem (some tmpfs / scratch volumes are
  inode-constrained).

The fix is straightforward: pool the buckets onto a small set of
anonymous tempfiles. `BucketPool::new(n_phys, work_dir)` opens
`n_phys` already-unlinked tempfiles up front; each
`PooledExtMemBucket` holds an `Arc<PhysicalFile>` to one of them and
appends its flushes at offsets handed out by a per-file
`AtomicU64::fetch_add`. Multiple threads in `pwrite` on the same fd
don't conflict at disjoint offsets — the kernel serialises by
`(fd, offset_range)` and the cursor allocates disjoint ranges,
making the write path lock-free at user level.

`load_all` reads the bucket's recorded extents back via `pread`. The
extents per bucket are `O(records / buffer_records)` — small in
absolute terms (a few hundred for a phase-4 partition at human scale),
so the extent metadata is a few MB total across all buckets.

`ExtMemOpts::physical_file_count = 0` (the default) resolves to
`rayon::current_num_threads()` — one writable inode per worker, which
empirically matches per-bucket-file wall time on rand100m, hum200m,
yeast, and the full human genome. Callers who want to override (e.g.
on a host with extreme file-descriptor or contention pressure) can
set the field directly, or override at runtime via the
`CAPS_SA_N_PHYS` env var.

#### Empirical impact (AMD EPYC 9575F / Zen 5)

```
                              per-bucket files     pooled (N = num_threads)
rand100m 32t openat calls           9 168                     76
rand100m 32t unlink calls           3 052                      0   (anonymous)
rand100m 32t wall                 ~11.2 s                  ~11.1 s
rand100m 32t RSS                   632 MB                   610 MB
human   32t wall                  10.47 min                10.78 min   (≈ run variance)
human   32t RSS                    5.03 GB                  4.72 GB   (−6%)
human   32t bucket files           16 384                       64
```

The wall time is **at parity** within run-to-run variance on every
input we tested; RAM is **slightly lower** because we drop the
per-bucket `BufWriter` + `NamedTempFile` metadata. The file-count
collapse from `2·p` to `2·N` is the qualitative win — no more open-
file-handle headaches, no more NFS metadata wall.

A separate empirical guard for the `*_for_positions` rustar-aligner
integration lives at [`examples/pathology_bench.rs`](../examples/pathology_bench.rs):
it builds a synthetic spacer-padding fixture (the kind rustar-aligner
hands us when `genomeChrBinNbits` rounds a small chromosome up to a
much larger padded text) and times `build_ext_mem` (sort everything)
against `build_ext_mem_for_positions` (sort only the kept positions)
on the same fixture, showing **6–10× speedups** on padding-dominated
inputs. The pool change leaves this gap intact.

### Where AVX-512 helps and where it doesn't — the measurement

A `perf record --call-graph dwarf` run on a 200 MB human-genome slice
(`/usr/bin/perf` 4.18, AMD EPYC 9575F / Zen 5) made the picture
unambiguous. On the human slice, the entire program is concentrated in
one function:

```
   97.54%   caps_sa::lcp::lcp_u8_avx2
```

Phase 1 (merge-sort) was 89% of the wall and was bottlenecked
exclusively on LCP scanning — the merge bookkeeping, swaps, indexing,
even rayon work-stealing all came to under 3% combined. On the same
host, the **rand100m** profile looked completely different:

```
   45.20%   caps_sa::sample_sort::merge
   42.23%   caps_sa::lcp::lcp_u8_avx2
```

Why so different? Random DNA over a 4-symbol alphabet has an expected
maximum LCP of log₄(n) ≈ 13 bytes for n = 100 MB — every LCP call
exits inside a single 32-byte AVX2 chunk, so the **per-call** cost
(register setup, the boundary tests, the trailing-zeros mask resolve)
is what shows. The genome, with its long microsatellites, segmental
duplications and other repeats, regularly runs LCPs into the hundreds
or thousands of bytes, and there the **SIMD-stride throughput**
dominates.

This made AVX-512 the obvious next step — but only conditionally:
naïvely replacing AVX2 with a 64-byte AVX-512 stride regressed
rand100m by ~15% (one wasted 64-byte load per call instead of one
useful 32-byte load) while improving the human slice by 29%. The
**hybrid path** ships both: a 32-byte AVX2 head that resolves the
short-LCP regime without ever touching a ZMM register, falling through
to a 64-byte AVX-512BW body once the LCP has already exceeded 32
bytes. On the bench host the results were:

| Input        | AVX2 only | AVX-512 (64B only) | AVX-512 hybrid |
| ------------ | --------- | ------------------ | -------------- |
| rand100m 32t | 11.50 s   | 13.32 s (−16%)     | **11.22 s (0%)** |
| hum200m 32t  | 79.67 s   | 57.40 s (+28%)     | **57.00 s (+28%)** |

The hybrid keeps the AVX2 short-LCP baseline intact while preserving
the full AVX-512 win on long LCPs. On Zen 5 specifically there is no
AVX-512 frequency licensing — the 512-bit data paths are native, and
upper-bit power gating is fast — so the only thing the head step
saves is a wasted 32 bytes of loaded data, which on memory-bandwidth-
bound workloads is still worth avoiding.

## SIMD LCP — AVX-512BW + AVX2 + NEON, dispatched once

The `u8`-specialized LCP path is dispatched at runtime by
`std::any::TypeId::of::<S>() == TypeId::of::<u8>()`:

- **x86_64 + AVX-512F + AVX-512BW:** 32-byte AVX2 head (resolves the
  short-LCP regime without ever touching ZMM registers), then 64-byte
  AVX-512 body via `_mm512_cmpeq_epi8_mask` — the byte-compare returns
  a `__mmask64` directly, with no movemask round-trip. See "Where
  AVX-512 helps" above for the measurement that motivated this shape.
- **x86_64 + AVX2:** 32-byte `_mm256_cmpeq_epi8` + `_mm256_movemask_epi8`,
  locate the first differing byte via `(!mask).trailing_zeros()`.
- **aarch64 + NEON:** 16-byte `vceqq_u8` + the classic shrn-by-4
  movemask emulation (`vshrn_n_u16<4>` over the reinterpreted `u16`
  view), then `trailing_zeros / 4` for the byte index.
- **Fallback:** portable scalar `S: Eq` byte-loop.

Feature detection runs **once** per `build_*` call (Opt 4): the chosen
function pointer is captured in an `LcpDispatch` value (`Copy + Send +
Sync`) and threaded explicitly through `merge_sort` / `merge` /
`cascade_merge` / `upper_bound_by_pivot`. The inner loops perform a
single indirect call through a register-held pointer — no atomic loads
and no per-call `is_*_feature_detected!` checks. This was particularly
valuable on inputs with short LCPs (random DNA): the per-call cost was
a meaningful fraction of work-per-LCP, so removing it gave 5–10%
across-the-board wins on the 100 MB benchmark.

## Where the differences still come from

| Optimization                             | Upstream C++ | caps-sa Rust |
| ---------------------------------------- | ------------ | ------------ |
| Sample-sort partitioning                 | yes (in-mem + ext-mem) | yes (ext-mem + in-mem-ss) |
| LCP-enhanced 2-way merge                 | yes          | yes          |
| Parallel partition merge                 | yes          | **yes** (Opt 2) |
| Reusable merge buffers across cascade    | yes          | **yes** (Opt 1) |
| AVX2 LCP comparison                      | yes          | **yes** (Opt 3) |
| AVX-512 LCP comparison                   | yes (always-64B) | **yes, hybrid 32B-head/64B-body** (Opt 5) |
| NEON LCP comparison                      | no (x86 only)| **yes** (Opt 3) |
| SIMD dispatch hoisted out of hot path    | yes (compile-time) | **yes** (Opt 4) |
| 32-bit indices when `n < 2^31`           | yes          | **yes** (in-mem + ext-mem via `SaLcp<u32>`) |
| `u32` index ext-mem path                 | yes          | **yes** (generic `SaLcp<I>`) |
| In-memory sample-sort partitioning       | yes          | **yes** (`build_in_memory_sample_sort`) |

## Caveats

- `/usr/bin/time -f %M` reports peak resident set, not the SA
  construction working set in isolation. Both binaries pre-load the
  input text (~`n` bytes) and write the SA (~`n × sizeof(idx)` bytes) —
  those move with index width but are not the merge-sort scratch.
- The upstream and caps-sa SA files differ in record width (i32 vs
  u32/u64) so they don't `cmp` byte-for-byte. Equivalence is verified
  inside caps-sa's unit tests (`build_in_memory` and `build_ext_mem`
  agree with a brute-force reference).
- Numbers vary 5–10% across runs; reported values are the median of
  three warm runs.