# Memory & Concurrency Optimizations: Google Books Importer
The Google Books n-gram importer was OOMing at ~33.79 GB peak heap and
burning ~49% of CPU in `__mprotect` syscalls during large-vocabulary
imports (English 1-5gram corpus, 5.8M unique words, ~50-100M entries per
2-gram file). This document captures the optimizations that addressed the
bottleneck.
## Bottleneck breakdown (pre-optimization)
| 49% CPU in `__mprotect` | glibc malloc's `mmap`/`munmap` per-large-allocation pattern issuing `mprotect` syscalls on every region change |
| 33.79 GB peak heap | Unbounded growth of (a) lock-free overlay entries, (b) per-tx in-memory buffers for 50-100M-entry prefix files, (c) doubled HashMap rebuilds in the vocabulary's reverse-index during checkpoints |
| 12× redundant checkpoints at end-of-import | Each worker called `save_checkpoint()` on exit; with 12 parallel workers shutting down simultaneously, this added ~6 minutes of blocking I/O after all data was already durable |
## Optimization summary
| 1 | `mimalloc` global allocator | mprotect CPU |
| 2 | Pre-sized vocabulary lock-free layer | resize-doubling spikes |
| 3 | SmallVec `[&str; 5]` for token splits | per-record heap allocs |
| 4 | Zero-alloc `parse_ngram_line_ref` + `push_ref` | per-record String/Vec allocs |
| 5 | xxh3 hashers in merge/MKN hot paths | hash CPU |
| 6 | Two-tier `ChildStore` replacing `im::Vector` | Arc COW/mprotect (libdictenstein) |
| 7 | `--cache-files` mode | download/parse decoupling |
| 8 | `--tx-chunk-size` chunked transactions | per-tx memory bound |
| 9 | Per-shard lock-free overlay flush threshold | overlay memory bound |
| 10 | Single-merge `merge_and_rotate_vocabulary_wal` | doubled reverse-index rebuilds (~1.7 GB saved at peak) |
| 11 | MKN aggregator cancellation flag | graceful shutdown during multi-minute MKN phase |
| 12 | Removed worker-exit checkpoint storm | ~6 min blocking I/O at end-of-import |
| 13 | `tokio::task::yield_now()` before finalization | SIGINT responsiveness during synchronous MKN work |
| 14 | Auto-scaled checkpoint interval (every 5 vs 10 files) | I/O cadence at high parallelism |
| 15 | `--overlay-budget-gib` overlay-heap eviction | hard bound on resident overlay RAM (the (a) root cause) |
## How they fit together
### Write path (steady-state, per-worker)
```
Google Books .gz
│
▼
┌──────────────────────┐
│ HTTP stream OR │ --cache-files (#7) downloads to disk
│ local cached file │ first, then streams from local file
└──────────┬───────────┘
▼
┌──────────────────────┐
│ GzipDecoder ▶ Lines │
└──────────┬───────────┘
▼
┌──────────────────────┐
│ parse_ngram_line_ref │ zero-alloc (#4) — borrowed &str slices
│ NgramRecordRef<'a> │ from the line buffer
└──────────┬───────────┘
▼
┌──────────────────────┐
│ YearAggregator │ push_ref (#4) — only allocates a String
│ ::push_ref │ when the ngram changes
└──────────┬───────────┘
▼
AggregatedNgram
│
▼
┌──────────────────────┐
│ tx_insert_ngram │ SmallVec<[&str; 5]> (#3) — stack-inline
│ → encode via vocab │ token splits for n-gram orders 1-5
└──────────┬───────────┘
▼
StoragePrefixTx
│
▼ chunk_count >= tx_chunk_size (#8) ─┐
┌──────────────────────┐ │
│ commit_and_renew_ │ ◀───────────────────────────┘
│ prefix_tx (chunk) │ bounds per-tx memory; SET-semantics
└──────────┬───────────┘ idempotent on crash recovery
▼
┌──────────────────────┐
│ ShardHandle's │ Per-shard AtomicU64 lockfree_entries
│ lock-free overlay │ counter (#9). Bounded by
└──────────┬───────────┘ --lockfree-flush-threshold
▼ entries > threshold (#9) ─┐
┌──────────────────────┐ │
│ flush_lockfree_over_ │ ◀──────────────────┘
│ threshold │
└──────────────────────┘
```
### Vocab-WAL durability invariant (#10)
The original bug: vocab was checkpointed independently of n-gram shards,
so a crash between vocab-checkpoint and shard-checkpoint left them at
divergent index points. On reopen, vocab restarted from a stale index,
orphaning n-grams encoded with newer indices.
The fix: every checkpoint path (`save_checkpoint`, `periodic_checkpoint`)
calls `storage.merge_and_rotate_vocabulary_wal()` FIRST, then syncs
shards. This single-method-call form replaced the prior two-step
`sync_vocabulary()` + `rotate_vocabulary_wal()` pattern — both internally
called `merge_into()` on the lock-free vocab, doubling the reverse-index
HashMap rebuild and spiking ~3.42 GB transient memory for a 5.8M-word
vocab. The combined method does ONE merge, halving the peak.
### Finalization (one-shot, end-of-import)
```
all workers exit
│
▼
┌───────────────────────┐
│ save_checkpoint flow │ (skipped on worker exit per #12)
└────────────┬──────────┘
▼
┌───────────────────────┐
│ Final checkpoint save │
└────────────┬──────────┘
▼
┌───────────────────────┐
│ yield_now().await │ (#13) lets SIGINT handler run before
└────────────┬──────────┘ synchronous MKN work blocks the runtime
▼
┌───────────────────────┐
│ MknAggregator │ with_cancellation_flag(&self.interrupted)
│ .compute_all() │ (#11) — Ctrl+C during MKN now exits
└────────────┬──────────┘ cleanly instead of running to completion
▼
┌───────────────────────┐
│ final compact │ checkpoint_vocabulary (one-shot full
│ vocabulary │ re-serialize) — only at finalize, not
└───────────────────────┘ periodically (which uses WAL rotation)
```
## Optimization details
### 1. mimalloc global allocator
`Cargo.toml` adds optional `mimalloc 0.1` dep + `mimalloc-alloc` feature
(included transitively in `google-books`). All three bins
(`grammstein`, `compare_artries`, `dump_checkpoint`) declare
`#[global_allocator]` gated on `cfg(feature = "mimalloc-alloc")`.
Why it matters: glibc malloc serves large allocations via `mmap` and frees
via `munmap`, both of which issue `mprotect` syscalls to update page
permissions. mimalloc uses thread-local segment heaps with pre-allocated
superpage regions — allocations satisfy from the existing reservation
without per-call `mprotect`. On a 12-worker English import this drops
`__mprotect` CPU share from ~49% to roughly background noise.
### 2. Pre-sized vocabulary
`open_or_create_concurrent_vocabulary_lockfree_with_capacity(path, n)`
pre-sizes the lock-free layer's DashMap term cache and the reverse-lookup
`Vec` to the estimated final size. Without pre-sizing, the structures
geometric-double during import, with each doubling temporarily holding
both old and new tables — several GB of transient overhead for a 5.8M-word
vocab.
`estimate_vocabulary_size(config)` derives the capacity from language +
`min_count` (English base 13M, scaled down by `min_count` factor — higher
thresholds prune more rare words).
### 3. SmallVec token splits
`NgramStorage::store_ngram(ngram, count)` and `tx_insert_ngram(...)` use
`SmallVec<[&str; 5]>` (matching the supported n-gram order range). All
splits fit inline on the stack with no heap allocation.
### 4. Zero-alloc parser
`parse_ngram_line_ref(line) -> NgramRecordRef<'_>` finds tab positions via
byte scanning instead of `split('\t').collect::<Vec<_>>()`. The returned
`NgramRecordRef` borrows ngram text from the input line. Combined with
`YearAggregator::push_ref` (only allocates a String when the ngram
changes), per-record heap traffic drops from O(line) to O(1) amortized.
### 5. xxh3 hashers
`merge.rs` and `mkn.rs` switched all internal `HashMap`/`HashSet` keys
(byte-encoded n-gram contexts, predecessor/successor index sets) from
SipHash defaults to `xxhash_rust::xxh3::Xxh3DefaultBuilder` via local
`XxHashMap`/`XxHashSet` type aliases. Non-adversarial data, so the speedup
(~32% faster hash on cache-line-sized keys) is free.
### 6. Two-tier ChildStore
Lives in libdictenstein (`persistent_artrie_char/nodes/persistent_node.rs`).
Replaces `im::Vector<SwizzledPtr>` with a two-tier enum:
- `Inline { count: u8, keys: [u32; 4], children: [SwizzledPtr; 4] }` —
~85% of nodes, zero heap alloc.
- `Heap { keys: Vec<u32>, children: Vec<SwizzledPtr> }` — ~15% of nodes,
flat Vec.
Eliminates `im::Vector`'s Arc COW + `Arc::make_mut` overhead (~7.22 GB
across a full English import) and the `mprotect` pressure it caused.
### 7. `--cache-files` mode
`process_prefix_file_cached` downloads the raw `.gz` to a local cache via
`download_to_cache` (atomic `.gz.downloading` → `.gz` rename, HTTP 206
Range resume, HTTP 416 recovery), then streams from the cached file via
`stream_aggregated_from_cached_file`. Decouples download reliability from
parse CPU — a failed HTTP stream no longer wastes CPU already spent
parsing.
### 8. `--tx-chunk-size`
Storage-level `commit_and_renew_prefix_tx` commits the current chunk and
begins a fresh transaction for the same prefix/shard. Caller (the
importer's inner streaming loop) invokes this when `chunk_count >=
tx_chunk_size`. Shard-level `commit_chunk` commits to the WAL +
persistent trie but does NOT update `completed_prefixes`; the final
`commit_prefix` marks the prefix complete.
Crash recovery: SET-semantics inserts make re-importing the prefix
idempotent. Uncommitted chunks are lost on crash; committed chunks
survive in the WAL.
### 9. Per-shard lock-free overlay flush threshold
`ShardHandle` carries an `AtomicU64 lockfree_entries` counter (Relaxed
ordering — approximate, used only for threshold decisions). Incremented
on `increment_lockfree`, reset on `sync` / `flush_lockfree` / `checkpoint`.
`ShardCoordinator::flush_lockfree_over_threshold(threshold)` does a fast
read-lock check on every shard and only acquires a write lock for the
ones over threshold. Workers on under-threshold shards continue
uninterrupted.
The importer auto-scales the default: 50K for ≥8 parallel workers, 100K
otherwise. `--lockfree-flush-threshold` overrides.
### 10. Single-merge vocab WAL rotation
Discussed above (durability invariant section).
### 11-14. Finalization fixes
Smaller surgical fixes:
- **#11** — `MknAggregator::with_cancellation_flag(&AtomicBool)` checks the
flag at the top of every shard iteration in `compute_all` and
`compute_continuation_counts`. The importer wires `self.interrupted`
in.
- **#12** — The `save_checkpoint()` call previously fired on every worker
exit. The original code is preserved as commented-out (per
CLAUDE.md "never disable by deleting" rule) with a `DISABLED:` rationale
block explaining the 12× checkpoint storm.
- **#13** — `tokio::task::yield_now().await` before `compute_mkn_stats`
lets the runtime drive other tasks (including SIGINT handlers) before
the multi-minute synchronous MKN work captures the thread.
- **#14** — `checkpoint_interval = if config.parallel_downloads >= 8 { 5 }
else { 10 }` at three sites in the importer. Higher parallelism →
faster overall throughput → checkpoints amortize over more work.
### 15. `--overlay-budget-gib` overlay-heap eviction (the OOM bound)
Optimizations #8/#9 *bounded* the inter-checkpoint overlay growth, but the
resident lock-free overlay itself (bottleneck (a)) was still **unbounded** — it
accumulated for the lifetime of an open shard, because `checkpoint()` serialized
the overlay snapshot to disk without reclaiming its resident RAM. After
libdictenstein added production overlay-heap eviction (the `checkpoint()` tail now
evicts the coldest resident overlay nodes down to a `resident_budget_bytes`,
**losslessly** — evicted nodes fault back from the durable image on read),
libgrammstein arms it per shard:
- **`ShardHandle.trie`** is held as `SharedARTrie<u64>` (`Arc<PersistentARTrie>`)
so the eviction coordinator can hold a weak self-reference; `arm_eviction()`
installs it after open/create. All trie writes remain `&self` (lock-free
overlay), so they deref through the `Arc` unchanged.
- **Budget policy** (`ShardConfig::overlay_eviction_config`): a global budget `G`
(CLI `--overlay-budget-gib`, default **10 GiB**, default-on; `0` disables) is
divided by the number of *simultaneously-resident* shards. Hash-based
`CpuProportional` (the default) and an unlimited `max_open_shards` keep all
`num_shards` resident, so the divisor is `num_shards`; otherwise the LRU cap
bounds residents to `max_open_shards`. So `SUM(per-shard budget)` over the
resident set ≈ `G`, granularity-invariant. A 64 MiB per-shard floor + a finite
200K-node per-checkpoint eviction cap keep the tail from thrashing or
latency-spiking; the base preset is `without_memory_monitor()` (no per-shard
`sysinfo` thread — the checkpoint tail fires purely on `resident > budget`).
- CX path-compression (also new in libdictenstein) shrinks the on-disk/evicted
node form, complementing eviction (it does not shrink the resident hot set).
This converts bottleneck (a) from "unbounded" to "bounded by `G`" — the resident
overlay is the dominant heap term during bulk ingest, so this is the lever that
makes the <16 GB target reachable.
## Verification
The fixes are guarded by ~20 unit and integration tests:
- `src/sources/google_books/sharding/shard.rs::tests` — 7 tests for
`lockfree_entry_count` + `commit_chunk` lifecycle + SET-semantics
idempotency.
- `src/sources/google_books/sharding/coordinator.rs::tests` — 4 tests
for `flush_lockfree_over_threshold` + `total_lockfree_entries` +
`commit_chunk_tx`.
- `src/sources/google_books/sharding/mkn.rs::tests` — 2 cancellation
tests (`compute_all`, `compute_continuation_counts`).
- `src/sources/google_books/aggregator.rs::tests` — 2 push_ref tests
(no-alloc same-ngram + equivalence-to-push).
- `src/sources/google_books/storage.rs::tests` — 4 tests for chunked
transactions + 1 idempotency test for `merge_and_rotate_vocabulary_wal`
+ 2 regression tests for the checkpoint-resume bug class.
- `src/sources/google_books/importer.rs::tests::cache_files` — 6
wiremock tests for `download_to_cache` (creates, skips existing, Range
resume, 416 recovery, cleanup of both files, idempotent cleanup).
- `src/sources/google_books/parser.rs::tests` — 7 tests for
`parse_ngram_line_ref` (unigram/bigram/unicode/wrong-field-count/
empty-ngram/invalid-fields/equivalence-to-owned).
- `src/sources/google_books/sharding/shard.rs::tests` — 3 overlay-eviction
tests (#15): `test_overlay_eviction_is_lossless_and_observable`
(`nodes_evicted > 0` + every evicted key faults back),
`test_overlay_eviction_bounds_resident_to_budget` (a 1 MiB budget reclaims the
bulk of a 50K-node overlay — `nodes_evicted >= 30K`), and
`test_overlay_eviction_under_concurrent_writers` (no lost writes under writers
racing the budget eviction; deterministic across repeats).
## Benchmark results
**In-process micro-benchmark** (`benches/overlay_eviction.rs` — single shard, 1M
distinct n-grams, `cargo bench --features google-books --bench overlay_eviction`):
| eviction OFF (unbounded) | ~211K ngrams/s | 448 ms | 0 |
| eviction ON (4 MiB budget) | ~135K ngrams/s | 139 ms | 1,078,092 |
The budget reclaims the **entire** resident overlay (>1M nodes) — the memory bound
works. The 4 MiB budget is a worst-case stress (maximal eviction); it costs ~36%
ingest throughput and, in exchange, makes the final checkpoint **3× faster** (the
overlay was evicted incrementally rather than serialized whole at the end). The
production default (10 GiB ÷ resident-shard count, ≥ 64 MiB/shard) evicts far less
often, so the real-world throughput cost is much smaller — the tradeoff is
"bounded heap, no OOM" vs unbounded growth. The bound is further verified by the
in-process eviction tests (fires, reclaims the bulk of the overlay, lossless,
concurrency-safe).
**Pending — end-to-end peak-RSS on a real corpus.** Run the importer with
`--overlay-budget-gib 0` (unbounded = old behavior) vs the default 10 GiB under
`/usr/bin/time -v` (off tmpfs) with CPU affinity (`taskset -c 0-11`). Acceptance:
peak heap 33.79 GB → <16 GB; `__mprotect` CPU share 49% → <5% (the latter already
addressed by mimalloc, #1). **Caveat:** mimalloc retains freed pages, so RSS lags
the live heap and understates the reduction — prefer `valgrind --tool=massif`
(live-heap profile, not RSS) for the heap-shape confirmation, or read the live
reclamation off `eviction_stats().nodes_evicted` as the micro-benchmark does. This
needs a real Google Books corpus run; results will land in MEMORY.md and here.