chunkshop-rs
Rust port of chunkshop. Same YAML config, same pgvector target table, same
ordering of chunks — vectors written by chunkshop-rs ingest are compatible
with vectors written by the Python reference (../scripts/parity_check.py),
and the same bakeoff YAML produces equivalent leaderboards in both languages
(../scripts/parity_check_bakeoff.py).
What Rust covers today
The chunkshop user journey is bring corpus → run bakeoff → take recommended.yaml → run ingest → repeat for new corpus. Rust now covers all of it except orchestrator-level parallel fan-out.
What Python Rust ingest(one YAML → one cell)✅ ✅ Pipeline(library / inline mode)✅ ✅ bakeoff(matrix → leaderboard → recommended.yaml)✅ ✅ orchestrate(N cells in parallel subprocesses)✅ ❌ Cross-language bakeoff parity is verified per-run by
scripts/parity_check_bakeoff.py: same YAML, leaderboards within ±2.5pp MRR with consistent ordering on distinct-MRR pairs. The orchestrator port is the next major piece.
Status: v0.1.x. Single-cell pipeline + bakeoff at parity with Python
(modulo two NLP-heavy extractors that need spaCy). The same canonical
YAML — docs/samples/bakeoff-ntsb/bakeoff-ntsb.yaml — runs from both
languages and produces equivalent leaderboards. Wire-format proof of the
cross-language claim at both the cell layer (vectors interchangeable) AND
the bakeoff layer (leaderboards within ±2.5pp MRR, consistent ordering on
distinct-MRR pairs).
Build
Release build takes ~25s wall on a modern laptop (plus ~5 min of crate
downloads on a cold ~/.cargo). The ONNX Runtime binary is downloaded by
ort-sys during cargo build — if that fails with HTTP 504, retry; the CDN
(cdn.pyke.io) occasionally hiccups.
Run
# Point at a pgvector-enabled Postgres
# Run the shipped sample config (from repo root)
The first run downloads the embedder model to fastembed's cache (~500 MB for
bge-base-en-v1.5). Subsequent runs are local.
What works
| Stage | Supported |
|---|---|
| source | files, json_corpus, pg_table, http, s3 (bucket + prefix + optional endpoint_url for minio / R2; via the object_store crate; standard AWS credential chain; metadata carries {bucket, key, size, etag}) — 5/5 sources ship in Rust |
| framer | identity (default 1-to-1 pass-through), heading_boundary (split markdown on a configurable heading regex with preamble + title-from-heading), regex_boundary (split on arbitrary regex; optional title-pattern capture), jsonpath (parse content as JSON, walk dotted path with * for list iteration; configurable body/title sub-paths) — all byte-identical to Python |
| chunker | sentence_aware, hierarchy, fixed_overlap, neighbor_expand, summary_embed, hierarchical_summary (all byte-identical to Python), semantic (algorithm-parity; chunks NOT byte-identical due to MB-1's ~1e-3 ORT drift) — all 6 Python chunkers ship in Rust. Summarizer dispatch supports passthrough + external natively; callable mode recognizes chunkshop.summarizers.passthrough always, and chunkshop.summarizers.lede behind the lede cargo feature (cargo build --features lede pulls lede from crates.io). |
| embedder | fastembed (maps model_name to fastembed-rs variant; see below) |
| extractor | none (default), composite, rake_keywords (hand-rolled RAKE + 150-word EN stopword list — algorithm-only parity), lang_detect (via whatlang crate, ISO 639-3 → 639-1 conversion — algorithm-only parity), keybert_phrases + spacy_entities (Python-only stubs that error at config-load) |
| target | pgvector table; modes overwrite / append / create_if_missing; force_overwrite; source_tag write-once on ON CONFLICT; promote_metadata jsonb-to-typed-column writes; HNSW index optional; concurrent-cell safe via schema-name advisory lock |
What does NOT work yet
The two extractor stubs are deliberate Python-only. Orchestrator and the embedder-registry breadth are real parity gaps.
Meta-runner gap
chunkshop-rs bakeoffsubcommand — SHIPPED. Run the samebakeoff-ntsb-rust.yamlfrom either language and verify withscripts/parity_check_bakeoff.py(leaderboards within ±2.5pp MRR; consistent ordering on distinct-MRR pairs).chunkshop-rs orchestratesubcommand — not ported. Bulk multi-cell fan-out (Python's CLI spawns Ningestsubprocesses with thread caps, log multiplexing, per-cell failure isolation). Different surface than the bakeoff; tracked separately.
Embedder registry breadth
- The Rust dispatch supports
Xenova/bge-{small,base}-en-v1.5-int8(bit-near-exact via the same ONNX file Python loads), the stock fastembed-rs variants (BAAI/bge-{small,base,large}-en-v1.5,sentence-transformers/all-MiniLM-L6-v2[-int8]), and the nomic v1.5 family (nomic-ai/nomic-embed-text-v1.5[-Q]). Adding a model that fastembed-rs already knows = one entry insrc/embedder.rs::resolve_model_name. Adding a model fastembed-rs doesn't know yet = one entry inuser_defined_source(for CLS-pooled variants) plus a mean-pooling branch if needed. The brief queue tracks a YAML-driven HF-pointer feature that turns "add a model" into a YAML edit instead of a Rust-source edit. See../docs/embedders.md.
Deliberate Python-only extractors
- Extractors
keybert_phrasesandspacy_entities— Python-only. Build-time error directs users back to Python or to a custom Rust binary that registers their own NER / embedding-keyphrase pipeline. The other four extractor variants (none,composite,rake_keywords,lang_detect) ship.
YAML configs from the Python side are accepted (unknown fields on
runtime/framer/extractor are ignored) — but obviously the ignored stages
won't run.
Embedding parity vs Python
For the two registered Xenova int8 BGE variants, chunkshop-rs loads
the same ONNX file Python loads (Xenova/bge-base-en-v1.5/onnx/model_quantized.onnx
and the small-model equivalent) via hf-hub,
tokenizes through the tokenizers crate with the same padding/truncation
config fastembed-py uses, runs ORT with intra_threads=1, CLS-pools, and
L2-normalizes (with f64 sum-of-squares to mirror numpy). On the shipped
4-file / 14-chunk sample corpus the cross-language parity check reports:
- Top-k retrieval order: identical (Python and Rust pick the same chunks in the same order for a fixed query) — the user-visible RAG claim.
- Chunk
embedded_content: 100% byte-for-byte identical. - Cosine distance between matched embeddings: mean ~1-2e-3, max ~5-15e-3 per chunk (was ~1e-2 mean before this work — ~5x improvement).
Strict bitwise equality is not achievable: Python's onnxruntime wheel
and Rust's ort crate are independent ORT
C++ binary builds. They diverge by ULPs (and occasionally more on quantized
matmul paths) regardless of thread count. If your workflow needs bitwise
reproducibility (e.g. cross-implementation vector hashing), use one
implementation throughout.
For all other model names the embedder falls back to fastembed-rs's stock variants. Those do not claim parity with Python — they use Qdrant's fp32-optimized ONNX, a different file from Python's BAAI fp32 ONNX, and typically drift ~1e-3 per element.
Model-name mapping today (in src/embedder.rs):
Python YAML model_name |
Rust path | Parity vs Python |
|---|---|---|
Xenova/bge-base-en-v1.5-int8 |
hand-rolled ORT + Xenova ONNX | retrieval-identical, cos drift ≤ 1.5e-2 |
Xenova/bge-small-en-v1.5-int8 |
hand-rolled ORT + Xenova ONNX | retrieval-identical, cos drift ≤ 1.5e-2 |
BAAI/bge-base-en-v1.5 |
fastembed-rs BGEBaseENV15 |
wire-format only |
BAAI/bge-small-en-v1.5 |
fastembed-rs BGESmallENV15 |
wire-format only |
BAAI/bge-large-en-v1.5 |
fastembed-rs BGELargeENV15 |
wire-format only |
sentence-transformers/all-MiniLM-L6-v2 |
fastembed-rs AllMiniLML6V2 |
wire-format only |
Any other model_name errors at cell start.
Parity verification
rust/chunkshop/tests/embedding_parity.rs— embeds 5 fixed inputs and asserts (a) median per-vector cosine distance ≤ 1e-7, (b) max abs element-wise diff ≤ 1e-2, (c) max per-vector cosine distance ≤ 5e-3 against committed Python reference vectors. Skips cleanly without network. Re-generate the reference withuv run --project python python scripts/produce_rust_parity_reference.py.scripts/parity_check.py— end-to-end ingest comparison. Boots Python and Rust against the same corpus into two tables, compares top-k retrieval and per-chunk cosine. Manual; needs both toolchains plus Postgres.
Cross-language parity check
scripts/parity_check.py (at the repo root) is a manual check — not a pytest
— because it needs both toolchains installed. It runs the Python ingest and
the Rust ingest into two tables, then compares top-k retrieval for a fixed
query:
# With both `uv` (Python) and cargo (Rust) available:
&& &&
&&
Writes skill-output/rust-parity/report.md.
Integration test
Skips if CHUNKSHOP_TEST_DSN is unset. The test creates schema
chunkshop_rust_parity, ingests tests/parity-fixtures/handbook-intro.md,
and asserts row count > 0, non-empty embedded_content, and
vector(768) column dim. Leaves the schema behind for inspection; rerun
re-creates it under mode: overwrite.
Implementation roadmap (not shipped)
| Want | Lift |
|---|---|
chunkshop-rs bakeoff subcommand |
In flight. Mirror python/src/chunkshop/bakeoff/ (config + keys + gold + runner + score + output) so the same bakeoff-ntsb.yaml runs from either language and produces the same leaderboard ordering. The single biggest gap to the cross-language pitch — without it, the comparison story is Python-only. |
chunkshop-rs orchestrate subcommand |
Spawn N chunkshop-rs ingest subprocesses over N YAML configs with thread caps + per-cell failure isolation. Different surface from bakeoff; tracked separately. |
Python-only extractors (keybert_phrases, spacy_entities) |
Each needs a Rust-native NER / embedding-keyphrase implementation modulo Python-only deps (spaCy can't cross). |
License
MIT (workspace inherits from the repo root).