chunkshop-rs 0.9.1

Standalone ingest-to-pgvector: source -> chunker -> embedder -> extractor -> table. int8 BGE by default; bakeoff matrix evaluator built in. Cross-language wire-format compatible with the Python `chunkshop` package.
Documentation

chunkshop-rs (crate)

Rust crate for chunkshop. See ../README.md for the top-level Rust port overview and Python-vs-Rust feature parity table. This file documents crate-level concerns specific to the chunkshop crate's Cargo features.

Code-aware chunking

The code-aware feature enables symbol-aware chunking for source code via tree-sitter. Each grammar is opt-in to control binary size.

Feature matrix

Feature flag Languages Binary (MB) Δ vs default
default (none) 51.40 (baseline)
code-aware (umbrella alone) 51.42 +0.02
code-aware-python Python 52.14 +0.74
code-aware-python,code-aware-java Python + Java (must-have set) 52.54 +1.14
code-aware-go Go (should-have, pending chunkshop#40) not implemented n/a
code-aware-typescript TypeScript (should-have) not implemented n/a
code-aware-javascript JavaScript (should-have) not implemented n/a
code-aware-rust Rust (should-have) not implemented n/a

Sizes captured with cargo build --release --bin chunkshop-rs on this build host. Will drift over time — re-measure with the same matrix to update.

The umbrella code-aware feature alone adds no grammar crates (the +0.02 MB drift is build-noise from a fresh recompile, not real code) — it's a marker for downstream crates that want to feature-detect "any code-aware grammar is on". Real cost arrives with the per-language features, which gate tree-sitter, tree-sitter-tags, and the language grammar crate behind dep: syntax.

Cross-port byte-equivalence

When the same source file is ingested via chunkshop-py and chunkshop-rs, the resulting chunks share identical fqn and node_id metadata. This is enforced by:

  • Rust proptest at rust/chunkshop/tests/cross_port_proptest.rs (~1500 random cases per cargo test)
  • Python pytest at python/tests/chunkshop/test_rust_cross_port_parity.py (46 curated vectors, invokes the fqn-cli Rust binary as subprocess)

Both gates run in CI on every PR.

Syntax-error fallback

When tree-sitter returns an error-containing parse tree (mirroring Python's ast.parse → SyntaxError check), or when the document's language can't be detected from its extension, the chunker falls back to sentence_aware and stamps metadata.strategy = "symbol_aware_fallback" with a fallback_reason for observability.

Should-have grammars status

code-aware-go, code-aware-typescript, code-aware-javascript, and code-aware-rust feature flags exist in Cargo.toml and pull in their respective grammar crates, but the per-language extractors are not yet implemented. Pending chunkshop#40 (Python's tree-sitter migration for Go/TS/JS) — implementing them first in Python keeps the cross-port equivalence contract intact.