Expand description
chunkshop-rs — Rust port of chunkshop.
Implements sources (files / HTTP / S3 / DB tables), chunkers, a fastembed embedder, and a modular sink/backend layer (PG / MariaDB / SQLite / ClickHouse). The YAML config schema and target table shape match the Python reference so vectors are interchangeable across implementations.
§Cargo features
default = ["full"] — preserves backward compatibility with chunkshop = "0.3".
Library consumers who want only the chunker structs (e.g. an embedded Postgres extension) can opt into the slim build:
chunkshop = { version = "0.4", default-features = false, features = ["chunkers"] }Available features:
chunkers— chunker structs + their config types (no fastembed/ort/sqlx).embedder-core— fastembed (BYOtry_new_from_user_defined) + ORT. Nohf-hub, no auto-download. Caller supplies model bytes directly viaembedder::FastembedEmbedder::from_user_defined_files.embedder-hub— addshf-hubfor runtime auto-download. Enablesembedder::FastembedEmbedder::new(stock variants + Xenova int8 BGE bit-near-exact) and thechunker::SemanticChunker::newconvenience.embedder— historical alias =embedder-core+embedder-hub. Existing consumers see no change.extractor— language detection + entity extractor.source— files / HTTP / S3 source loaders.sink— the full modular sink/backend layer (PG/MariaDB/SQLite/ClickHouse).pipeline— high-level Pipeline + run_cell glue.bakeoff— chunker × embedder matrix evaluator.full— all of the above (default).
Re-exports§
pub use backends::AnyBackend;pub use backends::Backend;pub use backends::BackendConn;pub use backends::BackendDialect;pub use backends::ClickhouseBackend;pub use backends::ColSpec;pub use backends::MariadbBackend;pub use backends::PostgresBackend;pub use backends::SQLiteBackend;pub use bakeoff::run_bakeoff;pub use bakeoff::run_bakeoff_with_base;pub use bakeoff::BakeoffConfig;pub use bakeoff::BakeoffResults;pub use chunker::Chunk;pub use chunker::SentenceAwareChunker;pub use config::load_config;pub use config::CellConfig;pub use embedder::FastembedEmbedder;pub use pipeline::Pipeline;pub use runner::run_cell;pub use runner::CellResult;pub use sinks::AnySink;pub use sinks::ClickhouseSink;pub use sinks::MariadbSink;pub use sinks::PgSink;pub use sinks::Sink;pub use sinks::SqliteSink;pub use sources::Document;pub use sources::AnySource;pub use sources::ClickhouseTableSource;pub use sources::MariadbTableSource;pub use sources::PgTableSource;pub use sources::SqliteTableSource;pub use sources::FilesSource;pub use sources::HttpSource;pub use sources::JsonCorpusSource;pub use sources::S3Source;
Modules§
- backends
- Backend module — connection management + dialect helpers per DB engine.
- bakeoff
chunkshop-rs bakeoff— matrix evaluation. Mirrorspython/src/chunkshop/bakeoff/. One YAML = one factorial run over(chunkers × embedders), scored against a gold-query set with recall@k + MRR. Outputs results.json + report.md + recommended.yaml.- chunker
- Sentence-aware chunker. Direct port of
python/src/chunkshop/chunkers/sentence_aware.py+python/src/chunkshop/chunkers/_splitting.py. - codeparse
- Code-symbol extraction primitives ported from Python
chunkshop.codeparse. - config
- YAML config parsing.
- consolidators
- RM-A consolidators — the structured-extraction seam called by
ConsolidationChunker(Task 8). Mirror of Pythonchunkshop.consolidators. v1 ships: - embedder
- Fastembed-backed embedder.
- extractor
- Extractor stage. Mirrors
python/src/chunkshop/extractors/. - framer
- Framer stage. Sits between source and chunker. Each framer’s
frame(&raw)returns 1+ framedDocuments. Each framed doc carriesmetadata.framerandmetadata.frame_seq. Mirrorspython/src/chunkshop/framers/. - memory
- RM-A: agent-memory staging API — chunkshop-owned append-only session
staging table with deterministic event_id derivation (byte-identical
to Python
chunkshop.memory.staging). RM-A agent-memory module. - pipeline
- Pipeline — chunkshop as a library. The host application drives ingestion.
- raw_
store - RawStore: pluggable storage for raw source artifacts.
- runner
- Single-cell runner: wires source -> chunker -> embedder -> sink.
- sentence_
split - Sentence splitting helpers for the semantic chunker.
- sinks
- Sinks — chunkshop’s per-backend data-model semantics layer.
- sources
- Sources — input document iterators per backing store.
- summarizer
- Summarizer dispatch — turns a
SummarizerConfiginto a(text, doc_metadata) -> Result<String>callable. Mirrorspython/src/chunkshop/chunkers/_summarizer.py.