Skip to main content

Crate chunkshop

Crate chunkshop 

Source
Expand description

chunkshop-rs — Rust port of chunkshop.

Implements sources (files / HTTP / S3 / DB tables), chunkers, a fastembed embedder, and a modular sink/backend layer (PG / MariaDB / SQLite / ClickHouse). The YAML config schema and target table shape match the Python reference so vectors are interchangeable across implementations.

§Cargo features

default = ["full"] — preserves backward compatibility with chunkshop = "0.3".

Library consumers who want only the chunker structs (e.g. an embedded Postgres extension) can opt into the slim build:

chunkshop = { version = "0.4", default-features = false, features = ["chunkers"] }

Available features:

  • chunkers — chunker structs + their config types (no fastembed/ort/sqlx).
  • embedder-core — fastembed (BYO try_new_from_user_defined) + ORT. No hf-hub, no auto-download. Caller supplies model bytes directly via embedder::FastembedEmbedder::from_user_defined_files.
  • embedder-hub — adds hf-hub for runtime auto-download. Enables embedder::FastembedEmbedder::new (stock variants + Xenova int8 BGE bit-near-exact) and the chunker::SemanticChunker::new convenience.
  • embedder — historical alias = embedder-core + embedder-hub. Existing consumers see no change.
  • extractor — language detection + entity extractor.
  • source — files / HTTP / S3 source loaders.
  • sink — the full modular sink/backend layer (PG/MariaDB/SQLite/ClickHouse).
  • pipeline — high-level Pipeline + run_cell glue.
  • bakeoff — chunker × embedder matrix evaluator.
  • full — all of the above (default).

Re-exports§

pub use backends::AnyBackend;
pub use backends::Backend;
pub use backends::BackendConn;
pub use backends::BackendDialect;
pub use backends::ClickhouseBackend;
pub use backends::ColSpec;
pub use backends::MariadbBackend;
pub use backends::PostgresBackend;
pub use backends::SQLiteBackend;
pub use bakeoff::run_bakeoff;
pub use bakeoff::run_bakeoff_with_base;
pub use bakeoff::BakeoffConfig;
pub use bakeoff::BakeoffResults;
pub use chunker::Chunk;
pub use chunker::SentenceAwareChunker;
pub use config::load_config;
pub use config::CellConfig;
pub use embedder::FastembedEmbedder;
pub use pipeline::Pipeline;
pub use runner::run_cell;
pub use runner::CellResult;
pub use sinks::AnySink;
pub use sinks::ClickhouseSink;
pub use sinks::MariadbSink;
pub use sinks::PgSink;
pub use sinks::Sink;
pub use sinks::SqliteSink;
pub use sources::Document;
pub use sources::AnySource;
pub use sources::ClickhouseTableSource;
pub use sources::MariadbTableSource;
pub use sources::PgTableSource;
pub use sources::SqliteTableSource;
pub use sources::FilesSource;
pub use sources::HttpSource;
pub use sources::JsonCorpusSource;
pub use sources::S3Source;

Modules§

backends
Backend module — connection management + dialect helpers per DB engine.
bakeoff
chunkshop-rs bakeoff — matrix evaluation. Mirrors python/src/chunkshop/bakeoff/. One YAML = one factorial run over (chunkers × embedders), scored against a gold-query set with recall@k + MRR. Outputs results.json + report.md + recommended.yaml.
chunker
Sentence-aware chunker. Direct port of python/src/chunkshop/chunkers/sentence_aware.py + python/src/chunkshop/chunkers/_splitting.py.
codeparse
Code-symbol extraction primitives ported from Python chunkshop.codeparse.
config
YAML config parsing.
consolidators
RM-A consolidators — the structured-extraction seam called by ConsolidationChunker (Task 8). Mirror of Python chunkshop.consolidators. v1 ships:
embedder
Fastembed-backed embedder.
extractor
Extractor stage. Mirrors python/src/chunkshop/extractors/.
framer
Framer stage. Sits between source and chunker. Each framer’s frame(&raw) returns 1+ framed Documents. Each framed doc carries metadata.framer and metadata.frame_seq. Mirrors python/src/chunkshop/framers/.
memory
RM-A: agent-memory staging API — chunkshop-owned append-only session staging table with deterministic event_id derivation (byte-identical to Python chunkshop.memory.staging). RM-A agent-memory module.
pipeline
Pipeline — chunkshop as a library. The host application drives ingestion.
raw_store
RawStore: pluggable storage for raw source artifacts.
runner
Single-cell runner: wires source -> chunker -> embedder -> sink.
sentence_split
Sentence splitting helpers for the semantic chunker.
sinks
Sinks — chunkshop’s per-backend data-model semantics layer.
sources
Sources — input document iterators per backing store.
summarizer
Summarizer dispatch — turns a SummarizerConfig into a (text, doc_metadata) -> Result<String> callable. Mirrors python/src/chunkshop/chunkers/_summarizer.py.