Expand description
Pure-Rust core for snipsplit. Token-aware greedy chunker for RAG.
Algorithm:
- Split into paragraphs on blank lines, then sentences via a regex
that handles the common abbreviation pitfalls (
Mr.,Dr.,e.g.,vs.,etc., version-style1.0, decimal numbers). - Greedy-pack sentences into chunks while the running BPE token count
is
<= max_tokens. - If a single sentence is too big on its own, slice it at token boundaries instead.
- Apply
overlap_tokensby re-prepending the last N tokens of each emitted chunk to the next. - Drop chunks shorter than
min_tokens.
Structs§
- Chunk
- One emitted chunk.
- Chunk
Config - Chunker configuration.
- Chunker
- Token-aware chunker.
Enums§
- Chunker
Error - All errors surfaced by
snipsplit-core.
Type Aliases§
- Result
- Crate-wide result alias.