Skip to main content

Crate snipsplit_core

Crate snipsplit_core 

Source
Expand description

Pure-Rust core for snipsplit. Token-aware greedy chunker for RAG.

Algorithm:

  1. Split into paragraphs on blank lines, then sentences via a regex that handles the common abbreviation pitfalls (Mr., Dr., e.g., vs., etc., version-style 1.0, decimal numbers).
  2. Greedy-pack sentences into chunks while the running BPE token count is <= max_tokens.
  3. If a single sentence is too big on its own, slice it at token boundaries instead.
  4. Apply overlap_tokens by re-prepending the last N tokens of each emitted chunk to the next.
  5. Drop chunks shorter than min_tokens.

Structs§

Chunk
One emitted chunk.
ChunkConfig
Chunker configuration.
Chunker
Token-aware chunker.

Enums§

ChunkerError
All errors surfaced by snipsplit-core.

Type Aliases§

Result
Crate-wide result alias.