chunkedrs
AI-native text chunking for Rust — split long documents into token-accurate pieces for embedding and retrieval.
Built on tiktoken for precise token counting. Every chunk is guaranteed to respect your token budget.
Why chunkedrs?
RAG pipelines need text split into chunks that fit model context windows. Naive splitting (by character count or fixed size) breaks mid-word, mid-sentence, or mid-paragraph — destroying meaning and hurting retrieval quality.
chunkedrs splits at semantic boundaries (paragraphs → sentences → words) while enforcing exact token limits. No chunk ever exceeds max_tokens.
Strategies
| Strategy | Use case | Speed |
|---|---|---|
| Recursive (default) | General text — paragraphs, sentences, words | Fastest |
| Markdown | Documents with # headers — preserves section metadata |
Fast |
| Semantic | High-quality RAG — splits at meaning boundaries via embeddings | Slower (API calls) |
Quick start
// split with defaults: recursive, 512 max tokens, no overlap
let chunks = chunk.split;
for chunk in &chunks
Token-accurate splitting
let chunks = chunk
.max_tokens
.overlap
.model
.split;
// every chunk is guaranteed to have <= 256 tokens
assert!;
Markdown-aware splitting
let markdown = "# Intro\n\nSome text.\n\n## Details\n\nMore text here.\n";
let chunks = chunk.markdown.split;
// each chunk knows which section it belongs to
assert_eq!;
Semantic splitting
With the semantic feature enabled, split at meaning boundaries using embeddings:
[]
= { = "1", = ["semantic"] }
let client = openai;
let chunks = chunk
.semantic
.threshold
.split_async
.await?;
Chunk metadata
Every Chunk carries rich metadata:
Overlap
Token overlap between consecutive chunks preserves context at boundaries — critical for retrieval quality:
let chunks = chunk
.max_tokens
.overlap
.split;
Tokenizer selection
// auto-detect from model name
let chunks = chunk.model.split;
// or specify encoding directly
let chunks = chunk.encoding.split;
// default: o200k_base (GPT-4o, GPT-4-turbo)
License
MIT