kiru 🗡️
Cut through text at the speed of light
A blazingly fast text chunking library for Rust, designed for RAG applications and large-scale text processing.
Why kiru?
When building RAG (Retrieval-Augmented Generation) systems, you need to chunk documents fast. Whether you're processing millions of documents for your vector database or streaming real-time data, kiru delivers:
- ⚡ Lightning Fast: Process 100MB+ files at 500+ MB/s
- 🎯 UTF-8 Safe: Handles multi-byte characters correctly, never breaking in the middle of a character
- 🔄 Zero-Copy Streaming: Process gigabyte files with constant memory usage
- 🚀 Parallel Processing: Chunk multiple sources concurrently with Rayon
- 🎨 Flexible Strategies: Chunk by bytes or characters, your choice
Benchmarks
file_chunking_by_bytes/4kb_chunk
time: [195.32 ms 195.88 ms 196.44 ms]
thrpt: [509.11 MB/s 510.56 MB/s 512.03 MB/s]
string_chunking_by_characters/4k_chars
time: [412.45 ms 413.21 ms 413.97 ms]
thrpt: [241.57 MB/s 242.01 MB/s 242.46 MB/s]
Benchmarked on 100MB text files with 4KB chunks and 10% overlap
Quick Start
Add to your Cargo.toml:
[]
= "0.1"
Basic Usage
use ;
// Chunk a string
let chunker = new?; // 1KB chunks, 128 bytes overlap
let chunks: = chunker
.chunk_string
.collect;
// Chunk a file (streaming, constant memory)
use ;
let chunker = new?;
let stream = from_source?;
for chunk in chunker.chunk_stream
Advanced: Parallel Processing
use ;
// Process multiple sources in parallel
let sources = vec!
.into_iter
.map
.collect;
let chunker = by_bytes;
// Stream chunks as they're processed in parallel
let chunks = chunker.on_sources_par_stream?;
for chunk in chunks
Chunking Strategies
BytesChunker
- Chunks by byte count while respecting UTF-8 boundaries
- Perfect for when you need consistent memory usage
- Ideal for embeddings with token limits
CharactersChunker
- Chunks by character count (grapheme clusters)
- Ensures exact character counts regardless of byte representation
- Best for character-limited APIs or display purposes
Source Types
kiru can chunk from multiple sources:
- Files: Local filesystem paths
- HTTP/HTTPS: Web pages and APIs
- Strings: In-memory text
- Glob patterns: Multiple files matching a pattern
Use Cases for RAG
Vector Database Ingestion
// Process an entire knowledge base
let chunker = by_bytes;
let sources = vec!;
let chunks = chunker.on_sources_par_stream?;
// Send to your vector DB
for chunk in chunks
Real-time Document Processing
// Stream-process documents as they arrive
let chunker = new?;
// Process without loading entire file into memory
let stream = from_source?;
for chunk in chunker.chunk_stream
Performance Tips
- Use parallel processing for multiple files:
on_sources_par_stream() - Tune chunk size based on your embedding model's context window
- Adjust overlap to balance between context preservation and storage
- Use BytesChunker for maximum throughput
- Stream large files instead of loading them into memory
Contributing
We welcome contributions! Please see CONTRIBUTING.md for details.
License
MIT License - see LICENSE for details.
Built with ❤️ for the RAG community