# kiru ⚡🗡️
> **Cut through text at the speed of light**
The fastest text chunking library for RAG applications. Available for both Rust and Python.
[](https://crates.io/crates/kiru)
[](https://pypi.org/project/kiru/)
[](https://docs.rs/kiru)
[](https://opensource.org/licenses/MIT)
## What is kiru?
kiru is a high-performance text chunking library designed for modern RAG (Retrieval-Augmented Generation) systems. When you need to split millions of documents for vector databases or process streaming data in real-time, kiru delivers unmatched speed without sacrificing correctness.
### Key Features
- **⚡ Blazing Fast (Python)**: 1000+ MB/s throughput for bytes, 300+ MB/s for characters
- **🎯 UTF-8 Safe**: Never breaks multi-byte characters or emoji
- **💾 Memory Efficient**: Stream gigabyte files with constant memory usage
- **🚀 Parallel Processing**: Utilize all CPU cores automatically
- **🔌 Multiple Sources**: Files, URLs, strings, and glob patterns
- **🛠️ Flexible Strategies**: Chunk by bytes or characters
- **🦀 Rust Core**: Rust performance and memory safety
- **🐍 Python Bindings**: Pythonic API for ease of use
## Performance
**Benchmarked on 1MB text file, 1MB chunks, 1KB overlap:**
| **kiru (Rust)** | bytes | string | 0.23 | - | **4,370** |
| **kiru (Python)** | bytes | string | 0.71 | 2.9 | **1,408** |
| **kiru (Python)** | chars | string | 3.13 | 2.9 | **319** |
| LangChain | chars | string | 2,982 | 18.6 | 0.34 |
**kiru is 4,000x faster than LangChain for byte chunking and 940x faster for character chunking!**
Key insights:
- **Rust native performance**: Up to 4,370 MB/s for byte chunking
- **Python bindings overhead**: Still 1,400+ MB/s, beating all pure Python alternatives
- **Character-aware chunking**: 300+ MB/s while respecting grapheme boundaries
- **Memory efficient**: Uses 6x less memory than LangChain
---
## Quick Start
### Python 🐍
```bash
pip install kiru
```
```python
from kiru import Chunker
# Create a chunker
chunker = Chunker.by_bytes(
chunk_size=1024, # 1KB chunks
overlap=128 # 128 bytes overlap
)
# Chunk text
chunks = chunker.on_string("Your text here...").all()
# Chunk files in parallel
sources = ["file://doc1.txt", "https://example.com/page", "glob://*.md"]
for chunk in chunker.on_sources_par(sources):
process(chunk)
```
### Rust 🦀
Add to your `Cargo.toml`:
```toml
[dependencies]
kiru = "0.1"
```
```rust
use kiru::{BytesChunker, Chunker};
// Create a chunker
let chunker = BytesChunker::new(1024, 128)?;
// Chunk text
let chunks: Vec<String> = chunker
.chunk_string("Your text here...".to_string())
.collect();
// Stream large files
use kiru::{Source, StreamType};
let stream = StreamType::from_source(&Source::File("huge.txt".to_string()))?;
for chunk in chunker.chunk_stream(stream) {
process(chunk);
}
```
---
## Use Cases
### Building RAG Systems
```python
# Perfect for vector database ingestion
chunker = Chunker.by_bytes(512, 50) # Tuned for embedding models
documents = ["glob://knowledge_base/**/*.md"]
chunks = chunker.on_sources_par(documents, channel_size=10000)
for chunk in chunks:
embedding = model.encode(chunk)
vector_db.insert(chunk, embedding)
```
### Real-time Processing
```python
# Stream processing without memory overhead
for chunk in chunker.on_file("10GB_file.txt"):
# Each chunk generated on-demand
send_to_queue(chunk)
```
### Parallel Document Processing
```rust
// Process hundreds of documents concurrently
use kiru::{ChunkerBuilder, ChunkerEnum};
let chunker = ChunkerBuilder::by_bytes(ChunkerEnum::Bytes {
chunk_size: 4096,
overlap: 512,
});
let sources = vec!["glob://docs/**/*.txt"];
let chunks = chunker.on_sources_par_stream(sources, 1000)?;
```
---
## Chunking Strategies
### Bytes Chunking
- Splits on byte boundaries while respecting UTF-8
- Fastest performance (1000+ MB/s in Rust, 1400+ MB/s in Python)
- Ideal for token-limited models and consistent memory usage
### Characters Chunking
- Splits on character (grapheme) boundaries
- Ensures exact character counts regardless of byte representation
- Perfect for character-limited APIs (300+ MB/s in Python)
---
## API Reference
### Python API
#### Creating Chunkers
```python
from kiru import Chunker
# Byte-based chunking
chunker = Chunker.by_bytes(chunk_size=1024, overlap=128)
# Character-based chunking
chunker = Chunker.by_characters(chunk_size=1000, overlap=100)
```
#### Input Sources
```python
# Single string
chunks = chunker.on_string("text...").all()
# Single file
chunks = chunker.on_file("/path/to/file.txt").all()
# HTTP/HTTPS URL
chunks = chunker.on_http("https://example.com/page").all()
# Multiple sources (serial)
sources = ["file://doc1.txt", "https://example.com/page", "glob://*.md"]
chunks = chunker.on_sources(sources).all()
# Multiple sources (parallel)
chunks = chunker.on_sources_par(sources, channel_size=1000).all()
# Or iterate lazily
for chunk in chunker.on_sources_par(sources):
process(chunk)
```
#### Source Prefixes
- `file://path/to/file.txt` - Local files
- `http://example.com` or `https://example.com` - URLs
- `text://Inline text content` - Raw text strings
- `glob://*.md` - Glob patterns
- No prefix - Treated as raw text
### Rust API
#### Creating Chunkers
```rust
use kiru::{BytesChunker, CharactersChunker, Chunker};
// Byte-based chunking
let chunker = BytesChunker::new(1024, 128)?;
// Character-based chunking
let chunker = CharactersChunker::new(1000, 100)?;
```
#### Basic Usage
```rust
use kiru::Chunker;
// Chunk a string
let chunks: Vec<String> = chunker
.chunk_string("Your text here".to_string())
.collect();
// Stream a file
use kiru::{Source, StreamType};
let stream = StreamType::from_source(&Source::File("file.txt".to_string()))?;
for chunk in chunker.chunk_stream(stream) {
// Process chunk
}
```
#### Advanced Usage
```rust
use kiru::{ChunkerBuilder, ChunkerEnum, Source, HigherOrderSource, SourceGenerator};
// Create chunker with builder pattern
let chunker = ChunkerBuilder::by_bytes(ChunkerEnum::Bytes {
chunk_size: 4096,
overlap: 512,
});
// Single source
let chunks = chunker.on_source(Source::File("doc.txt".to_string()))?;
// Multiple sources (serial)
let sources = vec![
Source::File("doc1.txt".to_string()),
Source::Http("https://example.com".to_string()),
];
let chunks = chunker.on_sources(sources)?;
// Multiple sources (parallel) - returns Vec
let chunks: Vec<String> = chunker.on_sources_par(sources)?;
// Multiple sources (parallel streaming) - returns iterator
let chunks = chunker.on_sources_par_stream(sources, 1000)?;
for chunk in chunks {
// Process as they arrive
}
// Using glob patterns
let sources = vec![HigherOrderSource::SourceGenerator(
SourceGenerator::Glob("**/*.md".to_string())
)];
let flattened = HigherOrderSource::into_flattened_sources(sources)?;
```
---
## Architecture
```
┌─────────────────────────────────────────┐
│ Application Layer │
│ (Python or Rust Application) │
├─────────────────────────────────────────┤
│ kiru-py (PyO3 Bindings) │
│ [Python only] │
├─────────────────────────────────────────┤
│ kiru-core (Rust Library) │
│ │
│ ┌──────────┬───────────┐ │
│ │ Chunkers │ Streaming │ │
│ │ Engine │ Engine │ │
│ └──────────┴───────────┘ │
└─────────────────────────────────────────┘
```
---
## Project Structure
```
kiru/
├── README.md # This file (shared documentation)
├── kiru-core/ # Rust implementation
│ ├── src/ # Core chunking algorithms
│ │ ├── bytes_chunker.rs
│ │ ├── characters_chunker.rs
│ │ ├── chunker.rs # Builder pattern & parallel processing
│ │ └── stream.rs # File/HTTP streaming
│ ├── benches/ # Criterion benchmarks
│ └── tests/ # Property-based tests
├── kiru-py/ # Python bindings (PyO3)
│ ├── src/lib.rs # Python wrapper
│ └── python/ # Python tests & benchmarks
└── utils/ # Version management scripts
```
---
## Streaming & Memory Efficiency
**kiru's killer feature: true streaming with constant memory usage.**
Unlike traditional chunkers that load entire files into memory, kiru processes data as it arrives using an intelligent buffering system. This means you can chunk **gigabyte-sized files** with minimal RAM usage.
### How Streaming Works
```
File/HTTP Source → Read Blocks (8KB) → UTF-8 Buffer → Chunk Iterator → Your Code
↓ ↓
As needed Constant size
```
**Key advantages:**
1. **Constant Memory**: Process 10GB files with ~10MB RAM
2. **Immediate Results**: First chunks available instantly, no waiting for full file load
3. **Works Everywhere**: Local files, HTTP/HTTPS streams, any data source
4. **UTF-8 Safe**: Buffer maintains character boundaries automatically
### Python Examples
```python
from kiru import Chunker
chunker = Chunker.by_bytes(chunk_size=4096, overlap=512)
# ⚡ Stream a 10GB file - uses only ~10MB RAM
for chunk in chunker.on_file("huge_dataset.txt"):
# Process chunk immediately as it arrives
vector_db.insert(chunk)
# No waiting, no memory explosion!
# ⚡ Stream from HTTP - process as data downloads
for chunk in chunker.on_http("https://example.com/large_document.txt"):
process(chunk)
# Chunks ready while download continues
# ⚡ Stream multiple sources in parallel
sources = [
"file://10gb_file1.txt",
"https://example.com/doc.txt",
"file://10gb_file2.txt"
]
for chunk in chunker.on_sources_par(sources, channel_size=1000):
# All sources stream in parallel
# Memory stays constant regardless of file sizes
send_to_queue(chunk)
```
### Rust Examples
```rust
use kiru::{BytesChunker, Chunker, Source, StreamType};
let chunker = BytesChunker::new(4096, 512)?;
// ⚡ Stream a massive file with constant memory
let stream = StreamType::from_source(&Source::File("10gb_file.txt".to_string()))?;
for chunk in chunker.chunk_stream(stream) {
// Process immediately, no memory buildup
vector_db.insert(chunk);
}
// ⚡ Stream from HTTP as data arrives
let stream = StreamType::from_source(&Source::Http("https://example.com/doc.txt".to_string()))?;
for chunk in chunker.chunk_stream(stream) {
process(chunk);
}
```
### Memory Comparison
Processing a **1GB file** with 4KB chunks:
| **kiru** | **~10 MB** | ❌ No | ✅ Yes |
| LangChain | **1000+ MB** | ✅ Yes | ❌ No |
| tiktoken | **1000+ MB** | ✅ Yes | ❌ No |
**Result**: kiru uses **100x less memory** while being **4,000x faster**!
---
## Development
### Setup
```bash
# Clone repository
git clone https://github.com/yourusername/kiru.git
cd kiru
# Run all tests
cargo test --workspace
# Run Rust benchmarks
cd kiru-core
cargo bench
# Build Python package
cd ../kiru-py
pip install maturin
maturin develop --release
# Run Python tests
pip install pytest hypothesis
pytest python/test.py
# Run Python benchmarks
python python/bench.py
```
### Running Benchmarks
```bash
# Rust benchmarks
cd kiru-core
cargo bench
# Python benchmarks
cd kiru-py
python python/bench.py
```
---
## Performance Tips
1. **Use byte chunking** for maximum throughput (1000+ MB/s)
2. **Use character chunking** when exact character counts matter (300+ MB/s)
3. **Enable parallel processing** with `on_sources_par()` for multiple files
4. **Tune chunk size** based on your embedding model's context window
5. **Adjust overlap** to balance context preservation and storage
6. **Stream large files** to maintain constant memory usage
---
## Why "kiru"?
"Kiru" (切る) is Japanese for "to cut" - reflecting the library's purpose of cutting text into chunks at lightning speed ⚡🗡️
---
## Contributing
We welcome contributions! Please check out our [Contributing Guide](CONTRIBUTING.md) for guidelines.
## License
MIT License - see [LICENSE](LICENSE) for details.
---
## Credits
Built with:
- [PyO3](https://pyo3.rs) - Rust bindings for Python
- [Rayon](https://github.com/rayon-rs/rayon) - Data parallelism for Rust
- [maturin](https://www.maturin.rs) - Build and publish Rust Python extensions
---
**Ready to cut through text at the speed of light?**
- 🐍 **Python**: `pip install kiru`
- 🦀 **Rust**: Add `kiru = "0.1"` to Cargo.toml