kiru 0.1.3

Fast text chunking for Rust
Documentation

kiru 🗡️

Cut through text at the speed of light

A blazingly fast text chunking library for Rust, designed for RAG applications and large-scale text processing.

Crates.io Documentation License: MIT

Why kiru?

When building RAG (Retrieval-Augmented Generation) systems, you need to chunk documents fast. Whether you're processing millions of documents for your vector database or streaming real-time data, kiru delivers:

  • Lightning Fast: Process 100MB+ files at 500+ MB/s
  • 🎯 UTF-8 Safe: Handles multi-byte characters correctly, never breaking in the middle of a character
  • 🔄 Zero-Copy Streaming: Process gigabyte files with constant memory usage
  • 🚀 Parallel Processing: Chunk multiple sources concurrently with Rayon
  • 🎨 Flexible Strategies: Chunk by bytes or characters, your choice

Benchmarks

file_chunking_by_bytes/4kb_chunk
                        time:   [195.32 ms 195.88 ms 196.44 ms]
                        thrpt:  [509.11 MB/s 510.56 MB/s 512.03 MB/s]

string_chunking_by_characters/4k_chars  
                        time:   [412.45 ms 413.21 ms 413.97 ms]
                        thrpt:  [241.57 MB/s 242.01 MB/s 242.46 MB/s]

Benchmarked on 100MB text files with 4KB chunks and 10% overlap

Quick Start

Add to your Cargo.toml:

[dependencies]
kiru = "0.1"

Basic Usage

use kiru::{BytesChunker, Chunker};

// Chunk a string
let chunker = BytesChunker::new(1024, 128)?; // 1KB chunks, 128 bytes overlap
let chunks: Vec<String> = chunker
    .chunk_string("Your long text here...".to_string())
    .collect();

// Chunk a file (streaming, constant memory)
use kiru::{Source, StreamType};

let chunker = BytesChunker::new(4096, 512)?;
let stream = StreamType::from_source(&Source::File("huge_file.txt".to_string()))?;
for chunk in chunker.chunk_stream(stream) {
    // Process each chunk as it's generated
    send_to_vector_db(chunk);
}

Advanced: Parallel Processing

use kiru::{ChunkerBuilder, ChunkerEnum};

// Process multiple sources in parallel
let sources = vec![
    "file://document1.txt",
    "https://example.com/page",
    "glob://*.md",
]
.into_iter()
.map(|s| s.to_string())
.collect();

let chunker = ChunkerBuilder::by_bytes(ChunkerEnum::Bytes {
    chunk_size: 4096,
    overlap: 512,
});

// Stream chunks as they're processed in parallel
let chunks = chunker.on_sources_par_stream(sources, 1000)?;
for chunk in chunks {
    // Chunks arrive as soon as they're ready from any source
    process_chunk(chunk);
}

Chunking Strategies

BytesChunker

  • Chunks by byte count while respecting UTF-8 boundaries
  • Perfect for when you need consistent memory usage
  • Ideal for embeddings with token limits

CharactersChunker

  • Chunks by character count (grapheme clusters)
  • Ensures exact character counts regardless of byte representation
  • Best for character-limited APIs or display purposes

Source Types

kiru can chunk from multiple sources:

  • Files: Local filesystem paths
  • HTTP/HTTPS: Web pages and APIs
  • Strings: In-memory text
  • Glob patterns: Multiple files matching a pattern

Use Cases for RAG

Vector Database Ingestion

// Process an entire knowledge base
let chunker = ChunkerBuilder::by_bytes(ChunkerEnum::Bytes {
    chunk_size: 512,  // Optimized for embedding models
    overlap: 50,      // Maintain context between chunks
});

let sources = vec!["glob://./knowledge_base/**/*.md"];
let chunks = chunker.on_sources_par_stream(sources, 10000)?;

// Send to your vector DB
for chunk in chunks {
    let embedding = embed_text(&chunk);
    vector_db.insert(chunk, embedding);
}

Real-time Document Processing

// Stream-process documents as they arrive
let chunker = BytesChunker::new(1024, 128)?;

// Process without loading entire file into memory
let stream = StreamType::from_source(&Source::File(new_document))?;
for chunk in chunker.chunk_stream(stream) {
    update_index(chunk);
}

Performance Tips

  1. Use parallel processing for multiple files: on_sources_par_stream()
  2. Tune chunk size based on your embedding model's context window
  3. Adjust overlap to balance between context preservation and storage
  4. Use BytesChunker for maximum throughput
  5. Stream large files instead of loading them into memory

Contributing

We welcome contributions! Please see CONTRIBUTING.md for details.

License

MIT License - see LICENSE for details.


Built with ❤️ for the RAG community