Tokio Cast
High-performance, structure-aware code chunking for RAG pipelines, powered by actors.
Tokio Cast is a Rust implementation of the cAST algorithm (Zhang et al., EMNLP 2025) for splitting source code into semantically meaningful chunks. It uses Tree-sitter for language-agnostic AST parsing and tokio-actors for a concurrent three-stage pipeline with bounded-mailbox backpressure.
Why tokio-cast?
Structure-Aware Chunking
Chunks follow AST boundaries, not arbitrary line counts. Functions, structs, and impl blocks stay intact. When a node exceeds the size limit, cAST recursively splits at child boundaries then merges small siblings, preserving semantic coherence.
// Line-based chunking breaks mid-function.
// cAST keeps this entire impl block as one chunk:
Actor Pipeline
Three actors run concurrently with bounded mailboxes:
Parser --> Chunker --> Indexer
(AST) (split) (metadata)
Each stage applies backpressure naturally. When the Chunker is busy, the Parser blocks on its mailbox. No unbounded queues, no OOM surprises.
Zero-Copy Parsing
Tree-sitter parses source bytes in-place and produces a concrete syntax tree. Tokio Cast converts the !Send tree-sitter Node types into a Send-safe AstNode representation before crossing actor boundaries. Zero unnecessary allocations.
Language Agnostic
Any Tree-sitter grammar works. Plug in a grammar, get structure-aware chunks. Currently ships with Rust support; adding a new language is one Language enum variant and a grammar dependency.
Serde-Ready Output
Chunk and IndexEntry derive Serialize/Deserialize. Pipe output straight to your embedding API, vector database, or JSON storage.
let entries = cast.chunk_file.await?;
let json = to_string_pretty?;
Quick Start
use ;
async
Persistent Pipeline
For repeated processing, keep the actor pipeline alive:
let cast = builder
.language
.build?;
let pipeline = cast.pipeline.await?;
let entries = pipeline.process_source.await?;
// ... process more files without respawning actors ...
pipeline.shutdown.await?;
API at a Glance
TokioCast Methods
| Method | Description |
|---|---|
TokioCast::builder() |
Start the factory builder |
builder.language(lang) |
Set the Tree-sitter grammar |
builder.max_chunk_size(n) |
Max bytes per chunk (default: 2000) |
builder.build() |
Construct the TokioCast instance |
chunk_source(src) |
Sync: chunk a source string |
chunk_file(path) |
Async: parse and chunk a single file |
chunk_dir(path) |
Async: chunk all files in a directory (actor pipeline) |
pipeline() |
Async: spawn a persistent actor pipeline |
Pipeline Methods
| Method | Description |
|---|---|
process_source(src, path) |
Send source through the live pipeline |
shutdown() |
Gracefully stop all pipeline actors |
Chunk Fields
| Field | Type | Description |
|---|---|---|
content |
String |
The chunk text |
byte_range |
Range<usize> |
Byte offset in original source |
node_kind |
String |
AST node type (e.g. "function_item") |
depth |
usize |
Nesting depth in AST |
IndexEntry Fields
| Field | Type | Description |
|---|---|---|
path |
String |
Source file path |
chunks |
Vec<Chunk> |
Ordered chunks for this file |
total_bytes |
usize |
Original source size |
language |
Language |
Tree-sitter grammar used |
Examples
| Example | Description |
|---|---|
chunk_file |
Chunk a single Rust source file |
chunk_dir |
Chunk all files in a directory |
pipeline |
Persistent pipeline with multiple inputs |
json_output |
Serialize chunks to JSON |
Run with:
License
MIT OR Apache-2.0
Structure-aware code chunking for Rust developers who need semantic RAG, not line-count heuristics.
Author
Saddam Uwejan (Sam) - Rust systems engineer specializing in concurrent systems and production infrastructure.
Building high-performance, production-ready Rust libraries for real-world problems.