Tokio Cast

High-performance, structure-aware code chunking for RAG pipelines, powered by actors.

Tokio Cast is a Rust implementation of the cAST algorithm (Zhang et al., EMNLP 2025) for splitting source code into semantically meaningful chunks. It uses Tree-sitter for language-agnostic AST parsing and tokio-actors for a concurrent three-stage pipeline with bounded-mailbox backpressure.

Why tokio-cast?

Structure-Aware Chunking

Chunks follow AST boundaries, not arbitrary line counts. Functions, structs, and impl blocks stay intact. When a node exceeds the size limit, cAST recursively splits at child boundaries then merges small siblings, preserving semantic coherence.

// Line-based chunking breaks mid-function.
// cAST keeps this entire impl block as one chunk:
impl Actor for Worker {
    type Message = Job;
    type Response = Result;

    async fn handle(&mut self, msg: Job, ctx: &mut ActorContext<Self>) -> ActorResult<Result> {
        self.process(msg).await
    }
}

Actor Pipeline

Three actors run concurrently with bounded mailboxes:

  Parser --> Chunker --> Indexer
  (AST)     (split)    (metadata)

Each stage applies backpressure naturally. When the Chunker is busy, the Parser blocks on its mailbox. No unbounded queues, no OOM surprises.

Zero-Copy Parsing

Tree-sitter parses source bytes in-place and produces a concrete syntax tree. Tokio Cast converts the !Send tree-sitter Node types into a Send-safe AstNode representation before crossing actor boundaries. Zero unnecessary allocations.

Language Agnostic

Any Tree-sitter grammar works. Plug in a grammar, get structure-aware chunks. Currently ships with Rust support; adding a new language is one Language enum variant and a grammar dependency.

Serde-Ready Output

Chunk and IndexEntry derive Serialize/Deserialize. Pipe output straight to your embedding API, vector database, or JSON storage.

let entries = cast.chunk_file("src/lib.rs").await?;
let json = serde_json::to_string_pretty(&entries)?;

Quick Start

cargo add tokio-cast

use tokio_cast::{TokioCast, Language};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let cast = TokioCast::builder()
        .language(Language::Rust)
        .max_chunk_size(2000)
        .build()?;

    // Chunk a single file
    let entries = cast.chunk_file("src/main.rs").await?;
    for entry in &entries {
        println!("{} ({} bytes, {} chunks)", entry.path, entry.total_bytes, entry.chunks.len());
    }

    // Chunk an entire directory with the actor pipeline
    let entries = cast.chunk_dir("./src").await?;
    println!("Indexed {} files", entries.len());

    Ok(())
}

Persistent Pipeline

For repeated processing, keep the actor pipeline alive:

let cast = TokioCast::builder()
    .language(Language::Rust)
    .build()?;

let pipeline = cast.pipeline().await?;

let entries = pipeline.process_source(source, "src/lib.rs").await?;
// ... process more files without respawning actors ...

pipeline.shutdown().await?;

API at a Glance

TokioCast Methods

Method	Description
`TokioCast::builder()`	Start the factory builder
`builder.language(lang)`	Set the Tree-sitter grammar
`builder.max_chunk_size(n)`	Max bytes per chunk (default: 2000)
`builder.build()`	Construct the `TokioCast` instance
`chunk_source(src)`	Sync: chunk a source string
`chunk_file(path)`	Async: parse and chunk a single file
`chunk_dir(path)`	Async: chunk all files in a directory (actor pipeline)
`pipeline()`	Async: spawn a persistent actor pipeline

Pipeline Methods

Method	Description
`process_source(src, path)`	Send source through the live pipeline
`shutdown()`	Gracefully stop all pipeline actors

Chunk Fields

Field	Type	Description
`content`	`String`	The chunk text
`byte_range`	`Range<usize>`	Byte offset in original source
`node_kind`	`String`	AST node type (e.g. `"function_item"`)
`depth`	`usize`	Nesting depth in AST

IndexEntry Fields

Field	Type	Description
`path`	`String`	Source file path
`chunks`	`Vec<Chunk>`	Ordered chunks for this file
`total_bytes`	`usize`	Original source size
`language`	`Language`	Tree-sitter grammar used

Examples

Example	Description
`chunk_file`	Chunk a single Rust source file
`chunk_dir`	Chunk all files in a directory
`pipeline`	Persistent pipeline with multiple inputs
`json_output`	Serialize chunks to JSON

Run with:

cargo run --example chunk_file

License

MIT OR Apache-2.0

Structure-aware code chunking for Rust developers who need semantic RAG, not line-count heuristics.

Author

Saddam Uwejan (Sam) - Rust systems engineer specializing in concurrent systems and production infrastructure.

Building high-performance, production-ready Rust libraries for real-world problems.

tokio-cast 0.1.0