tokio-cast 0.1.0

CAST algorithm implementation for code RAG pipelines, powered by tokio-actors
Documentation

Tokio Cast

Crates.io Documentation CI License Rust Version

High-performance, structure-aware code chunking for RAG pipelines, powered by actors.

Tokio Cast is a Rust implementation of the cAST algorithm (Zhang et al., EMNLP 2025) for splitting source code into semantically meaningful chunks. It uses Tree-sitter for language-agnostic AST parsing and tokio-actors for a concurrent three-stage pipeline with bounded-mailbox backpressure.


Why tokio-cast?

Structure-Aware Chunking

Chunks follow AST boundaries, not arbitrary line counts. Functions, structs, and impl blocks stay intact. When a node exceeds the size limit, cAST recursively splits at child boundaries then merges small siblings, preserving semantic coherence.

// Line-based chunking breaks mid-function.
// cAST keeps this entire impl block as one chunk:
impl Actor for Worker {
    type Message = Job;
    type Response = Result;

    async fn handle(&mut self, msg: Job, ctx: &mut ActorContext<Self>) -> ActorResult<Result> {
        self.process(msg).await
    }
}

Actor Pipeline

Three actors run concurrently with bounded mailboxes:

  Parser --> Chunker --> Indexer
  (AST)     (split)    (metadata)

Each stage applies backpressure naturally. When the Chunker is busy, the Parser blocks on its mailbox. No unbounded queues, no OOM surprises.

Zero-Copy Parsing

Tree-sitter parses source bytes in-place and produces a concrete syntax tree. Tokio Cast converts the !Send tree-sitter Node types into a Send-safe AstNode representation before crossing actor boundaries. Zero unnecessary allocations.

Language Agnostic

Any Tree-sitter grammar works. Plug in a grammar, get structure-aware chunks. Currently ships with Rust support; adding a new language is one Language enum variant and a grammar dependency.

Serde-Ready Output

Chunk and IndexEntry derive Serialize/Deserialize. Pipe output straight to your embedding API, vector database, or JSON storage.

let entries = cast.chunk_file("src/lib.rs").await?;
let json = serde_json::to_string_pretty(&entries)?;

Quick Start

cargo add tokio-cast
use tokio_cast::{TokioCast, Language};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let cast = TokioCast::builder()
        .language(Language::Rust)
        .max_chunk_size(2000)
        .build()?;

    // Chunk a single file
    let entries = cast.chunk_file("src/main.rs").await?;
    for entry in &entries {
        println!("{} ({} bytes, {} chunks)", entry.path, entry.total_bytes, entry.chunks.len());
    }

    // Chunk an entire directory with the actor pipeline
    let entries = cast.chunk_dir("./src").await?;
    println!("Indexed {} files", entries.len());

    Ok(())
}

Persistent Pipeline

For repeated processing, keep the actor pipeline alive:

let cast = TokioCast::builder()
    .language(Language::Rust)
    .build()?;

let pipeline = cast.pipeline().await?;

let entries = pipeline.process_source(source, "src/lib.rs").await?;
// ... process more files without respawning actors ...

pipeline.shutdown().await?;

API at a Glance

TokioCast Methods

Method Description
TokioCast::builder() Start the factory builder
builder.language(lang) Set the Tree-sitter grammar
builder.max_chunk_size(n) Max bytes per chunk (default: 2000)
builder.build() Construct the TokioCast instance
chunk_source(src) Sync: chunk a source string
chunk_file(path) Async: parse and chunk a single file
chunk_dir(path) Async: chunk all files in a directory (actor pipeline)
pipeline() Async: spawn a persistent actor pipeline

Pipeline Methods

Method Description
process_source(src, path) Send source through the live pipeline
shutdown() Gracefully stop all pipeline actors

Chunk Fields

Field Type Description
content String The chunk text
byte_range Range<usize> Byte offset in original source
node_kind String AST node type (e.g. "function_item")
depth usize Nesting depth in AST

IndexEntry Fields

Field Type Description
path String Source file path
chunks Vec<Chunk> Ordered chunks for this file
total_bytes usize Original source size
language Language Tree-sitter grammar used

Examples

Example Description
chunk_file Chunk a single Rust source file
chunk_dir Chunk all files in a directory
pipeline Persistent pipeline with multiple inputs
json_output Serialize chunks to JSON

Run with:

cargo run --example chunk_file

License

MIT OR Apache-2.0


Structure-aware code chunking for Rust developers who need semantic RAG, not line-count heuristics.


Author

Saddam Uwejan (Sam) - Rust systems engineer specializing in concurrent systems and production infrastructure.

Building high-performance, production-ready Rust libraries for real-world problems.