Expand description
§astchunk
AST-based code chunking, implementing the algorithm from the cAST paper.
§Quick start
A typical pipeline is:
Document -> AstChunk ->
TextChunk -> JsonRecord.
use astchunk::chunker::{Chunker, CastChunker, CastChunkerOptions};
use astchunk::formatter::{CanonicalFormatter, Formatter};
use astchunk::output::JsonRecord;
use astchunk::types::{Document, DocumentId, Origin};
use astchunk::lang::Language;
let source = "def hello():\n print('hi')\n";
let document = Document {
document_id: DocumentId(0),
language: Language::Python,
source: source.into(),
origin: Origin::default(),
};
let chunker = CastChunker::new(CastChunkerOptions::default());
let ast_chunks = chunker.chunk(&document).unwrap();
let formatter = CanonicalFormatter::default();
let text_chunks = formatter.format(&document, &ast_chunks).unwrap();
let records = JsonRecord::build(&document, &ast_chunks, &text_chunks);
assert!(!ast_chunks.is_empty());
assert_eq!(text_chunks.len(), ast_chunks.len());
assert_eq!(records.len(), text_chunks.len());§Modules
types— Data types:Document,AstChunk,TextChunk, etc.chunker—Chunkertrait andCastChunkerimplementation.formatter—Formattertrait withCanonicalFormatterandContextualFormatter.output— Output record types:JsonRecord,RepoEvalRecord,SwebenchLiteRecord.lang—Languageenum and tree-sitter bindings.error—AstchunkErrortype.
§Feature flags
| Feature | Description |
|---|---|
cli | Build the command-line interface |
Modules§
- chunker
- Chunking traits and concrete implementations for producing AST chunks.
- error
- Error types returned by the public astchunk pipeline APIs.
- formatter
- Text formatting traits and implementations built on top of AST chunks.
- lang
- Language definitions and tree-sitter bindings used by the chunking pipeline.
- output
- Output record types for serializing formatted chunks into downstream formats.
- types
- Core data types shared across the chunking, formatting, and output pipeline.