code-splitter
- Rust Crate: code-splitter
- Python Bindings: code-splitter
Introduction
Welcome to code-splitter, a Rust crate designed to split source code into chunks. It is particularly useful in Retrieval Augmented Generation (RAG), a technique that enhances the generation capabilities of Large Language Models (LLM) by leveraging external knowledge sources.
In a prototypical RAG application, longer documents are first split into chunks, encoded into embeddings, and then indexed into a vector database. When handling a new user query, the system first searches the vector database and retrieves the most relevant chunks based on their embeddings. These retrieved chunks are then used as context to augment the query, which is subsequently processed by an LLM. Chunking is essential for several reasons:
-
Model Input Constraints: Embedding models, such as OpenAI's text-embedding-3-small, have a fixed maximum input token limit (e.g., 8192 tokens). Longer documents need to be divided into smaller chunks to fit within these input constraints.
-
Precision and Relevance: By chunking longer documents, each chunk can be treated as a separate entity for retrieval. This improves the precision of the retrieval process, as smaller, focused chunks are more likely to be relevant to a specific query compared to an entire lengthy document.
-
Improved Generation Quality: LLMs generate better responses when they have a concise and relevant context. Smaller chunks help in maintaining a clear and focused context, reducing the chances of irrelevant information diluting the response.
Features
This crate utilizes tree-sitter to parse code into an Abstract Syntax Tree (AST) and merge sibling nodes to create the largest possible chunks without exceeding the size limit.
It supports all languages that can be parsed with tree-sitter, thanks to its extensible nature.
Get Started
Add it to your project:
By Characters
use ;
use fs;
let max_chars = 1000; // Maximum number of characters in a code chunk
let lang = language; // Requires `cargo add tree-sitter-rust`
let splitter = new
.expect
.with_max_size;
let code = read.expect;
let chunks = splitter.split;
By Words
use ;
use fs;
let max_words = 200; // Maximum number of words in a code chunk
let lang = language; // Requires `cargo add tree-sitter-rust`
let splitter = new
.expect
.with_max_size;
let code = read.expect;
let chunks = splitter.split;
By Tokens with Huggingface
Requires the tokenizers
feature to be activated.
use Splitter;
use fs;
use Tokenizer;
let max_tokens = 500; // Maximum number of tokens in a code chunk
let lang = language; // Requires `cargo add tree-sitter-rust`
let tokenizer = from_pretrained
.expect;
let splitter = new
.expect
.with_max_size;
let code = read.expect;
let chunks = splitter.split;
By Tokens with Tiktoken
Requires the tiktoken-rs
feature to be activated.
use Splitter;
use fs;
use cl100k_base;
let max_tokens = 500; // Maximum number of tokens in a code chunk
let lang = language; // Requires `cargo add tree-sitter-rust`
let bpe = cl100k_base.expect;
let splitter = new
.expect
.with_max_size;
let code = read.expect;
let chunks = splitter.split.unwrap;
Inspiration
This crate was inspired by LlamaIndex's CodeSplitter which, in turn, was based on SweepAI's blog post.
The project setup was inspired by text-splitter, but the splitting algorithm was developed independently.