Open Code Search Parser
A Rust library designed to parse, extract, and chunk source code into manageable segments. This tool is ideal for processing large codebases, preparing code for LLM context windows, or extracting logical blocks (like functions, structs, and classes) from source files.
Features
- Multi-Language Support: Natively targets and filters extensions for C, C++, Python, JavaScript, and Rust.
- Smart Segmentation: Splits code blocks automatically based on blank line delimiters (
\n\n), keeping logically cohesive blocks (like complete function definitions or classes) intact. - Directory Walking: Recursively scans directories to find and parse all files matching a specific target language.
- Parallel Processing: Utilizes
rayonto process multiple files across directories concurrently, configurable by thread count.
Dependencies
This library relies on the following crates:
anyhow- For idiomatic error handling.rayon- For parallel processing over directories.walkdir- For recursive directory traversal.
Usage
1. Parsing a Raw String
You can parse a raw string directly into a Vec<CodeSegment>.
use ;
let num_threads = 4;
let source = "fn add(a: i32, b: i32) -> i32 {\n a + b\n}\n\nfn sub(a: i32, b: i32) -> i32 {\n a - b\n}";
let segments = parse_str.unwrap;
assert_eq!;
println!; // Prints the `add` function
2. Parsing a Single File
Read and chunk a specific source file.
use ;
let num_threads = 4;
let file_path = "src/main.rs";
let segments = parse_file.unwrap;
for segment in segments
3. Parsing an Entire Directory
Recursively scan a directory, filter by a specific language, and process the files in parallel.
use ;
let dir_path = "./src";
let thread_count = 4; // Adjust based on your CPU cores
// Recursively finds all `.rs` files and parses them using 4 threads
let all_rust_segments = parse_dir.unwrap;
println!;
4. Data Structures
Enum representing the supported target languages. Determines which file extensions are picked up during directory traversal.
Struct containing the extracted code block.