opencodesearchparser 0.1.3

Rust Library for Parsing and Segmentation of Source code
Documentation

Open Code Search Parser

opencodesearchparser is a Rust library for parsing source files into top-level code segments using Tree-sitter.

Public API

#[derive(Clone, Copy, Debug, PartialEq)]
pub enum CodeLanguage {
    C,
    Cpp,
    Go,
    Java,
    Kotlin,
    Python,
    JavaScript,
    Js,
    Ts,
    Php,
    Proto,
    R,
    Rust,
}

pub fn parse_str(source: &str, lang: CodeLanguage, thread_num: u16) -> anyhow::Result<Vec<String>>;
pub fn parse_file(file_path: &str, lang: CodeLanguage, thread_num: u16) -> anyhow::Result<Vec<String>>;
pub fn parse_dir(dir_path: &str, lang: CodeLanguage, thread_num: u16) -> anyhow::Result<Vec<String>>;

pub mod recursive_character_text_splitter;

thread_num == 0 is treated as 1 thread internally.

Current Language Support

Area Supported now
parse_str / parse_file parsing C, Cpp, Python, JavaScript, Js, Rust
parse_dir extension filtering C (.c), Cpp (.cpp), Python (.py), JavaScript/Js (.js), Rust (.rs)
Other CodeLanguage variants Present in the enum, but currently return an error in parsing and/or directory mapping

Segmentation Behavior (Current)

  • C/C++: keeps top-level functions, declarations, struct/class/enum/union/type definitions, and preprocessor nodes (include, define, macro function define, conditional directives, and preprocessor calls like #pragma).
  • C/C++ struct/class/enum/union declarations are emitted with trailing ; when it is a separate sibling node.
  • Python: keeps top-level function definitions, class definitions, expression statements, assignments, and global statements.
  • JavaScript/Js: keeps top-level function declarations, class declarations, lexical/variable declarations, and expression statements.
  • Rust: keeps top-level nodes whose kinds end with _item or _definition.
  • Top-level comment nodes and empty/whitespace-only segments are skipped.

Parallelism

  • parse_str uses a Rayon thread pool (thread_num) for segment materialization.
  • parse_file reads one file, then calls parse_str with the same thread_num.
  • parse_dir walks directories with walkdir, filters by extension, then parses matching files in parallel with Rayon.

RecursiveCharacterTextSplitter

recursive_character_text_splitter::RecursiveCharacterTextSplitter provides recursive chunking with configurable separators, chunk size, and overlap.

Key constructors:

pub fn new(separators: Option<Vec<String>>, chunk_size: usize, chunk_overlap: usize) -> Self;
pub fn from_language(language: CodeLanguage) -> Self;
pub fn split_text(&self, text: &str) -> Vec<String>;

Usage Examples

Parse a string

use anyhow::Result;
use opencodesearchparser::{parse_str, CodeLanguage};

fn main() -> Result<()> {
    let source = r#"
static GLOBAL_VAR: i32 = 42;
fn add(a: i32, b: i32) -> i32 { a + b }
"#;

    let segments = parse_str(source, CodeLanguage::Rust, 4)?;
    println!("segments: {}", segments.len());
    Ok(())
}

Parse a file

use anyhow::Result;
use opencodesearchparser::{parse_file, CodeLanguage};

fn main() -> Result<()> {
    let segments = parse_file("tests/data/rust/example.rs", CodeLanguage::Rust, 2)?;
    println!("segments: {}", segments.len());
    Ok(())
}

Parse a directory

use anyhow::Result;
use opencodesearchparser::{parse_dir, CodeLanguage};

fn main() -> Result<()> {
    let segments = parse_dir("tests/data", CodeLanguage::Rust, 8)?;
    println!("segments: {}", segments.len());
    Ok(())
}

Use the recursive splitter

use opencodesearchparser::recursive_character_text_splitter::RecursiveCharacterTextSplitter;
use opencodesearchparser::CodeLanguage;

let splitter = RecursiveCharacterTextSplitter::from_language(CodeLanguage::Rust);
let chunks = splitter.split_text("fn a() {}\n\nfn b() {}");
assert!(!chunks.is_empty());