Expand description
§toak-rs
A high-performance library for tokenizing git repositories, generating markdown documentation, and creating semantic embeddings for code repositories.
§Features
- Code Cleaning & Secret Redaction: Remove comments, imports, and sensitive information (API keys, tokens, passwords)
- Tokenization: Count tokens in code for LLM context window estimation
- Text Chunking: Split text into overlapping chunks optimized for embeddings and RAG applications
- Embeddings Generation: Create semantic vector embeddings for code chunks
- Markdown Generation: Convert repositories into well-structured markdown documentation
- High Performance: Built in Rust with concurrent file processing and no runtime dependencies
§Quick Start
ⓘ
use toak_rs::prelude::*;
// Clean and redact code
let cleaned = clean_and_redact("let api_key = 'sk-1234567890';");
assert!(!cleaned.contains("sk-"));
// Generate embeddings
let mut generator = EmbeddingsGenerator::new()?;
let embedding = generator.generate_embedding("let x = 5;")?;
// Chunk text for RAG
let chunks = chunk_text("Hello world", &ChunkerConfig::default());
// Perform semantic search on embeddings
let mut search = SemanticSearch::new("embeddings.json")?;
let results = search.search("find rust code", 5)?;
for result in results {
println!("{}: {:.4}", result.file_path, result.similarity);
}Re-exports§
pub use embeddings_generator::EmbeddingsGenerator;pub use json_database_generator::ChunkMetadata;pub use json_database_generator::EmbeddedChunk;pub use json_database_generator::EmbeddingsDatabase;pub use json_database_generator::JsonDatabaseGenerator;pub use json_database_generator::JsonDatabaseOptions;pub use json_database_generator::JsonDatabaseResult;pub use markdown_generator::MarkdownGenerator;pub use markdown_generator::MarkdownGeneratorOptions;pub use markdown_generator::MarkdownResult;pub use semantic_search::EmbeddingChunk;pub use semantic_search::EmbeddingsDatabaseMetadata;pub use semantic_search::SearchResult;pub use semantic_search::SemanticSearch;pub use text_chunker::chunk_text;pub use text_chunker::ChunkerConfig;pub use text_chunker::TextChunk;pub use token_cleaner::clean_and_redact;pub use token_cleaner::clean_code;pub use token_cleaner::count_tokens;pub use token_cleaner::redact_secrets;
Modules§
- embeddings_
generator - Utilities for creating semantic embeddings via the
fastembedcrate. This module powers the embedding generation features that back the JSON database exporter and any higher level tooling. - json_
database_ generator - Helpers that walk a git repository, chunk the code, and persist embeddings into a JSON database.
- markdown_
generator - Utilities that turn a repository into a human readable markdown file, handling ignore files
and ensuring the generated artifacts are tracked in
.gitignore. - prelude
- Prelude module for convenient imports
- semantic_
search - Semantic search functionality for querying embeddings databases.
- text_
chunker - Helpers for slicing strings into token-aware chunks for embeddings and documentation.
- token_
cleaner - Utility routines for sanitizing code before chunking/embedding.