Skip to main content

Crate toak_rs

Crate toak_rs 

Source
Expand description

§toak-rs

A high-performance library for tokenizing git repositories, generating markdown documentation, and creating semantic embeddings for code repositories.

§Features

  • Code Cleaning & Secret Redaction: Remove comments, imports, and sensitive information (API keys, tokens, passwords)
  • Tokenization: Count tokens in code for LLM context window estimation
  • Text Chunking: Split text into overlapping chunks optimized for embeddings and RAG applications
  • Embeddings Generation: Create semantic vector embeddings for code chunks
  • Markdown Generation: Convert repositories into well-structured markdown documentation
  • High Performance: Built in Rust with concurrent file processing and no runtime dependencies

§Quick Start

use toak_rs::prelude::*;

// Clean and redact code
let cleaned = clean_and_redact("let api_key = 'sk-1234567890';");
assert!(!cleaned.contains("sk-"));

// Generate embeddings
let mut generator = EmbeddingsGenerator::new()?;
let embedding = generator.generate_embedding("let x = 5;")?;

// Chunk text for RAG
let chunks = chunk_text("Hello world", &ChunkerConfig::default());

// Perform semantic search on embeddings
let mut search = SemanticSearch::new("embeddings.json")?;
let results = search.search("find rust code", 5)?;
for result in results {
    println!("{}: {:.4}", result.file_path, result.similarity);
}

Re-exports§

pub use embeddings_generator::EmbeddingsGenerator;
pub use json_database_generator::ChunkMetadata;
pub use json_database_generator::EmbeddedChunk;
pub use json_database_generator::EmbeddingsDatabase;
pub use json_database_generator::JsonDatabaseGenerator;
pub use json_database_generator::JsonDatabaseOptions;
pub use json_database_generator::JsonDatabaseResult;
pub use markdown_generator::MarkdownGenerator;
pub use markdown_generator::MarkdownGeneratorOptions;
pub use markdown_generator::MarkdownResult;
pub use semantic_search::EmbeddingChunk;
pub use semantic_search::EmbeddingsDatabaseMetadata;
pub use semantic_search::SearchResult;
pub use semantic_search::SemanticSearch;
pub use text_chunker::chunk_text;
pub use text_chunker::ChunkerConfig;
pub use text_chunker::TextChunk;
pub use token_cleaner::clean_and_redact;
pub use token_cleaner::clean_code;
pub use token_cleaner::count_tokens;
pub use token_cleaner::redact_secrets;

Modules§

embeddings_generator
Utilities for creating semantic embeddings via the fastembed crate. This module powers the embedding generation features that back the JSON database exporter and any higher level tooling.
json_database_generator
Helpers that walk a git repository, chunk the code, and persist embeddings into a JSON database.
markdown_generator
Utilities that turn a repository into a human readable markdown file, handling ignore files and ensuring the generated artifacts are tracked in .gitignore.
prelude
Prelude module for convenient imports
semantic_search
Semantic search functionality for querying embeddings databases.
text_chunker
Helpers for slicing strings into token-aware chunks for embeddings and documentation.
token_cleaner
Utility routines for sanitizing code before chunking/embedding.