docs.rs failed to build niblits-0.3.7
Please check the build logs for more information.
See Builds for ideas on how to fix a failed build, or Metadata for how to configure docs.rs builds.
If you believe this is docs.rs' fault, open an issue.
Please check the build logs for more information.
See Builds for ideas on how to fix a failed build, or Metadata for how to configure docs.rs builds.
If you believe this is docs.rs' fault, open an issue.
niblits
A powerful, token-aware text chunking library for processing multiple file formats with language-aware semantic splitting.
Overview
This library provides streaming, async-first text chunking capabilities designed for ingestion pipelines and search systems. It handles diverse document types while maintaining semantic boundaries and offering configurable tokenization strategies.
Features
Multi-Format Support
- Plain Text: Basic text splitting with configurable overlap
- Markdown: Structure-aware chunking preserving headers and sections
- HTML: Tag-aware splitting that respects document structure
- PDF: Text extraction with intelligent chunking of document content
- DOCX: Word document parsing and content chunking
- Source Code: Semantic chunking for 50+ programming languages using tree-sitter grammars
Language-Aware Code Chunking
- Grammar-aware parsing using tree-sitter
- Semantic boundary detection (functions, classes, etc.)
- Language-specific chunking strategies
- Support for Rust, Python, JavaScript, TypeScript, Go, and many more
Flexible Tokenization
- Character-based: Simple character counting
- OpenAI tiktoken: cl100k_base, p50k_base, p50k_edit, r50k_base, o200k_base
- HuggingFace: Custom model tokenizers for specialized embeddings
Streaming Architecture
- Async-first design with Stream API
- Memory-efficient processing of large files
- Progress tracking with file size monitoring
- Graceful error handling and recovery
Quick Start
Add to your Cargo.toml:
[]
= "0.3.0"
= { = "1", = ["rt", "macros"] }
= "0.3"
use ;
use StreamExt;
use Cursor;
async
Configuration
ChunkerConfig
Tokenizer Options
Supported Languages
Check supported programming languages:
use ;
// Get all supported languages
let languages = supported_languages;
println!;
// Check specific language
assert!;
assert!;
Commonly supported languages include: Rust, Python, JavaScript, TypeScript, Go, Java, C++, C#, Ruby, PHP, Swift, Kotlin, and many more.
API Reference
Core Functions
chunk_stream(path, reader, config)- Process a file stream and yield chunkswalk_project(path, options)- Recursively walk a directory and stream chunkswalk_files(files, project_root, options)- Chunk a stream of file paths with ignore ruleswalker_includes_path(project_root, path, max_file_size)- Check if a path would be includedsupported_languages()- Get list of supported programming languagesis_language_supported(name)- Check if a language is supported
Types
Chunk- Represents different chunk types (Semantic, Text, EndOfFile, Delete)SemanticChunk- Contains text, tokens, and byte offset informationProjectChunk- File path, chunk data, and file sizeChunkError- Error types for parsing, IO, and unsupported formats
Examples
Processing Different File Types
// Markdown file
let config = default;
let reader = new.as_bytes;
let stream = chunk_stream.await;
// PDF file
let file = open.await?;
let stream = chunk_stream.await;
// Code file
let code_stream = chunk_stream.await;
Walking Projects
use ;
use StreamExt;
let mut stream = walk_project;
while let Some = stream.next.await
Custom Tokenizer
// Using HuggingFace tokenizer
let config = ChunkerConfig ;
// Using characters for simple cases
let config = ChunkerConfig ;
Architecture
src/
├── lib.rs # Public API and main exports
├── types.rs # Core data structures and error types
├── chunker/ # Format-specific chunkers
│ ├── code.rs # Language-aware code chunking
│ ├── text.rs # Plain text chunking
│ ├── markdown.rs # Markdown-aware chunking
│ ├── html.rs # HTML-aware chunking
│ ├── pdf.rs # PDF processing
│ └── docx.rs # Word document processing
├── languages.rs # Language support utilities
├── grammars.rs # Tree-sitter grammar management
└── grammar_loader.rs # Dynamic grammar loading
Performance Considerations
- Streaming: All processing is streaming-based to handle large files efficiently
- Memory: Minimal memory footprint with async I/O
- Tokenizers: Preload tokenizers for better performance in batch processing
- Grammars: Tree-sitter grammars are loaded on-demand and cached
Development
Building
Testing
Dependencies
Key dependencies:
text-splitter: Core splitting logic with tokenization supporttree-sitter: Code parsing for semantic chunkingtiktoken-rs: OpenAI tokenizer implementationtokenizers: HuggingFace tokenizer supportoxidize-pdf: PDF text extractiondocx-parser: Word document parsinghtmd: HTML processingpalate: Language detection