wg-ragsmith
Semantic chunking and RAG utilities for document processing and retrieval-augmented generation.
wg-ragsmith provides high-performance semantic chunking algorithms and vector storage utilities designed for building RAG (Retrieval-Augmented Generation) applications. It supports multiple document formats (HTML, JSON, plain text) and integrates with popular embedding providers.
β οΈ EARLY BETA WARNING
This crate is in early development (v0.1.x). APIs are unstable and will change between minor versions.
Breaking changes may arrive without fanfare. Pin exact versions in production, and check release notes carefully before upgrading.
That said, the core algorithms workβjust expect some assembly required.
β¨ Key Features
- π Semantic Chunking: Intelligent document segmentation using embeddings and structural analysis
- π Multi-format Support: Process HTML, JSON, and plain text documents
- π§ Embedding Integration: Built-in support for Rig-based embedding providers
- πΎ Vector Storage: SQLite-based vector store with efficient similarity search
- π Async Processing: Full async/await support with tokio runtime
- π Rich Metadata: Preserve document structure and provenance information
- ποΈ Configurable: Extensive tuning options for different use cases
π Quick Start
Add wg-ragsmith to your Cargo.toml:
[]
= "0.1"
Basic Document Chunking
use ;
use MockEmbeddingProvider;
async
Vector Storage and Retrieval
use SqliteChunkStore;
use chunk_response_to_ingestion;
use Arc;
// Set up vector store
let store = new;
// Store chunks (from previous example)
let ingestion = chunk_response_to_ingestion?;
store.store_batch.await?;
// Search for similar content
let query_embedding = vec!; // Your query embedding
let results = store.search_similar.await?;
for result in results
π Feature Flags
semantic-chunking-tiktoken(default): Enable OpenAI tiktoken-based tokenizationsemantic-chunking-rust-bert: Enable Rust BERT integration for advanced NLPsemantic-chunking-segtok: Enable segtok sentence segmentation
ποΈ Architecture
Core Components
SemanticChunkingService: Main entry point for document processingHtmlSemanticChunker: HTML-specific chunking with DOM awarenessJsonSemanticChunker: JSON document processing with structural preservationSqliteChunkStore: Vector storage with SQLite backend- Embedding Providers: Pluggable embedding generation (Rig, custom implementations)
Chunking Strategies
- Percentile: Breakpoints based on embedding similarity percentiles
- Standard Deviation: Statistical outlier detection for breakpoints
- Interquartile: Robust statistical breakpoint detection
- Gradient: Similarity gradient analysis
π€ Integration Examples
With Rig (Recommended)
use Client as OpenAIClient;
use SemanticChunkingService;
let openai_client = new;
let service = builder
.with_embedding_provider
.build?;
Custom Embedding Provider
use ;
;
π Performance
- Memory Efficient: Streaming processing for large documents
- Concurrent: Parallel embedding generation with configurable batching
- Cached: Built-in embedding caching to reduce API calls
- Scalable: SQLite backend supports large document collections
π§ Configuration
Extensive configuration options for tuning chunking behavior:
use ;
let config = ChunkingConfig ;
π Documentation
π€ Contributing
Contributions welcome! Please see the main Weavegraph repository for contribution guidelines.
π License
Licensed under the MIT License. See LICENSE for details.