LangExtract (Rust Implementation)
A powerful Rust library for extracting structured and grounded information from text using Large Language Models (LLMs).
LangExtract processes unstructured text and extracts specific information with precise character-level alignment, making it perfect for document analysis, research paper processing, product catalogs, and more.
โจ Key Features
- ๐ High-Performance Async Processing - Concurrent chunk processing with configurable parallelism
- ๐ฏ Universal Provider Support - OpenAI, Ollama, and custom HTTP APIs
- ๐ Character-Level Alignment - Precise text positioning with fuzzy matching fallback
- ๐ง Advanced Validation System - Schema validation, type coercion, and raw data preservation
- ๐จ Rich Visualization - Export to HTML, Markdown, JSON, and CSV formats
- ๐ Multi-Pass Extraction - Improved recall through multiple extraction rounds
- ๐งฉ Intelligent Chunking - Automatic text splitting with overlap handling
- ๐ Memory-safe and thread-safe by design
Quick Start
Add this to your Cargo.toml
:
[]
= "0.1.0"
Basic Usage Example
use ;
async
Advanced Features
Validation and Type Coercion
use ;
// Enable advanced validation
let validation_config = ValidationConfig ;
// Automatic type coercion handles:
// - Currencies: "$1,234.56" โ 1234.56
// - Percentages: "95.5%" โ 0.955
// - Booleans: "true", "yes", "1" โ true
// - Numbers: "42" โ 42, "3.14" โ 3.14
// - Emails, phones, URLs, dates
Rich Visualization
use ;
// Export to interactive HTML
let html_config = ExportConfig ;
let html_output = export_document?;
write?;
// Also supports Markdown, JSON, and CSV exports
Provider Configuration
use ProviderConfig;
// OpenAI configuration
let openai_config = openai;
// Ollama configuration
let ollama_config = ollama;
// Custom HTTP API
let custom_config = custom;
๐ Example Applications
Product Catalog Processing
# Extract product information from catalogs
Academic Paper Analysis
# Extract research information from papers
End-to-End Provider Testing
# Test with multiple LLM providers
๐ Supported Providers
Provider | Models | Features | Use Case |
---|---|---|---|
OpenAI | gpt-4o, gpt-4o-mini, gpt-3.5-turbo | High accuracy, JSON mode | Production applications |
Ollama | mistral, llama2, codellama, qwen | Local, privacy-first | Development, sensitive data |
Custom | Any OpenAI-compatible API | Flexible integration | Custom deployments |
Environment Setup
# For OpenAI
# For Ollama (local)
# For custom providers
โ๏ธ Performance Configuration
The ExtractConfig
struct provides fine-grained control over extraction performance:
let config = ExtractConfig ;
Performance Tuning Tips
- max_workers: Increase for faster processing (6-12 recommended)
- batch_length: Larger batches = better throughput (4-8 optimal)
- max_char_buffer: Balance speed vs accuracy (6000-12000 characters)
- temperature: Lower values (0.1-0.3) for consistent extraction
See PERFORMANCE_TUNING.md for detailed optimization guide.
๐ Real-World Examples
Document Analysis
Perfect for processing contracts, research papers, or reports:
let examples = vec!;
Large Document Processing
The library handles large documents automatically with intelligent chunking:
// Configure for academic papers or catalogs
let config = ExtractConfig ;
Error Handling
The library provides comprehensive error types:
use LangExtractError;
match extract.await
๐๏ธ Architecture & Performance
High-Performance Features
- Concurrent processing: Multiple workers process chunks in parallel
- UTF-8 safe: Handles Unicode text with proper character boundary detection
- Memory efficient: Streaming processing for large documents
- Async I/O: Non-blocking network operations
- Smart chunking: Intelligent text splitting with overlap handling
Development Status
This Rust implementation provides a complete, production-ready text extraction system:
โ Core Infrastructure (COMPLETE)
- Data structures and type system - Robust extraction and document models
- Error handling and results - Comprehensive error types with context
- Universal provider system - OpenAI, Ollama, and custom HTTP APIs
- Async processing pipeline - High-performance concurrent chunk processing
โ Text Processing (COMPLETE)
- Intelligent chunking - Automatic document splitting with overlap management
- Character alignment - Precise text positioning with fuzzy matching fallback
- Multi-pass extraction - Improved recall through multiple extraction rounds
- Prompt template system - Flexible LLM prompt generation
โ Validation & Quality (COMPLETE)
- Advanced validation system - Schema validation with type coercion
- Raw data preservation - Save original LLM outputs before processing
- Type coercion - Automatic conversion of strings to appropriate types
- Quality assurance - Validation reporting and data correction
โ Visualization & Export (COMPLETE)
- Rich HTML export - Interactive highlighting with modern styling
- Multiple formats - HTML, Markdown, JSON, and CSV export options
- Character-level highlighting - Precise extraction positioning in source text
- Statistical reporting - Comprehensive extraction analytics
Architecture Advantages
- Type Safety: Compile-time guarantees for configurations and data structures
- Memory Safety: Rust's ownership system prevents common memory errors
- Performance: Zero-cost abstractions and efficient async processing
- Explicit Configuration: Clear, predictable provider and processing setup
- Unicode Support: Proper handling of international text and mathematical symbols
๐งช Testing & Examples
Run the included test scripts to explore LangExtract capabilities:
# Test with product catalogs
# Test with academic papers
# Test multiple LLM providers
Each test generates interactive HTML reports, structured JSON data, and CSV exports for analysis.
๐ Documentation
- SPEC.md - Complete technical specification and implementation status
- PERFORMANCE_TUNING.md - Detailed performance optimization guide
- E2E_TEST_README.md - End-to-end testing instructions
๐ค Contributing
We welcome contributions! Key areas for enhancement:
- Additional LLM provider implementations
- New export formats and visualization options
- Performance optimizations for specific document types
- Enhanced validation and quality assurance features
๐ License
Licensed under the Apache License, Version 2.0. See LICENSE for details. For health-related applications, use of LangExtract is also subject to the Health AI Developer Foundations Terms of Use.
๐ Citations & Acknowledgments
This work builds upon research and implementations from the broader NLP and information extraction community:
Acknowledgments:
- Inspired by the folks at Google that open-sourced langextract
- Thank you so much for providing such a complicated tool to the AI-Engineers of the world trying for more deterministic outcomes.