LangExtract (Rust Implementation)
A powerful Rust library for extracting structured and grounded information from text using Large Language Models (LLMs).
LangExtract processes unstructured text and extracts specific information with precise character-level alignment, making it perfect for document analysis, research paper processing, product catalogs, and more.
โจ Key Features
- ๐ High-Performance Async Processing - Concurrent chunk processing with configurable parallelism
- ๐ฏ Universal Provider Support - OpenAI, Ollama, and custom HTTP APIs
- ๐ Character-Level Alignment - Precise text positioning with fuzzy matching fallback
- ๐ง Advanced Validation System - Schema validation, type coercion, and raw data preservation
- ๐จ Rich Visualization - Export to HTML, Markdown, JSON, and CSV formats
- ๐ Multi-Pass Extraction - Improved recall through multiple extraction rounds
- ๐งฉ Intelligent Chunking - Automatic text splitting with overlap handling
- ๐ Memory-safe and thread-safe by design
Quick Start
๐ฅ๏ธ CLI Installation
Quick Install (Recommended)
Linux/macOS (Auto-detect best method):
|
Windows (PowerShell):
iwr -useb https://raw.githubusercontent.com/modularflow/langextract-rust/main/install.ps1 | iex
Alternative Installation Methods
From crates.io (requires Rust):
Pre-built binaries (no Rust required):
# Download from GitHub releases
|
Homebrew (macOS/Linux - coming soon):
From source:
CLI Quick Start
# Initialize configuration (provider required)
# Extract from text (provider required)
# Test your setup
# Process files
# Check available providers
๐ฆ Library Usage
Add this to your Cargo.toml
:
[]
= "0.1.0"
Basic Usage Example
use ;
async
๐ฅ๏ธ Command Line Interface
The CLI provides a powerful interface for text extraction without writing code.
Installation Options
Quick Install (Recommended)
# Linux/macOS
|
# Windows PowerShell
|
Manual Install
# From source with CLI features
# Or clone and build
CLI Commands
Extract Command
Extract structured information from text, files, or URLs:
# Basic extraction
# From file with custom examples
# With specific provider and model
# From URL
# Advanced options
Configuration Commands
# Initialize configuration files (provider required)
# Initialize for OpenAI provider
# Force overwrite existing configs
# Test provider connectivity (provider required)
Information Commands
# List available providers and models
# Show example configurations
# Get help
Conversion Commands
# Convert between formats
Configuration Files
The CLI supports configuration files for easier management:
examples.json
.env
# Provider API keys
OPENAI_API_KEY=your_openai_key_here
GEMINI_API_KEY=your_gemini_key_here
# Ollama configuration
OLLAMA_BASE_URL=http://localhost:11434
langextract.yaml
# Default configuration
model: "mistral"
provider: "ollama"
model_url: "http://localhost:11434"
temperature: 0.3
max_char_buffer: 8000
max_workers: 6
batch_length: 4
multipass: false
extraction_passes: 1
CLI Examples by Use Case
Document Processing
# Academic papers
# Legal documents
Data Extraction
# Product catalogs
# Contact information
Batch Processing
# Process multiple files
for; do
done
# URL processing
Provider-Specific Setup
Ollama (Local)
# Install and start Ollama
# Test connection
OpenAI
# Set API key
# Test connection
Gemini
# Set API key
# Test connection
Performance Optimization
# High-performance extraction
# Memory-efficient processing
Troubleshooting
# Verbose output for debugging
# Test specific provider
# Check installation
# Reset configuration
Advanced Features
Validation and Type Coercion
use ;
// Enable advanced validation
let validation_config = ValidationConfig ;
// Automatic type coercion handles:
// - Currencies: "$1,234.56" โ 1234.56
// - Percentages: "95.5%" โ 0.955
// - Booleans: "true", "yes", "1" โ true
// - Numbers: "42" โ 42, "3.14" โ 3.14
// - Emails, phones, URLs, dates
Rich Visualization
use ;
// Export to interactive HTML
let html_config = ExportConfig ;
let html_output = export_document?;
write?;
// Also supports Markdown, JSON, and CSV exports
Provider Configuration
use ProviderConfig;
// OpenAI configuration
let openai_config = openai;
// Ollama configuration
let ollama_config = ollama;
// Custom HTTP API
let custom_config = custom;
๐ Example Applications
Product Catalog Processing
# Extract product information from catalogs
Academic Paper Analysis
# Extract research information from papers
End-to-End Provider Testing
# Test with multiple LLM providers
๐ Supported Providers
Provider | Models | Features | Use Case |
---|---|---|---|
OpenAI | gpt-4o, gpt-4o-mini, gpt-3.5-turbo | High accuracy, JSON mode | Production applications |
Ollama | mistral, llama2, codellama, qwen | Local, privacy-first | Development, sensitive data |
Custom | Any OpenAI-compatible API | Flexible integration | Custom deployments |
Environment Setup
# For OpenAI
# For Ollama (local)
# For custom providers
โ๏ธ Performance Configuration
The ExtractConfig
struct provides fine-grained control over extraction performance:
let config = ExtractConfig ;
Performance Tuning Tips
- max_workers: Increase for faster processing (6-12 recommended)
- batch_length: Larger batches = better throughput (4-8 optimal)
- max_char_buffer: Balance speed vs accuracy (6000-12000 characters)
- temperature: Lower values (0.1-0.3) for consistent extraction
See PERFORMANCE_TUNING.md for detailed optimization guide.
๐ Real-World Examples
Document Analysis
Perfect for processing contracts, research papers, or reports:
let examples = vec!;
Large Document Processing
The library handles large documents automatically with intelligent chunking:
// Configure for academic papers or catalogs
let config = ExtractConfig ;
Error Handling
The library provides comprehensive error types:
use LangExtractError;
match extract.await
๐๏ธ Architecture & Performance
High-Performance Features
- Concurrent processing: Multiple workers process chunks in parallel
- UTF-8 safe: Handles Unicode text with proper character boundary detection
- Memory efficient: Streaming processing for large documents
- Async I/O: Non-blocking network operations
- Smart chunking: Intelligent text splitting with overlap handling
Development Status
This Rust implementation provides a complete, production-ready text extraction system:
โ Core Infrastructure (COMPLETE)
- Data structures and type system - Robust extraction and document models
- Error handling and results - Comprehensive error types with context
- Universal provider system - OpenAI, Ollama, and custom HTTP APIs
- Async processing pipeline - High-performance concurrent chunk processing
โ Text Processing (COMPLETE)
- Intelligent chunking - Automatic document splitting with overlap management
- Character alignment - Precise text positioning with fuzzy matching fallback
- Multi-pass extraction - Improved recall through multiple extraction rounds
- Prompt template system - Flexible LLM prompt generation
โ Validation & Quality (COMPLETE)
- Advanced validation system - Schema validation with type coercion
- Raw data preservation - Save original LLM outputs before processing
- Type coercion - Automatic conversion of strings to appropriate types
- Quality assurance - Validation reporting and data correction
โ Visualization & Export (COMPLETE)
- Rich HTML export - Interactive highlighting with modern styling
- Multiple formats - HTML, Markdown, JSON, and CSV export options
- Character-level highlighting - Precise extraction positioning in source text
- Statistical reporting - Comprehensive extraction analytics
Architecture Advantages
- Type Safety: Compile-time guarantees for configurations and data structures
- Memory Safety: Rust's ownership system prevents common memory errors
- Performance: Zero-cost abstractions and efficient async processing
- Explicit Configuration: Clear, predictable provider and processing setup
- Unicode Support: Proper handling of international text and mathematical symbols
๐งช Testing & Examples
Run the included test scripts to explore LangExtract capabilities:
# Test with product catalogs
# Test with academic papers
# Test multiple LLM providers
Each test generates interactive HTML reports, structured JSON data, and CSV exports for analysis.
๐ Documentation
- SPEC.md - Complete technical specification and implementation status
- PERFORMANCE_TUNING.md - Detailed performance optimization guide
- E2E_TEST_README.md - End-to-end testing instructions
๐ค Contributing
We welcome contributions! Key areas for enhancement:
- Additional LLM provider implementations
- New export formats and visualization options
- Performance optimizations for specific document types
- Enhanced validation and quality assurance features
๐ License
Licensed under the Apache License, Version 2.0. See LICENSE for details. For health-related applications, use of LangExtract is also subject to the Health AI Developer Foundations Terms of Use.
๐ Citations & Acknowledgments
This work builds upon research and implementations from the broader NLP and information extraction community:
Acknowledgments:
- Inspired by the folks at Google that open-sourced langextract
- Thank you so much for providing such a complicated tool to the AI-Engineers of the world trying for more deterministic outcomes.