LangExtract (Rust)
A Rust library for extracting structured, source-grounded information from unstructured text using LLMs. Every extraction is mapped back to exact character offsets in the original document, enabling verification and interactive highlighting.
The core workflow: provide a few examples of what to extract, the library builds a few-shot prompt, chunks large documents, sends chunks to an LLM in parallel, parses structured JSON responses, aligns extracted values back to character positions in the source text, then deduplicates and aggregates results.
Key Features
- High-performance async processing with configurable concurrency via
buffer_unordered - Multiple provider support — OpenAI, Ollama, and custom HTTP APIs
- Character-level alignment — exact match then fuzzy word-overlap fallback
- Validation and type coercion — schema validation, raw data preservation, automatic type detection
- Visualization — export to interactive HTML, Markdown, JSON, and CSV
- Multi-pass extraction — improved recall through targeted reprocessing of low-yield chunks
- Semantic chunking — intelligent text splitting via
semchunk-rswith sentence boundary awareness - Memory efficient — zero-copy document sharing via
Arc, pre-computed tokenization
Quick Start
CLI Installation
From source (requires Rust):
From repository:
CLI Usage
# Initialize configuration
# Extract from text
# Process files with HTML export
# Test provider connectivity
# List available providers
Library Usage
Add to your Cargo.toml:
[]
= "0.4"
Basic example:
use ;
async
CLI Reference
Extract Command
# From file with options
# From URL
Configuration Commands
Configuration Files
examples.json
.env
OPENAI_API_KEY=your_openai_key_here
OLLAMA_BASE_URL=http://localhost:11434
Supported Providers
| Provider | Models | Notes |
|---|---|---|
| OpenAI | gpt-4o, gpt-4o-mini, gpt-3.5-turbo | Via async-openai, feature-gated (--features openai) |
| Ollama | mistral, llama2, codellama, qwen | Local inference via HTTP to /api/generate |
| Custom | Any OpenAI-compatible API | For vLLM, LiteLLM, and other compatible endpoints |
Provider Setup
# OpenAI
# Ollama (local)
Configuration
The ExtractConfig struct controls extraction behavior:
let config = ExtractConfig ;
Tuning Guidelines
- max_workers: 6-12 for parallel throughput
- batch_length: 4-8 for optimal batching
- max_char_buffer: 6000-12000 characters per chunk
- temperature: 0.1-0.3 for consistent extraction
Advanced Features
Validation and Type Coercion
use ;
let validation_config = ValidationConfig ;
Supported coercion types: integers, floats, booleans, currencies, percentages, emails, phone numbers, dates, URLs.
Visualization
use ;
let config = ExportConfig ;
let html = export_document?;
write?;
Provider Configuration
use ProviderConfig;
let openai = openai;
let ollama = ollama;
Error Handling
use LangExtractError;
match extract.await
Architecture
Text Input -> extract() -> Prompting -> Chunking -> LLM Inference -> Parsing -> Alignment -> Aggregation -> Result
Key modules:
annotation.rs— orchestrates the chunk-infer-parse-align loopchunking.rs— semantic and token-based text splittingalignment.rs— exact + fuzzy character offset mappingresolver.rs— JSON parsing, repair, and type coercionmultipass.rs— multi-pass extraction with quality scoringpipeline.rs— multi-step extraction with dependency resolutionvisualization.rs— HTML, Markdown, CSV, JSON export
See SPEC.md for the complete technical specification.
Testing
Documentation
- SPEC.md — Technical specification, architecture, known issues, and fix priorities
License
Licensed under the Apache License, Version 2.0. See LICENSE for details.
Acknowledgments
This is a Rust port of Google's langextract Python library.