langextract-rust 0.4.3

A Rust library for extracting structured and grounded information from text using LLMs
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
# LangExtract (Rust Implementation)

A powerful Rust library for extracting structured and grounded information from text using Large Language Models (LLMs).

LangExtract processes unstructured text and extracts specific information with precise character-level alignment, making it perfect for document analysis, research paper processing, product catalogs, and more.

## โœจ Key Features

- ๐Ÿš€ **High-Performance Async Processing** - Concurrent chunk processing with configurable parallelism
- ๐ŸŽฏ **Universal Provider Support** - OpenAI, Ollama, and custom HTTP APIs
- ๐Ÿ“ **Character-Level Alignment** - Precise text positioning with fuzzy matching fallback  
- ๐Ÿ”ง **Advanced Validation System** - Schema validation, type coercion, and raw data preservation
- ๐ŸŽจ **Rich Visualization** - Export to HTML, Markdown, JSON, and CSV formats
- ๐Ÿ“Š **Multi-Pass Extraction** - Improved recall through multiple extraction rounds
- ๐Ÿงฉ **Intelligent Chunking** - Automatic text splitting with overlap handling
- ๐Ÿ”’ **Memory-safe** and **thread-safe** by design

## Quick Start

### ๐Ÿ–ฅ๏ธ CLI Installation

#### Quick Install (Recommended)

**Linux/macOS (Auto-detect best method):**
```bash
curl -fsSL https://raw.githubusercontent.com/modularflow/langextract-rust/main/install.sh | bash
```

**Windows (PowerShell):**
```powershell
iwr -useb https://raw.githubusercontent.com/modularflow/langextract-rust/main/install.ps1 | iex
```

#### Alternative Installation Methods

**From crates.io (requires Rust):**
```bash
cargo install langextract-rust --features cli
```

**Pre-built binaries (no Rust required):**
```bash
# Download from GitHub releases
curl -fsSL https://raw.githubusercontent.com/modularflow/langextract-rust/main/install.sh | bash -s -- --prebuilt
```

**Homebrew (macOS/Linux - coming soon):**
```bash
brew install modularflow/tap/lx-rs
```

**From source:**
```bash
git clone https://github.com/modularflow/langextract-rust
cd langextract-rust
cargo install --path . --features cli
```

#### CLI Quick Start

```bash
# Initialize configuration (provider required)
lx-rs init --provider ollama

# Extract from text (provider required)
lx-rs extract "John Doe is 30 years old" --prompt "Extract names and ages" --provider ollama

# Test your setup
lx-rs test --provider ollama

# Process files
lx-rs extract document.txt --examples examples.json --export html --provider ollama

# Check available providers
lx-rs providers
```

### ๐Ÿ“ฆ Library Usage

Add this to your `Cargo.toml`:

```toml
[dependencies]
langextract-rust = "0.1.0"
```

#### Basic Usage Example

```rust
use langextract::{
    extract, ExtractConfig, FormatType,
    data::{ExampleData, Extraction},
    providers::ProviderConfig,
};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Set up examples to guide extraction
    let examples = vec![
        ExampleData::new(
            "John Doe is 30 years old and works as a doctor".to_string(),
            vec![
                Extraction::new("person".to_string(), "John Doe".to_string()),
                Extraction::new("age".to_string(), "30".to_string()),
                Extraction::new("profession".to_string(), "doctor".to_string()),
            ],
        )
    ];

    // Configure for Ollama
    let provider_config = ProviderConfig::ollama("mistral", None);
    
    let config = ExtractConfig {
        model_id: "mistral".to_string(),
        format_type: FormatType::Json,
        max_char_buffer: 8000,
        max_workers: 6,
        batch_length: 4,
        temperature: 0.3,
        model_url: Some("http://localhost:11434".to_string()),
        language_model_params: {
            let mut params = std::collections::HashMap::new();
            params.insert("provider_config".to_string(), serde_json::to_value(&provider_config)?);
            params
        },
        debug: true,
        ..Default::default()
    };

    // Extract information
    let result = extract(
        "Alice Smith is 25 years old and works as a doctor. Bob Johnson is 35 and is an engineer.",
        Some("Extract person names, ages, and professions from the text"),
        &examples,
        config,
    ).await?;

    println!("โœ… Extracted {} items", result.extraction_count());
    
    // Show extractions with character positions
    if let Some(extractions) = &result.extractions {
        for extraction in extractions {
            println!("โ€ข [{}] '{}' at {:?}", 
                extraction.extraction_class, 
                extraction.extraction_text,
                extraction.char_interval
            );
        }
    }
    
    Ok(())
}
```

## ๐Ÿ–ฅ๏ธ Command Line Interface

The CLI provides a powerful interface for text extraction without writing code.

### Installation Options

#### Quick Install (Recommended)
```bash
# Linux/macOS
curl -fsSL https://raw.githubusercontent.com/modularflow/langextract-rust/main/install.sh | bash

# Windows PowerShell
iwr -useb https://raw.githubusercontent.com/modularflow/langextract-rust/main/install.ps1 | iex
```

#### Manual Install
```bash
# From source with CLI features
cargo install langextract-rust --features cli

# Or clone and build
git clone https://github.com/modularflow/langextract-rust
cd langextract-rust
cargo install --path . --features cli
```

### CLI Commands

#### Extract Command
Extract structured information from text, files, or URLs:

```bash
# Basic extraction
lx-rs extract "Alice Smith is 25 years old" --prompt "Extract names and ages" --provider ollama

# From file with custom examples
lx-rs extract document.txt \
  --examples my_examples.json \
  --output results.json \
  --export html \
  --provider ollama

# With specific provider and model
lx-rs extract text.txt \
  --provider ollama \
  --model mistral \
  --workers 8 \
  --multipass

# From URL
lx-rs extract "https://example.com/article.html" \
  --prompt "Extract key facts" \
  --format yaml \
  --provider openai

# Advanced options
lx-rs extract large_document.txt \
  --examples patterns.json \
  --provider openai \
  --model gpt-4o \
  --max-chars 12000 \
  --workers 10 \
  --batch-size 6 \
  --temperature 0.1 \
  --multipass \
  --passes 3 \
  --export html \
  --show-intervals \
  --verbose
```

#### Configuration Commands

```bash
# Initialize configuration files (provider required)
lx-rs init --provider ollama

# Initialize for OpenAI provider
lx-rs init --provider openai

# Force overwrite existing configs
lx-rs init --provider ollama --force

# Test provider connectivity (provider required)
lx-rs test --provider ollama
lx-rs test --provider ollama --model mistral
lx-rs test --provider openai --api-key your_key
```

#### Information Commands

```bash
# List available providers and models
lx-rs providers

# Show example configurations
lx-rs examples

# Get help
lx-rs --help
lx-rs extract --help
```

#### Conversion Commands

```bash
# Convert between formats
lx-rs convert results.json --output report.html --format html
lx-rs convert data.json --output summary.csv --format csv
```

### Configuration Files

The CLI supports configuration files for easier management:

#### examples.json
```json
[
  {
    "text": "Dr. Sarah Johnson works at Mayo Clinic in Rochester, MN",
    "extractions": [
      {"extraction_class": "person", "extraction_text": "Dr. Sarah Johnson"},
      {"extraction_class": "organization", "extraction_text": "Mayo Clinic"},
      {"extraction_class": "location", "extraction_text": "Rochester, MN"}
    ]
  }
]
```

#### .env
```bash
# Provider API keys
OPENAI_API_KEY=your_openai_key_here
GEMINI_API_KEY=your_gemini_key_here

# Ollama configuration
OLLAMA_BASE_URL=http://localhost:11434
```

#### langextract.yaml
```yaml
# Default configuration
model: "mistral"
provider: "ollama"
model_url: "http://localhost:11434"
temperature: 0.3
max_char_buffer: 8000
max_workers: 6
batch_length: 4
multipass: false
extraction_passes: 1
```

### CLI Examples by Use Case

#### Document Processing
```bash
# Academic papers
lx-rs extract research_paper.pdf \
  --prompt "Extract authors, institutions, key findings, and methodology" \
  --examples academic_examples.json \
  --export html \
  --show-intervals

# Legal documents
lx-rs extract contract.txt \
  --prompt "Extract parties, dates, obligations, and key terms" \
  --provider openai \
  --model gpt-4o \
  --temperature 0.1
```

#### Data Extraction
```bash
# Product catalogs
lx-rs extract catalog.txt \
  --prompt "Extract product names, prices, descriptions, and specs" \
  --multipass \
  --passes 2 \
  --export csv

# Contact information
lx-rs extract directory.txt \
  --prompt "Extract names, emails, phone numbers, and addresses" \
  --format yaml \
  --show-intervals
```

#### Batch Processing
```bash
# Process multiple files
for file in documents/*.txt; do
  lx-rs extract "$file" \
    --examples patterns.json \
    --output "results/$(basename "$file" .txt).json"
done

# URL processing
lx-rs extract "https://news.site.com/article" \
  --prompt "Extract headline, author, date, and key points" \
  --export html
```

### Provider-Specific Setup

#### Ollama (Local)
```bash
# Install and start Ollama
ollama serve
ollama pull mistral

# Test connection
lx-rs test --provider ollama --model mistral
```

#### OpenAI
```bash
# Set API key
export OPENAI_API_KEY="your-key-here"

# Test connection
lx-rs test --provider openai --model gpt-4o-mini
```

#### Gemini
```bash
# Set API key
export GEMINI_API_KEY="your-key-here"

# Test connection
lx-rs test --provider gemini --model gemini-2.5-flash
```

### Performance Optimization

```bash
# High-performance extraction
langextract-rust extract large_file.txt \
  --workers 12 \           # Increase parallel workers
  --batch-size 8 \         # Larger batches
  --max-chars 10000 \      # Optimal chunk size
  --provider ollama \      # Local inference
  --temperature 0.2        # Consistent results

# Memory-efficient processing
langextract-rust extract huge_file.txt \
  --max-chars 6000 \       # Smaller chunks
  --workers 4 \            # Fewer workers
  --batch-size 2           # Smaller batches
```

### Troubleshooting

```bash
# Verbose output for debugging
langextract-rust extract text.txt --verbose --debug

# Test specific provider
langextract-rust test --provider ollama --verbose

# Check installation
langextract-rust --version
langextract-rust providers

# Reset configuration
langextract-rust init --force
```

### Advanced Features

#### Validation and Type Coercion

```rust
use langextract::{ValidationConfig, ValidationResult};

// Enable advanced validation
let validation_config = ValidationConfig {
    enable_schema_validation: true,
    enable_type_coercion: true,
    save_raw_output: true,
    validate_required_fields: true,
    raw_output_dir: Some("./raw_outputs".to_string()),
    ..Default::default()
};

// Automatic type coercion handles:
// - Currencies: "$1,234.56" โ†’ 1234.56
// - Percentages: "95.5%" โ†’ 0.955  
// - Booleans: "true", "yes", "1" โ†’ true
// - Numbers: "42" โ†’ 42, "3.14" โ†’ 3.14
// - Emails, phones, URLs, dates
```

#### Rich Visualization

```rust
use langextract::visualization::{export_document, ExportConfig, ExportFormat};

// Export to interactive HTML
let html_config = ExportConfig {
    format: ExportFormat::Html,
    title: Some("Document Analysis".to_string()),
    highlight_extractions: true,
    show_char_intervals: true,
    include_statistics: true,
    ..Default::default()
};

let html_output = export_document(&annotated_doc, &html_config)?;
std::fs::write("analysis.html", html_output)?;

// Also supports Markdown, JSON, and CSV exports
```

#### Provider Configuration

```rust
use langextract::providers::ProviderConfig;

// OpenAI configuration  
let openai_config = ProviderConfig::openai("gpt-4o-mini", Some(api_key));

// Ollama configuration
let ollama_config = ProviderConfig::ollama("mistral", Some("http://localhost:11434".to_string()));

// Custom HTTP API
let custom_config = ProviderConfig::custom("https://my-api.com/v1", "my-model");
```

## ๐Ÿš€ Example Applications

### Product Catalog Processing
```bash
# Extract product information from catalogs
./test_product_extraction.sh
```

### Academic Paper Analysis  
```bash
# Extract research information from papers
./test_academic_extraction.sh
```

### End-to-End Provider Testing
```bash
# Test with multiple LLM providers
./test_providers.sh
```

## ๐Ÿ“‹ Supported Providers

| Provider | Models | Features | Use Case |
|----------|--------|----------|----------|
| **OpenAI** | gpt-4o, gpt-4o-mini, gpt-3.5-turbo | High accuracy, JSON mode | Production applications |
| **Ollama** | mistral, llama2, codellama, qwen | Local, privacy-first | Development, sensitive data |
| **Custom** | Any OpenAI-compatible API | Flexible integration | Custom deployments |

### Environment Setup

```bash
# For OpenAI
export OPENAI_API_KEY="your-openai-key"

# For Ollama (local)
ollama serve
ollama pull mistral

# For custom providers
export CUSTOM_API_KEY="your-key"
```

## โš™๏ธ Performance Configuration

The `ExtractConfig` struct provides fine-grained control over extraction performance:

```rust
let config = ExtractConfig {
    model_id: "mistral".to_string(),
    temperature: 0.3,                    // Lower = more consistent
    max_char_buffer: 8000,               // Chunk size for large documents
    batch_length: 6,                     // Chunks per batch  
    max_workers: 8,                      // Parallel workers (key for speed!)
    extraction_passes: 1,                // Multiple passes for better recall
    enable_multipass: false,             // Advanced multi-pass extraction
    multipass_min_extractions: 5,        // Minimum extractions to avoid re-processing
    multipass_quality_threshold: 0.8,    // Quality threshold for keeping extractions
    debug: true,                         // Enable debug information
    ..Default::default()
};
```

### Performance Tuning Tips

- **max_workers**: Increase for faster processing (6-12 recommended)
- **batch_length**: Larger batches = better throughput (4-8 optimal)  
- **max_char_buffer**: Balance speed vs accuracy (6000-12000 characters)
- **temperature**: Lower values (0.1-0.3) for consistent extraction

See [PERFORMANCE_TUNING.md](PERFORMANCE_TUNING.md) for detailed optimization guide.

## ๐Ÿ“š Real-World Examples

### Document Analysis
Perfect for processing contracts, research papers, or reports:

```rust
let examples = vec![
    ExampleData::new(
        "Dr. Sarah Johnson (contact: s.johnson@mayo.edu) works at Mayo Clinic in Rochester, MN since 2019".to_string(),
        vec![
            Extraction::new("person".to_string(), "Dr. Sarah Johnson".to_string()),
            Extraction::new("email".to_string(), "s.johnson@mayo.edu".to_string()),
            Extraction::new("institution".to_string(), "Mayo Clinic".to_string()),
            Extraction::new("location".to_string(), "Rochester, MN".to_string()),
            Extraction::new("year".to_string(), "2019".to_string()),
        ],
    )
];
```

### Large Document Processing

The library handles large documents automatically with intelligent chunking:

```rust
// Configure for academic papers or catalogs
let config = ExtractConfig {
    max_char_buffer: 8000,     // Optimal chunk size
    max_workers: 8,            // High parallelism  
    batch_length: 6,           // Process multiple chunks per batch
    enable_multipass: true,    // Multiple extraction rounds
    multipass_min_extractions: 3,
    multipass_quality_threshold: 0.8,
    debug: true,               // See processing details
    ..Default::default()
};
```

## Error Handling

The library provides comprehensive error types:

```rust
use langextract::LangExtractError;

match extract(/* ... */).await {
    Ok(result) => println!("Success: {} extractions", result.extraction_count()),
    Err(LangExtractError::ConfigurationError(msg)) => {
        eprintln!("Configuration issue: {}", msg);
    }
    Err(LangExtractError::InferenceError { message, provider, .. }) => {
        eprintln!("Inference failed ({}): {}", provider.unwrap_or("unknown"), message);
    }
    Err(LangExtractError::NetworkError(e)) => {
        eprintln!("Network error: {}", e);
    }
    Err(e) => eprintln!("Other error: {}", e),
}
```

## ๐Ÿ—๏ธ Architecture & Performance

### High-Performance Features
- **Concurrent processing**: Multiple workers process chunks in parallel
- **UTF-8 safe**: Handles Unicode text with proper character boundary detection
- **Memory efficient**: Streaming processing for large documents  
- **Async I/O**: Non-blocking network operations
- **Smart chunking**: Intelligent text splitting with overlap handling

### Development Status

This Rust implementation provides a complete, production-ready text extraction system:

#### โœ… Core Infrastructure (COMPLETE)
- **Data structures and type system** - Robust extraction and document models
- **Error handling and results** - Comprehensive error types with context
- **Universal provider system** - OpenAI, Ollama, and custom HTTP APIs
- **Async processing pipeline** - High-performance concurrent chunk processing

#### โœ… Text Processing (COMPLETE) 
- **Intelligent chunking** - Automatic document splitting with overlap management
- **Character alignment** - Precise text positioning with fuzzy matching fallback
- **Multi-pass extraction** - Improved recall through multiple extraction rounds
- **Prompt template system** - Flexible LLM prompt generation

#### โœ… Validation & Quality (COMPLETE)
- **Advanced validation system** - Schema validation with type coercion
- **Raw data preservation** - Save original LLM outputs before processing
- **Type coercion** - Automatic conversion of strings to appropriate types
- **Quality assurance** - Validation reporting and data correction

#### โœ… Visualization & Export (COMPLETE)
- **Rich HTML export** - Interactive highlighting with modern styling
- **Multiple formats** - HTML, Markdown, JSON, and CSV export options
- **Character-level highlighting** - Precise extraction positioning in source text
- **Statistical reporting** - Comprehensive extraction analytics

### Architecture Advantages

- **Type Safety**: Compile-time guarantees for configurations and data structures
- **Memory Safety**: Rust's ownership system prevents common memory errors
- **Performance**: Zero-cost abstractions and efficient async processing
- **Explicit Configuration**: Clear, predictable provider and processing setup
- **Unicode Support**: Proper handling of international text and mathematical symbols

## ๐Ÿงช Testing & Examples

Run the included test scripts to explore LangExtract capabilities:

```bash
# Test with product catalogs
./test_product_extraction.sh

# Test with academic papers  
./test_academic_extraction.sh

# Test multiple LLM providers
./test_providers.sh
```

Each test generates interactive HTML reports, structured JSON data, and CSV exports for analysis.

## ๐Ÿ“„ Documentation

- **[SPEC.md]SPEC.md** - Complete technical specification and implementation status
- **[PERFORMANCE_TUNING.md]PERFORMANCE_TUNING.md** - Detailed performance optimization guide
- **[E2E_TEST_README.md]E2E_TEST_README.md** - End-to-end testing instructions

## ๐Ÿค Contributing

We welcome contributions! Key areas for enhancement:

- Additional LLM provider implementations
- New export formats and visualization options  
- Performance optimizations for specific document types
- Enhanced validation and quality assurance features

## ๐Ÿ“œ License

Licensed under the Apache License, Version 2.0. See [LICENSE](LICENSE) for details. For health-related applications, use of LangExtract is also subject to the [Health AI Developer Foundations Terms of Use](https://developers.google.com/health-ai-developer-foundations/terms).


## ๐Ÿ“– Citations & Acknowledgments

This work builds upon research and implementations from the broader NLP and information extraction community:

```bibtex
@misc{langextract,
  title={langextract},
  author={Google Research Team},
  year={2024},
  publisher={GitHub},
  url={https://github.com/google/langextract}
}
```

**Acknowledgments:**
- Inspired by the folks at Google that open-sourced [langextract]https://github.com/google/langextract
- Thank you so much for providing such a complicated tool to the AI-Engineers of the world trying for more deterministic outcomes.