langextract-rust 0.5.0

A Rust library for extracting structured and grounded information from text using LLMs
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
# LangExtract Pipeline Feature

The Pipeline feature enables multi-step information extraction with nested hierarchical processing. This allows you to break down complex extraction tasks into sequential steps, creating structured outputs from unstructured text.

## 🚀 Quick Start

### 1. Create a Sample Pipeline

```bash
# Create a sample requirements extraction pipeline
lx-rs pipeline --create-sample --config requirements_pipeline.yaml

# Or specify a different sample type
lx-rs pipeline --create-sample --config medical_pipeline.yaml --sample-type medical
```

### 2. Execute the Pipeline

```bash
# Process text directly
lx-rs pipeline --config requirements_pipeline.yaml "The system shall process 100 transactions per second."

# Process a file
lx-rs pipeline --config requirements_pipeline.yaml requirements.txt --output results.json

# Process a URL
lx-rs pipeline --config requirements_pipeline.yaml "https://example.com/requirements" --output results.json
```

## 📋 Pipeline Configuration

Pipelines are defined in YAML format with the following structure:

```yaml
name: "Requirements Extraction Pipeline"
description: "Extract requirements and sub-divide into values, units, and specifications"
version: "1.0.0"

global_config:
  model_id: "gpt-4o-mini"
  format_type: "json"
  temperature: 0.3
  max_char_buffer: 8000
  max_workers: 6
  language_model_params:
    provider_config:
      provider_type: "openai"
      base_url: "https://api.openai.com/v1"
      model: "gpt-4o-mini"
  # ... other global settings

steps:
  - id: "extract_requirements"
    name: "Extract Requirements"
    description: "Extract all 'shall' statements and requirements"
    prompt: "Extract all requirements, 'shall' statements, and specifications from the text."
    output_field: "requirements"
    depends_on: []  # No dependencies - first step
    examples:
      - text: "The system shall process 100 transactions per second."
        extractions:
          - extraction_class: "requirement"
            extraction_text: "The system shall process 100 transactions per second."

  - id: "extract_values"
    name: "Extract Values"
    description: "Extract numeric values and units from requirements"
    prompt: "Extract all numeric values and their units from this requirement."
    output_field: "values"
    depends_on: ["extract_requirements"]  # Depends on previous step
    filter:
      class_filter: "requirement"  # Only process requirement extractions
    examples:
      - text: "The system shall process 100 transactions per second."
        extractions:
          - extraction_class: "value"
            extraction_text: "100"
          - extraction_class: "unit"
            extraction_text: "transactions per second"
```

## 🔧 Configuration Options

### Global Configuration

| Field | Description | Example |
|-------|-------------|---------|
| `model_id` | LLM model to use | `"gpt-4o-mini"`, `"mistral"` |
| `format_type` | Output format | `"json"`, `"yaml"` |
| `temperature` | Sampling temperature (0.0-1.0) | `0.3` |
| `max_char_buffer` | Maximum characters per chunk | `8000` |
| `max_workers` | Concurrent processing workers | `6` |
| `provider_config` | LLM provider configuration | See provider docs |

### Step Configuration

| Field | Description | Required |
|-------|-------------|----------|
| `id` | Unique step identifier | ✅ |
| `name` | Human-readable step name | ✅ |
| `description` | Step description | ✅ |
| `prompt` | Extraction prompt | ✅ |
| `output_field` | Output field name | ✅ |
| `depends_on` | Step dependencies | ✅ |
| `examples` | Training examples | ✅ |
| `filter` | Input filtering | ❌ |

### Filter Configuration

```yaml
filter:
  class_filter: "requirement"    # Only process specific extraction classes
  text_pattern: "shall.*"        # Regex pattern for text filtering
  max_items: 10                  # Maximum items to process
```

## 📊 Pipeline Execution

### Execution Flow

1. **Dependency Resolution**: Steps are executed in dependency order
2. **Input Processing**: Each step processes outputs from dependent steps
3. **Filtering**: Optional filtering of input data
4. **Extraction**: LLM processing with step-specific prompts
5. **Aggregation**: Results are collected and structured

### Output Structure

Pipeline results are nested JSON structures:

```json
{
  "extract_requirements": {
    "extractions": [
      {
        "class": "requirement",
        "text": "The system shall process 100 transactions per second",
        "start": 0,
        "end": 55
      }
    ],
    "count": 1,
    "processing_time_ms": 1250
  },
  "extract_values": {
    "extractions": [
      {
        "class": "value",
        "text": "100"
      },
      {
        "class": "unit",
        "text": "transactions per second"
      }
    ],
    "count": 2,
    "processing_time_ms": 980
  }
}
```

## 🛠️ Use Cases

### Requirements Engineering

Extract and categorize requirements from specifications:

```yaml
steps:
  - id: "extract_functional_reqs"
    name: "Functional Requirements"
    prompt: "Extract all functional requirements (what the system shall do)"
    filter:
      class_filter: "requirement"

  - id: "extract_performance_reqs"
    name: "Performance Requirements"
    prompt: "Extract performance metrics, timing, and capacity requirements"
    filter:
      class_filter: "requirement"
```

### Medical Record Processing

Process medical documents hierarchically:

```yaml
steps:
  - id: "extract_symptoms"
    name: "Extract Symptoms"
    prompt: "Extract all mentioned symptoms and conditions"

  - id: "extract_medications"
    name: "Extract Medications"
    prompt: "Extract medication names, dosages, and frequencies"
    depends_on: ["extract_symptoms"]

  - id: "extract_treatments"
    name: "Extract Treatments"
    prompt: "Extract treatment procedures and interventions"
    depends_on: ["extract_symptoms", "extract_medications"]
```

### Financial Document Analysis

Process financial statements and reports:

```yaml
steps:
  - id: "extract_financial_statements"
    name: "Financial Statements"
    prompt: "Extract balance sheet, income statement, and cash flow items"

  - id: "extract_values"
    name: "Extract Values"
    prompt: "Extract monetary values, percentages, and ratios"
    depends_on: ["extract_financial_statements"]

  - id: "categorize_accounts"
    name: "Categorize Accounts"
    prompt: "Categorize extracted items by account type and classification"
    depends_on: ["extract_values"]
```

## 🔍 Advanced Features

### Conditional Processing

Use filters to create conditional processing paths:

```yaml
steps:
  - id: "check_document_type"
    name: "Document Classification"
    prompt: "Classify the document type and main topic"

  - id: "process_contract"
    name: "Process Contract"
    prompt: "Extract contract terms, parties, and obligations"
    filter:
      text_pattern: "contract|agreement"

  - id: "process_technical_spec"
    name: "Process Technical Spec"
    prompt: "Extract technical specifications and requirements"
    filter:
      text_pattern: "specification|requirement"
```

### Parallel Processing

Independent steps can be processed in parallel:

```yaml
steps:
  - id: "extract_entities"
    name: "Entity Extraction"
    depends_on: []
    # Runs in parallel with other root steps

  - id: "extract_relationships"
    name: "Relationship Extraction"
    depends_on: []
    # Also runs in parallel

  - id: "combine_results"
    name: "Combine Results"
    depends_on: ["extract_entities", "extract_relationships"]
    # Waits for both previous steps to complete
```

## 📈 Performance Optimization

### Chunking Strategy

Configure text chunking for large documents:

```yaml
global_config:
  max_char_buffer: 8000  # Optimal chunk size
  max_workers: 8         # Parallel processing
  batch_length: 6        # Batches per worker
```

### Caching and Reuse

Reuse pipeline configurations across multiple documents:

```bash
# Process multiple files with same pipeline
for file in documents/*.txt; do
  lx-rs pipeline --config analysis_pipeline.yaml "$file" \
    --output "results/$(basename "$file" .txt).json"
done
```

## 🐛 Troubleshooting

### Common Issues

**Dependency Cycle Error**
```
Error: Circular dependency detected
```
- Check that `depends_on` doesn't create loops
- Ensure dependency graph is a DAG (Directed Acyclic Graph)

**Provider Configuration Missing**
```
Error: Provider configuration is required
```
- Add provider configuration to `language_model_params.provider_config`
- Or set environment variables for your provider

**Step Execution Failed**
```
Step 'step_name' failed: Parse error
```
- Check step examples and prompts
- Verify input data format
- Enable debug mode: add `debug: true` to global_config

### Debug Mode

Enable detailed logging:

```yaml
global_config:
  debug: true
```

## 📚 Examples

### Complete Requirements Pipeline

See `requirements_pipeline.yaml` for a complete working example that:
- Extracts "shall" statements from requirements documents
- Sub-divides them into numeric values and units
- Categorizes specifications and constraints

### Custom Pipeline Creation

Create custom pipelines for your domain:

```bash
# Start with a sample
lx-rs pipeline --create-sample --config my_pipeline.yaml

# Edit the configuration
nano my_pipeline.yaml

# Test with sample data
lx-rs pipeline --config my_pipeline.yaml "Your test text here"
```

## 🔗 Integration

### Programmatic Usage

```rust
use langextract_rust::pipeline::{PipelineExecutor, utils};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Load pipeline from YAML
    let executor = PipelineExecutor::from_yaml_file("pipeline.yaml")?;

    // Execute pipeline
    let result = executor.execute("Your text here").await?;

    // Process results
    println!("Pipeline completed in {}ms", result.total_time_ms);
    println!("Results: {}", serde_json::to_string_pretty(&result.nested_output)?);

    Ok(())
}
```

### CI/CD Integration

```yaml
# .github/workflows/extract.yml
name: Document Extraction
on: [push]

jobs:
  extract:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Extract Requirements
        run: |
          lx-rs pipeline --config requirements_pipeline.yaml \
            --input docs/requirements.md \
            --output results/extracted.json
      - name: Upload Results
        uses: actions/upload-artifact@v3
        with:
          name: extraction-results
          path: results/
```

## 🎯 Best Practices

### Pipeline Design

1. **Start Simple**: Begin with 2-3 steps maximum
2. **Clear Dependencies**: Keep dependency chains short
3. **Meaningful Names**: Use descriptive step IDs and names
4. **Comprehensive Examples**: Provide diverse training examples
5. **Incremental Testing**: Test each step individually

### Performance

1. **Chunk Wisely**: Balance chunk size with processing speed
2. **Parallel Processing**: Maximize worker utilization
3. **Filter Early**: Use filters to reduce processing load
4. **Cache Results**: Reuse pipelines across similar documents

### Maintenance

1. **Version Control**: Track pipeline configuration changes
2. **Documentation**: Document pipeline purpose and usage
3. **Monitoring**: Track performance and accuracy metrics
4. **Updates**: Regularly update examples and prompts

This pipeline system transforms LangExtract from a single-step extraction tool into a powerful multi-step processing framework for complex document analysis workflows.