langextract-rust 0.5.0

A Rust library for extracting structured and grounded information from text using LLMs
Documentation
# Alignment Demo

This example demonstrates LangExtract's **character-level text alignment** capabilities, showing how extracted information is precisely positioned within the original text.

## What This Example Does

Extracts various entities (**dates**, **times**, **locations**, **emails**, **phone numbers**, **prices**, **URLs**) and shows their exact character positions in the source text.

## Key Features Demonstrated

- 🎯 **Character-level positioning** - Exact start/end positions for each extraction
- 📍 **Alignment algorithms** - How LangExtract maps extracted text back to source
- 🔍 **Position accuracy** - Verification that extracted text matches source positions
- 🎨 **Visual highlighting** - HTML output showing extractions highlighted in context

## Files

- **`examples.json`** - Training examples with contact/event information
- **`config.yaml`** - Configuration optimized for alignment accuracy (low temperature)
- **`input.txt`** - Sample conference announcement with rich entity types
- **`run.sh`** - Script demonstrating alignment features
- **`output/`** - Generated results with character positions

## Quick Start

```bash
# Ensure you have a provider running
ollama serve
ollama pull mistral

# Run the alignment demo
./run.sh
```

## Understanding Character Intervals

The output shows character positions for each extraction:

```json
{
  "extraction_class": "date",
  "extraction_text": "April 22, 2024",
  "char_interval": {
    "start_pos": 45,
    "end_pos": 59
  }
}
```

This means:
- The text "April 22, 2024" was found at characters 45-58 (end_pos is exclusive)
- You can verify: `input_text[45:59]` equals "April 22, 2024"

## What to Look For

### 1. Position Accuracy
Check that extracted text exactly matches the source at the given positions:
- **dates**: "April 22, 2024" should align perfectly
- **emails**: "m.chen@mit.edu" should have correct boundaries  
- **phones**: "(617) 555-0123" should include parentheses and formatting
- **prices**: "$250.00" should include the dollar sign

### 2. Alignment Status
The system reports alignment quality:
- **MatchExact**: Perfect character-level match
- **MatchFuzzy**: Close match with minor differences
- **MatchGreater/Lesser**: Length differences between extracted and source text

### 3. Complex Entity Handling
See how the system handles:
- **Multi-word entities**: "Convention Center in Austin, Texas"
- **Formatted data**: Phone numbers with parentheses and dashes
- **URLs**: Complete web addresses with protocols
- **Monetary values**: Currency symbols and decimal places

## Configuration for Alignment

The example uses settings optimized for alignment accuracy:

```yaml
temperature: 0.2      # Lower = more consistent positioning
max_workers: 2        # Fewer workers = more consistent results
show_intervals: true  # Always display character positions
```

## Troubleshooting Alignment Issues

### Poor Alignment Accuracy
- **Lower temperature**: Try 0.1 for more deterministic results
- **Simpler examples**: Use more precise training examples
- **Shorter text**: Long documents can have cumulative alignment errors

### Missing Character Intervals
- **Check model output**: Some models may not preserve exact text
- **Verify examples**: Ensure training examples show desired precision
- **Enable debug**: Use `--debug` to see alignment process details

### Misaligned Positions
- **Unicode handling**: Special characters can affect positioning
- **Whitespace**: Extra spaces in model output vs. source text
- **Fuzzy matching**: System falls back to approximate matching

## Advanced Usage

### Custom Alignment Examples
Edit `examples.json` to train for specific entity types:

```json
{
  "text": "The meeting is scheduled for 2:30 PM on January 15th",
  "extractions": [
    {
      "extraction_class": "time", 
      "extraction_text": "2:30 PM"
    },
    {
      "extraction_class": "date",
      "extraction_text": "January 15th" 
    }
  ]
}
```

### Different Input Types
Test alignment with various text types:
- **Structured documents**: Forms, tables, lists
- **Natural language**: Narratives, descriptions
- **Mixed content**: Code snippets, formatted text
- **International text**: Unicode characters, different languages

### Visualization Analysis
The HTML output provides:
- **Color-coded highlighting** for different entity types
- **Hover tooltips** showing exact character positions
- **Statistics panel** with alignment accuracy metrics
- **Side-by-side comparison** of extracted vs. source text

## Next Steps

- Try **validation_demo** to see how alignment works with type validation
- Explore **advanced_examples** for alignment with large documents and chunking
- Test with your own text to see alignment accuracy for your use case
- Experiment with different models to compare alignment quality