# Extracting Structured Data
Learn how to extract specific fields from documents using JSON schemas.
---
## Prerequisites
- [Datalab CLI installed](../getting-started/installation.md)
- [API key configured](../getting-started/configuration.md)
---
## Basic Extraction
Extract fields from a document using an inline schema:
```bash
datalab extract invoice.pdf --schema '{
"fields": [
{"name": "invoice_number", "type": "string"},
{"name": "total", "type": "number"},
{"name": "date", "type": "string"}
]
}'
```
Output:
```json
{
"invoice_number": "INV-2024-001",
"total": 1250.00,
"date": "2024-01-15"
}
```
---
## Schema Basics
### Schema Structure
```json
{
"fields": [
{
"name": "field_name",
"type": "string|number|boolean|array|object",
"description": "Optional hint for extraction"
}
]
}
```
### Field Types
| `string` | Text value | `"John Doe"` |
| `number` | Numeric value | `1250.00` |
| `boolean` | True/false | `true` |
| `array` | List of values | `["item1", "item2"]` |
| `object` | Nested object | `{"name": "John", "age": 30}` |
---
## Using Schema Files
For complex schemas, use a file:
```bash
datalab extract document.pdf --schema schema.json
```
### Example: Invoice Schema
Create `invoice-schema.json`:
```json
{
"fields": [
{
"name": "invoice_number",
"type": "string",
"description": "The unique invoice identifier"
},
{
"name": "vendor_name",
"type": "string",
"description": "Name of the vendor/seller"
},
{
"name": "invoice_date",
"type": "string",
"description": "Date the invoice was issued"
},
{
"name": "due_date",
"type": "string",
"description": "Payment due date"
},
{
"name": "subtotal",
"type": "number",
"description": "Total before tax"
},
{
"name": "tax",
"type": "number",
"description": "Tax amount"
},
{
"name": "total",
"type": "number",
"description": "Total amount due"
}
]
}
```
Run extraction:
```bash
datalab extract invoice.pdf --schema invoice-schema.json
```
---
## Adding Descriptions
Descriptions help the AI understand what to extract:
```json
{
"fields": [
{
"name": "parties",
"type": "array",
"description": "Names of all parties signing the contract, including both individuals and organizations"
},
{
"name": "effective_date",
"type": "string",
"description": "The date when the contract becomes effective, not the signing date"
}
]
}
```
Good descriptions:
- Clarify ambiguous terms
- Specify format expectations
- Distinguish similar fields
---
## Nested Objects
Extract complex structured data:
```json
{
"fields": [
{
"name": "customer",
"type": "object",
"fields": [
{"name": "name", "type": "string"},
{"name": "email", "type": "string"},
{"name": "phone", "type": "string"},
{
"name": "address",
"type": "object",
"fields": [
{"name": "street", "type": "string"},
{"name": "city", "type": "string"},
{"name": "state", "type": "string"},
{"name": "zip", "type": "string"}
]
}
]
}
]
}
```
Output:
```json
{
"customer": {
"name": "John Doe",
"email": "john@example.com",
"phone": "555-123-4567",
"address": {
"street": "123 Main St",
"city": "Anytown",
"state": "CA",
"zip": "90210"
}
}
}
```
---
## Arrays of Items
Extract lists of items:
```json
{
"fields": [
{
"name": "line_items",
"type": "array",
"description": "Individual items on the invoice",
"items": {
"type": "object",
"fields": [
{"name": "description", "type": "string"},
{"name": "quantity", "type": "number"},
{"name": "unit_price", "type": "number"},
{"name": "total", "type": "number"}
]
}
}
]
}
```
Output:
```json
{
"line_items": [
{
"description": "Widget A",
"quantity": 10,
"unit_price": 25.00,
"total": 250.00
},
{
"description": "Widget B",
"quantity": 5,
"unit_price": 50.00,
"total": 250.00
}
]
}
```
---
## Confidence Scores
Get confidence scores for extracted values:
```bash
datalab extract invoice.pdf --schema schema.json --include-scores
```
Output:
```json
{
"invoice_number": "INV-2024-001",
"invoice_number_score": 0.95,
"total": 1250.00,
"total_score": 0.98,
"date": "2024-01-15",
"date_score": 0.92
}
```
### Interpreting Scores
| 0.9 - 1.0 | High | Trust the value |
| 0.7 - 0.9 | Good | Likely accurate |
| 0.5 - 0.7 | Medium | Review recommended |
| 0.0 - 0.5 | Low | Manual verification needed |
---
## Using Checkpoints
### Save Checkpoint
Save parsing work for reuse:
```bash
datalab extract document.pdf --schema schema1.json --save-checkpoint
# Returns checkpoint_id: "ckpt_abc123"
```
### Reuse Checkpoint
Run additional extractions without re-parsing:
```bash
# Extract with different schema, same document
datalab extract document.pdf --schema schema2.json --checkpoint-id ckpt_abc123
```
This is faster and more cost-effective for multiple extractions.
---
## Processing Options
### Processing Mode
```bash
# Fast mode for simple documents
datalab extract invoice.pdf --schema schema.json --mode fast
# Accurate mode for complex documents
datalab extract contract.pdf --schema schema.json --mode accurate
```
### Page Selection
```bash
# Limit pages
datalab extract report.pdf --schema schema.json --max-pages 10
# Specific pages
datalab extract report.pdf --schema schema.json --page-range "0-5,20-25"
```
---
## Practical Examples
### Invoice Processing
```json
{
"fields": [
{"name": "invoice_number", "type": "string"},
{"name": "vendor_name", "type": "string"},
{"name": "vendor_address", "type": "string"},
{"name": "invoice_date", "type": "string"},
{"name": "due_date", "type": "string"},
{
"name": "line_items",
"type": "array",
"items": {
"type": "object",
"fields": [
{"name": "description", "type": "string"},
{"name": "quantity", "type": "number"},
{"name": "unit_price", "type": "number"},
{"name": "amount", "type": "number"}
]
}
},
{"name": "subtotal", "type": "number"},
{"name": "tax", "type": "number"},
{"name": "total", "type": "number"}
]
}
```
### Resume Parsing
```json
{
"fields": [
{"name": "name", "type": "string"},
{"name": "email", "type": "string"},
{"name": "phone", "type": "string"},
{"name": "summary", "type": "string", "description": "Professional summary or objective"},
{
"name": "experience",
"type": "array",
"items": {
"type": "object",
"fields": [
{"name": "company", "type": "string"},
{"name": "title", "type": "string"},
{"name": "start_date", "type": "string"},
{"name": "end_date", "type": "string"},
{"name": "responsibilities", "type": "array"}
]
}
},
{
"name": "education",
"type": "array",
"items": {
"type": "object",
"fields": [
{"name": "institution", "type": "string"},
{"name": "degree", "type": "string"},
{"name": "graduation_date", "type": "string"}
]
}
},
{"name": "skills", "type": "array"}
]
}
```
### Contract Analysis
```json
{
"fields": [
{"name": "contract_type", "type": "string", "description": "Type of contract (NDA, Employment, Service, etc.)"},
{"name": "parties", "type": "array", "description": "All parties to the contract"},
{"name": "effective_date", "type": "string"},
{"name": "termination_date", "type": "string"},
{"name": "governing_law", "type": "string", "description": "Jurisdiction governing the contract"},
{
"name": "key_terms",
"type": "array",
"description": "Important terms, conditions, or obligations"
}
]
}
```
---
## Batch Extraction
Process multiple documents:
```bash
#!/bin/bash
schema="invoice-schema.json"
for file in invoices/*.pdf; do
output="${file%.pdf}.json"
echo "Extracting $file..."
datalab extract "$file" --schema "$schema" --output "$output"
done
```
---
## Troubleshooting
### Missing Fields
If fields are not extracted:
1. Add descriptions to clarify what to look for
2. Try accurate mode for complex documents
3. Check if the field exists in the document
### Wrong Values
If values are incorrect:
1. Add more specific descriptions
2. Use confidence scores to identify low-confidence extractions
3. Try accurate mode
### Nested Data Not Extracted
Ensure your schema structure matches the document:
```json
{
"fields": [
{
"name": "address",
"type": "object",
"fields": [
{"name": "street", "type": "string"},
{"name": "city", "type": "string"}
]
}
]
}
```
---
## Next Steps
- [Fill forms with extracted data](fill-forms.md)
- [Score extractions](../commands/extract-score.md)
- [Learn about checkpoints](../concepts/checkpoints.md)