# extract
Extract structured data from a document using a JSON schema.
## Synopsis
```
## Description
Extract specific fields from documents using a JSON schema. Returns structured data matching your schema definition.
Use `--checkpoint-id` to reuse a previously parsed document, avoiding redundant processing.
---
## Arguments
| `<FILE\|URL>` | File path or URL to extract from |
---
## Options
### Schema Options
| `--schema <SCHEMA>` | JSON schema file path or inline JSON string **(required)** |
### Output Options
| `--include-scores` | Include per-field confidence scores | - |
| `-o, --output <FILE>` | Write result to file | - |
### Processing Options
| `--mode <MODE>` | Processing mode: `fast`, `balanced`, `accurate` | `fast` |
| `--max-pages <N>` | Maximum pages to process | - |
| `--page-range <RANGE>` | Page range (e.g., `"0-5,10"`) | - |
| `--checkpoint-id <ID>` | Checkpoint ID to reuse parsed document | - |
| `--save-checkpoint` | Save checkpoint for reuse | - |
| `--timeout <SECS>` | Request timeout in seconds | `300` |
### Cache Options
| `--skip-cache` | Skip local cache lookup |
---
## Schema Format
The schema defines the fields to extract:
```json
{
"fields": [
{
"name": "field_name",
"type": "string|number|boolean|array|object",
"description": "Optional description to guide extraction"
}
]
}
```
### Field Types
| `string` | Text value |
| `number` | Numeric value (integer or decimal) |
| `boolean` | True/false value |
| `array` | List of values |
| `object` | Nested object with sub-fields |
---
## Examples
### Basic Extraction
```bash
# Inline schema
datalab extract invoice.pdf --schema '{
"fields": [
{"name": "invoice_number", "type": "string"},
{"name": "total", "type": "number"},
{"name": "date", "type": "string"}
]
}'
```
### Schema from File
```bash
# Using a schema file
datalab extract invoice.pdf --schema schema.json
```
Where `schema.json` contains:
```json
{
"fields": [
{"name": "invoice_number", "type": "string"},
{"name": "total", "type": "number"},
{"name": "date", "type": "string"}
]
}
```
### With Descriptions
```bash
datalab extract contract.pdf --schema '{
"fields": [
{
"name": "parties",
"type": "array",
"description": "Names of all parties to the contract"
},
{
"name": "effective_date",
"type": "string",
"description": "Date when the contract becomes effective"
}
]
}'
```
### Nested Objects
```bash
datalab extract receipt.pdf --schema '{
"fields": [
{
"name": "items",
"type": "array",
"items": {
"type": "object",
"fields": [
{"name": "description", "type": "string"},
{"name": "quantity", "type": "number"},
{"name": "price", "type": "number"}
]
}
}
]
}'
```
### With Confidence Scores
```bash
datalab extract document.pdf --schema schema.json --include-scores
```
Output includes confidence scores:
```json
{
"invoice_number": "INV-001",
"invoice_number_score": 0.95,
"total": 1250.00,
"total_score": 0.98
}
```
### Using Checkpoints
```bash
# First, convert with checkpoint
datalab convert document.pdf --save-checkpoint
# Returns checkpoint_id: "abc123"
# Then extract using checkpoint (faster)
datalab extract document.pdf --schema schema.json --checkpoint-id abc123
```
---
## Output Schema
```json
{
"field_name": "extracted_value",
"another_field": 123,
"metadata": {
"processing_time": 1.5,
"checkpoint_id": "abc123"
}
}
```
---
## Related Commands
- [`convert`](convert.md) - Convert documents (can save checkpoints)
- [`extract-score`](extract-score.md) - Score extraction results
- [`segment`](segment.md) - Segment documents
---
## See Also
- [Extracting Data Tutorial](../tutorials/extract-data.md)
- [Checkpoints](../concepts/checkpoints.md)