datalab-cli 0.1.0

# extract

Extract structured data from a document using a JSON schema.

## Synopsis

```
datalab extract [OPTIONS] <FILE|URL> --schema <SCHEMA>
```

## Description

Extract specific fields from documents using a JSON schema. Returns structured data matching your schema definition.

Use `--checkpoint-id` to reuse a previously parsed document, avoiding redundant processing.

---

## Arguments

| Argument | Description |
|----------|-------------|
| `<FILE\|URL>` | File path or URL to extract from |

---

## Options

### Schema Options

| Option | Description |
|--------|-------------|
| `--schema <SCHEMA>` | JSON schema file path or inline JSON string **(required)** |

### Output Options

| Option | Description | Default |
|--------|-------------|---------|
| `--include-scores` | Include per-field confidence scores | - |
| `-o, --output <FILE>` | Write result to file | - |

### Processing Options

| Option | Description | Default |
|--------|-------------|---------|
| `--mode <MODE>` | Processing mode: `fast`, `balanced`, `accurate` | `fast` |
| `--max-pages <N>` | Maximum pages to process | - |
| `--page-range <RANGE>` | Page range (e.g., `"0-5,10"`) | - |
| `--checkpoint-id <ID>` | Checkpoint ID to reuse parsed document | - |
| `--save-checkpoint` | Save checkpoint for reuse | - |
| `--timeout <SECS>` | Request timeout in seconds | `300` |

### Cache Options

| Option | Description |
|--------|-------------|
| `--skip-cache` | Skip local cache lookup |

---

## Schema Format

The schema defines the fields to extract:

```json
{
  "fields": [
    {
      "name": "field_name",
      "type": "string|number|boolean|array|object",
      "description": "Optional description to guide extraction"
    }
  ]
}
```

### Field Types

| Type | Description |
|------|-------------|
| `string` | Text value |
| `number` | Numeric value (integer or decimal) |
| `boolean` | True/false value |
| `array` | List of values |
| `object` | Nested object with sub-fields |

---

## Examples

### Basic Extraction

```bash
# Inline schema
datalab extract invoice.pdf --schema '{
  "fields": [
    {"name": "invoice_number", "type": "string"},
    {"name": "total", "type": "number"},
    {"name": "date", "type": "string"}
  ]
}'
```

### Schema from File

```bash
# Using a schema file
datalab extract invoice.pdf --schema schema.json
```

Where `schema.json` contains:
```json
{
  "fields": [
    {"name": "invoice_number", "type": "string"},
    {"name": "total", "type": "number"},
    {"name": "date", "type": "string"}
  ]
}
```

### With Descriptions

```bash
datalab extract contract.pdf --schema '{
  "fields": [
    {
      "name": "parties",
      "type": "array",
      "description": "Names of all parties to the contract"
    },
    {
      "name": "effective_date",
      "type": "string",
      "description": "Date when the contract becomes effective"
    }
  ]
}'
```

### Nested Objects

```bash
datalab extract receipt.pdf --schema '{
  "fields": [
    {
      "name": "items",
      "type": "array",
      "items": {
        "type": "object",
        "fields": [
          {"name": "description", "type": "string"},
          {"name": "quantity", "type": "number"},
          {"name": "price", "type": "number"}
        ]
      }
    }
  ]
}'
```

### With Confidence Scores

```bash
datalab extract document.pdf --schema schema.json --include-scores
```

Output includes confidence scores:

```json
{
  "invoice_number": "INV-001",
  "invoice_number_score": 0.95,
  "total": 1250.00,
  "total_score": 0.98
}
```

### Using Checkpoints

```bash
# First, convert with checkpoint
datalab convert document.pdf --save-checkpoint
# Returns checkpoint_id: "abc123"

# Then extract using checkpoint (faster)
datalab extract document.pdf --schema schema.json --checkpoint-id abc123
```

---

## Output Schema

```json
{
  "field_name": "extracted_value",
  "another_field": 123,
  "metadata": {
    "processing_time": 1.5,
    "checkpoint_id": "abc123"
  }
}
```

---

## Related Commands

- [`convert`](convert.md) - Convert documents (can save checkpoints)
- [`extract-score`](extract-score.md) - Score extraction results
- [`segment`](segment.md) - Segment documents

---

## See Also

- [Extracting Data Tutorial](../tutorials/extract-data.md)
- [Checkpoints](../concepts/checkpoints.md)