datalab-cli 0.1.0

# segment

Segment a document into logical sections.

## Synopsis

```
datalab segment [OPTIONS] <FILE|URL> --schema <SCHEMA>
```

## Description

Split multi-document PDFs into logical sections based on a schema. Useful for processing document bundles containing multiple document types (invoices, receipts, contracts, etc.).

---

## Arguments

| Argument | Description |
|----------|-------------|
| `<FILE\|URL>` | File path or URL to segment |

---

## Options

### Schema Options

| Option | Description |
|--------|-------------|
| `--schema <SCHEMA>` | JSON schema file path or inline JSON string **(required)** |

### Processing Options

| Option | Description | Default |
|--------|-------------|---------|
| `--mode <MODE>` | Processing mode: `fast`, `balanced`, `accurate` | `fast` |
| `--max-pages <N>` | Maximum pages to process | - |
| `--checkpoint-id <ID>` | Checkpoint ID to reuse parsed document | - |
| `--save-checkpoint` | Save checkpoint for reuse | - |
| `--timeout <SECS>` | Request timeout in seconds | `300` |

### Output Options

| Option | Description |
|--------|-------------|
| `-o, --output <FILE>` | Write result to file |

### Cache Options

| Option | Description |
|--------|-------------|
| `--skip-cache` | Skip local cache lookup |

---

## Schema Format

The schema defines the document types to identify:

```json
{
  "segments": ["document_type_1", "document_type_2", ...]
}
```

---

## Examples

### Basic Segmentation

```bash
# Identify invoices and receipts
datalab segment bundle.pdf --schema '{
  "segments": ["invoice", "receipt", "contract"]
}'
```

### Schema from File

```bash
datalab segment bundle.pdf --schema segments.json
```

Where `segments.json` contains:
```json
{
  "segments": ["invoice", "receipt", "contract", "form"]
}
```

### With Checkpoint

```bash
# Reuse a previously parsed document
datalab segment bundle.pdf --schema schema.json --checkpoint-id abc123
```

---

## Output Schema

```json
{
  "segments": [
    {
      "type": "invoice",
      "start_page": 0,
      "end_page": 2,
      "confidence": 0.95
    },
    {
      "type": "receipt",
      "start_page": 3,
      "end_page": 3,
      "confidence": 0.92
    },
    {
      "type": "contract",
      "start_page": 4,
      "end_page": 10,
      "confidence": 0.98
    }
  ],
  "metadata": {
    "total_pages": 11,
    "processing_time": 3.2
  }
}
```

---

## Use Cases

### Document Bundle Processing

Process a scanned bundle of documents:

```bash
# 1. Segment the bundle
datalab segment scanned-bundle.pdf --schema '{"segments": ["invoice", "receipt"]}'

# 2. Extract data from each segment using page ranges
datalab extract scanned-bundle.pdf --schema invoice-schema.json --page-range "0-2"
datalab extract scanned-bundle.pdf --schema receipt-schema.json --page-range "3-3"
```

### Mail Room Automation

Classify incoming mail:

```bash
datalab segment incoming-mail.pdf --schema '{
  "segments": ["invoice", "statement", "letter", "advertisement"]
}'
```

---

## Related Commands

- [`convert`](convert.md) - Convert documents
- [`extract`](extract.md) - Extract data from segments

---

## See Also

- [Checkpoints](../concepts/checkpoints.md)