# segment
Segment a document into logical sections.
## Synopsis
```
## Description
Split multi-document PDFs into logical sections based on a schema. Useful for processing document bundles containing multiple document types (invoices, receipts, contracts, etc.).
---
## Arguments
| `<FILE\|URL>` | File path or URL to segment |
---
## Options
### Schema Options
| `--schema <SCHEMA>` | JSON schema file path or inline JSON string **(required)** |
### Processing Options
| `--mode <MODE>` | Processing mode: `fast`, `balanced`, `accurate` | `fast` |
| `--max-pages <N>` | Maximum pages to process | - |
| `--checkpoint-id <ID>` | Checkpoint ID to reuse parsed document | - |
| `--save-checkpoint` | Save checkpoint for reuse | - |
| `--timeout <SECS>` | Request timeout in seconds | `300` |
### Output Options
| `-o, --output <FILE>` | Write result to file |
### Cache Options
| `--skip-cache` | Skip local cache lookup |
---
## Schema Format
The schema defines the document types to identify:
```json
{
"segments": ["document_type_1", "document_type_2", ...]
}
```
---
## Examples
### Basic Segmentation
```bash
# Identify invoices and receipts
datalab segment bundle.pdf --schema '{
"segments": ["invoice", "receipt", "contract"]
}'
```
### Schema from File
```bash
datalab segment bundle.pdf --schema segments.json
```
Where `segments.json` contains:
```json
{
"segments": ["invoice", "receipt", "contract", "form"]
}
```
### With Checkpoint
```bash
# Reuse a previously parsed document
datalab segment bundle.pdf --schema schema.json --checkpoint-id abc123
```
---
## Output Schema
```json
{
"segments": [
{
"type": "invoice",
"start_page": 0,
"end_page": 2,
"confidence": 0.95
},
{
"type": "receipt",
"start_page": 3,
"end_page": 3,
"confidence": 0.92
},
{
"type": "contract",
"start_page": 4,
"end_page": 10,
"confidence": 0.98
}
],
"metadata": {
"total_pages": 11,
"processing_time": 3.2
}
}
```
---
## Use Cases
### Document Bundle Processing
Process a scanned bundle of documents:
```bash
# 1. Segment the bundle
datalab segment scanned-bundle.pdf --schema '{"segments": ["invoice", "receipt"]}'
# 2. Extract data from each segment using page ranges
datalab extract scanned-bundle.pdf --schema invoice-schema.json --page-range "0-2"
datalab extract scanned-bundle.pdf --schema receipt-schema.json --page-range "3-3"
```
### Mail Room Automation
Classify incoming mail:
```bash
datalab segment incoming-mail.pdf --schema '{
"segments": ["invoice", "statement", "letter", "advertisement"]
}'
```
---
## Related Commands
- [`convert`](convert.md) - Convert documents
- [`extract`](extract.md) - Extract data from segments
---
## See Also
- [Checkpoints](../concepts/checkpoints.md)