datalab-cli 0.1.0

A powerful CLI for converting, extracting, and processing documents using the Datalab API
Documentation
# Checkpoints

Checkpoints allow you to reuse parsed documents across multiple operations, saving time and API costs.

---

## What Are Checkpoints?

When you process a document, the Datalab API parses and analyzes its structure. A checkpoint saves this parsed state on the server, allowing you to:

- Run multiple extractions without re-parsing
- Segment documents after conversion
- Score extraction results

```mermaid
flowchart LR
    A[Document] --> B[Parse]
    B --> C[Checkpoint]
    C --> D[Extract 1]
    C --> E[Extract 2]
    C --> F[Segment]
    C --> G[Score]
```

---

## Creating Checkpoints

Add `--save-checkpoint` to any command that processes a document:

```bash
# During conversion
datalab convert document.pdf --save-checkpoint

# During extraction
datalab extract document.pdf --schema schema.json --save-checkpoint
```

The response includes a `checkpoint_id`:

```json
{
  "content": "...",
  "metadata": {
    "checkpoint_id": "ckpt_abc123def456"
  }
}
```

---

## Using Checkpoints

Reference a checkpoint with `--checkpoint-id`:

```bash
# Extract using existing checkpoint
datalab extract document.pdf --schema schema.json --checkpoint-id ckpt_abc123def456

# Segment using existing checkpoint
datalab segment document.pdf --schema segments.json --checkpoint-id ckpt_abc123def456
```

---

## Checkpoint Workflow

### Example: Multiple Extractions

```bash
# 1. Convert and save checkpoint
result=$(datalab convert invoice.pdf --save-checkpoint)
checkpoint_id=$(echo "$result" | jq -r '.metadata.checkpoint_id')

# 2. Extract different data using the same checkpoint
datalab extract invoice.pdf --schema header.json --checkpoint-id "$checkpoint_id"
datalab extract invoice.pdf --schema line_items.json --checkpoint-id "$checkpoint_id"
datalab extract invoice.pdf --schema totals.json --checkpoint-id "$checkpoint_id"
```

### Example: Convert Then Segment

```bash
# 1. Convert document
datalab convert bundle.pdf --save-checkpoint
# Returns checkpoint_id: "ckpt_abc123"

# 2. Segment using checkpoint
datalab segment bundle.pdf --schema '{"segments": ["invoice", "receipt"]}' \
  --checkpoint-id ckpt_abc123
```

### Example: Extract Then Score

```bash
# 1. Extract with checkpoint
datalab extract invoice.pdf --schema schema.json --save-checkpoint
# Returns checkpoint_id: "ckpt_xyz789"

# 2. Score the extraction
datalab extract-score --checkpoint-id ckpt_xyz789
```

---

## Cost Benefits

Checkpoints reduce processing costs by avoiding redundant parsing:

| Operation | Without Checkpoint | With Checkpoint |
|-----------|-------------------|-----------------|
| Convert | Full parse | Full parse |
| Extract #1 | Full parse | Reuse parse |
| Extract #2 | Full parse | Reuse parse |
| Segment | Full parse | Reuse parse |
| **Total** | 4x parse cost | 1x parse cost |

---

## Checkpoint Retention

Checkpoints are stored on Datalab servers with the following retention:

| Plan | Retention Period |
|------|------------------|
| Free | 1 hour |
| Pro | 24 hours |
| Enterprise | Configurable |

!!! warning "Checkpoint Expiration"
    Checkpoints expire after their retention period. Plan your workflow to use checkpoints within the retention window.

---

## Commands Supporting Checkpoints

### Can Create Checkpoints

| Command | Flag |
|---------|------|
| `convert` | `--save-checkpoint` |
| `extract` | `--save-checkpoint` |
| `segment` | `--save-checkpoint` |

### Can Use Checkpoints

| Command | Flag |
|---------|------|
| `extract` | `--checkpoint-id <ID>` |
| `segment` | `--checkpoint-id <ID>` |
| `extract-score` | `--checkpoint-id <ID>` |

---

## Checkpoints vs. Caching

| Feature | Checkpoints | Local Cache |
|---------|-------------|-------------|
| **Location** | Datalab servers | Your machine |
| **Purpose** | Reuse parsed document | Avoid duplicate API calls |
| **Retention** | Hours (server-defined) | Until you clear it |
| **Cross-operation** | Yes | No (same operation only) |
| **Cost** | Included in API | Free (local storage) |

### When to Use Checkpoints

- Multiple extractions from the same document
- Extract → Score workflow
- Convert → Segment workflow
- Processing the same document with different schemas

### When to Use Local Cache

- Repeated identical operations
- Development and testing
- Avoiding redundant API calls

---

## Best Practices

### Save Checkpoints Proactively

If you might need to re-process a document, save a checkpoint:

```bash
# Always save checkpoint for important documents
datalab convert document.pdf --save-checkpoint > result.json
```

### Store Checkpoint IDs

Save checkpoint IDs for later use:

```bash
# Parse and store checkpoint
checkpoint=$(datalab convert doc.pdf --save-checkpoint | jq -r '.metadata.checkpoint_id')
echo "$checkpoint" > checkpoint.txt

# Use later
datalab extract doc.pdf --schema schema.json --checkpoint-id "$(cat checkpoint.txt)"
```

### Plan Workflows Within Retention

Ensure all checkpoint operations complete before expiration:

```bash
# Good: All operations in sequence
datalab convert doc.pdf --save-checkpoint  # t=0
datalab extract doc.pdf --schema a.json --checkpoint-id ...  # t=1min
datalab extract doc.pdf --schema b.json --checkpoint-id ...  # t=2min

# Risk: Long delay between operations
datalab convert doc.pdf --save-checkpoint  # t=0
# ... 2 hours later ...
datalab extract doc.pdf --checkpoint-id ...  # May fail if expired
```

---

## Troubleshooting

### "Checkpoint not found" Error

The checkpoint may have expired. Create a new checkpoint:

```bash
datalab convert document.pdf --save-checkpoint
```

### Checkpoint Not Returned

Ensure you're using `--save-checkpoint`:

```bash
# Wrong: no checkpoint
datalab convert document.pdf

# Right: checkpoint saved
datalab convert document.pdf --save-checkpoint
```

---

## See Also

- [extract-score command]../commands/extract-score.md
- [Caching]caching.md
- [Extracting Data Tutorial]../tutorials/extract-data.md