datalab-cli 0.1.0

A powerful CLI for converting, extracting, and processing documents using the Datalab API
Documentation
# Caching

The Datalab CLI includes built-in caching to reduce API costs and speed up repeated operations.

---

## How It Works

When you run a command, the CLI:

1. **Generates a cache key** from the file contents, endpoint, and parameters
2. **Checks the local cache** for a matching entry
3. If found, **returns the cached result** immediately
4. If not found, **calls the API** and caches the response

```mermaid
flowchart LR
    A[Command] --> B{Cache Hit?}
    B -->|Yes| C[Return Cached]
    B -->|No| D[Call API]
    D --> E[Cache Response]
    E --> F[Return Result]
```

---

## Cache Location

The cache is stored in your system's cache directory:

| Operating System | Path |
|-----------------|------|
| Linux | `~/.cache/datalab/` |
| macOS | `~/Library/Caches/datalab/` |
| Windows | `%LOCALAPPDATA%\datalab\cache\` |

### Directory Structure

```
~/.cache/datalab/
├── responses/           # JSON API responses
│   ├── a1b2c3d4.json   # Cached response
│   ├── e5f6g7h8.json
│   └── ...
└── files/              # Binary files (filled forms, created documents)
    ├── i9j0k1l2.pdf
    └── ...
```

---

## Cache Key Generation

Cache keys are SHA256 hashes computed from:

| Component | Description |
|-----------|-------------|
| File hash | SHA256 of file contents (for local files) |
| URL | Full URL (for remote files) |
| Endpoint | API endpoint name (e.g., `convert`, `extract`) |
| Parameters | Sorted JSON of all request parameters |

This ensures:

- **Same file + same options** → Cache hit
- **Same file + different options** → Cache miss
- **Modified file** → Cache miss (different hash)

### Example

```bash
# These commands have different cache keys:
datalab convert doc.pdf --output-format markdown
datalab convert doc.pdf --output-format html
datalab convert doc.pdf --output-format markdown --mode accurate
```

---

## Bypassing the Cache

### Skip Local Cache

Use `--skip-cache` to ignore the local cache:

```bash
datalab convert document.pdf --skip-cache
```

This still uses the API's server-side cache.

### Force Reprocessing

Use `--force` to bypass the API's server-side cache:

```bash
datalab convert document.pdf --force
```

This still checks the local cache first.

### Skip Both Caches

Combine both flags for fully fresh processing:

```bash
datalab convert document.pdf --skip-cache --force
```

---

## Managing the Cache

### View Statistics

```bash
datalab cache stats
```

Output:
```json
{
  "cache_dir": "/home/user/.cache/datalab",
  "response_count": 150,
  "response_size": 52428800,
  "file_count": 10,
  "file_size": 314572800
}
```

### Clear All Cache

```bash
datalab cache clear
```

### Clear Old Entries

Remove entries older than a specified number of days:

```bash
# Clear entries older than 7 days
datalab cache clear --older-than 7

# Clear entries older than 30 days
datalab cache clear --older-than 30
```

---

## Cache Metadata

Each cached response includes metadata:

```json
{
  "created_at": "2024-01-15T10:30:00Z",
  "endpoint": "convert",
  "params_hash": "abc123...",
  "file_hash": "def456...",
  "file_path": "/path/to/original.pdf"
}
```

---

## Best Practices

### During Development

Keep caching enabled to minimize API costs:

```bash
# First run: API call (~$0.01)
datalab convert document.pdf

# Subsequent runs: cached (free!)
datalab convert document.pdf
```

### For Production Pipelines

Consider cache management strategies:

```bash
# Option 1: Scheduled cleanup
0 0 * * 0 datalab cache clear --older-than 7

# Option 2: Fresh processing for critical documents
datalab convert important.pdf --skip-cache --force
```

### For Testing

Bypass cache to ensure consistent results:

```bash
datalab convert test.pdf --skip-cache --force
```

---

## Cache vs. Checkpoints

| Feature | Cache | Checkpoints |
|---------|-------|-------------|
| Stored | Locally | On Datalab servers |
| Purpose | Reduce API calls | Reuse parsed documents |
| Scope | Full response | Document parse state |
| Duration | Until cleared | Server-defined retention |
| Cost | Free (local storage) | Included in API usage |

Use **cache** to avoid repeating identical requests.
Use **checkpoints** to efficiently run multiple operations on the same document.

---

## Troubleshooting

### Cache Not Working

If results aren't being cached:

1. Check cache directory exists and is writable
2. Verify you're not using `--skip-cache`
3. Check available disk space

### Stale Results

If you're getting outdated results:

```bash
# Clear and retry
datalab cache clear
datalab convert document.pdf
```

### Cache Too Large

If the cache is using too much disk space:

```bash
# Check size
datalab cache stats

# Clear old entries
datalab cache clear --older-than 7

# Or clear everything
datalab cache clear
```

---

## See Also

- [cache command]../commands/cache.md
- [Checkpoints]checkpoints.md