datalab-cli 0.1.0

# Agent Integration

Learn how to integrate the Datalab CLI with AI agents and automation pipelines.

---

## Overview

The Datalab CLI is designed for programmatic use:

- **JSON output**: All results are structured JSON
- **Progress events**: Streamable status updates
- **Exit codes**: Standard 0/1 for success/failure
- **Quiet mode**: Suppress human-readable output

---

## Output Design

### stdout: Data

Results are always JSON on stdout:

```bash
datalab convert document.pdf
```

```json
{
  "content": "# Document Title\n\nContent...",
  "metadata": {
    "pages": 5,
    "processing_time": 2.3
  }
}
```

### stderr: Progress (Optional)

Progress events go to stderr:

```json
{"type":"start","operation":"convert","file":"document.pdf"}
{"type":"complete","elapsed_secs":3.4}
```

### Separation

This separation allows:

```bash
# Pipe data to next step, progress visible
datalab convert document.pdf | jq '.content'

# Capture data, discard progress
result=$(datalab -q convert document.pdf)

# Capture both separately
datalab convert document.pdf > result.json 2> progress.log
```

---

## Quiet Mode

Suppress all progress output:

```bash
datalab -q convert document.pdf
```

Or:

```bash
datalab --quiet convert document.pdf
```

This is recommended for:
- Scripts
- CI/CD pipelines
- Agent integrations
- Background processing

---

## Parsing Results

### With jq

```bash
# Get specific field
datalab extract invoice.pdf --schema schema.json | jq '.total'

# Get nested data
datalab extract document.pdf --schema schema.json | jq '.customer.address.city'

# Format for display
datalab convert document.pdf | jq -r '.content'
```

### In Python

```python
import subprocess
import json

def convert_document(file_path):
    result = subprocess.run(
        ["datalab", "-q", "convert", file_path],
        capture_output=True,
        text=True
    )

    if result.returncode != 0:
        raise Exception(f"Conversion failed: {result.stderr}")

    return json.loads(result.stdout)

# Usage
data = convert_document("document.pdf")
content = data["content"]
```

### In Node.js

```javascript
const { execSync } = require('child_process');

function convertDocument(filePath) {
    try {
        const result = execSync(`datalab -q convert "${filePath}"`, {
            encoding: 'utf-8'
        });
        return JSON.parse(result);
    } catch (error) {
        throw new Error(`Conversion failed: ${error.message}`);
    }
}

// Usage
const data = convertDocument('document.pdf');
console.log(data.content);
```

### In Bash

```bash
#!/bin/bash

convert_document() {
    local file="$1"
    local result

    result=$(datalab -q convert "$file" 2>&1)
    local exit_code=$?

    if [ $exit_code -ne 0 ]; then
        echo "Error: $result" >&2
        return 1
    fi

    echo "$result"
}

# Usage
data=$(convert_document "document.pdf")
content=$(echo "$data" | jq -r '.content')
```

---

## Error Handling

### Exit Codes

| Code | Meaning |
|------|---------|
| `0` | Success |
| `1` | Error |

### JSON Errors

When piped, errors are JSON:

```bash
datalab -q convert nonexistent.pdf 2>&1
```

```json
{"error":"File not found: nonexistent.pdf","code":"FILE_NOT_FOUND"}
```

### Pattern: Check and Handle

```bash
#!/bin/bash
set -e

if output=$(datalab -q convert "$file" 2>&1); then
    # Success - process output
    echo "$output" | jq '.content'
else
    # Error - handle it
    error_code=$(echo "$output" | jq -r '.code')
    error_msg=$(echo "$output" | jq -r '.error')
    echo "Failed with $error_code: $error_msg" >&2
    exit 1
fi
```

---

## Progress Monitoring

For long operations, monitor progress:

### Verbose Mode

Force progress even when piped:

```bash
datalab -v convert large-document.pdf 2>&1 | while read line; do
    event_type=$(echo "$line" | jq -r '.type // empty')
    case "$event_type" in
        start)
            echo "Starting..."
            ;;
        poll)
            elapsed=$(echo "$line" | jq -r '.elapsed_secs')
            echo "Processing... ${elapsed}s"
            ;;
        complete)
            echo "Done!"
            ;;
    esac
done
```

### Progress Events

| Event | Fields | Use |
|-------|--------|-----|
| `start` | `operation`, `file` | Log operation start |
| `upload` | `bytes_sent`, `total_bytes` | Show upload progress |
| `submit` | `request_id` | Track request |
| `poll` | `status`, `elapsed_secs` | Show processing status |
| `cache_hit` | `cache_key` | Note cached result |
| `complete` | `elapsed_secs` | Log completion |
| `error` | `code`, `message` | Handle errors |

---

## Caching Strategy

### Enable Caching (Default)

Reduce API costs during development:

```bash
# First call: API request
datalab -q convert document.pdf

# Second call: instant from cache
datalab -q convert document.pdf
```

### Skip Cache for Fresh Data

```bash
datalab -q convert document.pdf --skip-cache
```

### Force Reprocessing

```bash
datalab -q convert document.pdf --force
```

---

## Checkpoints for Efficiency

Save parsed documents for multiple operations:

```bash
# Parse once
result=$(datalab -q convert document.pdf --save-checkpoint)
checkpoint_id=$(echo "$result" | jq -r '.metadata.checkpoint_id')

# Reuse for multiple extractions
datalab -q extract document.pdf --schema schema1.json --checkpoint-id "$checkpoint_id"
datalab -q extract document.pdf --schema schema2.json --checkpoint-id "$checkpoint_id"
datalab -q extract document.pdf --schema schema3.json --checkpoint-id "$checkpoint_id"
```

This is faster and more cost-effective than re-parsing each time.

---

## Batch Processing Pattern

### Sequential

```bash
#!/bin/bash
for file in documents/*.pdf; do
    echo "Processing $file..." >&2
    datalab -q convert "$file" > "${file%.pdf}.json"
done
```

### Parallel (with GNU Parallel)

```bash
# Process 4 files at a time
parallel -j4 'datalab -q convert {} > {.}.json' ::: documents/*.pdf
```

### With Rate Limiting

```bash
# 2 requests per second
for file in documents/*.pdf; do
    datalab -q convert "$file" > "${file%.pdf}.json" &
    sleep 0.5
done
wait
```

---

## LLM Integration Example

Use Datalab CLI to prepare documents for LLM analysis:

```python
import subprocess
import json
import openai

def extract_and_analyze(pdf_path, question):
    # Convert PDF to text
    result = subprocess.run(
        ["datalab", "-q", "convert", pdf_path, "--output-format", "markdown"],
        capture_output=True,
        text=True
    )

    if result.returncode != 0:
        raise Exception(f"Conversion failed: {result.stderr}")

    doc_content = json.loads(result.stdout)["content"]

    # Send to LLM
    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": "You are a document analyst."},
            {"role": "user", "content": f"Document:\n{doc_content}\n\nQuestion: {question}"}
        ]
    )

    return response.choices[0].message.content

# Usage
answer = extract_and_analyze("report.pdf", "What are the key findings?")
print(answer)
```

---

## RAG Pipeline Example

Prepare documents for retrieval-augmented generation:

```bash
#!/bin/bash
# Convert to chunks for vector database

for file in knowledge-base/*.pdf; do
    echo "Chunking $file..." >&2

    # Convert to chunks with block IDs
    datalab -q convert "$file" \
        --output-format chunks \
        --add-block-ids \
        > "chunks/$(basename "$file" .pdf).json"
done

# Now chunks are ready for embedding
```

```python
import json
import glob

def load_chunks():
    chunks = []
    for file in glob.glob("chunks/*.json"):
        with open(file) as f:
            data = json.load(f)
            for chunk in data["chunks"]:
                chunks.append({
                    "content": chunk["content"],
                    "source": file,
                    "block_id": chunk.get("block_id"),
                    "metadata": chunk.get("metadata", {})
                })
    return chunks

# Now embed and store in vector database
chunks = load_chunks()
# embeddings = embed(chunks)
# vector_db.upsert(chunks, embeddings)
```

---

## Best Practices

### Always Use Quiet Mode

```bash
# In scripts, always use -q
datalab -q convert document.pdf
```

### Handle Errors Gracefully

```python
try:
    result = run_datalab_command(...)
except DatalabError as e:
    if e.code == "RATE_LIMITED":
        time.sleep(30)
        retry()
    elif e.code == "FILE_NOT_FOUND":
        log_missing_file(e.file)
    else:
        raise
```

### Use Checkpoints

For multiple operations on the same document, always use checkpoints.

### Respect Rate Limits

Add delays between requests when processing many documents.

### Cache During Development

Keep caching enabled during development to minimize API costs.

---

## Next Steps

- [Output Formats](../concepts/output-formats.md)
- [Caching](../concepts/caching.md)
- [Checkpoints](../concepts/checkpoints.md)
- [Rate Limits](../concepts/rate-limits.md)