# Agent Integration
Learn how to integrate the Datalab CLI with AI agents and automation pipelines.
---
## Overview
The Datalab CLI is designed for programmatic use:
- **JSON output**: All results are structured JSON
- **Progress events**: Streamable status updates
- **Exit codes**: Standard 0/1 for success/failure
- **Quiet mode**: Suppress human-readable output
---
## Output Design
### stdout: Data
Results are always JSON on stdout:
```bash
datalab convert document.pdf
```
```json
{
"content": "# Document Title\n\nContent...",
"metadata": {
"pages": 5,
"processing_time": 2.3
}
}
```
### stderr: Progress (Optional)
Progress events go to stderr:
```json
{"type":"start","operation":"convert","file":"document.pdf"}
{"type":"complete","elapsed_secs":3.4}
```
### Separation
This separation allows:
```bash
# Pipe data to next step, progress visible
# Capture data, discard progress
result=$(datalab -q convert document.pdf)
# Capture both separately
datalab convert document.pdf > result.json 2> progress.log
```
---
## Quiet Mode
Suppress all progress output:
```bash
datalab -q convert document.pdf
```
Or:
```bash
datalab --quiet convert document.pdf
```
This is recommended for:
- Scripts
- CI/CD pipelines
- Agent integrations
- Background processing
---
## Parsing Results
### With jq
```bash
# Get specific field
# Get nested data
# Format for display
### In Python
```python
import subprocess
import json
def convert_document(file_path):
result = subprocess.run(
["datalab", "-q", "convert", file_path],
capture_output=True,
text=True
)
if result.returncode != 0:
raise Exception(f"Conversion failed: {result.stderr}")
return json.loads(result.stdout)
# Usage
data = convert_document("document.pdf")
content = data["content"]
```
### In Node.js
```javascript
const { execSync } = require('child_process');
function convertDocument(filePath) {
try {
const result = execSync(`datalab -q convert "${filePath}"`, {
encoding: 'utf-8'
});
return JSON.parse(result);
} catch (error) {
throw new Error(`Conversion failed: ${error.message}`);
}
}
// Usage
const data = convertDocument('document.pdf');
console.log(data.content);
```
### In Bash
```bash
#!/bin/bash
convert_document() {
local file="$1"
local result
result=$(datalab -q convert "$file" 2>&1)
local exit_code=$?
if [ $exit_code -ne 0 ]; then
echo "Error: $result" >&2
return 1
fi
echo "$result"
}
# Usage
data=$(convert_document "document.pdf")
---
## Error Handling
### Exit Codes
| `0` | Success |
| `1` | Error |
### JSON Errors
When piped, errors are JSON:
```bash
datalab -q convert nonexistent.pdf 2>&1
```
```json
{"error":"File not found: nonexistent.pdf","code":"FILE_NOT_FOUND"}
```
### Pattern: Check and Handle
```bash
#!/bin/bash
set -e
if output=$(datalab -q convert "$file" 2>&1); then
# Success - process output
echo "$output" | jq '.content'
else
# Error - handle it
error_code=$(echo "$output" | jq -r '.code')
error_msg=$(echo "$output" | jq -r '.error')
echo "Failed with $error_code: $error_msg" >&2
exit 1
fi
```
---
## Progress Monitoring
For long operations, monitor progress:
### Verbose Mode
Force progress even when piped:
```bash
case "$event_type" in
start)
echo "Starting..."
;;
poll)
elapsed=$(echo "$line" | jq -r '.elapsed_secs')
echo "Processing... ${elapsed}s"
;;
complete)
echo "Done!"
;;
esac
done
```
### Progress Events
| `start` | `operation`, `file` | Log operation start |
| `upload` | `bytes_sent`, `total_bytes` | Show upload progress |
| `submit` | `request_id` | Track request |
| `poll` | `status`, `elapsed_secs` | Show processing status |
| `cache_hit` | `cache_key` | Note cached result |
| `complete` | `elapsed_secs` | Log completion |
| `error` | `code`, `message` | Handle errors |
---
## Caching Strategy
### Enable Caching (Default)
Reduce API costs during development:
```bash
# First call: API request
datalab -q convert document.pdf
# Second call: instant from cache
datalab -q convert document.pdf
```
### Skip Cache for Fresh Data
```bash
datalab -q convert document.pdf --skip-cache
```
### Force Reprocessing
```bash
datalab -q convert document.pdf --force
```
---
## Checkpoints for Efficiency
Save parsed documents for multiple operations:
```bash
# Parse once
result=$(datalab -q convert document.pdf --save-checkpoint)
checkpoint_id=$(echo "$result" | jq -r '.metadata.checkpoint_id')
# Reuse for multiple extractions
datalab -q extract document.pdf --schema schema1.json --checkpoint-id "$checkpoint_id"
datalab -q extract document.pdf --schema schema2.json --checkpoint-id "$checkpoint_id"
datalab -q extract document.pdf --schema schema3.json --checkpoint-id "$checkpoint_id"
```
This is faster and more cost-effective than re-parsing each time.
---
## Batch Processing Pattern
### Sequential
```bash
#!/bin/bash
for file in documents/*.pdf; do
echo "Processing $file..." >&2
datalab -q convert "$file" > "${file%.pdf}.json"
done
```
### Parallel (with GNU Parallel)
```bash
# Process 4 files at a time
parallel -j4 'datalab -q convert {} > {.}.json' ::: documents/*.pdf
```
### With Rate Limiting
```bash
# 2 requests per second
for file in documents/*.pdf; do
datalab -q convert "$file" > "${file%.pdf}.json" &
sleep 0.5
done
wait
```
---
## LLM Integration Example
Use Datalab CLI to prepare documents for LLM analysis:
```python
import subprocess
import json
import openai
def extract_and_analyze(pdf_path, question):
# Convert PDF to text
result = subprocess.run(
["datalab", "-q", "convert", pdf_path, "--output-format", "markdown"],
capture_output=True,
text=True
)
if result.returncode != 0:
raise Exception(f"Conversion failed: {result.stderr}")
doc_content = json.loads(result.stdout)["content"]
# Send to LLM
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[
{"role": "system", "content": "You are a document analyst."},
{"role": "user", "content": f"Document:\n{doc_content}\n\nQuestion: {question}"}
]
)
return response.choices[0].message.content
# Usage
answer = extract_and_analyze("report.pdf", "What are the key findings?")
print(answer)
```
---
## RAG Pipeline Example
Prepare documents for retrieval-augmented generation:
```bash
#!/bin/bash
# Convert to chunks for vector database
for file in knowledge-base/*.pdf; do
echo "Chunking $file..." >&2
# Convert to chunks with block IDs
datalab -q convert "$file" \
--output-format chunks \
--add-block-ids \
> "chunks/$(basename "$file" .pdf).json"
done
# Now chunks are ready for embedding
```
```python
import json
import glob
def load_chunks():
chunks = []
for file in glob.glob("chunks/*.json"):
with open(file) as f:
data = json.load(f)
for chunk in data["chunks"]:
chunks.append({
"content": chunk["content"],
"source": file,
"block_id": chunk.get("block_id"),
"metadata": chunk.get("metadata", {})
})
return chunks
# Now embed and store in vector database
chunks = load_chunks()
# embeddings = embed(chunks)
# vector_db.upsert(chunks, embeddings)
```
---
## Best Practices
### Always Use Quiet Mode
```bash
# In scripts, always use -q
datalab -q convert document.pdf
```
### Handle Errors Gracefully
```python
try:
result = run_datalab_command(...)
except DatalabError as e:
if e.code == "RATE_LIMITED":
time.sleep(30)
retry()
elif e.code == "FILE_NOT_FOUND":
log_missing_file(e.file)
else:
raise
```
### Use Checkpoints
For multiple operations on the same document, always use checkpoints.
### Respect Rate Limits
Add delays between requests when processing many documents.
### Cache During Development
Keep caching enabled during development to minimize API costs.
---
## Next Steps
- [Output Formats](../concepts/output-formats.md)
- [Caching](../concepts/caching.md)
- [Checkpoints](../concepts/checkpoints.md)
- [Rate Limits](../concepts/rate-limits.md)