datalab-cli 0.1.0

A powerful CLI for converting, extracting, and processing documents using the Datalab API
Documentation
# Datalab CLI

**Convert, extract, and process documents from the command line**

[![Crates.io](https://img.shields.io/crates/v/datalab-cli.svg)](https://crates.io/crates/datalab-cli)
[![CI](https://github.com/dipankar/datalab-cli/workflows/CI/badge.svg)](https://github.com/dipankar/datalab-cli/actions)
[![License](https://img.shields.io/badge/license-MIT-blue.svg)](https://github.com/dipankar/datalab-cli/blob/main/LICENSE)
[![Downloads](https://img.shields.io/crates/d/datalab-cli.svg)](https://crates.io/crates/datalab-cli)

[Installation]#installation | [Quick Start]#quick-start | [Usage]#usage | [Documentation]https://github.com/dipankar/datalab-cli/tree/main/documentation

---

A powerful command-line interface for the [Datalab](https://www.datalab.to) document processing API. Built in Rust for speed and reliability.

## Features

- **📄 Document Conversion** — Convert PDFs, images, and documents to Markdown, HTML, JSON, or semantic chunks
- **🔍 Structured Extraction** — Extract data using JSON schemas with confidence scores
- **📝 Form Filling** — Fill PDF forms programmatically with smart field matching
- **⚡ Smart Caching** — Local file-based caching reduces API costs on repeated requests
- **🤖 Agent-Friendly** — JSON output to stdout, progress events to stderr, designed for piping
- **📊 Progress Streaming** — Real-time JSON progress events for monitoring long operations

## Installation

### From crates.io

```bash
cargo install datalab-cli
```

### From source

```bash
git clone https://github.com/dipankar/datalab-cli
cd datalab-cli
cargo install --path .
```

### Pre-built binaries

Download from [GitHub Releases](https://github.com/dipankar/datalab-cli/releases).

## Quick Start

**1. Get your API key** from [datalab.to/app/keys](https://www.datalab.to/app/keys)

**2. Set the environment variable**

```bash
export DATALAB_API_KEY="your-api-key"
```

**3. Convert your first document**

```bash
datalab convert document.pdf
```

That's it! The converted markdown is output as JSON to stdout.

## Usage

### Convert Documents

```bash
# Convert to markdown (default)
datalab convert document.pdf

# Convert to HTML
datalab convert document.pdf --output-format html

# High-quality mode for complex documents
datalab convert report.pdf --mode accurate

# Convert specific pages
datalab convert book.pdf --page-range "0-10"

# Save to file
datalab convert document.pdf --output result.json
```

### Extract Structured Data

```bash
# Extract with inline schema
datalab extract invoice.pdf --schema '{
  "fields": [
    {"name": "total", "type": "number"},
    {"name": "date", "type": "string"}
  ]
}'

# Extract with schema file
datalab extract invoice.pdf --schema schema.json

# Include confidence scores
datalab extract invoice.pdf --schema schema.json --include-scores
```

### Fill Forms

```bash
# Fill a form
datalab fill application.pdf \
  --fields '{"name": "John Doe", "email": "john@example.com"}' \
  --output filled.pdf
```

### File Management

```bash
# Upload a file
datalab files upload document.pdf

# List files
datalab files list

# Download a file
datalab files download file_abc123 --output downloaded.pdf
```

### Cache Management

```bash
# View cache stats
datalab cache stats

# Clear old entries
datalab cache clear --older-than 7
```

## Output Format

All commands output JSON to **stdout** for easy piping:

```bash
# Pipe to jq
datalab convert document.pdf | jq '.content'

# Save to file
datalab convert document.pdf > result.json
```

Progress events stream to **stderr** as JSON:

```json
{"type":"start","operation":"convert","file":"document.pdf"}
{"type":"poll","status":"processing","elapsed_secs":1.2}
{"type":"complete","elapsed_secs":3.4}
```

Use `--quiet` to suppress progress, `--verbose` to force it.

## Environment Variables

| Variable | Required | Description |
|----------|----------|-------------|
| `DATALAB_API_KEY` | Yes | Your API key |
| `DATALAB_BASE_URL` | No | Custom API endpoint (for on-prem) |
| `NO_COLOR` | No | Disable colored output |

## Caching

Results are cached locally in `~/.cache/datalab/` to reduce API costs:

```bash
# First run: calls API
datalab convert document.pdf

# Second run: instant from cache
datalab convert document.pdf

# Bypass cache
datalab convert document.pdf --skip-cache
```

## Documentation

Full documentation is available in the [documentation](./documentation) directory. To view locally:

```bash
cd documentation
pip install -r requirements.txt
mkdocs serve
```

## Contributing

We welcome contributions! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.

```bash
# Development setup
git clone https://github.com/dipankar/datalab-cli
cd datalab-cli
cargo build

# Run tests
cargo test

# Run lints
cargo clippy
cargo fmt --check
```

## License

MIT License - see [LICENSE](LICENSE) for details.

---

Built with Rust | Powered by [Datalab]https://www.datalab.to