datalab-cli 0.1.0

# convert

Convert a document to markdown, HTML, JSON, or chunks.

## Synopsis

```
datalab convert [OPTIONS] <FILE|URL>
```

## Description

Convert PDF, images, or documents to structured formats. Supports markdown (default), HTML, JSON, and chunked output for RAG applications.

Use `--save-checkpoint` to enable efficient follow-up extraction or segmentation on the same document.

---

## Arguments

| Argument | Description |
|----------|-------------|
| `<FILE\|URL>` | File path or URL to convert |

---

## Options

### Output Options

| Option | Description | Default |
|--------|-------------|---------|
| `--output-format <FORMAT>` | Output format: `markdown`, `html`, `json`, `chunks` | `markdown` |
| `--paginate` | Add page delimiters to output | - |
| `--token-efficient-markdown` | Use token-efficient markdown format | - |
| `-o, --output <FILE>` | Write result to file instead of stdout | - |

### Processing Options

| Option | Description | Default |
|--------|-------------|---------|
| `--mode <MODE>` | Processing mode: `fast`, `balanced`, `accurate` | `fast` |
| `--max-pages <N>` | Maximum pages to process | - |
| `--page-range <RANGE>` | Page range (e.g., `"0-5,10"`) | - |

### Advanced Options

| Option | Description | Default |
|--------|-------------|---------|
| `--save-checkpoint` | Save checkpoint for follow-up extraction/segmentation | - |
| `--extras <FEATURES>` | Extra features: `track_changes`, `chart_understanding`, `extract_links` | - |
| `--add-block-ids` | Add block IDs for citation tracking | - |
| `--disable-image-extraction` | Disable image extraction from document | - |
| `--disable-image-captions` | Disable image caption generation | - |
| `--timeout <SECS>` | Request timeout in seconds | `300` |

### Cache Options

| Option | Description |
|--------|-------------|
| `--skip-cache` | Skip local cache lookup |
| `--force` | Force reprocessing (skip API-side cache) |

---

## Output Formats

### Markdown (default)

Standard markdown with headers, lists, tables, and images.

```bash
datalab convert document.pdf --output-format markdown
```

### HTML

Structured HTML output.

```bash
datalab convert document.pdf --output-format html
```

### JSON

Full structured JSON with metadata.

```bash
datalab convert document.pdf --output-format json
```

### Chunks

Semantic chunks for RAG applications.

```bash
datalab convert document.pdf --output-format chunks
```

---

## Processing Modes

| Mode | Description | Use Case |
|------|-------------|----------|
| `fast` | Quick processing, good for simple documents | Invoices, forms |
| `balanced` | Balance of speed and quality | General documents |
| `accurate` | Highest quality, slower | Complex layouts, charts |

---

## Examples

### Basic Conversion

```bash
# Convert to markdown
datalab convert document.pdf

# Convert from URL
datalab convert https://example.com/document.pdf
```

### Output Format

```bash
# HTML output
datalab convert document.pdf --output-format html

# Chunked output for RAG
datalab convert document.pdf --output-format chunks
```

### Quality Control

```bash
# High accuracy mode
datalab convert report.pdf --mode accurate

# Enable chart understanding
datalab convert report.pdf --extras chart_understanding
```

### Page Selection

```bash
# First 10 pages only
datalab convert book.pdf --max-pages 10

# Specific page ranges
datalab convert book.pdf --page-range "0-5,10-15,20"
```

### Save to File

```bash
# Using --output flag
datalab convert document.pdf --output result.json

# Using redirection
datalab convert document.pdf > result.json
```

### Checkpoints

```bash
# Save checkpoint for later extraction
datalab convert document.pdf --save-checkpoint
```

The checkpoint ID is returned in the response and can be used with `extract` or `segment`.

---

## Output Schema

```json
{
  "content": "# Document Title\n\nDocument content...",
  "metadata": {
    "pages": 5,
    "processing_time": 2.3,
    "checkpoint_id": "abc123"
  }
}
```

---

## Related Commands

- [`extract`](extract.md) - Extract structured data from converted documents
- [`segment`](segment.md) - Segment documents into sections
- [`cache`](cache.md) - Manage cached conversions

---

## See Also

- [Converting Documents Tutorial](../tutorials/convert-documents.md)
- [Checkpoints](../concepts/checkpoints.md)
- [Caching](../concepts/caching.md)