# convert
Convert a document to markdown, HTML, JSON, or chunks.
## Synopsis
```
## Description
Convert PDF, images, or documents to structured formats. Supports markdown (default), HTML, JSON, and chunked output for RAG applications.
Use `--save-checkpoint` to enable efficient follow-up extraction or segmentation on the same document.
---
## Arguments
| `<FILE\|URL>` | File path or URL to convert |
---
## Options
### Output Options
| `--output-format <FORMAT>` | Output format: `markdown`, `html`, `json`, `chunks` | `markdown` |
| `--paginate` | Add page delimiters to output | - |
| `--token-efficient-markdown` | Use token-efficient markdown format | - |
| `-o, --output <FILE>` | Write result to file instead of stdout | - |
### Processing Options
| `--mode <MODE>` | Processing mode: `fast`, `balanced`, `accurate` | `fast` |
| `--max-pages <N>` | Maximum pages to process | - |
| `--page-range <RANGE>` | Page range (e.g., `"0-5,10"`) | - |
### Advanced Options
| `--save-checkpoint` | Save checkpoint for follow-up extraction/segmentation | - |
| `--extras <FEATURES>` | Extra features: `track_changes`, `chart_understanding`, `extract_links` | - |
| `--add-block-ids` | Add block IDs for citation tracking | - |
| `--disable-image-extraction` | Disable image extraction from document | - |
| `--disable-image-captions` | Disable image caption generation | - |
| `--timeout <SECS>` | Request timeout in seconds | `300` |
### Cache Options
| `--skip-cache` | Skip local cache lookup |
| `--force` | Force reprocessing (skip API-side cache) |
---
## Output Formats
### Markdown (default)
Standard markdown with headers, lists, tables, and images.
```bash
datalab convert document.pdf --output-format markdown
```
### HTML
Structured HTML output.
```bash
datalab convert document.pdf --output-format html
```
### JSON
Full structured JSON with metadata.
```bash
datalab convert document.pdf --output-format json
```
### Chunks
Semantic chunks for RAG applications.
```bash
datalab convert document.pdf --output-format chunks
```
---
## Processing Modes
| `fast` | Quick processing, good for simple documents | Invoices, forms |
| `balanced` | Balance of speed and quality | General documents |
| `accurate` | Highest quality, slower | Complex layouts, charts |
---
## Examples
### Basic Conversion
```bash
# Convert to markdown
datalab convert document.pdf
# Convert from URL
datalab convert https://example.com/document.pdf
```
### Output Format
```bash
# HTML output
datalab convert document.pdf --output-format html
# Chunked output for RAG
datalab convert document.pdf --output-format chunks
```
### Quality Control
```bash
# High accuracy mode
datalab convert report.pdf --mode accurate
# Enable chart understanding
datalab convert report.pdf --extras chart_understanding
```
### Page Selection
```bash
# First 10 pages only
datalab convert book.pdf --max-pages 10
# Specific page ranges
datalab convert book.pdf --page-range "0-5,10-15,20"
```
### Save to File
```bash
# Using --output flag
datalab convert document.pdf --output result.json
# Using redirection
datalab convert document.pdf > result.json
```
### Checkpoints
```bash
# Save checkpoint for later extraction
datalab convert document.pdf --save-checkpoint
```
The checkpoint ID is returned in the response and can be used with `extract` or `segment`.
---
## Output Schema
```json
{
"content": "# Document Title\n\nDocument content...",
"metadata": {
"pages": 5,
"processing_time": 2.3,
"checkpoint_id": "abc123"
}
}
```
---
## Related Commands
- [`extract`](extract.md) - Extract structured data from converted documents
- [`segment`](segment.md) - Segment documents into sections
- [`cache`](cache.md) - Manage cached conversions
---
## See Also
- [Converting Documents Tutorial](../tutorials/convert-documents.md)
- [Checkpoints](../concepts/checkpoints.md)
- [Caching](../concepts/caching.md)