# Converting Documents
Learn how to convert PDFs, images, and documents to structured formats using the Datalab CLI.
---
## Prerequisites
- [Datalab CLI installed](../getting-started/installation.md)
- [API key configured](../getting-started/configuration.md)
---
## Basic Conversion
Convert a PDF to markdown:
```bash
datalab convert document.pdf
```
Output:
```json
{
"content": "# Document Title\n\n## Introduction\n\nThis is the document content...",
"metadata": {
"pages": 5,
"processing_time": 2.3
}
}
```
### From URL
Convert a document from a URL:
```bash
datalab convert https://example.com/report.pdf
```
### Save to File
```bash
# Using --output flag
datalab convert document.pdf --output result.json
# Using redirection
datalab convert document.pdf > result.json
```
---
## Output Formats
### Markdown (Default)
Best for human-readable content and LLM processing:
```bash
datalab convert document.pdf --output-format markdown
```
Features:
- Headers preserved as `#`, `##`, etc.
- Tables converted to markdown tables
- Lists preserved
- Images extracted with alt text
### HTML
Structured HTML with semantic tags:
```bash
datalab convert document.pdf --output-format html
```
Features:
- Semantic HTML5 elements
- Preserved styling hints
- Links and images included
### JSON
Full structured representation:
```bash
datalab convert document.pdf --output-format json
```
Features:
- Block-level structure
- Rich metadata
- Position information
### Chunks
Semantic chunks for RAG applications:
```bash
datalab convert document.pdf --output-format chunks
```
Features:
- Intelligently split content
- Chunk metadata (page, section)
- Optimized for vector databases
---
## Processing Modes
### Fast Mode (Default)
Quick processing, suitable for most documents:
```bash
datalab convert document.pdf --mode fast
```
Best for:
- Simple documents
- Text-heavy content
- High volume processing
### Balanced Mode
Balance of speed and quality:
```bash
datalab convert document.pdf --mode balanced
```
Best for:
- Documents with some complexity
- Mixed content types
### Accurate Mode
Highest quality, slower processing:
```bash
datalab convert document.pdf --mode accurate
```
Best for:
- Complex layouts
- Scientific papers
- Tables and charts
- Critical documents
---
## Page Selection
### Limit Pages
Process only the first N pages:
```bash
datalab convert book.pdf --max-pages 10
```
### Page Ranges
Process specific page ranges (0-indexed):
```bash
# First 5 pages
datalab convert book.pdf --page-range "0-4"
# Multiple ranges
datalab convert book.pdf --page-range "0-4,10-14,20-24"
# Single page
datalab convert book.pdf --page-range "5"
```
---
## Advanced Features
### Page Markers
Add page delimiters to output:
```bash
datalab convert document.pdf --paginate
```
Output includes `<!-- Page X -->` markers in markdown.
### Block IDs
Add unique IDs for citation tracking:
```bash
datalab convert document.pdf --add-block-ids
```
Each block gets a unique ID for reference.
### Token-Efficient Markdown
Optimized format for LLM token usage:
```bash
datalab convert document.pdf --token-efficient-markdown
```
Reduces token count while preserving meaning.
### Extra Features
Enable additional processing features:
```bash
# Chart understanding
datalab convert report.pdf --extras chart_understanding
# Extract links
datalab convert document.pdf --extras extract_links
# Track changes (for Word docs converted to PDF)
datalab convert document.pdf --extras track_changes
# Multiple extras
datalab convert report.pdf --extras chart_understanding,extract_links
```
### Image Handling
Control image extraction:
```bash
# Disable image extraction
datalab convert document.pdf --disable-image-extraction
# Disable image captions
datalab convert document.pdf --disable-image-captions
```
---
## Using Checkpoints
Save a checkpoint for later operations:
```bash
datalab convert document.pdf --save-checkpoint
```
The response includes a `checkpoint_id` you can use with `extract` or `segment`:
```bash
# Use checkpoint for extraction
datalab extract document.pdf --schema schema.json --checkpoint-id ckpt_abc123
```
See [Checkpoints](../concepts/checkpoints.md) for details.
---
## Practical Examples
### Convert Resume to JSON
```bash
datalab convert resume.pdf --output-format json --mode accurate
```
### Process Book Chapter
```bash
datalab convert textbook.pdf \
--page-range "50-75" \
--mode balanced \
--paginate \
--output chapter3.json
```
### Prepare for RAG
```bash
datalab convert knowledge-base.pdf \
--output-format chunks \
--add-block-ids \
--output chunks.json
```
### High-Quality Report
```bash
datalab convert financial-report.pdf \
--mode accurate \
--extras chart_understanding \
--output report.json
```
### Batch Processing
```bash
#!/bin/bash
for file in documents/*.pdf; do
output="${file%.pdf}.json"
echo "Converting $file..."
datalab convert "$file" --output "$output"
done
```
---
## Tips and Tricks
### Combine with jq
Extract just the content:
```bash
### Suppress Progress
For clean output in scripts:
```bash
### Check Processing Time
```bash
### Validate Before Processing
Check file exists and is readable:
```bash
if [ -f "document.pdf" ] && [ -r "document.pdf" ]; then
datalab convert document.pdf
else
echo "Cannot read document.pdf"
fi
```
---
## Troubleshooting
### "File too large" Error
Split the document or use page ranges:
```bash
datalab convert large.pdf --max-pages 100
```
### Poor Quality Output
Try accurate mode:
```bash
datalab convert complex.pdf --mode accurate
```
### Missing Images
Ensure image extraction is enabled (default):
```bash
# Don't use --disable-image-extraction
datalab convert document.pdf
```
### Slow Processing
Use fast mode or limit pages:
```bash
datalab convert document.pdf --mode fast --max-pages 50
```
---
## Next Steps
- [Extract structured data](extract-data.md)
- [Learn about checkpoints](../concepts/checkpoints.md)
- [Explore the convert command reference](../commands/convert.md)