kreuzberg-cli
Command-line interface for the Kreuzberg document intelligence library.
Overview
This crate provides a production-ready CLI tool for document extraction, MIME type detection, batch processing, embeddings, chunking, and cache management. It exposes the core extraction capabilities of the Kreuzberg Rust library through an easy-to-use command-line interface.
The CLI supports 91+ file formats including PDF, DOCX, PPTX, XLSX, images, HTML, and more, with optional OCR support for scanned documents.
Architecture
Binary Structure
Kreuzberg Core Library (crates/kreuzberg)
|
Kreuzberg CLI (crates/kreuzberg-cli) <- This crate
|
Command-line interface with configuration and caching
Commands
| Command | Description |
|---|---|
extract |
Extract text from a document |
batch |
Batch extract from multiple documents |
detect |
Detect MIME type of a file |
formats |
List all supported document formats |
version |
Show version information |
cache |
Cache management (stats, clear, warm, manifest) |
serve |
Start the API server (requires api feature) |
mcp |
Start the MCP server (requires mcp feature) |
api |
API utilities (schema) (requires api feature) |
embed |
Generate embeddings for text (requires embeddings feature) v4.5.2 |
chunk |
Chunk text for processing v4.5.2 |
completions |
Generate shell completions v4.5.2 |
Platform Support
The CLI is tested and officially supported on:
- Linux x86_64 (glibc and musl static)
- Linux aarch64 / ARM64 (glibc and musl static)
- macOS aarch64 (Apple Silicon)
- Windows x86_64
All platforms receive precompiled binaries through GitHub releases. Linux musl binaries are fully statically linked with zero runtime dependencies.
Installation
Install Script (Linux / macOS)
|
Homebrew
Cargo
Docker
From Source
Or via the workspace:
Platform-Specific Requirements
ONNX Runtime (for embeddings)
If using embeddings functionality, ONNX Runtime must be installed:
# macOS
# Ubuntu/Debian
# Windows (MSVC)
# OR download from https://github.com/microsoft/onnxruntime/releases
Without ONNX Runtime, embeddings will raise MissingDependencyError with installation instructions.
OCR Support (Optional)
To enable optical character recognition for scanned documents:
- macOS:
brew install tesseract - Ubuntu/Debian:
sudo apt-get install tesseract-ocr - Windows: Download from tesseract-ocr/tesseract
Quick Start
The CLI is available for Linux (x86_64/aarch64), macOS (Apple Silicon), and Windows with consistent behavior across all platforms.
Basic Text Extraction
# Extract text from a PDF
# Extract with JSON output
Extract with OCR
# Enable OCR for scanned documents
# Force OCR even if text extraction succeeds
Batch Processing
# Process multiple documents in parallel
# Process with custom configuration
MIME Type Detection
# Detect file type
# JSON output
Generate Embeddings (with embeddings feature)
# Embed a single text
# Embed multiple texts
# Read from stdin
|
Chunk Text
# Chunk text from flag
# Chunk from stdin with overlap
|
# Use markdown-aware chunking
Cache Management
# View cache statistics
# Clear the cache
# Pre-download all models
# Pre-download models including all embedding presets
# Show model manifest (paths, checksums, sizes)
Shell Completions
# Bash
# Zsh
# Fish
|
API Server (with api feature)
# Start API server on localhost:8000
# Custom host and port
# With configuration file
API Utilities (with api feature)
# Dump the OpenAPI 3.1 schema
MCP Server (with mcp feature)
# Start Model Context Protocol server (stdio transport)
# With HTTP transport
# With configuration file
Global Flags
| Flag | Description |
|---|---|
--log-level <LEVEL> |
Set log level (trace, debug, info, warn, error). Overrides RUST_LOG env var. v4.5.2 |
Configuration
The CLI supports configuration files in TOML, YAML, or JSON formats. Configuration can be:
- Explicit: Passed via
--config /path/to/config.{toml,yaml,json} - Auto-discovered: Searches for
kreuzberg.{toml,yaml,json}in current and parent directories - Inline JSON: Passed via
--config-json '{"ocr":{"backend":"tesseract"}}' - Base64 JSON: Passed via
--config-json-base64 <BASE64>(useful when shell quoting is tricky) - Default: Uses built-in defaults if no config found
Configuration precedence (highest to lowest):
- Individual CLI flags (
--ocr,--chunk-size, etc.) - Inline JSON config (
--config-jsonor--config-json-base64) - Config file (
--config path.toml) - Default values
Example Configuration (TOML)
# Basic extraction settings
= true
= true
= false
# OCR configuration
[]
= "tesseract"
= "eng"
[]
= true
= 6
= 50.0
# Text chunking (useful for LLM processing)
[]
= 1000
= 200
# PDF-specific options
[]
= true
= true
= []
# Language detection
[]
= true
= 0.8
= false
# Image extraction
[]
= true
= 300
= 4096
= true
Configuration Overrides
Command-line flags override configuration file settings:
# Override OCR setting from config
# Override chunking settings
# Disable cache despite config file
# Enable language detection
Command Reference
extract
Extract text, tables, and metadata from a document.
Options:
--config <PATH>: Configuration file (TOML, YAML, or JSON)--config-json <JSON>: Inline JSON configuration--config-json-base64 <BASE64>: Base64-encoded JSON configuration--mime-type <TYPE>: MIME type hint (auto-detected if not provided)--format <FORMAT>: Output format (textorjson), default:text
Extraction override flags v4.5.2 (also available on batch):
| Flag | Description |
|---|---|
--ocr <true|false> |
Enable/disable OCR |
--ocr-backend <BACKEND> |
OCR backend: tesseract, paddle-ocr, easyocr |
--ocr-language <LANG> |
OCR language code (e.g. eng, fra, ch) |
--ocr-auto-rotate <true|false> |
Auto-rotate images before OCR |
--force-ocr <true|false> |
Force OCR even if text extraction succeeds |
--no-cache <true|false> |
Disable result caching |
--chunk <true|false> |
Enable text chunking |
--chunk-size <SIZE> |
Maximum chunk size in characters (default: 1000) |
--chunk-overlap <SIZE> |
Overlap between chunks in characters (default: 200) |
--chunking-tokenizer <MODEL> |
Tokenizer model for token-based sizing (e.g. Xenova/gpt-4o) |
--output-format <FORMAT> |
Content format: plain, markdown, djot, html |
--include-structure <true|false> |
Include hierarchical document structure |
--quality <true|false> |
Enable quality post-processing |
--detect-language <true|false> |
Enable language detection |
--layout-preset <PRESET> |
Layout detection preset (e.g. accurate) |
--layout-confidence <FLOAT> |
Layout confidence threshold (0.0 - 1.0) |
--acceleration <PROVIDER> |
ONNX execution provider: auto, cpu, coreml, cuda, tensorrt |
--max-concurrent <N> |
Max parallel extractions in batch mode |
--max-threads <N> |
Cap all internal thread pools |
--extract-pages <true|false> |
Extract pages as separate array |
--page-markers <true|false> |
Insert page marker comments |
--extract-images <true|false> |
Enable image extraction |
--target-dpi <DPI> |
Target DPI for images (36 - 2400) |
--pdf-password <PASS> |
Password for encrypted PDFs (repeatable) |
--pdf-extract-images <true|false> |
Extract images from PDF pages |
--pdf-extract-metadata <true|false> |
Extract PDF metadata |
--token-reduction <LEVEL> |
Token reduction: off, light, moderate, aggressive, maximum |
--msg-codepage <CODE> |
Windows codepage fallback for MSG files |
Examples:
# Simple extraction
# With configuration and JSON output
# With chunking for LLM processing
# With OCR for scanned document
# Markdown output with page markers
# GPU-accelerated extraction
batch
Process multiple documents in parallel.
Options:
--config <PATH>: Configuration file (TOML, YAML, or JSON)--config-json <JSON>: Inline JSON configuration--config-json-base64 <BASE64>: Base64-encoded JSON configuration--format <FORMAT>: Output format (textorjson), default:json--file-configs <PATH>: JSON file mapping per-file extraction overrides- All extraction override flags (see
extractabove)
Examples:
# Batch process multiple files
# With glob patterns
# With custom configuration
# With OCR and concurrency limit
# Per-file overrides
detect
Identify the MIME type of a file.
Options:
--format <FORMAT>: Output format (textorjson), default:text
Examples:
# Simple detection
# JSON output
formats
List all supported document formats.
Options:
--format <FORMAT>: Output format (textorjson), default:text
Examples:
# List formats as table
# JSON output for tooling
cache
Manage extraction result cache and model downloads.
Subcommands:
stats
Show cache statistics.
Options:
--cache-dir <DIR>: Cache directory (default:.kreuzbergin current directory)--format <FORMAT>: Output format (textorjson), default:text
clear
Clear the cache.
Options:
--cache-dir <DIR>: Cache directory (default:.kreuzbergin current directory)--format <FORMAT>: Output format (textorjson), default:text
warm
Pre-download all models (OCR, layout detection, and optionally embeddings).
Options:
--cache-dir <DIR>: Cache directory (default:.kreuzbergorKREUZBERG_CACHE_DIR)--format <FORMAT>: Output format (textorjson), default:text--all-embeddings: Download all 4 embedding model presets (fast, balanced, quality, multilingual)--embedding-model <PRESET>: Download a specific embedding model preset
manifest
Output model manifest (expected model files, checksums, sizes).
Options:
--format <FORMAT>: Output format (textorjson), default:json
Examples:
# View cache statistics
# Clear cache with custom directory
# Pre-download all models for offline/container use
# Also download embedding models
# Download only the fast embedding model
# Get model manifest as JSON
serve (requires api feature)
Start the REST API server.
Options:
-H, --host <HOST>: Host to bind to (default:127.0.0.1)--port <PORT>: Port to bind to (default:8000)--config <PATH>: Configuration file (TOML, YAML, or JSON)
Examples:
# Default: localhost:8000
# Public access on port 3000
# With custom configuration
mcp (requires mcp feature)
Start the Model Context Protocol server.
Options:
--config <PATH>: Configuration file (TOML, YAML, or JSON)--transport <MODE>: Transport mode:stdio(default) orhttp--host <HOST>: HTTP host (only for--transport http, default:127.0.0.1)--port <PORT>: HTTP port (only for--transport http, default:8001)
Examples:
# Start MCP server (stdio, for editor integration)
# HTTP transport for remote access
# With custom configuration
api (requires api feature)
API utility commands.
schema v4.5.2
Output the full OpenAPI 3.1 specification as JSON.
Examples:
# Dump OpenAPI spec to file
# Pipe to jq for inspection
|
embed v4.5.2 (requires embeddings feature)
Generate vector embeddings for text.
Options:
--text <TEXT>: Text to embed (repeatable for batch embedding; reads from stdin if omitted)--preset <PRESET>: Embedding preset (fast,balanced,quality,multilingual), default:balanced--format <FORMAT>: Output format (textorjson), default:json
Examples:
# Embed a single text
# Batch embed multiple texts
# Use a specific preset
# Read from stdin
|
chunk v4.5.2
Chunk text for processing (useful for LLM context windows).
Options:
--text <TEXT>: Text to chunk (reads from stdin if omitted)--config <PATH>: Configuration file (TOML, YAML, or JSON)--chunk-size <SIZE>: Chunk size in characters--chunk-overlap <SIZE>: Chunk overlap in characters--chunker-type <TYPE>: Chunker type:text(default) ormarkdown--chunking-tokenizer <MODEL>: Tokenizer model for token-based sizing (e.g.Xenova/gpt-4o)--format <FORMAT>: Output format (textorjson), default:json
Examples:
# Chunk text with defaults
# Custom chunk size and overlap
# Markdown-aware chunking
# Token-based chunking with specific tokenizer
# Read from stdin
|
completions v4.5.2
Generate shell completion scripts.
Supported shells: bash, zsh, fish, elvish, powershell
Examples:
# Bash (add to .bashrc)
# Zsh (add to fpath)
# Fish
|
# PowerShell
| |
version
Show version information.
Options:
--format <FORMAT>: Output format (textorjson), default:text
Examples:
# Display version
# JSON output
Output Formats
Text Format
The default human-readable format:
# Output:
# Document content here...
JSON Format
For programmatic integration:
# Output:
# {
# "content": "Document content...",
# "mime_type": "application/pdf",
# "metadata": { "title": "...", "author": "..." },
# "tables": [{ "markdown": "...", "cells": [...], "page_number": 0 }]
# }
Supported File Formats
| Category | Formats |
|---|---|
| Documents | PDF, DOCX, DOC, PPTX, PPT, XLSX, XLS, ODT, ODP, ODS, RTF |
| Images | PNG, JPEG, JPG, WEBP, BMP, TIFF, GIF |
| Web | HTML, XHTML, XML |
| Text | TXT, MD, CSV, TSV, JSON, YAML, TOML |
| EML, MSG | |
| Archives | ZIP, TAR, 7Z |
| Other | 30+ additional formats |
Exit Codes
0: Successful executionNon-zero: Error occurred (check stderr for details)
Logging
Control logging verbosity with the --log-level flag or RUST_LOG environment variable:
# Using --log-level flag (overrides RUST_LOG)
# Using RUST_LOG environment variable
RUST_LOG=info
RUST_LOG=debug
RUST_LOG=warn
# Show logs from specific modules
RUST_LOG=kreuzberg=debug
Performance Tips
-
Use batch processing for multiple files instead of sequential extraction:
-
Enable caching to avoid reprocessing the same documents:
# Cache is enabled by default -
Use appropriate chunk sizes for LLM processing:
-
Tune OCR settings for better performance:
# Adjust tesseract_config in configuration file for optimization -
Monitor cache size and clear when needed:
-
Pre-warm models for containerized deployments:
-
Use hardware acceleration when available:
Features
Default Features
None by default. The binary includes core extraction.
Optional Features
api: Enable the REST API server (kreuzberg servecommand) and API utilities (kreuzberg api schema)mcp: Enable Model Context Protocol server (kreuzberg mcpcommand)embeddings: Enable embedding generation (kreuzberg embedcommand)layout-detection: Enable layout detection flags (--layout-preset,--layout-confidence)chunking-tokenizers: Enable token-based chunking (--chunking-tokenizer)all: Enable all features (api+mcp)
Building with Features
# Build with all features
# Build with specific features
Troubleshooting
File Not Found Error
Ensure the file path is correct and the file is readable:
# Check if file exists
# Try with absolute path
OCR Not Working
Verify Tesseract is installed:
# If not found:
# macOS: brew install tesseract
# Ubuntu: sudo apt-get install tesseract-ocr
# Windows: Download from https://github.com/tesseract-ocr/tesseract
Configuration File Not Found
Check that the configuration file has the correct format and location:
# Use explicit path
# Or place kreuzberg.toml in current directory
Out of Memory with Large Files
Use chunking to reduce memory usage:
Cache Directory Permissions
Ensure write access to the cache directory:
# Check permissions
# Or use a custom directory with appropriate permissions
# In config.toml: cache_dir = "/tmp/kreuzberg-cache"
Key Files
src/main.rs: CLI implementation with command definitions and argument parsingsrc/commands/overrides.rs: Extraction override flags (OCR, chunking, layout, acceleration, etc.)Cargo.toml: Package metadata and dependencies
Building
Development Build
Release Build
With All Features
Testing
# Run CLI tests
# With logging
RUST_LOG=debug
Performance Characteristics
- Single file extraction: Typically 10-100ms depending on file size and format
- Batch processing: Near-linear scaling with 8 concurrent extractions by default
- OCR processing: 100-500ms per page depending on image quality and language
- Caching: Sub-millisecond retrieval for cached results
References
- Kreuzberg Core:
../kreuzberg/ - Main Documentation: https://docs.kreuzberg.dev
- GitHub Repository: https://github.com/kreuzberg-dev/kreuzberg
- Configuration Guide: See example configuration sections above
Contributing
We welcome contributions! Please see the main Kreuzberg repository for contribution guidelines.
License
MIT