kreuzberg-cli
Command-line interface for the Kreuzberg document intelligence library.
Overview
This crate provides a production-ready CLI tool for document extraction, MIME type detection, batch processing, and cache management. It exposes the core extraction capabilities of the Kreuzberg Rust library through an easy-to-use command-line interface.
The CLI supports 88+ file formats including PDF, DOCX, PPTX, XLSX, images, HTML, and more, with optional OCR support for scanned documents.
Architecture
Binary Structure
Kreuzberg Core Library (crates/kreuzberg)
↓
Kreuzberg CLI (crates/kreuzberg-cli) ← This crate
↓
Command-line interface with configuration and caching
Features
- Extract Command: Extract text, tables, and metadata from single documents
- Batch Command: Process multiple documents in parallel with optimized concurrency
- Detect Command: Identify MIME type of any file with magic byte analysis
- Cache Commands: Manage extraction result cache (stats, clear)
- Serve Command (requires
apifeature): Start REST API server for remote document processing - MCP Command (requires
mcpfeature): Start Model Context Protocol server for AI integration - Version Command: Display version information in text or JSON format
- Configuration: TOML, YAML, or JSON config files with auto-discovery
Platform Support
The CLI is tested and officially supported on:
- ✅ Linux x86_64 (glibc and musl static)
- ✅ Linux aarch64 / ARM64 (glibc and musl static)
- ✅ macOS aarch64 (Apple Silicon)
- ✅ Windows x86_64
All platforms receive precompiled binaries through GitHub releases. Linux musl binaries are fully statically linked with zero runtime dependencies.
Installation
Install Script (Linux / macOS)
|
Homebrew
Cargo
Docker
From Source
Or via the workspace:
Platform-Specific Requirements
ONNX Runtime (for embeddings)
If using embeddings functionality, ONNX Runtime must be installed:
# macOS
# Ubuntu/Debian
# Windows (MSVC)
# OR download from https://github.com/microsoft/onnxruntime/releases
Without ONNX Runtime, embeddings will raise MissingDependencyError with installation instructions.
OCR Support (Optional)
To enable optical character recognition for scanned documents:
- macOS:
brew install tesseract - Ubuntu/Debian:
sudo apt-get install tesseract-ocr - Windows: Download from tesseract-ocr/tesseract
Quick Start
The CLI is available for Linux (x86_64/aarch64), macOS (Apple Silicon), and Windows with consistent behavior across all platforms.
Basic Text Extraction
# Extract text from a PDF
# Extract with JSON output
Extract with OCR
# Enable OCR for scanned documents
# Force OCR even if text extraction succeeds
Batch Processing
# Process multiple documents in parallel
# Process with custom configuration
MIME Type Detection
# Detect file type
# JSON output
Cache Management
# View cache statistics
# Clear the cache
# Custom cache directory
API Server (with api feature)
# Start API server on localhost:8000
# Custom host and port
# With configuration file
MCP Server (with mcp feature)
# Start Model Context Protocol server
# With configuration file
Configuration
The CLI supports configuration files in TOML, YAML, or JSON formats. Configuration can be:
- Explicit: Passed via
--config /path/to/config.{toml,yaml,json} - Auto-discovered: Searches for
kreuzberg.{toml,yaml,json}in current and parent directories - Default: Uses built-in defaults if no config found
Example Configuration (TOML)
# Basic extraction settings
= true
= true
= false
# OCR configuration
[]
= "tesseract"
= "eng"
[]
= true
= 6
= 50.0
# Text chunking (useful for LLM processing)
[]
= 1000
= 200
# PDF-specific options
[]
= true
= true
= []
# Language detection
[]
= true
= 0.8
= false
# Image extraction
[]
= true
= 300
= 4096
= true
Configuration Overrides
Command-line flags override configuration file settings:
# Override OCR setting from config
# Override chunking settings
# Disable cache despite config file
# Enable language detection
Command Reference
extract
Extract text, tables, and metadata from a document.
Options:
--config <PATH>: Configuration file (TOML, YAML, or JSON)--mime-type <TYPE>: MIME type hint (auto-detected if not provided)--format <FORMAT>: Output format (textorjson), default:text--ocr <true|false>: Enable/disable OCR--force-ocr <true|false>: Force OCR even if text extraction succeeds--no-cache <true|false>: Disable result caching--chunk <true|false>: Enable text chunking--chunk-size <SIZE>: Chunk size in characters (default: 1000)--chunk-overlap <SIZE>: Overlap between chunks (default: 200)--quality <true|false>: Enable quality processing--detect-language <true|false>: Enable language detection
Examples:
# Simple extraction
# With configuration and JSON output
# With chunking for LLM processing
# With OCR for scanned document
batch
Process multiple documents in parallel.
Options:
--config <PATH>: Configuration file (TOML, YAML, or JSON)--format <FORMAT>: Output format (textorjson), default:json--ocr <true|false>: Enable/disable OCR--force-ocr <true|false>: Force OCR even if text extraction succeeds--no-cache <true|false>: Disable result caching--quality <true|false>: Enable quality processing
Examples:
# Batch process multiple files
# With glob patterns
# With custom configuration
# With OCR
detect
Identify the MIME type of a file.
Options:
--format <FORMAT>: Output format (textorjson), default:text
Examples:
# Simple detection
# JSON output
cache
Manage extraction result cache.
Subcommands:
stats
Show cache statistics.
Options:
--cache-dir <DIR>: Cache directory (default:.kreuzbergin current directory)--format <FORMAT>: Output format (textorjson), default:text
clear
Clear the cache.
Options:
--cache-dir <DIR>: Cache directory (default:.kreuzbergin current directory)--format <FORMAT>: Output format (textorjson), default:text
Examples:
# View cache statistics
# Clear cache with custom directory
# JSON output
serve (requires api feature)
Start the REST API server.
Options:
--host <HOST>: Host to bind to (default:127.0.0.1)--port <PORT>: Port to bind to (default:8000)--config <PATH>: Configuration file (TOML, YAML, or JSON)
Examples:
# Default: localhost:8000
# Public access on port 3000
# With custom configuration
mcp (requires mcp feature)
Start the Model Context Protocol server.
Options:
--config <PATH>: Configuration file (TOML, YAML, or JSON)
Examples:
# Start MCP server
# With custom configuration
version
Show version information.
Options:
--format <FORMAT>: Output format (textorjson), default:text
Examples:
# Display version
# JSON output
Output Formats
Text Format
The default human-readable format:
# Output:
# Document content here...
JSON Format
For programmatic integration:
# Output:
# {
# "content": "Document content...",
# "mime_type": "application/pdf",
# "metadata": { "title": "...", "author": "..." },
# "tables": [{ "markdown": "...", "cells": [...], "page_number": 0 }]
# }
Supported File Formats
| Category | Formats |
|---|---|
| Documents | PDF, DOCX, DOC, PPTX, PPT, XLSX, XLS, ODT, ODP, ODS, RTF |
| Images | PNG, JPEG, JPG, WEBP, BMP, TIFF, GIF |
| Web | HTML, XHTML, XML |
| Text | TXT, MD, CSV, TSV, JSON, YAML, TOML |
| EML, MSG | |
| Archives | ZIP, TAR, 7Z |
| Other | 30+ additional formats |
Exit Codes
0: Successful executionNon-zero: Error occurred (check stderr for details)
Logging
Control logging verbosity with the RUST_LOG environment variable:
# Show info-level logs (default)
RUST_LOG=info
# Show detailed debug logs
RUST_LOG=debug
# Show only warnings and errors
RUST_LOG=warn
# Suppress all logs
RUST_LOG=error
# Show logs from specific modules
RUST_LOG=kreuzberg=debug
Performance Tips
-
Use batch processing for multiple files instead of sequential extraction:
-
Enable caching to avoid reprocessing the same documents:
# Cache is enabled by default -
Use appropriate chunk sizes for LLM processing:
-
Tune OCR settings for better performance:
# Adjust tesseract_config in configuration file for optimization -
Monitor cache size and clear when needed:
Features
Default Features
None by default. The binary includes core extraction.
Optional Features
api: Enable the REST API server (kreuzberg servecommand)mcp: Enable Model Context Protocol server (kreuzberg mcpcommand)all: Enable all features (api+mcp)
Building with Features
# Build with all features
# Build with specific features
Troubleshooting
File Not Found Error
Ensure the file path is correct and the file is readable:
# Check if file exists
# Try with absolute path
OCR Not Working
Verify Tesseract is installed:
# If not found:
# macOS: brew install tesseract
# Ubuntu: sudo apt-get install tesseract-ocr
# Windows: Download from https://github.com/tesseract-ocr/tesseract
Configuration File Not Found
Check that the configuration file has the correct format and location:
# Use explicit path
# Or place kreuzberg.toml in current directory
Out of Memory with Large Files
Use chunking to reduce memory usage:
Cache Directory Permissions
Ensure write access to the cache directory:
# Check permissions
# Or use a custom directory with appropriate permissions
# In config.toml: cache_dir = "/tmp/kreuzberg-cache"
Key Files
src/main.rs: CLI implementation with command definitions and argument parsingCargo.toml: Package metadata and dependencies
Building
Development Build
Release Build
With All Features
Testing
# Run CLI tests
# With logging
RUST_LOG=debug
Performance Characteristics
- Single file extraction: Typically 10-100ms depending on file size and format
- Batch processing: Near-linear scaling with 8 concurrent extractions by default
- OCR processing: 100-500ms per page depending on image quality and language
- Caching: Sub-millisecond retrieval for cached results
References
- Kreuzberg Core:
../kreuzberg/ - Main Documentation: https://docs.kreuzberg.dev
- GitHub Repository: https://github.com/kreuzberg-dev/kreuzberg
- Configuration Guide: See example configuration sections above
Contributing
We welcome contributions! Please see the main Kreuzberg repository for contribution guidelines.
License
MIT