kreuzberg-cli
Command-line interface for the Kreuzberg document intelligence library.
Overview
This crate provides a production-ready CLI tool for document extraction, MIME type detection, batch processing, and cache management. It exposes the core extraction capabilities of the Kreuzberg Rust library through an easy-to-use command-line interface.
The CLI supports 56 file formats including PDF, DOCX, PPTX, XLSX, images, HTML, and more, with optional OCR support for scanned documents.
Architecture
Binary Structure
Kreuzberg Core Library (crates/kreuzberg)
↓
Kreuzberg CLI (crates/kreuzberg-cli) ← This crate
↓
Command-line interface with configuration and caching
Features
- Extract Command: Extract text, tables, and metadata from single documents
- Batch Command: Process multiple documents in parallel with optimized concurrency
- Detect Command: Identify MIME type of any file with magic byte analysis
- Cache Commands: Manage extraction result cache (stats, clear)
- Serve Command (requires
apifeature): Start REST API server for remote document processing - MCP Command (requires
mcpfeature): Start Model Context Protocol server for AI integration - Version Command: Display version information in text or JSON format
- Configuration: TOML, YAML, or JSON config files with auto-discovery
Platform Support
The CLI is tested and officially supported on:
- ✅ Linux x86_64
- ✅ Linux aarch64 (ARM64)
- ✅ macOS aarch64 (Apple Silicon)
- ✅ Windows x86_64
All platforms receive precompiled binaries through GitHub releases and are tested in continuous integration.
Installation
From Source
Or via the workspace:
Platform-Specific Requirements
ONNX Runtime (for embeddings)
If using embeddings functionality, ONNX Runtime must be installed:
# macOS
# Ubuntu/Debian
# Windows (MSVC)
# OR download from https://github.com/microsoft/onnxruntime/releases
Without ONNX Runtime, embeddings will raise MissingDependencyError with installation instructions.
OCR Support (Optional)
To enable optical character recognition for scanned documents:
- macOS:
brew install tesseract - Ubuntu/Debian:
sudo apt-get install tesseract-ocr - Windows: Download from tesseract-ocr/tesseract
Legacy Office Format Support (Optional)
For .doc and .ppt file extraction:
- macOS:
brew install libreoffice - Ubuntu/Debian:
sudo apt-get install libreoffice
Quick Start
The CLI is available for Linux (x86_64/aarch64), macOS (Apple Silicon), and Windows with consistent behavior across all platforms.
Basic Text Extraction
# Extract text from a PDF
# Extract with JSON output
Extract with OCR
# Enable OCR for scanned documents
# Force OCR even if text extraction succeeds
Batch Processing
# Process multiple documents in parallel
# Process with custom configuration
MIME Type Detection
# Detect file type
# JSON output
Cache Management
# View cache statistics
# Clear the cache
# Custom cache directory
API Server (with api feature)
# Start API server on localhost:8000
# Custom host and port
# With configuration file
MCP Server (with mcp feature)
# Start Model Context Protocol server
# With configuration file
Configuration
The CLI supports configuration files in TOML, YAML, or JSON formats. Configuration can be:
- Explicit: Passed via
--config /path/to/config.{toml,yaml,json} - Auto-discovered: Searches for
kreuzberg.{toml,yaml,json}in current and parent directories - Default: Uses built-in defaults if no config found
Example Configuration (TOML)
# Basic extraction settings
= true
= true
= false
# OCR configuration
[]
= "tesseract"
= "eng"
[]
= true
= 6
= 50.0
# Text chunking (useful for LLM processing)
[]
= 1000
= 200
# PDF-specific options
[]
= true
= true
= []
# Language detection
[]
= true
= 0.8
= false
# Image extraction
[]
= true
= 300
= 4096
= true
Configuration Overrides
Command-line flags override configuration file settings:
# Override OCR setting from config
# Override chunking settings
# Disable cache despite config file
# Enable language detection
Command Reference
extract
Extract text, tables, and metadata from a document.
Options:
--config <PATH>: Configuration file (TOML, YAML, or JSON)--mime-type <TYPE>: MIME type hint (auto-detected if not provided)--format <FORMAT>: Output format (textorjson), default:text--ocr <true|false>: Enable/disable OCR--force-ocr <true|false>: Force OCR even if text extraction succeeds--no-cache <true|false>: Disable result caching--chunk <true|false>: Enable text chunking--chunk-size <SIZE>: Chunk size in characters (default: 1000)--chunk-overlap <SIZE>: Overlap between chunks (default: 200)--quality <true|false>: Enable quality processing--detect-language <true|false>: Enable language detection
Examples:
# Simple extraction
# With configuration and JSON output
# With chunking for LLM processing
# With OCR for scanned document
batch
Process multiple documents in parallel.
Options:
--config <PATH>: Configuration file (TOML, YAML, or JSON)--format <FORMAT>: Output format (textorjson), default:json--ocr <true|false>: Enable/disable OCR--force-ocr <true|false>: Force OCR even if text extraction succeeds--no-cache <true|false>: Disable result caching--quality <true|false>: Enable quality processing
Examples:
# Batch process multiple files
# With glob patterns
# With custom configuration
# With OCR
detect
Identify the MIME type of a file.
Options:
--format <FORMAT>: Output format (textorjson), default:text
Examples:
# Simple detection
# JSON output
cache
Manage extraction result cache.
Subcommands:
stats
Show cache statistics.
Options:
--cache-dir <DIR>: Cache directory (default:.kreuzbergin current directory)--format <FORMAT>: Output format (textorjson), default:text
clear
Clear the cache.
Options:
--cache-dir <DIR>: Cache directory (default:.kreuzbergin current directory)--format <FORMAT>: Output format (textorjson), default:text
Examples:
# View cache statistics
# Clear cache with custom directory
# JSON output
serve (requires api feature)
Start the REST API server.
Options:
--host <HOST>: Host to bind to (default:127.0.0.1)--port <PORT>: Port to bind to (default:8000)--config <PATH>: Configuration file (TOML, YAML, or JSON)
Examples:
# Default: localhost:8000
# Public access on port 3000
# With custom configuration
mcp (requires mcp feature)
Start the Model Context Protocol server.
Options:
--config <PATH>: Configuration file (TOML, YAML, or JSON)
Examples:
# Start MCP server
# With custom configuration
version
Show version information.
Options:
--format <FORMAT>: Output format (textorjson), default:text
Examples:
# Display version
# JSON output
Output Formats
Text Format
The default human-readable format:
# Output:
# Document content here...
JSON Format
For programmatic integration:
# Output:
# {
# "content": "Document content...",
# "mime_type": "application/pdf",
# "metadata": { "title": "...", "author": "..." },
# "tables": [{ "markdown": "...", "cells": [...], "page_number": 0 }]
# }
Supported File Formats
| Category | Formats |
|---|---|
| Documents | PDF, DOCX, DOC, PPTX, PPT, XLSX, XLS, ODT, ODP, ODS, RTF |
| Images | PNG, JPEG, JPG, WEBP, BMP, TIFF, GIF |
| Web | HTML, XHTML, XML |
| Text | TXT, MD, CSV, TSV, JSON, YAML, TOML |
| EML, MSG | |
| Archives | ZIP, TAR, 7Z |
| Other | 30+ additional formats |
Exit Codes
0: Successful executionNon-zero: Error occurred (check stderr for details)
Logging
Control logging verbosity with the RUST_LOG environment variable:
# Show info-level logs (default)
RUST_LOG=info
# Show detailed debug logs
RUST_LOG=debug
# Show only warnings and errors
RUST_LOG=warn
# Suppress all logs
RUST_LOG=error
# Show logs from specific modules
RUST_LOG=kreuzberg=debug
Performance Tips
-
Use batch processing for multiple files instead of sequential extraction:
-
Enable caching to avoid reprocessing the same documents:
# Cache is enabled by default -
Use appropriate chunk sizes for LLM processing:
-
Tune OCR settings for better performance:
# Adjust tesseract_config in configuration file for optimization -
Monitor cache size and clear when needed:
Features
Default Features
None by default. The binary includes core extraction.
Optional Features
api: Enable the REST API server (kreuzberg servecommand)mcp: Enable Model Context Protocol server (kreuzberg mcpcommand)all: Enable all features (api+mcp)
Building with Features
# Build with all features
# Build with specific features
Troubleshooting
File Not Found Error
Ensure the file path is correct and the file is readable:
# Check if file exists
# Try with absolute path
OCR Not Working
Verify Tesseract is installed:
# If not found:
# macOS: brew install tesseract
# Ubuntu: sudo apt-get install tesseract-ocr
# Windows: Download from https://github.com/tesseract-ocr/tesseract
Configuration File Not Found
Check that the configuration file has the correct format and location:
# Use explicit path
# Or place kreuzberg.toml in current directory
Out of Memory with Large Files
Use chunking to reduce memory usage:
Cache Directory Permissions
Ensure write access to the cache directory:
# Check permissions
# Or use a custom directory with appropriate permissions
# In config.toml: cache_dir = "/tmp/kreuzberg-cache"
Key Files
src/main.rs: CLI implementation with command definitions and argument parsingCargo.toml: Package metadata and dependencies
Building
Development Build
Release Build
With All Features
Testing
# Run CLI tests
# With logging
RUST_LOG=debug
Performance Characteristics
- Single file extraction: Typically 10-100ms depending on file size and format
- Batch processing: Near-linear scaling with 8 concurrent extractions by default
- OCR processing: 100-500ms per page depending on image quality and language
- Caching: Sub-millisecond retrieval for cached results
References
- Kreuzberg Core:
../kreuzberg/ - Main Documentation: https://docs.kreuzberg.dev
- GitHub Repository: https://github.com/kreuzberg-dev/kreuzberg
- Configuration Guide: See example configuration sections above
Contributing
We welcome contributions! Please see the main Kreuzberg repository for contribution guidelines.
License
MIT