Kreuzberg
High-performance document intelligence library for Rust. Extract text, metadata, and structured information from PDFs, Office documents, images, and 56 formats.
This is the core Rust library that powers the Python, TypeScript, and Ruby bindings.
🚀 Version 4.1.2 Release This is a pre-release version. We invite you to test the library and report any issues you encounter.
Note: The Rust crate is not currently published to crates.io for this RC. Use git dependencies or language bindings (Python, TypeScript, Ruby) instead.
Installation
[]
= "4.0"
= { = "1", = ["rt", "macros"] }
PDFium Linking Options
Kreuzberg offers flexible PDFium linking strategies for different deployment scenarios. Note: Language bindings (Python, TypeScript, Ruby, Java, Go, C#, PHP, Elixir) automatically bundle PDFium—no configuration needed. This section applies only to the Rust crate.
| Strategy | Feature Flag | Description | Use Case |
|---|---|---|---|
| Default (Dynamic) | None | Links to system PDFium at runtime | Development, system package users |
| Static | pdf-static |
Statically links PDFium into binary | Single binary distribution, no runtime dependencies |
| Bundled | pdf-bundled |
Downloads and embeds PDFium in binary | CI/CD, hermetic builds, largest binary size |
| System | pdf-system |
Uses system PDFium via pkg-config | Linux distributions with PDFium package |
Example Cargo.toml configurations:
# Default (dynamic linking)
[]
= "4.0"
# Static linking
[]
= { = "4.0", = ["pdf-static"] }
# Bundled in binary
[]
= { = "4.0", = ["pdf-bundled"] }
# System library (requires PDFium installed)
[]
= { = "4.0", = ["pdf-system"] }
For more details on feature flags and configuration options, see the Features documentation.
System Requirements
ONNX Runtime (for embeddings)
If using embeddings functionality, ONNX Runtime must be installed:
# macOS
# Ubuntu/Debian
# Windows (MSVC)
# OR download from https://github.com/microsoft/onnxruntime/releases
Without ONNX Runtime, embeddings will raise MissingDependencyError with installation instructions.
Quick Start
use ;
Async Extraction
use ;
async
Batch Processing
use ;
async
OCR with Table Extraction
use ;
Password-Protected PDFs
use ;
Extract from Bytes
use ;
use fs;
Features
The crate uses feature flags for optional functionality:
[]
= { = "4.0", = ["pdf", "excel", "ocr"] }
Available Features
| Feature | Description | Binary Size |
|---|---|---|
pdf |
PDF extraction via pdfium | +25MB |
excel |
Excel/spreadsheet parsing | +3MB |
office |
DOCX, PPTX extraction | +1MB |
email |
EML, MSG extraction | +500KB |
html |
HTML to markdown | +1MB |
xml |
XML streaming parser | +500KB |
archives |
ZIP, TAR, 7Z extraction | +2MB |
ocr |
OCR with Tesseract | +5MB |
language-detection |
Language detection | +100KB |
chunking |
Text chunking | +200KB |
quality |
Text quality processing | +500KB |
Feature Bundles
= { = "4.0", = ["full"] }
= { = "4.0", = ["server"] }
= { = "4.0", = ["cli"] }
PDF Support and Linking Options
Kreuzberg supports three PDFium linking strategies. Default is bundled-pdfium (best developer experience).
| Strategy | Feature | Use Case | Binary Size | Runtime Deps |
|---|---|---|---|---|
| Bundled (default) | bundled-pdfium |
Development, production | +8-15MB | None |
| Static | static-pdfium |
Docker, musl, standalone binaries | +200MB | None |
| System | system-pdfium |
Package managers, distros | +2MB | libpdfium.so |
Quick Start
# Default - bundled PDFium (recommended)
[]
= "4.0"
# Static linking (Docker, musl)
[]
= { = "4.0", = ["static-pdfium"] }
# System PDFium (package managers)
[]
= { = "4.0", = ["system-pdfium"] }
For detailed information, see the PDFium Linking Guide.
Note: Language bindings (Python, TypeScript, Ruby, Java, Go) automatically bundle PDFium. No configuration needed.
Documentation
API Documentation – Complete API reference with examples
https://docs.kreuzberg.dev – User guide and tutorials
License
MIT License - see LICENSE for details.