Expand description
Kreuzberg - High-Performance Document Intelligence Library
Kreuzberg is a Rust-first document extraction library with language-agnostic plugin support. It provides fast, accurate extraction from PDFs, images, Office documents, emails, and more.
§Quick Start
use kreuzberg::{extract_file_sync, ExtractionConfig};
// Extract content from a file
let config = ExtractionConfig::default();
let result = extract_file_sync("document.pdf", None, &config)?;
println!("Extracted: {}", result.content);§Architecture
- Core Module (
core): Main extraction orchestration, MIME detection, config loading - Plugin System: Language-agnostic plugin architecture
- Extractors: Format-specific extraction (PDF, images, Office docs, email, etc.)
- OCR: Multiple OCR backend support (Tesseract, EasyOCR, PaddleOCR)
§Features
- Fast parallel processing with async/await
- Priority-based extractor selection
- Comprehensive MIME type detection (118+ file extensions)
- Configurable caching and quality processing
- Cross-language plugin support (Python, Node.js planned)
Re-exports§
pub use error::KreuzbergError;pub use error::Result;pub use core::extractor::batch_extract_bytes;pub use core::extractor::batch_extract_file;pub use core::extractor::extract_bytes;pub use core::extractor::extract_file;pub use core::extractor::batch_extract_bytes_sync;pub use core::extractor::extract_bytes_sync;pub use core::extractor::batch_extract_file_sync;pub use core::extractor::extract_file_sync;pub use core::config::ChunkerType;pub use core::config::ChunkingConfig;pub use core::config::EmbeddingConfig;pub use core::config::EmbeddingModelType;pub use core::config::ExtractionConfig;pub use core::config::ImageExtractionConfig;pub use core::config::LanguageDetectionConfig;pub use core::config::OcrConfig;pub use core::config::OutputFormat;pub use core::config::PageConfig;pub use core::config::PostProcessorConfig;pub use core::config::TokenReductionConfig;pub use core::mime::DOCX_MIME_TYPE;pub use core::mime::EXCEL_MIME_TYPE;pub use core::mime::HTML_MIME_TYPE;pub use core::mime::JSON_MIME_TYPE;pub use core::mime::MARKDOWN_MIME_TYPE;pub use core::mime::PDF_MIME_TYPE;pub use core::mime::PLAIN_TEXT_MIME_TYPE;pub use core::mime::POWER_POINT_MIME_TYPE;pub use core::mime::XML_MIME_TYPE;pub use core::mime::detect_mime_type;pub use core::mime::detect_mime_type_from_bytes;pub use core::mime::detect_or_validate;pub use core::mime::get_extensions_for_mime;pub use core::mime::validate_mime_type;pub use core::formats::KNOWN_FORMATS;pub use core::formats::is_valid_format_field;pub use plugins::registry::get_document_extractor_registry;pub use plugins::registry::get_ocr_backend_registry;pub use plugins::registry::get_post_processor_registry;pub use plugins::registry::get_validator_registry;pub use types::*;
Modules§
- cache
- Generic cache implementation with lock poisoning recovery.
- core
- Core extraction orchestration module.
- error
- Error types for Kreuzberg.
- extraction
- extractors
- Built-in document extractors.
- panic_
context - plugins
- Plugin system for extending Kreuzberg functionality.
- text
- types
- Core types for document extraction.
- utils
- Text utility functions for quality processing and string manipulation.