Skip to main content

Crate kreuzberg

Crate kreuzberg 

Source
Expand description

Kreuzberg - High-Performance Document Intelligence Library

Kreuzberg is a Rust-first document extraction library with language-agnostic plugin support. It provides fast, accurate extraction from PDFs, images, Office documents, emails, and more.

§Quick Start

use kreuzberg::{extract_file_sync, ExtractionConfig};

// Extract content from a file
let config = ExtractionConfig::default();
let result = extract_file_sync("document.pdf", None, &config)?;
println!("Extracted: {}", result.content);

§Architecture

  • Core Module (core): Main extraction orchestration, MIME detection, config loading
  • Plugin System: Language-agnostic plugin architecture
  • Extractors: Format-specific extraction (PDF, images, Office docs, email, etc.)
  • OCR: Multiple OCR backend support (Tesseract, EasyOCR, PaddleOCR)

§Features

  • Fast parallel processing with async/await
  • Priority-based extractor selection
  • Comprehensive MIME type detection (118+ file extensions)
  • Configurable caching and quality processing
  • Cross-language plugin support (Python, Node.js planned)

Re-exports§

pub use error::KreuzbergError;
pub use error::Result;
pub use core::extractor::batch_extract_bytes;
pub use core::extractor::batch_extract_file;
pub use core::extractor::extract_bytes;
pub use core::extractor::extract_file;
pub use core::extractor::batch_extract_bytes_sync;
pub use core::extractor::extract_bytes_sync;
pub use core::extractor::batch_extract_file_sync;
pub use core::extractor::extract_file_sync;
pub use core::config::AccelerationConfig;
pub use core::config::ChunkSizing;
pub use core::config::ChunkerType;
pub use core::config::ChunkingConfig;
pub use core::config::EmailConfig;
pub use core::config::EmbeddingConfig;
pub use core::config::EmbeddingModelType;
pub use core::config::ExecutionProviderType;
pub use core::config::ExtractionConfig;
pub use core::config::FileExtractionConfig;
pub use core::config::ImageExtractionConfig;
pub use core::config::LanguageDetectionConfig;
pub use core::config::OcrConfig;
pub use core::config::OutputFormat;
pub use core::config::PageConfig;
pub use core::config::PostProcessorConfig;
pub use core::config::TokenReductionConfig;
pub use core::mime::DOCX_MIME_TYPE;
pub use core::mime::EXCEL_MIME_TYPE;
pub use core::mime::HTML_MIME_TYPE;
pub use core::mime::JSON_MIME_TYPE;
pub use core::mime::MARKDOWN_MIME_TYPE;
pub use core::mime::PDF_MIME_TYPE;
pub use core::mime::PLAIN_TEXT_MIME_TYPE;
pub use core::mime::POWER_POINT_MIME_TYPE;
pub use core::mime::SupportedFormat;
pub use core::mime::XML_MIME_TYPE;
pub use core::mime::detect_mime_type;
pub use core::mime::detect_mime_type_from_bytes;
pub use core::mime::detect_or_validate;
pub use core::mime::get_extensions_for_mime;
pub use core::mime::list_supported_formats;
pub use core::mime::validate_mime_type;
pub use core::formats::KNOWN_FORMATS;
pub use core::formats::is_valid_format_field;
pub use plugins::registry::get_document_extractor_registry;
pub use plugins::registry::get_ocr_backend_registry;
pub use plugins::registry::get_post_processor_registry;
pub use plugins::registry::get_validator_registry;
pub use types::*;

Modules§

cache
Generic cache implementation with lock poisoning recovery.
core
Core extraction orchestration module.
error
Error types for Kreuzberg.
extraction
extractors
Built-in document extractors.
model_cache
Generic global model cache for ONNX-based models.
panic_context
plugins
Plugin system for extending Kreuzberg functionality.
rendering
Unified rendering of DocumentStructure to output formats.
text
types
Core types for document extraction.
utils
Text utility functions for quality processing and string manipulation.