Skip to main content

Crate kreuzberg

Crate kreuzberg 

Source
Expand description

Kreuzberg - High-Performance Document Intelligence Library

Kreuzberg is a Rust-first document extraction library with language-agnostic plugin support. It provides fast, accurate extraction from PDFs, images, Office documents, emails, and more.

§Quick Start

use kreuzberg::{extract_file_sync, ExtractionConfig};

// Extract content from a file
let config = ExtractionConfig::default();
let result = extract_file_sync("document.pdf", None, &config)?;
println!("Extracted: {}", result.content);

§Architecture

  • Core Module (core): Main extraction orchestration, MIME detection, config loading
  • Plugin System: Language-agnostic plugin architecture
  • Extractors: Format-specific extraction (PDF, images, Office docs, email, etc.)
  • OCR: Multiple OCR backend support (Tesseract, EasyOCR, PaddleOCR)

§Features

  • Fast parallel processing with async/await
  • Priority-based extractor selection
  • Comprehensive MIME type detection (118+ file extensions)
  • Configurable caching and quality processing
  • Cross-language plugin support (Python, Node.js planned)

Re-exports§

pub use error::KreuzbergError;
pub use error::Result;
pub use core::extractor::batch_extract_bytes;
pub use core::extractor::batch_extract_file;
pub use core::extractor::extract_bytes;
pub use core::extractor::extract_file;
pub use core::extractor::batch_extract_bytes_sync;
pub use core::extractor::extract_bytes_sync;
pub use core::extractor::batch_extract_file_sync;
pub use core::extractor::extract_file_sync;
pub use core::config::ChunkerType;
pub use core::config::ChunkingConfig;
pub use core::config::EmbeddingConfig;
pub use core::config::EmbeddingModelType;
pub use core::config::ExtractionConfig;
pub use core::config::ImageExtractionConfig;
pub use core::config::LanguageDetectionConfig;
pub use core::config::OcrConfig;
pub use core::config::OutputFormat;
pub use core::config::PageConfig;
pub use core::config::PostProcessorConfig;
pub use core::config::TokenReductionConfig;
pub use core::mime::DOCX_MIME_TYPE;
pub use core::mime::EXCEL_MIME_TYPE;
pub use core::mime::HTML_MIME_TYPE;
pub use core::mime::JSON_MIME_TYPE;
pub use core::mime::MARKDOWN_MIME_TYPE;
pub use core::mime::PDF_MIME_TYPE;
pub use core::mime::PLAIN_TEXT_MIME_TYPE;
pub use core::mime::POWER_POINT_MIME_TYPE;
pub use core::mime::XML_MIME_TYPE;
pub use core::mime::detect_mime_type;
pub use core::mime::detect_mime_type_from_bytes;
pub use core::mime::detect_or_validate;
pub use core::mime::get_extensions_for_mime;
pub use core::mime::validate_mime_type;
pub use core::formats::KNOWN_FORMATS;
pub use core::formats::is_valid_format_field;
pub use plugins::registry::get_document_extractor_registry;
pub use plugins::registry::get_ocr_backend_registry;
pub use plugins::registry::get_post_processor_registry;
pub use plugins::registry::get_validator_registry;
pub use types::*;

Modules§

cache
Generic cache implementation with lock poisoning recovery.
core
Core extraction orchestration module.
error
Error types for Kreuzberg.
extraction
extractors
Built-in document extractors.
panic_context
plugins
Plugin system for extending Kreuzberg functionality.
text
types
Core types for document extraction.
utils
Text utility functions for quality processing and string manipulation.