Crate doc_loader

Crate doc_loader 

Source
Expand description

§Doc Loader

A comprehensive toolkit for extracting and processing documentation from multiple file formats.

This library provides unified processing for different document types:

  • PDF documents
  • Plain text files
  • JSON documents
  • CSV files
  • DOCX documents

Each processor extracts content and metadata, then formats everything into a universal JSON structure ready for vector stores and RAG systems.

§Features

  • Universal JSON Output: Consistent format across all document types
  • Intelligent Text Processing: Smart chunking, cleaning, and metadata extraction
  • Modular Architecture: Each document type has its specialized processor
  • Vector Store Ready: Optimized output for embedding and indexing

§Example

use doc_loader::{UniversalProcessor, ProcessingParams};
 
// Create a processor instance
let processor = UniversalProcessor::new();
let params = ProcessingParams::default();
 
// Get supported extensions
let extensions = UniversalProcessor::supported_extensions();
assert!(!extensions.is_empty());
assert!(extensions.contains(&"pdf"));
 
// Example of processing (would require an actual file)
// let result = processor.process_file(Path::new("document.pdf"), Some(params))?;
// println!("Extracted {} chunks", result.chunks.len());

Re-exports§

pub use error::DocLoaderError;
pub use error::Result;
pub use core::UniversalOutput;
pub use core::DocumentChunk;
pub use core::ChunkMetadata;
pub use core::DocumentMetadata;
pub use core::ProcessingParams;
pub use core::DocumentType;
pub use core::ProcessingInfo;
pub use processors::UniversalProcessor;
pub use processors::DocumentProcessor;
pub use utils::clean_text;
pub use utils::chunk_text;
pub use utils::extract_text_metadata;
pub use utils::detect_language;

Modules§

core
error
processors
utils