Expand description
§Doc Loader
A comprehensive toolkit for extracting and processing documentation from multiple file formats.
This library provides unified processing for different document types:
- PDF documents
- Plain text files
- JSON documents
- CSV files
- DOCX documents
Each processor extracts content and metadata, then formats everything into a universal JSON structure ready for vector stores and RAG systems.
§Features
- Universal JSON Output: Consistent format across all document types
- Intelligent Text Processing: Smart chunking, cleaning, and metadata extraction
- Modular Architecture: Each document type has its specialized processor
- Vector Store Ready: Optimized output for embedding and indexing
§Example
use doc_loader::{UniversalProcessor, ProcessingParams};
// Create a processor instance
let processor = UniversalProcessor::new();
let params = ProcessingParams::default();
// Get supported extensions
let extensions = UniversalProcessor::supported_extensions();
assert!(!extensions.is_empty());
assert!(extensions.contains(&"pdf"));
// Example of processing (would require an actual file)
// let result = processor.process_file(Path::new("document.pdf"), Some(params))?;
// println!("Extracted {} chunks", result.chunks.len());Re-exports§
pub use error::DocLoaderError;pub use error::Result;pub use core::UniversalOutput;pub use core::DocumentChunk;pub use core::ChunkMetadata;pub use core::DocumentMetadata;pub use core::ProcessingParams;pub use core::DocumentType;pub use core::ProcessingInfo;pub use processors::UniversalProcessor;pub use processors::DocumentProcessor;pub use utils::clean_text;pub use utils::chunk_text;pub use utils::extract_text_metadata;pub use utils::detect_language;