Parser Core
The core engine of the parser project, providing functionality for extracting text from various file formats.
Features
- Parse a wide variety of document formats:
- PDF files (
.pdf) - Office documents (
.docx,.xlsx,.pptx) - Plain text files (
.txt,.csv,.json) - Images with OCR (
.png,.jpg,.webp)
- PDF files (
- Automatic format detection
- Parallel processing support via Rayon
Dependencies
This package requires the following system libraries:
- Tesseract OCR - Used for image text extraction
- Leptonica - Image processing library used by Tesseract
- Clang - Required for some build dependencies
Installation on Debian/Ubuntu
Installation on macOS
Installation on Windows
Follow the instructions at Tesseract GitHub repository.
Usage
Add as a dependency in your Cargo.toml:
Basic usage:
use parse;
Architecture
The crate is organized around a central parse function that:
- Detects the MIME type of the provided data
- Routes to the appropriate parser module
- Returns the extracted text
Each parser is implemented in its own module:
docx.rs- Microsoft Word documentspdf.rs- PDF documentsxlsx.rs- Microsoft Excel spreadsheetspptx.rs- Microsoft PowerPoint presentationstext.rs- Plain text formats, including CSV and JSONimage.rs- Image formats using OCR
Development
Testing
Run tests with:
Benchmarking
Benchmark sequential vs. parallel parsing: