docs.rs failed to build parser-core-0.1.0
Please check the build logs for more information.
See Builds for ideas on how to fix a failed build, or Metadata for how to configure docs.rs builds.
If you believe this is docs.rs' fault, open an issue.
Please check the build logs for more information.
See Builds for ideas on how to fix a failed build, or Metadata for how to configure docs.rs builds.
If you believe this is docs.rs' fault, open an issue.
Visit the last successful build:
parser-core-0.1.3
Parser Core
The core engine of the parser project, providing functionality for extracting text from various file formats.
Features
- Parse a wide variety of document formats:
- PDF files (
.pdf
) - Office documents (
.docx
,.xlsx
,.pptx
) - Plain text files (
.txt
,.csv
,.json
) - Images with OCR (
.png
,.jpg
,.webp
)
- PDF files (
- Automatic format detection
- Parallel processing support via Rayon
Dependencies
This package requires the following system libraries:
- Tesseract OCR - Used for image text extraction
- Leptonica - Image processing library used by Tesseract
- Clang - Required for some build dependencies
Installation on Debian/Ubuntu
Installation on macOS
Installation on Windows
Follow the instructions at Tesseract GitHub repository.
Usage
Add as a dependency in your Cargo.toml
:
Basic usage:
use parse;
Architecture
The crate is organized around a central parse
function that:
- Detects the MIME type of the provided data
- Routes to the appropriate parser module
- Returns the extracted text
Each parser is implemented in its own module:
docx.rs
- Microsoft Word documentspdf.rs
- PDF documentsxlsx.rs
- Microsoft Excel spreadsheetspptx.rs
- Microsoft PowerPoint presentationstext.rs
- Plain text formats, including CSV and JSONimage.rs
- Image formats using OCR
Development
Testing
Run tests with:
Benchmarking
Benchmark sequential vs. parallel parsing: