Parser Core

The core engine of the parser project, providing functionality for extracting text from various file formats.

Features

Parse a wide variety of document formats:
- PDF files (.pdf)
- Office documents (.docx, .xlsx, .pptx)
- Plain text files (.txt, .csv, .json)
- Images with OCR (.png, .jpg, .webp)
Automatic format detection
Parallel processing support via Rayon

Dependencies

This package requires the following system libraries:

Tesseract OCR - Used for image text extraction
Leptonica - Image processing library used by Tesseract
Clang - Required for some build dependencies

Installation on Debian/Ubuntu

sudo apt install libtesseract-dev libleptonica-dev libclang-dev

Installation on macOS

brew install tesseract

Installation on Windows

Follow the instructions at Tesseract GitHub repository.

Usage

Add as a dependency in your Cargo.toml:

cargo add parser-core

Basic usage:

use parser_core::parse;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Read a file
    let data = std::fs::read("document.pdf")?;
    
    // Parse the document
    let text = parse(&data)?;
    
    println!("Extracted text: {}", text);
    
    Ok(())
}

Architecture

The crate is organized around a central parse function that:

Detects the MIME type of the provided data
Routes to the appropriate parser module
Returns the extracted text

Each parser is implemented in its own module:

docx.rs - Microsoft Word documents
pdf.rs - PDF documents
xlsx.rs - Microsoft Excel spreadsheets
pptx.rs - Microsoft PowerPoint presentations
text.rs - Plain text formats, including CSV and JSON
image.rs - Image formats using OCR

Development

Testing

Run tests with:

cargo test

Benchmarking

Benchmark sequential vs. parallel parsing:

cargo bench

parser-core 0.1.3