parser-core 0.1.0

A library for extracting text from various file formats including PDF, DOCX, XLSX, PPTX, images via OCR, and more
docs.rs failed to build parser-core-0.1.0
Please check the build logs for more information.
See Builds for ideas on how to fix a failed build, or Metadata for how to configure docs.rs builds.
If you believe this is docs.rs' fault, open an issue.
Visit the last successful build: parser-core-0.1.3

Parser Core

The core engine of the parser project, providing functionality for extracting text from various file formats.

Features

  • Parse a wide variety of document formats:
    • PDF files (.pdf)
    • Office documents (.docx, .xlsx, .pptx)
    • Plain text files (.txt, .csv, .json)
    • Images with OCR (.png, .jpg, .webp)
  • Automatic format detection
  • Parallel processing support via Rayon

Dependencies

This package requires the following system libraries:

  • Tesseract OCR - Used for image text extraction
  • Leptonica - Image processing library used by Tesseract
  • Clang - Required for some build dependencies

Installation on Debian/Ubuntu

sudo apt install libtesseract-dev libleptonica-dev libclang-dev

Installation on macOS

brew install tesseract

Installation on Windows

Follow the instructions at Tesseract GitHub repository.

Usage

Add as a dependency in your Cargo.toml:

cargo add parser-core

Basic usage:

use parser_core::parse;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Read a file
    let data = std::fs::read("document.pdf")?;
    
    // Parse the document
    let text = parse(&data)?;
    
    println!("Extracted text: {}", text);
    
    Ok(())
}

Architecture

The crate is organized around a central parse function that:

  1. Detects the MIME type of the provided data
  2. Routes to the appropriate parser module
  3. Returns the extracted text

Each parser is implemented in its own module:

  • docx.rs - Microsoft Word documents
  • pdf.rs - PDF documents
  • xlsx.rs - Microsoft Excel spreadsheets
  • pptx.rs - Microsoft PowerPoint presentations
  • text.rs - Plain text formats, including CSV and JSON
  • image.rs - Image formats using OCR

Development

Testing

Run tests with:

cargo test

Benchmarking

Benchmark sequential vs. parallel parsing:

cargo bench