Crate pdfvec

Expand description

High-performance PDF text extraction for vectorization pipelines.

pdfvec provides fast, reliable text extraction from PDF documents, optimized for NLP and machine learning preprocessing workflows.

§Quick Start

Extract text in just 2 lines:

let text = pdfvec::extract(&std::fs::read("doc.pdf")?)?;

Or use the Extractor for more control:

use pdfvec::{Extractor, Config, Result};

fn main() -> Result<()> {
    let pdf_data = std::fs::read("document.pdf")?;

    // Simple extraction
    let text = Extractor::new().extract(&pdf_data)?;

    // With configuration
    let text = Extractor::new()
        .parallel(false)
        .page_separator("\n---\n")
        .extract(&pdf_data)?;

    println!("{text}");
    Ok(())
}

§Streaming Extraction

For large PDFs, use the streaming API to maintain constant memory:

use pdfvec::{Extractor, Result};

fn main() -> Result<()> {
    let data = std::fs::read("large.pdf")?;

    for page_result in Extractor::new().pages(&data) {
        let page = page_result?;
        println!("Page {}: {} chars", page.number(), page.char_count());
    }
    Ok(())
}

§Structured Output

Get a Document with individual Page access:

use pdfvec::{Extractor, Result};

fn main() -> Result<()> {
    let data = std::fs::read("document.pdf")?;
    let doc = Extractor::new().extract_document(&data)?;

    println!("Total pages: {}", doc.page_count());
    if let Some(page) = doc.page(1) {
        println!("First page: {}", page.text());
    }
    Ok(())
}

§Performance

pdfvec achieves 10-50x speedup over pdf-extract through:

Lazy PDF parsing with pdf-rs
Parallel page processing with rayon
Minimal allocations during extraction

Typical throughput: 40-137 MiB/s on academic papers.

§Text Chunking

Split extracted text for embedding and RAG pipelines:

use pdfvec::{Chunker, ChunkStrategy};

let text = "First sentence. Second sentence.\n\nNew paragraph.";

// Fixed-size chunks with overlap
let chunks: Vec<_> = Chunker::new(ChunkStrategy::Fixed)
    .chunk_size(20)
    .overlap(5)
    .chunks(text)
    .collect();

// Paragraph-based chunking
let chunks: Vec<_> = Chunker::new(ChunkStrategy::Paragraph)
    .chunks(text)
    .collect();

// Sentence-based chunking
let chunks: Vec<_> = Chunker::new(ChunkStrategy::Sentence)
    .chunks(text)
    .collect();

§Metadata Extraction

Extract document metadata without processing page content:

use pdfvec::{Extractor, Result};

fn main() -> Result<()> {
    let data = std::fs::read("document.pdf")?;
    let meta = Extractor::new().extract_metadata(&data)?;

    println!("Title: {:?}", meta.title());
    println!("Author: {:?}", meta.author());
    if let Some(date) = meta.creation_date() {
        println!("Created: {}", date.format("%Y-%m-%d"));
    }
    Ok(())
}

§Error Handling

All errors provide actionable context:

use pdfvec::{Extractor, Error};

let result = Extractor::new().extract(&[]);
match result {
    Err(Error::EmptyDocument) => eprintln!("Provide a valid PDF file"),
    Err(e) => eprintln!("Extraction failed: {e}"),
    Ok(text) => println!("{text}"),
}

Modules§

cli: Command-line interface for pdfvec.

Structs§

Chunk: A chunk of text with positional metadata.
Chunker: Text chunker with configurable strategy and parameters.
Config: Configuration for PDF text extraction.
Document: A fully extracted PDF document.
Extractor: PDF text extractor with configurable behavior.
Metadata: PDF document metadata.
Page: A single extracted page from a PDF document.
PageIterator: Streaming iterator over PDF pages.

Enums§

ChunkStrategy: Strategy for splitting text into chunks.
Error: Errors that can occur during PDF text extraction.

Functions§

extract: Extracts all text from a PDF in one call.

Type Aliases§

Result: A specialized Result type for pdfvec operations.

Crate pdfvec

Crate pdfvec Copy item path

§Quick Start

§Streaming Extraction

§Structured Output

§Performance

§Text Chunking

§Metadata Extraction

§Error Handling

Modules§

Structs§

Enums§

Functions§

Type Aliases§

Crate pdfvec