pdfvec 0.1.0

High-performance PDF text extraction library for vectorization pipelines
Documentation

High-performance PDF text extraction for vectorization pipelines.

pdfvec provides fast, reliable text extraction from PDF documents, optimized for NLP and machine learning preprocessing workflows.

Quick Start

Extract text in just 2 lines:

let text = pdfvec::extract(&std::fs::read("doc.pdf")?)?;
# Ok::<(), pdfvec::Error>(())

Or use the [Extractor] for more control:

use pdfvec::{Extractor, Config, Result};

fn main() -> Result<()> {
    let pdf_data = std::fs::read("document.pdf")?;

    // Simple extraction
    let text = Extractor::new().extract(&pdf_data)?;

    // With configuration
    let text = Extractor::new()
        .parallel(false)
        .page_separator("\n---\n")
        .extract(&pdf_data)?;

    println!("{text}");
    Ok(())
}

Streaming Extraction

For large PDFs, use the streaming API to maintain constant memory:

use pdfvec::{Extractor, Result};

fn main() -> Result<()> {
    let data = std::fs::read("large.pdf")?;

    for page_result in Extractor::new().pages(&data) {
        let page = page_result?;
        println!("Page {}: {} chars", page.number(), page.char_count());
    }
    Ok(())
}

Structured Output

Get a [Document] with individual [Page] access:

use pdfvec::{Extractor, Result};

fn main() -> Result<()> {
    let data = std::fs::read("document.pdf")?;
    let doc = Extractor::new().extract_document(&data)?;

    println!("Total pages: {}", doc.page_count());
    if let Some(page) = doc.page(1) {
        println!("First page: {}", page.text());
    }
    Ok(())
}

Performance

pdfvec achieves 10-50x speedup over pdf-extract through:

  • Lazy PDF parsing with pdf-rs
  • Parallel page processing with rayon
  • Minimal allocations during extraction

Typical throughput: 40-137 MiB/s on academic papers.

Text Chunking

Split extracted text for embedding and RAG pipelines:

use pdfvec::{Chunker, ChunkStrategy};

let text = "First sentence. Second sentence.\n\nNew paragraph.";

// Fixed-size chunks with overlap
let chunks: Vec<_> = Chunker::new(ChunkStrategy::Fixed)
    .chunk_size(20)
    .overlap(5)
    .chunks(text)
    .collect();

// Paragraph-based chunking
let chunks: Vec<_> = Chunker::new(ChunkStrategy::Paragraph)
    .chunks(text)
    .collect();

// Sentence-based chunking
let chunks: Vec<_> = Chunker::new(ChunkStrategy::Sentence)
    .chunks(text)
    .collect();

Metadata Extraction

Extract document metadata without processing page content:

use pdfvec::{Extractor, Result};

fn main() -> Result<()> {
    let data = std::fs::read("document.pdf")?;
    let meta = Extractor::new().extract_metadata(&data)?;

    println!("Title: {:?}", meta.title());
    println!("Author: {:?}", meta.author());
    if let Some(date) = meta.creation_date() {
        println!("Created: {}", date.format("%Y-%m-%d"));
    }
    Ok(())
}

Error Handling

All errors provide actionable context:

use pdfvec::{Extractor, Error};

let result = Extractor::new().extract(&[]);
match result {
    Err(Error::EmptyDocument) => eprintln!("Provide a valid PDF file"),
    Err(e) => eprintln!("Extraction failed: {e}"),
    Ok(text) => println!("{text}"),
}