Skip to main content

Crate pdfvec

Crate pdfvec 

Source
Expand description

High-performance PDF text extraction for vectorization pipelines.

pdfvec provides fast, reliable text extraction from PDF documents, optimized for NLP and machine learning preprocessing workflows.

§Quick Start

Extract text in just 2 lines:

let text = pdfvec::extract(&std::fs::read("doc.pdf")?)?;

Or use the Extractor for more control:

use pdfvec::{Extractor, Config, Result};

fn main() -> Result<()> {
    let pdf_data = std::fs::read("document.pdf")?;

    // Simple extraction
    let text = Extractor::new().extract(&pdf_data)?;

    // With configuration
    let text = Extractor::new()
        .parallel(false)
        .page_separator("\n---\n")
        .extract(&pdf_data)?;

    println!("{text}");
    Ok(())
}

§Streaming Extraction

For large PDFs, use the streaming API to maintain constant memory:

use pdfvec::{Extractor, Result};

fn main() -> Result<()> {
    let data = std::fs::read("large.pdf")?;

    for page_result in Extractor::new().pages(&data) {
        let page = page_result?;
        println!("Page {}: {} chars", page.number(), page.char_count());
    }
    Ok(())
}

§Structured Output

Get a Document with individual Page access:

use pdfvec::{Extractor, Result};

fn main() -> Result<()> {
    let data = std::fs::read("document.pdf")?;
    let doc = Extractor::new().extract_document(&data)?;

    println!("Total pages: {}", doc.page_count());
    if let Some(page) = doc.page(1) {
        println!("First page: {}", page.text());
    }
    Ok(())
}

§Performance

pdfvec achieves 10-50x speedup over pdf-extract through:

  • Lazy PDF parsing with pdf-rs
  • Parallel page processing with rayon
  • Minimal allocations during extraction

Typical throughput: 40-137 MiB/s on academic papers.

§Text Chunking

Split extracted text for embedding and RAG pipelines:

use pdfvec::{Chunker, ChunkStrategy};

let text = "First sentence. Second sentence.\n\nNew paragraph.";

// Fixed-size chunks with overlap
let chunks: Vec<_> = Chunker::new(ChunkStrategy::Fixed)
    .chunk_size(20)
    .overlap(5)
    .chunks(text)
    .collect();

// Paragraph-based chunking
let chunks: Vec<_> = Chunker::new(ChunkStrategy::Paragraph)
    .chunks(text)
    .collect();

// Sentence-based chunking
let chunks: Vec<_> = Chunker::new(ChunkStrategy::Sentence)
    .chunks(text)
    .collect();

§Metadata Extraction

Extract document metadata without processing page content:

use pdfvec::{Extractor, Result};

fn main() -> Result<()> {
    let data = std::fs::read("document.pdf")?;
    let meta = Extractor::new().extract_metadata(&data)?;

    println!("Title: {:?}", meta.title());
    println!("Author: {:?}", meta.author());
    if let Some(date) = meta.creation_date() {
        println!("Created: {}", date.format("%Y-%m-%d"));
    }
    Ok(())
}

§Error Handling

All errors provide actionable context:

use pdfvec::{Extractor, Error};

let result = Extractor::new().extract(&[]);
match result {
    Err(Error::EmptyDocument) => eprintln!("Provide a valid PDF file"),
    Err(e) => eprintln!("Extraction failed: {e}"),
    Ok(text) => println!("{text}"),
}

Modules§

cli
Command-line interface for pdfvec.

Structs§

Chunk
A chunk of text with positional metadata.
Chunker
Text chunker with configurable strategy and parameters.
Config
Configuration for PDF text extraction.
Document
A fully extracted PDF document.
Extractor
PDF text extractor with configurable behavior.
Metadata
PDF document metadata.
Page
A single extracted page from a PDF document.
PageIterator
Streaming iterator over PDF pages.

Enums§

ChunkStrategy
Strategy for splitting text into chunks.
Error
Errors that can occur during PDF text extraction.

Functions§

extract
Extracts all text from a PDF in one call.

Type Aliases§

Result
A specialized Result type for pdfvec operations.