Expand description
High-performance PDF text extraction for vectorization pipelines.
pdfvec provides fast, reliable text extraction from PDF documents,
optimized for NLP and machine learning preprocessing workflows.
§Quick Start
Extract text in just 2 lines:
let text = pdfvec::extract(&std::fs::read("doc.pdf")?)?;Or use the Extractor for more control:
use pdfvec::{Extractor, Config, Result};
fn main() -> Result<()> {
let pdf_data = std::fs::read("document.pdf")?;
// Simple extraction
let text = Extractor::new().extract(&pdf_data)?;
// With configuration
let text = Extractor::new()
.parallel(false)
.page_separator("\n---\n")
.extract(&pdf_data)?;
println!("{text}");
Ok(())
}§Streaming Extraction
For large PDFs, use the streaming API to maintain constant memory:
use pdfvec::{Extractor, Result};
fn main() -> Result<()> {
let data = std::fs::read("large.pdf")?;
for page_result in Extractor::new().pages(&data) {
let page = page_result?;
println!("Page {}: {} chars", page.number(), page.char_count());
}
Ok(())
}§Structured Output
Get a Document with individual Page access:
use pdfvec::{Extractor, Result};
fn main() -> Result<()> {
let data = std::fs::read("document.pdf")?;
let doc = Extractor::new().extract_document(&data)?;
println!("Total pages: {}", doc.page_count());
if let Some(page) = doc.page(1) {
println!("First page: {}", page.text());
}
Ok(())
}§Performance
pdfvec achieves 10-50x speedup over pdf-extract through:
- Lazy PDF parsing with
pdf-rs - Parallel page processing with
rayon - Minimal allocations during extraction
Typical throughput: 40-137 MiB/s on academic papers.
§Text Chunking
Split extracted text for embedding and RAG pipelines:
use pdfvec::{Chunker, ChunkStrategy};
let text = "First sentence. Second sentence.\n\nNew paragraph.";
// Fixed-size chunks with overlap
let chunks: Vec<_> = Chunker::new(ChunkStrategy::Fixed)
.chunk_size(20)
.overlap(5)
.chunks(text)
.collect();
// Paragraph-based chunking
let chunks: Vec<_> = Chunker::new(ChunkStrategy::Paragraph)
.chunks(text)
.collect();
// Sentence-based chunking
let chunks: Vec<_> = Chunker::new(ChunkStrategy::Sentence)
.chunks(text)
.collect();§Metadata Extraction
Extract document metadata without processing page content:
use pdfvec::{Extractor, Result};
fn main() -> Result<()> {
let data = std::fs::read("document.pdf")?;
let meta = Extractor::new().extract_metadata(&data)?;
println!("Title: {:?}", meta.title());
println!("Author: {:?}", meta.author());
if let Some(date) = meta.creation_date() {
println!("Created: {}", date.format("%Y-%m-%d"));
}
Ok(())
}§Error Handling
All errors provide actionable context:
use pdfvec::{Extractor, Error};
let result = Extractor::new().extract(&[]);
match result {
Err(Error::EmptyDocument) => eprintln!("Provide a valid PDF file"),
Err(e) => eprintln!("Extraction failed: {e}"),
Ok(text) => println!("{text}"),
}Modules§
- cli
- Command-line interface for pdfvec.
Structs§
- Chunk
- A chunk of text with positional metadata.
- Chunker
- Text chunker with configurable strategy and parameters.
- Config
- Configuration for PDF text extraction.
- Document
- A fully extracted PDF document.
- Extractor
- PDF text extractor with configurable behavior.
- Metadata
- PDF document metadata.
- Page
- A single extracted page from a PDF document.
- Page
Iterator - Streaming iterator over PDF pages.
Enums§
- Chunk
Strategy - Strategy for splitting text into chunks.
- Error
- Errors that can occur during PDF text extraction.
Functions§
- extract
- Extracts all text from a PDF in one call.
Type Aliases§
- Result
- A specialized Result type for pdfvec operations.