High-performance PDF text extraction for vectorization pipelines.
pdfvec provides fast, reliable text extraction from PDF documents,
optimized for NLP and machine learning preprocessing workflows.
Quick Start
Extract text in just 2 lines:
let text = pdfvec::extract(&std::fs::read("doc.pdf")?)?;
# Ok::<(), pdfvec::Error>(())
Or use the [Extractor] for more control:
use pdfvec::{Extractor, Config, Result};
fn main() -> Result<()> {
let pdf_data = std::fs::read("document.pdf")?;
// Simple extraction
let text = Extractor::new().extract(&pdf_data)?;
// With configuration
let text = Extractor::new()
.parallel(false)
.page_separator("\n---\n")
.extract(&pdf_data)?;
println!("{text}");
Ok(())
}
Streaming Extraction
For large PDFs, use the streaming API to maintain constant memory:
use pdfvec::{Extractor, Result};
fn main() -> Result<()> {
let data = std::fs::read("large.pdf")?;
for page_result in Extractor::new().pages(&data) {
let page = page_result?;
println!("Page {}: {} chars", page.number(), page.char_count());
}
Ok(())
}
Structured Output
Get a [Document] with individual [Page] access:
use pdfvec::{Extractor, Result};
fn main() -> Result<()> {
let data = std::fs::read("document.pdf")?;
let doc = Extractor::new().extract_document(&data)?;
println!("Total pages: {}", doc.page_count());
if let Some(page) = doc.page(1) {
println!("First page: {}", page.text());
}
Ok(())
}
Performance
pdfvec achieves 10-50x speedup over pdf-extract through:
- Lazy PDF parsing with
pdf-rs - Parallel page processing with
rayon - Minimal allocations during extraction
Typical throughput: 40-137 MiB/s on academic papers.
Text Chunking
Split extracted text for embedding and RAG pipelines:
use ;
let text = "First sentence. Second sentence.\n\nNew paragraph.";
// Fixed-size chunks with overlap
let chunks: = new
.chunk_size
.overlap
.chunks
.collect;
// Paragraph-based chunking
let chunks: = new
.chunks
.collect;
// Sentence-based chunking
let chunks: = new
.chunks
.collect;
Metadata Extraction
Extract document metadata without processing page content:
use pdfvec::{Extractor, Result};
fn main() -> Result<()> {
let data = std::fs::read("document.pdf")?;
let meta = Extractor::new().extract_metadata(&data)?;
println!("Title: {:?}", meta.title());
println!("Author: {:?}", meta.author());
if let Some(date) = meta.creation_date() {
println!("Created: {}", date.format("%Y-%m-%d"));
}
Ok(())
}
Error Handling
All errors provide actionable context:
use ;
let result = new.extract;
match result