pdfvec
High-performance PDF text extraction for vectorization pipelines.
Features
- Fast: 40-134 MiB/s throughput (15-143x faster than pdf-extract)
- Parallel: Multi-threaded page extraction with rayon
- Streaming: Constant memory usage for large documents
- Chunking: Built-in text chunking for RAG/embedding pipelines
- Metadata: Extract title, author, dates without parsing pages
Installation
[]
= "0.1"
Quick Start
// One-liner extraction
let text = extract?;
Usage
Text Extraction
use ;
Structured Document Access
use ;
Streaming (Constant Memory)
use ;
Text Chunking
use ;
let text = "First sentence. Second sentence.\n\nNew paragraph here.";
// Fixed-size chunks with overlap
let chunks: = new
.chunk_size
.overlap
.chunks
.collect;
// Paragraph-based chunking
let chunks: = new
.chunks
.collect;
// Sentence-based chunking
let chunks: = new
.chunks
.collect;
for chunk in chunks
Metadata Extraction
use ;
CLI
# Install
# Extract text
# Extract to file
# Show metadata
# Process directory
Performance
Benchmarked against pdf-extract on academic papers:
| File Size | pdfvec | pdf-extract | Speedup |
|---|---|---|---|
| 33 KB | 818 µs | 12.7 ms | 15x |
| 94 KB | 1.5 ms | 83 ms | 55x |
| 422 KB | 3.1 ms | 439 ms | 143x |
Throughput: 40-134 MiB/s vs 0.9-2.6 MiB/s
License
Licensed under either of:
- Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
- MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)
at your option.