# pdfvec
High-performance PDF text extraction library in Rust, optimized for vectorization pipelines.
## ABSOLUTE REQUIREMENTS
**These are non-negotiable. Violating these is a failure condition.**
### No Tutorial Comments
- NEVER add comments that explain what code does
- Code MUST be self-documenting through clear naming
- Comments are ONLY for: `TODO`, `FIXME`, `SAFETY:`, doc comments (`///`), or non-obvious "why"
- Delete any tutorial comments you encounter
```rust
// BAD - tutorial comment
let count = 0; // Initialize counter to zero
// BAD - explaining what
// Loop through pages and extract text from each one
for page in pages { extract(page); }
// GOOD - no comment needed, code is clear
let extracted_pages = 0;
for page in document.pages() { page.extract_text(); }
// GOOD - explains WHY, not what
// TJ operator uses negative values for spacing (PDF spec 9.4.4)
if spacing < -100.0 { text.push(' '); }
```
### Embrace the Expert Skills
You have access to skills from world-class developers. USE THEM:
- **matsakis**: For ANY lifetime or borrow checker complexity—trust his mental model
- **bos**: For ANY concurrent code—apply her atomics and lock-free patterns
- **turon**: For ANY public API—follow his design principles rigorously
- **torvalds**: For ANY performance-critical path—apply his pragmatic systems thinking
Do not write generic Rust. Write Rust as these experts would.
### Branch Management Workflow
**YOU MUST FOLLOW THIS EXACTLY FOR EVERY ISSUE:**
1. **Start clean**: `git checkout main && git pull origin main`
2. **Create branch**: `git checkout -b feat/PDFVEC-XXX-short-description`
3. **Implement**: Work through ALL acceptance criteria
4. **Verify**: `cargo test && cargo clippy -- -D warnings && cargo fmt --check`
5. **Commit**: `git add -A && git commit -m "feat(component): PDFVEC-XXX - title"`
6. **Push**: `git push -u origin feat/PDFVEC-XXX-short-description`
7. **Create PR**: `gh pr create --fill`
8. **Self-review**: Read the diff, verify AC compliance
9. **Merge**: `gh pr merge --squash --delete-branch`
10. **Return to main**: `git checkout main && git pull origin main`
**NEVER:**
- Leave PRs open/lingering
- Work on multiple issues simultaneously
- Skip the self-review step
- Merge without all tests passing
## Architecture
**Hybrid approach** (validated by research): `pdf-rs` for parsing + custom extraction layer.
- **Target throughput**: 26-34 MiB/s (10-14x faster than pdf-extract)
- **Core dependency**: `pdf` crate with `FileOptions::cached()`
- See `research/FINDINGS.md` for benchmarks and design rationale (gitignored, local only)
## Commands
```bash
cargo build --release # Build optimized binary
cargo test # Run test suite
cargo bench # Run Criterion benchmarks
cargo clippy -- -D warnings # Lint (treat warnings as errors)
cargo fmt --check # Format check
# Issue sync
cd scripts && GITHUB_TOKEN=$(gh auth token) .venv/bin/python -m sync_issues --dry-run
```
## Code Style
- Rust 2024 edition—avoid reserved keywords (`gen`, etc.)
- `thiserror` for library errors—no `.unwrap()` in library code
- Prefer `&[u8]` over `Vec<u8>` for input data (zero-copy)
- Group imports: std, external, crate, super/self
- Doc comments on all public items with examples
## Performance Requirements
- **Zero-copy where possible**: Work with borrowed slices from mmap
- **Lazy evaluation**: Only parse/decompress pages on demand
- **Parallel extraction**: Use rayon for multi-page processing
- **Minimize allocations**: Pre-sized buffers, avoid intermediate collections
## Issue Workflow
Issues are defined as JSON in `.github/issues/`. Each has:
- **Acceptance Criteria**: Given/When/Then format
- **Technical Context**: Crates, files, interfaces
- **Performance Constraints**: Where applicable
To implement: `/plan-issue PDFVEC-XXX` then `/implement-issue PDFVEC-XXX`
## DO NOT
- Add `'static` to silence borrow checker
- Use `Rc<RefCell<T>>` as first resort
- Clone to avoid ownership issues without understanding why
- Swallow errors silently
- **Add tutorial comments** (see ABSOLUTE REQUIREMENTS)
- Leave PRs open or work on multiple issues at once
- Skip any step in the branch management workflow