cli-pdf-extract-0.1.4 is not a library.
cli-pdf-extract
cli-pdf-extract is a fast Rust CLI for LLM-friendly PDF inspection. It wraps pdf_oxide and focuses on practical research workflows: quick triage, targeted extraction, and lightweight downstream parsing.
Why This Tool
LLMs are often slow when they must open and synthesize full PDFs. This CLI gives a faster path:
- Extract only what you need (
--abstract,--highlight, page/range) - Avoid heavy image payloads by default
- Prefer plain text output for fast agent throughput (default
--mode text)
Modalities
full text/pages: extract one page, a range, or all pages (default if no page flags)abstract: extract only the abstract block for paper triagehighlight: extract only PDF highlights and their notes
Extraction Modes
For non-highlight extraction, choose with --mode:
text(default): plain text, usually fastest for agentsmarkdown: preserves heading/list structure when availableauto: tries markdown first, then falls back to text if quality looks poor
Spacing normalization is enabled by default to reduce merged-word artifacts. Disable with --no-normalize-spacing.
Install
Local build/run
Install binary locally
Then:
Usage
Help
Single page
Page range (inclusive)
All pages (default behavior)
Abstract-only
Highlights + notes only
Force markdown mode
Auto fallback mode
Write to file
Recommended Presets
Paper triage (fast)
Quick skim with structure
Robust default for noisy PDFs
Annotation mining workflow
Notes
- Page indices are zero-based.
--start-pageand--end-pagemust be used together.--abstractcannot be combined with page/range/all or--highlight.- Pro-tip: add standardized tags to annotation notes (e.g.,
<paper-idea>,<method>,<limitation>) for downstream clustering and trend discovery.
License
MIT. See LICENSE.
Author
Edgar Torres (edgar.torres@ki.uni-stuttgart.de)