cli-pdf-extract-0.1.4 is not a library.

cli-pdf-extract

cli-pdf-extract is a fast Rust CLI for LLM-friendly PDF inspection. It wraps pdf_oxide and focuses on practical research workflows: quick triage, targeted extraction, and lightweight downstream parsing.

Why This Tool

LLMs are often slow when they must open and synthesize full PDFs. This CLI gives a faster path:

Extract only what you need (--abstract, --highlight, page/range)
Avoid heavy image payloads by default
Prefer plain text output for fast agent throughput (default --mode text)

Modalities

full text/pages: extract one page, a range, or all pages (default if no page flags)
abstract: extract only the abstract block for paper triage
highlight: extract only PDF highlights and their notes

Extraction Modes

For non-highlight extraction, choose with --mode:

text (default): plain text, usually fastest for agents
markdown: preserves heading/list structure when available
auto: tries markdown first, then falls back to text if quality looks poor

Spacing normalization is enabled by default to reduce merged-word artifacts. Disable with --no-normalize-spacing.

Install

Local build/run

cargo run -- <PDF_PATH> --page 0

Install binary locally

cargo install --path .

Then:

cli-pdf-extract <PDF_PATH> --page 0

Usage

Help

cli-pdf-extract --help

Single page

cli-pdf-extract examples/main.pdf --page 0

Page range (inclusive)

cli-pdf-extract examples/main.pdf --start-page 0 --end-page 5

All pages (default behavior)

cli-pdf-extract examples/main.pdf

Abstract-only

cli-pdf-extract examples/main.pdf --abstract

Highlights + notes only

cli-pdf-extract examples/main.pdf --highlight

Force markdown mode

cli-pdf-extract examples/main.pdf --mode markdown

Auto fallback mode

cli-pdf-extract examples/main.pdf --mode auto

Write to file

cli-pdf-extract examples/main.pdf --abstract --output abstract.txt

Recommended Presets

Paper triage (fast)

cli-pdf-extract paper.pdf --abstract

Quick skim with structure

cli-pdf-extract paper.pdf --mode markdown --start-page 0 --end-page 2

Robust default for noisy PDFs

cli-pdf-extract paper.pdf --mode auto --start-page 0 --end-page 5

Annotation mining workflow

cli-pdf-extract paper.pdf --highlight

Notes

Page indices are zero-based.
--start-page and --end-page must be used together.
--abstract cannot be combined with page/range/all or --highlight.
Pro-tip: add standardized tags to annotation notes (e.g., <paper-idea>, <method>, <limitation>) for downstream clustering and trend discovery.

License

MIT. See LICENSE.

Author

Edgar Torres (edgar.torres@ki.uni-stuttgart.de)

cli-pdf-extract 0.1.4