cli-pdf-extract 0.1.4

Fast Rust CLI wrapper around pdf_oxide for LLM-friendly PDF extraction
# cli-pdf-extract

`cli-pdf-extract` is a fast Rust CLI for LLM-friendly PDF inspection. It wraps `pdf_oxide` and focuses on practical research workflows: quick triage, targeted extraction, and lightweight downstream parsing.

## Why This Tool

LLMs are often slow when they must open and synthesize full PDFs. This CLI gives a faster path:

- Extract only what you need (`--abstract`, `--highlight`, page/range)
- Avoid heavy image payloads by default
- Prefer plain text output for fast agent throughput (default `--mode text`)

## Modalities

- `full text/pages`: extract one page, a range, or all pages (default if no page flags)
- `abstract`: extract only the abstract block for paper triage
- `highlight`: extract only PDF highlights and their notes

## Extraction Modes

For non-highlight extraction, choose with `--mode`:

- `text` (default): plain text, usually fastest for agents
- `markdown`: preserves heading/list structure when available
- `auto`: tries markdown first, then falls back to text if quality looks poor

Spacing normalization is enabled by default to reduce merged-word artifacts. Disable with `--no-normalize-spacing`.

## Install

### Local build/run

```bash
cargo run -- <PDF_PATH> --page 0
```

### Install binary locally

```bash
cargo install --path .
```

Then:

```bash
cli-pdf-extract <PDF_PATH> --page 0
```

## Usage

### Help

```bash
cli-pdf-extract --help
```

### Single page

```bash
cli-pdf-extract examples/main.pdf --page 0
```

### Page range (inclusive)

```bash
cli-pdf-extract examples/main.pdf --start-page 0 --end-page 5
```

### All pages (default behavior)

```bash
cli-pdf-extract examples/main.pdf
```

### Abstract-only

```bash
cli-pdf-extract examples/main.pdf --abstract
```

### Highlights + notes only

```bash
cli-pdf-extract examples/main.pdf --highlight
```

### Force markdown mode

```bash
cli-pdf-extract examples/main.pdf --mode markdown
```

### Auto fallback mode

```bash
cli-pdf-extract examples/main.pdf --mode auto
```

### Write to file

```bash
cli-pdf-extract examples/main.pdf --abstract --output abstract.txt
```

## Recommended Presets

### Paper triage (fast)

```bash
cli-pdf-extract paper.pdf --abstract
```

### Quick skim with structure

```bash
cli-pdf-extract paper.pdf --mode markdown --start-page 0 --end-page 2
```

### Robust default for noisy PDFs

```bash
cli-pdf-extract paper.pdf --mode auto --start-page 0 --end-page 5
```

### Annotation mining workflow

```bash
cli-pdf-extract paper.pdf --highlight
```

## Notes

- Page indices are zero-based.
- `--start-page` and `--end-page` must be used together.
- `--abstract` cannot be combined with page/range/all or `--highlight`.
- Pro-tip: add standardized tags to annotation notes (e.g., `<paper-idea>`, `<method>`, `<limitation>`) for downstream clustering and trend discovery.

## License

MIT. See `LICENSE`.

## Author

Edgar Torres (edgar.torres@ki.uni-stuttgart.de)