pdfplumber-cli 0.2.0

Command-line tool to extract text, characters, words, and tables from PDF documents
pdfplumber-cli-0.2.0 is not a library.

pdfplumber-cli

Command-line tool to extract text, characters, words, and tables from PDF documents.

pdfplumber-cli is the CLI frontend for pdfplumber-rs, a Rust port of Python's pdfplumber.

Installation

cargo install pdfplumber-cli

Usage

pdfplumber <COMMAND> [OPTIONS] <FILE>

Subcommands

Command Description
text Extract text from PDF pages
chars Extract individual characters with coordinates
words Extract words with bounding box coordinates
tables Detect and extract tables from PDF pages
info Display PDF metadata and page information

Global Options

Option Description
--version Print version number
--help Print help information

Extract Text

# Extract all text
pdfplumber text document.pdf

# Extract text from specific pages
pdfplumber text document.pdf --pages 1,3-5

# Layout-preserving extraction
pdfplumber text document.pdf --layout

# JSON output (one object per page)
pdfplumber text document.pdf --format json

Extract Characters

# Tab-separated output (default)
pdfplumber chars document.pdf

# JSON output with all fields (text, fontname, size, bbox, etc.)
pdfplumber chars document.pdf --format json

# CSV output
pdfplumber chars document.pdf --format csv --pages 1

Example CSV output:

page,text,x0,top,x1,bottom,fontname,size
1,H,72.00,72.00,84.00,84.00,Helvetica,12.00
1,e,84.00,72.00,90.72,84.00,Helvetica,12.00

Extract Words

# Tab-separated output (default)
pdfplumber words document.pdf

# JSON output
pdfplumber words document.pdf --format json

# CSV output with custom tolerances
pdfplumber words document.pdf --format csv --x-tolerance 5.0 --y-tolerance 2.5

Example CSV output:

page,text,x0,top,x1,bottom
1,Hello,72.00,72.00,108.00,84.00
1,World,112.00,72.00,148.00,84.00

Extract Tables

# Human-readable grid format (default)
pdfplumber tables document.pdf

# JSON output
pdfplumber tables document.pdf --format json

# CSV output
pdfplumber tables document.pdf --format csv

# Use stream strategy instead of lattice
pdfplumber tables document.pdf --strategy stream

# Tune detection parameters
pdfplumber tables document.pdf --snap-tolerance 5.0 --join-tolerance 4.0 --text-tolerance 2.0

Example grid output:

--- Table 1 (page 1, bbox: [72.00, 100.00, 540.00, 300.00]) ---
Name   | Age | City
Alice  | 30  | New York
Bob    | 25  | London

Inspect PDF Info

# Text summary
pdfplumber info document.pdf

# JSON output
pdfplumber info document.pdf --format json

# Specific pages only
pdfplumber info document.pdf --pages 1-3

Example text output:

=== PDF Info ===
Pages: 3

--- Page 1 (612.00 x 792.00, rotation: 0°) ---
  Chars:  1250
  Lines:  45
  Rects:  12
  Curves: 0
  Images: 2

=== Summary ===
Total chars:  3200
Total tables: 1

Output Formats

Subcommand text (default) json csv
text Plain text JSON lines
chars TSV JSON array CSV
words TSV JSON array CSV
tables Grid JSON array CSV
info Summary JSON

Page Selection

Use --pages to select specific pages (1-indexed):

  • --pages 1 — single page
  • --pages 1-5 — range
  • --pages 1,3,5 — list
  • --pages 1-3,7,10-12 — mixed

Omit --pages to process all pages.

License

MIT OR Apache-2.0