pdf-dump
A CLI tool for inspecting and debugging the internal structure of PDF files.
pdf-dump shows you what's actually inside a PDF — objects, streams, fonts, images, form fields, bookmarks, annotations, tagged structure, and more. Useful for debugging PDF generation, understanding why a PDF looks wrong, or exploring the format.
Installation
Requires a Rust toolchain that supports edition 2024.
Quick Start
# Overview: metadata, validation summary, stream stats, feature indicators
# Extract text
# Search for text across pages
# Page info: dimensions, resources, fonts, annotations, text preview
# List fonts or images
# Explain a specific object
# Find all font objects
# Structural validation
# One-line listing of every object
Modes
Document-level modes (combinable)
These can be used together — output gets section headers automatically:
| Flag | Description |
|---|---|
--text |
Extract readable text from content streams |
--operators |
Show content stream operators |
--find-text "pattern" |
Case-insensitive text search with context |
--fonts |
List all fonts with encoding and embedding details |
--images |
List all images with dimensions, color space, filters |
--forms |
List AcroForm fields with names, types, values |
--bookmarks |
Show the document outline tree |
--annotations |
Show annotations with link targets |
--tags |
Show tagged PDF structure tree (accessibility) |
--tree |
Show the object graph as an indented reference tree |
--validate |
Structural checks: broken refs, unreachable objects, required keys |
--list |
One-line-per-object table |
--detail <view> |
Detail views: security, embedded, labels, layers |
# Combine freely
Standalone modes (one at a time)
| Flag | Description |
|---|---|
--object N |
Print object(s) by number (5, 1,5,12, 3-7) |
--inspect N |
Full explanation of an object's role and relationships |
--search <expr> |
Find objects matching criteria (Type=Font, key=MediaBox, stream=text) |
--extract-stream N --output file |
Extract a decoded stream to a file |
Modifiers
| Flag | Effect |
|---|---|
--page N or --page N-M |
Filter to specific pages; shows page info when used alone |
--json |
Structured JSON output (works with every mode) |
--decode |
Decompress stream contents |
--deref |
Inline-expand references (with --object) |
--depth N |
Limit traversal depth (with --tree, --tags, --json) |
--hex |
Hex dump for binary streams |
--raw |
Raw undecoded stream bytes (with --object) |
--truncate N |
Limit binary output to N bytes |
--dot |
GraphViz DOT output (with --tree) |
JSON Output
Every mode supports --json for structured output:
Supported Stream Filters
FlateDecode, ASCII85Decode, ASCIIHexDecode, LZWDecode, RunLengthDecode — applied sequentially for multi-filter pipelines.
Acknowledgments
Built on lopdf, a pure-Rust PDF parsing library.
Related Projects
- medpdf — Medium-level PDF API over lopdf (includes medpdf-image for image embedding)
- pdf-maker — CLI tool for merging, watermarking, and manipulating PDF files
License
Licensed under either of
- Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
- MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)
at your option.
Contribution
Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.