spdfdiff_cli 0.1.12

Command-line semantic PDF diff and comparison tool with JSON, Markdown, and HTML output.
spdfdiff_cli-0.1.12 is not a library.

spdfdiff_cli

Command-line semantic PDF diff and PDF comparison tool.

spdfdiff_cli provides the spdfdiff executable. It compares digitally generated PDFs through the workspace pipeline and writes deterministic JSON, AI-review JSON, Markdown, and self-contained HTML reports. It is intended for automation, regression checks, release gates, and evidence-preserving review workflows where a text-only or screenshot-only PDF diff is not enough.

Commands

  • diff <old.pdf> <new.pdf> runs the semantic PDF comparison pipeline and emits JSON, AI-review JSON, Markdown, or HTML. --fail-on-changes exits with code 1 when a completed diff contains changes.
  • inspect <file.pdf> parses a PDF with pdf_core and reports deterministic parser/object diagnostics plus simple tagged-structure and parent-tree summaries and incremental-update offsets when present.
  • extract <file.pdf> runs parse/content/text/semantic extraction across parsed page content and reports paragraph text, aligned text-grid table row/cell, row-span, column-span, merged-cell, repeated table header-row evidence, rectangle table-border hints, diagnostics, and tagged-structure summaries.
  • corpus <folder> scans .pdf files, runs parse/extract for each file, and writes stable aggregate totals, per-file status, extracted node counts, and diagnostic-code frequencies. With --manifest <json>, it also checks required files, runs declared diff pairs, emits diff diagnostic counts, and reports a deterministic release gate with a manifest compatibility label. Manifests can pin maximum partial-file counts, file diagnostic counts, and diff diagnostic counts as compatibility regression baselines. Files with only informational diagnostics remain parsed; partial is reserved for warning or error diagnostics. A public-alpha label is release-blocking unless the corpus gate has curated release evidence. With --fail-on-gate, a failed gate exits with code 1.
  • check --config .spdfdiff.toml runs configured PDF pairs for CI, writes deterministic artifacts, applies threshold and baseline suppression rules, and emits stable summary JSON on stdout. A failed check exits with code 1.
  • benchmark --pages <n> runs the synthetic benchmark path and reports deterministic phase timing fields for parse, extract, semantic, diff, report, and total work.
  • review <review.ai.json> sends deterministic AI-review JSON to an optional OpenAI-compatible HTTP endpoint such as local llama.cpp llama-server and writes a request/response envelope. This is outside the deterministic diff path.
  • visual-diff <old.pdf> <new.pdf> invokes an external renderer command twice, compares deterministic PPM page images pixel-by-pixel, writes stable JSON, and can write PPM heatmaps under --artifacts-dir.

Example

spdfdiff diff old.pdf new.pdf --format html --output diff.html
spdfdiff diff old.pdf new.pdf --format ai-json --output review.json
spdfdiff review review.json --endpoint http://127.0.0.1:8080/v1 --model local-model --output llm-review.json
spdfdiff extract old.pdf --format json --output extract.json
spdfdiff visual-diff old.pdf new.pdf --renderer-command .\render-to-ppm.cmd --artifacts-dir visual-artifacts --output visual.json
spdfdiff corpus samples --manifest samples\compatibility_corpus_manifest.json --output corpus.json --fail-on-gate
spdfdiff check --config .spdfdiff.toml

Visual Diff Renderer Contract

visual-diff keeps PDF rendering outside the core crates. The configured renderer command is executed once for the old PDF and once for the new PDF. The CLI sets these environment variables:

  • SPDFDIFF_RENDER_INPUT: PDF path to render.
  • SPDFDIFF_RENDER_OUTPUT_DIR: directory where rendered page images must be written.
  • SPDFDIFF_RENDER_ROLE: old or new.
  • SPDFDIFF_RENDER_FORMAT: currently ppm.
  • SPDFDIFF_RENDER_PAGE_PATTERN: page-%04d.ppm.

The renderer must write RGB PPM files such as page-0001.ppm. The command can be supplied with --renderer-command or SPDFDIFF_RENDER_COMMAND. When --artifacts-dir is set, rendered pages are preserved in old-rendered/ and new-rendered/, and changed pages get heatmaps in heatmaps/.

CI Check Config

schema_version = "1"
output_dir = "spdfdiff-check"
formats = ["json", "html"]
fail_on_changes = true

[[pairs]]
name = "contract"
old = "old.pdf"
new = "new.pdf"
baseline = "approved-contract-diff.json"
max_diagnostics = 0

Baseline files are normal spdfdiff diff --format json reports. Matching baseline changes and configured ignore_change_kinds values are counted as suppressed, while remaining unsuppressed changes drive the check exit code and summary report.

Local LLM Review

The review command targets local OpenAI-compatible HTTP servers. For llama.cpp:

llama-server -m C:\models\model.gguf --host 127.0.0.1 --port 8080 -c 8192
spdfdiff review review.ai.json --endpoint http://127.0.0.1:8080/v1 --model local-model --output review.llm.json

The command supports optional --api-key, --timeout-seconds, and --max-review-items. It supports plain http:// endpoints so local-first review works without adding TLS or hosted provider dependencies.

What It Compares Today

  • Extracted paragraph text and deterministic text hunks.
  • Controlled multi-column reading order plus repeated header, footer, and page-template candidate counts in extract JSON.
  • Semantic-role evidence in diff and AI-review reports, including RepeatedPageRegion tags for changed header, footer, and page-template candidates.
  • Moved blocks and layout-only changes when text anchors and bounding boxes support them.
  • Simple aligned text-grid table candidates with row/cell, sparse blank-cell, row-span, column-span, merged-cell, repeated header-row, continuation-group, and rectangle border-hint evidence.
  • Image XObject payload changes by deterministic stream hash.
  • Native vector path operations and graphic-style operations by deterministic parsed content-operation signature.
  • Rendered page pixels through the optional external visual-diff adapter, including changed-pixel counts, max channel deltas, page status, and optional heatmap artifacts.
  • Text font resource and font-size changes for unchanged text as deterministic StyleChanged entries.
  • Text extraction that prefers /ToUnicode and uses a conservative Base14 Latin fallback for safe Helvetica, Times, and Courier-family simple fonts.
  • Link/annotation semantic fields, including subtype, rectangle, URI or destination, contents, color, border, and quad-point evidence.
  • Selected report-facing document surfaces such as embedded-file/FileSpec objects, outline-like objects, and metadata/XMP objects by deterministic object hash.
  • Simple tagged-PDF structure markers, /RoleMap summaries, and MCID-backed text mapping.

OCR Path

For image-only PDFs, the CLI can OCR supported high-contrast image XObjects with an external engine. Set SPDFDIFF_OCR_COMMAND to a command that accepts a PPM path and writes recognized text to stdout, or install tesseract for the default adapter:

tesseract <image> stdout --psm 6

OCR is an adapter path, not a replacement for parser/content diagnostics.

Current Compatibility Boundary

Native vector/style comparison is a parsed-operation signature comparison, not a pixel renderer. Text style classification currently covers content-stream font resource and font-size changes for unchanged text. Base14 fallback is limited to safe Latin simple-font bytes and does not claim broad font encoding support. Link/annotation comparison is field-level semantic comparison, not JavaScript/action execution. Renderer-grade visual diffing is available only through the external visual-diff renderer adapter; native renderer integration and renderer-grade table reconstruction from arbitrary drawing geometry remain incremental compatibility work. Unsupported surfaces are reported through stable diagnostics instead of being silently treated as supported semantic diffs.