spdfdiff_cli 0.1.2

Command-line interface for semantic PDF diffing.
spdfdiff_cli-0.1.2 is not a library.

spdfdiff_cli

CLI entry point for spdfdiff diff, inspect, extract, and corpus.

Current command behavior:

  • diff: runs the vertical-slice semantic diff pipeline and emits JSON/Markdown/HTML. --fail-on-changes exits with code 1 when a completed diff contains changes.
  • inspect: parses a PDF with pdf_core and reports deterministic parser/object diagnostics plus simple tagged-structure and parent-tree summaries when present.
  • extract: runs parse/content/text/semantic extraction across parsed page content and reports extracted paragraph text, simple aligned text-grid table row/cell evidence, diagnostics summary, and simple tagged-structure summary when present.
  • corpus: scans a folder for .pdf files, runs parse/extract for each file, and writes stable aggregate totals (total, parsed, partial, failed), per-file status, extracted node counts, and diagnostic-code frequency. With --manifest <json>, it also checks required files, runs declared diff pairs, emits diff diagnostic counts, and reports a deterministic release gate; with --fail-on-gate, a failed gate exits with code 1.

The CLI compares image XObject payloads and selected annotation, attachment, outline, and metadata objects by deterministic hash and emits object-level changes in diff reports. It still emits stable unsupported-feature diagnostics for native vector graphic comparison and incomplete annotation/link semantics. For image-only PDFs, the CLI can OCR supported high-contrast image XObjects with an external engine. Set SPDFDIFF_OCR_COMMAND to a command that accepts a PPM path and writes recognized text to stdout, or install tesseract for the default tesseract <image> stdout --psm 6 adapter.