Skip to main content

Crate crw_diff

Crate crw_diff 

Source
Expand description

Stateless change-tracking diff engine for CRW monitors.

Pure, synchronous, no I/O, no LLM. Given the current scrape (markdown + optionally extracted JSON) and a caller-supplied previous snapshot, it classifies the page (same / changed), computes the requested diff surfaces, and returns the current snapshot to persist as the next baseline.

§Caller-supplied JSON invariant

current_json is the already-extracted structured JSON supplied by the orchestration layer. This crate NEVER extracts JSON itself and does not depend on crw-extract — the LLM/judge live upstream.

§Mode-aware hashing

content_hash is the normalized-markdown hash in gitDiff/mixed mode, and the canonicalized tracked-JSON hash in json-only mode. The SaaS store-skip short-circuit keys off this hash.

Modules§

git_diff
Git-diff (markdown) surface: a unified text diff plus a parse-diff-style AST, BOTH derived from the same similar op stream so they can never disagree. There is no parse-diff crate in Rust; the AST is synthesized directly from similar’s DiffOp/ChangeTag stream.
json_diff
JSON-mode per-field diff. Walks two extractions and emits a map keyed by field path (plans[0].price, Firecrawl style) to {previous, current} pairs. Added fields have previous: null; removed fields current: null.
snapshot
Markdown normalization + content hashing. Single source of truth for the content_hash so cosmetic churn (trailing whitespace, blank-line runs, CRLF) never flips a page from same to changed.

Structs§

DiffLimits
Tunable limits for diff computation.

Constants§

DEFAULT_MAX_DIFF_CHANGES
Default cap on AST change-lines before the diff AST is truncated.

Functions§

compute_change_tracking
Compute change tracking with default limits. See module docs for the caller-supplied-JSON invariant.
compute_change_tracking_with_limits
Compute change tracking with explicit limits.