Internal entropy shape-classification predicates, exposed for the
canonical-shape unit tests migrated out of src/entropy/scanner.rs
(KH-GAP-004). credential_keyword_context builds the production
credential anchor so tests need not know the private tuning constants.
Canonical detector_id per hot pattern - the id of the named detector the
fast-path represents, so scan output (JSON/SARIF/text/baselines) is
identical regardless of which engine path made the find. sq0csp- keeps
hot-square_secret: no standalone square-secret detector exists yet, so it
is genuinely fast-path-only (keyhog explain documents this). Static (not
format!-per-match) to keep the per-hit allocation the perf audit removed.
Canonical human-readable detector name per hot pattern (matches the name
field of the corresponding detectors/*.toml). Square has no canonical
detector, so it carries a plain “Square Secret” label.
service field per hot pattern - the CANONICAL service of the detector
this fast-path stands in for, NOT an internal *_key label. The hot path
is a perf optimization, not a distinct detector: a leaked AKIA… is an
aws-access-key finding however the engine found it. Before 2026-05-29
these were aws_key/github_pat/… so the SAME secret surfaced as
hot-aws_key/service aws_key on Linux (Hyperscan path) but
aws-access-key/service aws on macOS/Windows (portable, no hot path) -
a cross-platform id divergence. Emitting canonical identity here makes all
platforms agree and matches what keyhog explain already resolves hot ids
to. Index-parallel with HOT_PATTERNS / the two arrays below.
Attribute each global GPU match to its source chunk using the
coalesce-entry table (chunk_index, offset, len). Matches that
straddle a chunk boundary are dropped (the coalesce separator
makes a true cross-chunk hit impossible; this skip is the safety
net for any pid > total_patterns smuggled through).
Sort by (pid, start, end), fold same-pid overlapping spans, then
re-sort by start. The downstream chunk-attribution walk expects
matches in start-ascending order; the per-pid fold collapses the
duplicate (pid, start, end) triples that subgroup-ballot can
emit when a hit straddles a workgroup boundary.
Large many-file batches with dense literal-prefix output are pathological
for the two-phase literal GPU path: phase 1 is fast, but phase 2 has to
confirm too many broad detector prefixes on CPU. Rerouting that batch
through the existing SIMD coalesced scanner preserves the finding contract
and avoids turning permissive prefixes into thousands of whole-chunk regex
confirmations.
FNV-1a hash of data. Non-cryptographic; used as a content key for dedup
and memoization across the scanner. Keep the seed/prime in sync here only -
every cache that keys on this depends on the value being identical.