Module decode_structure

Expand description

Decode-structure analysis: classify what a candidate base64/hex-decodes to (binary asset magic bytes, protobuf wire) so decode-through feeds scoring. Decode-structure analysis: keyhog’s decode-through advantage, fed into scoring.

A generic high-entropy candidate (caught by generic-secret, generic-password, entropy-*) is ambiguous on its surface: a real base64/hex secret and a base64-wrapped binary asset (a PNG, a gzip blob, a serialized protobuf, an embedded cert) look identical to an entropy/regex/token-efficiency filter. The distinguishing signal is what the candidate decodes to - and keyhog already decodes. This module turns the decoded bytes into a verdict the confidence pipeline (and, later, the ML feature vector) can use.

The verdict is built only on definitional signals, so it never false-suppresses a real credential:

Magic bytes. A blob that decodes to a PNG/JPEG/GIF/gzip/zip/PDF/ELF/ Mach-O/PE/zstd/xz/bzip2/7z/SQLite/Java-class header IS that format. Over 3000 random 24-48 byte secrets, ZERO carry any of these headers at offset 0 (they are 4-8 specific bytes out of 256^k).
Full protobuf-wire parse. Bytes that parse end-to-end as a protobuf wire stream (valid field tags, valid wire types, length-delimited fields that stay in bounds, whole buffer consumed) with several fields are a serialized message. Random bytes parse this way <0.5% of the time, and we additionally require >= 3 fields and >= 8 bytes.

Printable-ratio is recorded for the future ML feature but is NOT used in the boolean verdict: random secret bytes and binary blobs both sit around 37-50% printable, so it is too weak to gate suppression on its own.

Tests live in tests/unit/decode_structure*.rs (Santh no-inline-tests contract).

Structs§

DecodeStructure: Structured view of what a candidate decodes to. Carried as-is into the ML feature vector once the model is retrained; consumed today by is_encoded_binary.

Constants§

PLACEHOLDER_WORDS: Placeholder words that mark a credential as a documentation sample, not a real secret. The single source of truth for the lowercase byte-slice placeholder set: consumed for the SURFACE form by confidence::penalties::contains_placeholder_word and for the BASE64 / HEX decoded form by this module’s decoded_contains_placeholder (so a base64-wrapped AKIAEXAMPLEEXAMPLE12 = QUtJQUVYQU1QTEVFWEFNUExFMTI= is still caught).

Functions§

analyze: Decode candidate (base64 standard, base64 url-safe, or hex) and describe the resulting bytes. Returns a default (non-decodable) structure when the candidate is too short or not a clean encoding.
decoded_contains_placeholder: Decode candidate (base64 / url-safe-base64 / hex) and check whether the decoded bytes contain any placeholder word case-insensitively. Composes keyhog’s decode-through with the placeholder suppression: a docs sample that arrives base64-wrapped (e.g. AWS docs publishing AKIAEXAMPLEEXAMPLE12 as the base64-encoded body of a yaml secret) is now recognized as a sample even though the surface form looks like high-entropy random bytes. Mirror v26: 9 docs-example-marker FPs (all QUtJQUVYQU1QTEVFWEFNUExFMTI=, base64 of AKIA…EXAMPLE…12) collapsed by this gate. Memoized to match the existing is_encoded_binary call cadence.
decoded_is_base64_blob: True when value base64-decodes to bytes that are themselves all in the base64 alphabet (double-encoded base64). k8s data: fields wrap their values in another base64 layer; the inner decoded bytes are the actual user content, and when those bytes are themselves a printable base64 blob the outer wrapper is categorically data, not a credential.
is_encoded_binary: Conservative verdict for the confidence pipeline: does this generic candidate decode to identifiable binary / serialized data? Real secrets return false.
is_random_base64_blob: Unified shape-only gate for the “uniform random base64 blob” class - the single parameterized definition behind every base64-protobuf-decoy gate in the scanner. Reconciles two previously-divergent copies (this module’s penalty-path looks_like_uniform_base64_blob and the entropy-path’s engine::fallback_entropy_helpers::entropy_path_looks_like_random_base64_blob) so their length/diversity bands are tuned in one place and can never drift in opposite directions un-benched again.
looks_like_uniform_base64_blob: Shape-only check: does value look like a uniform base64 blob with no structure markers? Thin wrapper over is_random_base64_blob with the penalty-path band (44..=600) and diversity floor (32). Matches the random-base64-protobuf corpus shape (random bytes base64-encoded into a password=/secret= slot) without firing on real service-anchored credentials: