Expand description
Decode-structure analysis: classify what a candidate base64/hex-decodes to (binary asset magic bytes, protobuf wire) so decode-through feeds scoring. Decode-structure analysis: keyhog’s decode-through advantage, fed into scoring.
A generic high-entropy candidate (caught by generic-secret,
generic-password, entropy-*) is ambiguous on its surface: a real
base64/hex secret and a base64-wrapped binary asset (a PNG, a gzip blob,
a serialized protobuf, an embedded cert) look identical to an
entropy/regex/token-efficiency filter. The distinguishing signal is what
the candidate decodes to - and keyhog already decodes. This module turns
the decoded bytes into a verdict the confidence pipeline (and, later, the ML
feature vector) can use.
The verdict is built only on definitional signals, so it never false-suppresses a real credential:
- Magic bytes. A blob that decodes to a PNG/JPEG/GIF/gzip/zip/PDF/ELF/ Mach-O/PE/zstd/xz/bzip2/7z/SQLite/Java-class header IS that format. Over 3000 random 24-48 byte secrets, ZERO carry any of these headers at offset 0 (they are 4-8 specific bytes out of 256^k).
- Full protobuf-wire parse. Bytes that parse end-to-end as a protobuf wire stream (valid field tags, valid wire types, length-delimited fields that stay in bounds, whole buffer consumed) with several fields are a serialized message. Random bytes parse this way <0.5% of the time, and we additionally require >= 3 fields and >= 8 bytes.
Printable-ratio is recorded for the future ML feature but is NOT used in the boolean verdict: random secret bytes and binary blobs both sit around 37-50% printable, so it is too weak to gate suppression on its own.
Tests live in tests/unit/decode_structure*.rs (Santh no-inline-tests
contract).
Structs§
- Decode
Structure - Structured view of what a candidate decodes to. Carried as-is into the ML
feature vector once the model is retrained; consumed today by
is_encoded_binary.
Constants§
- PLACEHOLDER_
WORDS - Placeholder words that mark a credential as a documentation sample, not a
real secret. The single source of truth for the lowercase byte-slice
placeholder set: consumed for the SURFACE form by
confidence::penalties::contains_placeholder_wordand for the BASE64 / HEX decoded form by this module’sdecoded_contains_placeholder(so a base64-wrappedAKIAEXAMPLEEXAMPLE12=QUtJQUVYQU1QTEVFWEFNUExFMTI=is still caught).
Functions§
- analyze
- Decode
candidate(base64 standard, base64 url-safe, or hex) and describe the resulting bytes. Returns a default (non-decodable) structure when the candidate is too short or not a clean encoding. - decoded_
contains_ placeholder - Decode
candidate(base64 / url-safe-base64 / hex) and check whether the decoded bytes contain any placeholder word case-insensitively. Composes keyhog’s decode-through with the placeholder suppression: a docs sample that arrives base64-wrapped (e.g. AWS docs publishing AKIAEXAMPLEEXAMPLE12 as the base64-encoded body of a yaml secret) is now recognized as a sample even though the surface form looks like high-entropy random bytes. Mirror v26: 9 docs-example-marker FPs (allQUtJQUVYQU1QTEVFWEFNUExFMTI=, base64 of AKIA…EXAMPLE…12) collapsed by this gate. Memoized to match the existingis_encoded_binarycall cadence. - decoded_
is_ base64_ blob - True when
valuebase64-decodes to bytes that are themselves all in the base64 alphabet (double-encoded base64). k8sdata:fields wrap their values in another base64 layer; the inner decoded bytes are the actual user content, and when those bytes are themselves a printable base64 blob the outer wrapper is categorically data, not a credential. - is_
encoded_ binary - Conservative verdict for the confidence pipeline: does this generic
candidate decode to identifiable binary / serialized data? Real secrets
return
false. - is_
random_ base64_ blob - Unified shape-only gate for the “uniform random base64 blob” class - the
single parameterized definition behind every base64-protobuf-decoy gate in
the scanner. Reconciles two previously-divergent copies (this module’s
penalty-path
looks_like_uniform_base64_bloband the entropy-path’sengine::fallback_entropy_helpers::entropy_path_looks_like_random_base64_blob) so their length/diversity bands are tuned in one place and can never drift in opposite directions un-benched again. - looks_
like_ uniform_ base64_ blob - Shape-only check: does
valuelook like a uniform base64 blob with no structure markers? Thin wrapper overis_random_base64_blobwith the penalty-path band (44..=600) and diversity floor (32). Matches therandom-base64-protobufcorpus shape (random bytes base64-encoded into apassword=/secret=slot) without firing on real service-anchored credentials: