keyhog-sources 0.2.0

$ keyhog scan --path .

  ██   ██ ████████ ██    ██ ██   ██  ██████   ██████
  ██  ██  ██        ██  ██  ██   ██ ██    ██ ██
  █████   █████      ████   ███████ ██    ██ ██   ███
  ██  ██  ██          ██    ██   ██ ██    ██ ██    ██
  ██   ██ ████████    ██    ██   ██  ██████   ██████
  v1.0.0 · Secret Scanner · 886 detectors
  by SanthSecurity

  critical  82%  ██████░░  GitHub Classic PAT
                 ghp_...7890  src/config.py:42
  critical  78%  █████░░░  Stripe Secret Key
                 sk_l...ab12  .env:7
  critical  78%  █████░░░  GitHub PAT (decoded from base64)
                 ghp_...7890  k8s/secret.yaml:12

  3 secrets found · 2 unique credentials · 0 false positives

Why KeyHog

Most secret scanners run regex against plaintext. They miss anything encoded, embedded, or obfuscated. KeyHog doesn't.

Decode-through scanning recursively unwraps base64, hex, URL encoding, quoted-printable, and Unicode escapes before pattern matching — catching secrets buried in Kubernetes manifests, CI configs, Docker layers, and compiled artifacts that other tools never see.

ML confidence scoring uses a 3,969-parameter neural network trained on 200K real credentials to separate secrets from hashes, test fixtures, and documentation strings. Every finding comes with a 0–100% score. Zero false positives at the default 70% threshold.

Live verification hits real APIs (AWS, GitHub, Stripe, Slack, OpenAI, and more) to confirm whether a leaked credential is actually active.

Feature Comparison

	KeyHog	TruffleHog	Gitleaks	Semgrep
Detectors	886+	800+	150+	Rules
Recall (blind test)	98%	32%	~30%	~40%
False positives	Zero	Moderate	Low	High
Base64 decode	✓	✗	✗	✗
Hex decode	✓	✗	✗	✗
ML scoring	✓ (99.5%)	Partial	✗	✗
Live verify	✓	✓	✗	✗
Throughput	~50 MB/s	~10–30	~5–15	~20
License	MIT	AGPL	MIT	LGPL

KeyHog finds 74 credentials that TruffleHog misses. TruffleHog finds 0 that KeyHog misses.

Choosing Between Alternatives

Use KeyHog when you need high recall on encoded secrets, embeddable Rust crates, and optional live verification.
Use TruffleHog when you prioritize its existing verification workflows over a lightweight Rust-native integration story.
Use Gitleaks when plaintext regex scanning is enough and you want a simpler rule engine.
Use Semgrep when your main goal is broad static analysis rather than secret-specific recall.

Quick Start

# Install
cargo install keyhog

# Scan a directory
keyhog scan --path .

# Scan with verification
keyhog scan --path . --verify

# Scan a git repo's full history
keyhog scan --git ./repo

# CI mode: only changed files, SARIF output
keyhog scan --git-diff origin/main --format sarif --fail-on-findings

Install

# Install the published CLI
cargo install keyhog

# Or build from source
git clone https://github.com/santhsecurity/keyhog.git
cd keyhog
cargo install --path crates/cli

Standalone Crates

[dependencies]
keyhog-core = "0.1.0"
keyhog-scanner = "0.1.0"
keyhog-sources = "0.1.0"
keyhog-verifier = "0.1.0"

keyhog-core provides detector specs, findings, reporting, and allowlists.
keyhog-scanner compiles detectors and scans Chunk values.
keyhog-sources provides filesystem, stdin, git, Docker, S3, and binary inputs.
keyhog-verifier verifies deduplicated findings asynchronously.
keyhog is the end-user binary package.

Library Quick Start

use keyhog_core::{Chunk, ChunkMetadata, DetectorSpec, PatternSpec, Severity};
use keyhog_scanner::CompiledScanner;

let scanner = CompiledScanner::compile(vec![DetectorSpec {
    id: "demo-token".into(),
    name: "Demo Token".into(),
    service: "demo".into(),
    severity: Severity::High,
    patterns: vec![PatternSpec {
        regex: "demo_[A-Z0-9]{8}".into(),
        description: None,
        group: None,
    }],
    companion: None,
    verify: None,
    keywords: vec!["demo_".into()],
}])?;

let findings = scanner.scan(&Chunk {
    data: "TOKEN=demo_ABC12345".into(),
    metadata: ChunkMetadata {
        source_type: "filesystem".into(),
        path: Some(".env".into()),
        commit: None,
        author: None,
        date: None,
    },
});

assert_eq!(findings.len(), 1);
# Ok::<(), keyhog_scanner::ScanError>(())

Docker

docker run --rm -v $(pwd):/scan ghcr.io/keyhog/keyhog:latest scan --path /scan

GitHub Actions

- uses: keyhog/keyhog-action@v1
  with:
    path: .
    min-confidence: 0.7
    format: sarif

Pre-commit

repos:
  - repo: https://github.com/santhsecurity/keyhog
    rev: v0.1.0
    hooks:
      - id: keyhog-secret-scan

Usage

# Scan directory
keyhog scan --path ./src

# JSON output
keyhog scan --path . --format json

# Only high-severity findings
keyhog scan --path . --severity high

# Scan last 5 commits
keyhog scan --git-diff HEAD~5

# Staged files only (for pre-commit)
keyhog scan --git-diff --staged

# Custom confidence threshold
keyhog scan --path . --min-confidence 0.8

# Fail CI on any finding
keyhog scan --path . --fail-on-findings

Output Formats

Format	Flag	Use for
Text	`--format text`	Human reading (default)
JSON	`--format json`	Programmatic use
JSONL	`--format jsonl`	Streaming / log ingestion
SARIF	`--format sarif`	GitHub code scanning

Architecture

KeyHog uses a two-phase architecture built on Aho-Corasick automata:

Input          Phase 1: Prefilter           Phase 2: Confirm          Score & Verify
─────          ──────────────────           ────────────────          ──────────────

              ┌───────────────────┐     ┌──────────────────┐     ┌────────────────┐
 file         │  Decode-Through   │     │  Regex Confirm   │     │  ML Classifier │
 stdin  ────▶ │  Aho-Corasick     │────▶│  Match regions   │────▶│  3,969 params  │
 git          │  O(n) single-pass │     │  per candidate   │     │  99.5% acc     │
              └───────────────────┘     └──────────────────┘     └───────┬────────┘
                                                                         │
                                                                         ▼
                                                                 ┌────────────────┐
                                                                 │  Live Verify   │
                                                                 │  (optional)    │
                                                                 │  async tokio   │
                                                                 └────────────────┘

Decode-Through Scanning

Before pattern matching, KeyHog recursively decodes:

Base64 (standard + URL-safe)
Hexadecimal
URL encoding
Quoted-printable
Unicode escapes

# KeyHog catches this. Other scanners don't.
encoded = "Z2hwX3h4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4"  # base64(ghp_...)

Structural Context

Same credential, different context, different confidence:

# 82% — production config
production_config = "ghp_xxxxxxxxxxxxxxxxxxxx"

# 25% — test fixture (auto-detected via AST context)
def test_auth():
    token = "ghp_xxxxxxxxxxxxxxxxxxxx"

Adding Detectors

Detectors are TOML — no code changes needed:

# detectors/my-service.toml
[detector]
id = "my-service-api-key"
name = "My Service API Key"
severity = "critical"
keywords = ["ms_live_", "ms_test_"]

[[detector.patterns]]
regex = 'ms_(live|test)_[a-zA-Z0-9]{32}'

[detector.verify]
method = "GET"
url = "https://api.myservice.com/v1/status"
[detector.verify.auth]
type = "bearer"
field = "match"

Configuration

`.keyhog.toml`

detectors = "detectors"       # Path to detector TOML files
severity = "medium"            # Minimum: info | low | medium | high | critical
format = "text"                # Output: text | json | jsonl | sarif
min_confidence = 0.7           # ML confidence threshold (0.0–1.0)
threads = 8                    # Parallel scan threads
dedup = "credential"           # Dedup: credential | file | none
deep = true                    # Enable decode-through + entropy + multiline
timeout = 10                   # Verification timeout (seconds)
show_secrets = false            # Redact credentials in output

`.keyhogignore`

# Paths
path:tests/**
path:**/*.md

# Detectors
detector:entropy
detector:generic-api-key

# Specific findings by hash
hash:abc123def456

Inline suppression

# keyhog:ignore
GITHUB_TOKEN = "ghp_xxxxxxxxxxxxxxxxxxxx"

# keyhog:ignore detector=github-token
api_key = "ghp_yyyyyyyyyyyyyyyyyyyy"

# keyhog:ignore reason="public CI token"
TOKEN = "ghp_zzzzzzzzzzzzzzzzzzzz"

Modular Builds

# Full build (default)
cargo build --release

# Fast mode: regex-only, no ML/decode/multiline — for pre-commit hooks
cargo build --release --no-default-features --features fast

# With live verification
cargo build --release --features verify

Performance

All benchmarks: AMD Ryzen 9 5900X, 32 GB RAM, NVMe SSD.

Throughput

Detectors	1 MB	10 MB	100 MB
100	55 MB/s	58 MB/s	62 MB/s
500	48 MB/s	52 MB/s	56 MB/s
886	42 MB/s	46 MB/s	50 MB/s

Real-World Repos

Repository	Size	KeyHog	TruffleHog	Gitleaks
facebook/react	350 MB	8s	25s	45s
denoland/deno	900 MB	18s	55s	95s
rust-lang/rust	2.1 GB	42s	120s	200s

Verification Latency

Service	Status	Latency
AWS	✓	~200ms
GitHub	✓	~150ms
Slack	✓	~180ms
Stripe	✓	~220ms
OpenAI	✓	~250ms

License

MIT — see LICENSE.