Crate secret_scraper

Expand description

§SecretScraper

Rust library for crawling web targets, discovering URLs and JavaScript links, and detecting secrets (API keys, credentials, internal IPs, PII, and more) with configurable regular-expression rules. Also scans local files and directories recursively.

§Quick Start

Crawl a website with the built-in detection rules:

use secret_scraper::{
    cli::{Config, Mode},
    facade::{CrawlerFacade, ScanFacade, ScanResult},
};

let mut config = Config::default_with_rules();
config.url = Some("https://example.com".to_string());
config.mode = Mode::Thorough;
config.detail = true;
config.outfile = Some("crawl.csv".into());

match Box::new(CrawlerFacade::new(config).unwrap()).scan().unwrap() {
    ScanResult::CrawlResult(result) => {
        println!(
            "{} domains, {} URL groups, {} secret-bearing URLs",
            result.hosts.len(),
            result.urls.len(),
            result.secrets.len()
        );
    }
    ScanResult::LocalScanResult(_) => unreachable!(),
}

Scan a local directory recursively:

use secret_scraper::{
    cli::Config,
    facade::{FileScannerFacade, ScanFacade, ScanResult},
};

let mut config = Config::default_with_rules();
config.local = Some("./samples".into());
config.outfile = Some("local-scan.yml".into());

match Box::new(FileScannerFacade::new(config).unwrap()).scan().unwrap() {
    ScanResult::LocalScanResult(result) => {
        println!("{} files scanned", result.len());
        for (path, secrets) in &result {
            println!("{}: {} secrets", path.display(), secrets.len());
        }
    }
    ScanResult::CrawlResult(_) => unreachable!(),
}

§Features

Web crawling — crawl seed URLs with configurable depth, following HTML links, JavaScript sources, and regex-discovered URLs.
Local file scanning — scan a single file or walk a directory tree recursively for secrets.
Built-in secret rules — detects Swagger docs, ID cards, phone numbers, email addresses, internal IPs, cloud keys, Shiro keys, API keys, and more.
Custom rules — add your own regex patterns for URL discovery, JavaScript link extraction, and secret detection.
Domain filtering — allow-list or block-list domains with wildcard patterns (*.example.com).
Rate limiting — per-domain concurrency caps and minimum request intervals.
Proxy support — HTTP and SOCKS5 proxies.
Status filtering — filter displayed results by HTTP status codes or ranges.
Validation mode — verify discovered link statuses without crawling them.
Output formats — crawl results as CSV, local scan results as YAML.

§Configuration

Build a Config by starting from a default, then setting fields directly. The layering order used by the CLI (defaults → YAML → CLI flags) is available programmatically via apply_file_layer and apply_cli_layer, but for library usage you typically set fields directly on the struct.

Two constructors are available:

Method	Description
`Config::default()`	Empty rule lists — add your own rules.
`Config::default_with_rules()`	Pre-populated with 5 URL-find, 3 JS-find, and 10 secret-detection rules.

Key configuration fields on Config:

Field	Type	Description
`url`	`Option<String>`	Single seed URL for crawling.
`url_file`	`Option<PathBuf>`	Newline-delimited file of seed URLs.
`local`	`Option<PathBuf>`	File or directory for local scanning.
`mode`	`Mode`	`Normal` (depth 1) or `Thorough` (depth 2).
`max_depth`	`Option<u32>`	Override crawl depth; 0 = seed URLs only.
`max_page`	`Option<u32>`	Maximum pages to crawl (default 1000).
`detail`	`bool`	Show per-URL hierarchy in output.
`validate`	`bool`	Validate discovered link statuses.
`follow_redirect`	`bool`	Follow HTTP redirects.
`hide_regex`	`bool`	Suppress secret output.
`outfile`	`Option<PathBuf>`	Write results to file (CSV for crawl, YAML for scan).
`timeout`	`Duration`	Request timeout (default 30s).
`proxy`	`Option<String>`	Proxy URL (`http://host:port` or `socks5://host:port`).
`user_agent`	`Option<String>`	Override User-Agent header.
`cookie`	`Option<String>`	Set Cookie header.
`allow_domains`	`Option<Vec<String>>`	Domain allow-list with wildcards.
`disallow_domains`	`Option<Vec<String>>`	Domain block-list with wildcards.
`max_concurrency_per_domain`	`usize`	Concurrent request cap per domain (default 50).
`min_request_interval`	`Duration`	Minimum seconds between requests to same domain (default 200ms).
`dangerous_paths`	`Option<Vec<String>>`	Path fragments to avoid requesting (e.g. `logout`, `delete`).
`url_find_rules`	`Vec<Rule>`	Regex rules for discovering URLs in response text.
`js_find_rules`	`Vec<Rule>`	Regex rules for discovering JavaScript URLs.
`custom_rules`	`Vec<Rule>`	Regex rules for secret detection.
`custom_headers`	`Option<HeaderMap>`	Extra HTTP headers sent with requests.
`status_filter`	`Option<StatusRangeRule>`	Filter output by response status.

§Custom Rules

Use Rule::new to compile a named regex:

use secret_scraper::cli::{Config, Rule};

let mut config = Config::default();
config.url_find_rules.push(
    Rule::new_with_group("api_path".into(), r#""(/api/v[0-9]+/[^"]+)""#, true).unwrap()
);
config.custom_rules.push(
    Rule::new("Custom Token".into(), r"TOKEN_[A-Z0-9]{16}").unwrap()
);

Rule::new emits the full regex match. Use Rule::new_with_group when capture groups should be emitted instead, which is usually what URL-discovery rules need.

Rules added via Config::default() start empty. When using Config::default_with_rules(), your custom rules are appended to the built-in lists.

§Result Handling

The high-level API uses ScanFacade::scan, which returns ScanStdResult — an alias for Result<ScanResult, SecretScraperError>.

use secret_scraper::{
    cli::Config,
    error::{Result as SsResult, SecretScraperError},
    facade::{FileScannerFacade, ScanFacade, ScanResult},
};

fn try_scan() -> SsResult<()> {
    let mut config = Config::default_with_rules();
    config.local = Some("./src".into());

    match Box::new(FileScannerFacade::new(config)?).scan() {
        Ok(ScanResult::LocalScanResult(files)) => {
            for (path, secrets) in &files {
                for s in secrets {
                    println!("{}: [{}] {}", path.display(), s.secret_type, s.data);
                }
            }
        }
        Ok(ScanResult::CrawlResult(_)) => unreachable!(),
        Err(SecretScraperError::Scanner(msg)) => eprintln!("scan failed: {msg}"),
        Err(e) => eprintln!("error: {e}"),
    }
    Ok(())
}

§Advanced: Crawl with Full Options

use std::time::Duration;
use secret_scraper::{
    cli::{Config, Mode, Rule},
    facade::{CrawlerFacade, ScanFacade, ScanResult},
};

let mut config = Config::default_with_rules();
config.url = Some("https://example.com".to_string());
config.mode = Mode::Thorough;
config.max_depth = Some(3);
config.max_page = Some(500);
config.max_concurrency_per_domain = 10;
config.min_request_interval = Duration::from_millis(500);
config.timeout = Duration::from_secs(15);
config.follow_redirect = true;
config.validate = true;
config.detail = true;
config.user_agent = Some("SecretScraper/0.1".into());
config.proxy = Some("http://127.0.0.1:8080".into());
config.allow_domains = Some(vec!["*.example.com".into()]);
config.dangerous_paths = Some(vec!["logout".into(), "delete".into()]);
config.outfile = Some("crawl.csv".into());
config.custom_rules.push(
    Rule::new("JWT".into(), r"eyJ[a-zA-Z0-9_-]+\.[a-zA-Z0-9_-]+\.[a-zA-Z0-9_-]+").unwrap()
);

match Box::new(CrawlerFacade::new(config).unwrap()).scan().unwrap() {
    ScanResult::CrawlResult(result) => {
        println!(
            "Done: {} domains, {} URLs, {} JS files, {} secrets",
            result.hosts.len(),
            result.urls.len(),
            result.js.len(),
            result.secrets.len(),
        );
    }
    ScanResult::LocalScanResult(_) => unreachable!(),
}

§Module Overview

Module	Purpose
`cli`	Configuration types: `Config`, `Mode`, `Rule`.
`facade`	High-level entry points: `CrawlerFacade`, `FileScannerFacade`.
`error`	Error types: `SecretScraperError` and the `Result` alias.
`handler`	Secret detection: `RegexHandler`, `Secret`.
`urlparser`	URL representation: `URLNode`, `ResponseStatus`.
`filter`	Domain allow-list / block-list filter chain.
`output`	Human-readable and CSV output formatting.
`rate_limiter`	Per-domain request rate limiting.
`scanner`	Local file traversal and scanning engine.
`scraper`	Lower-level crawler actor implementation.
`logging`	Tracing and log subscriber initialization.

Modules§

cli: Command-line and YAML configuration types. CLI, YAML, and runtime configuration types.
error: Library error and result types. Error types used by the public SecretScraper facade API.
facade: High-level crawler and file-scanner facades. High-level scan facades for crawler and local file scanning workflows.
filter: URL filtering primitives. URL filtering primitives used by the crawler.
handler: Secret extraction handlers. Secret detection handlers and result types.
logging: Tracing/logging initialization helpers. Logging and tracing setup.
output: Human-readable and CSV output formatting. Formatting and CSV output helpers.
rate_limiter: Per-domain crawler rate limiting. Per-domain crawler rate limiting.
scanner: Local file scanning engine. Local file scanning engine.
scraper: Lower-level crawler actor implementation. Actor-based crawler internals.
urlparser: URL node and link extraction utilities. URL node representation and link extraction helpers.