secret_scraper 0.1.0

A URL Crawler tool and library for crawling web targets, discovering links, and detecting secrets with configurable regex rules.
Documentation

SecretScraper

SecretScraper is a Rust CLI and library for crawling web targets, discovering URLs and JavaScript links, and detecting secrets with regular-expression rules. It can also scan local files or directories recursively.

The package is prepared for use as both a CLI binary and a Rust library. Until a crates.io release is available, build, run, and depend on it from source.

Features

  • Crawl a single URL or newline-delimited URL seed file.
  • Extract links from HTML (a[href], link[href], JavaScript scripts) and regex-based URL rules.
  • Detect secrets with built-in and custom regex rules.
  • Scan a single local file or a local directory tree.
  • Allow-list and block-list domains with wildcard patterns.
  • Configure headers, user agent, cookie, proxy, timeout, crawl depth, redirects, validation, and per-domain request limits.
  • Write crawler results as CSV and local scan results as YAML.
  • Use as a Rust library through Config, CrawlerFacade, FileScannerFacade, ScanFacade, and typed SecretScraperError results.

Install

cargo install --path .

This builds an optimized release binary and installs it as secret_scraper in your Cargo bin directory (typically ~/.cargo/bin). Make sure that directory is on your PATH.

For development without installing, you can still use cargo run -- in place of secret_scraper throughout the examples below.

CLI Usage

secret_scraper --help

Crawl One URL

secret_scraper --url https://example.com

Crawl Multiple URLs

secret_scraper --url-file urls.txt

urls.txt is newline-delimited. Blank lines are ignored.

https://example.com/
https://example.com/docs
https://example.org/

Scan Local Files

secret_scraper --local ./samples

If --local points to a file, SecretScraper scans that file. If it points to a directory, files are scanned recursively.

Write Output

Crawler output is CSV:

secret_scraper --url https://example.com --outfile crawl.csv

Local file scan output is YAML:

secret_scraper --local ./samples --outfile local-scan.yml

Crawl Modes And Depth

secret_scraper --url https://example.com --mode normal
secret_scraper --url https://example.com --mode thorough

normal uses a crawl depth preset of 1. thorough uses a crawl depth preset of 2. If --max-depth is set, it overrides the mode preset.

secret_scraper --url https://example.com --mode thorough --max-depth 4

--max-depth 0 fetches only the seed URL(s).

Detail, Validation, Redirects, And Regex Output

Boolean CLI options are flags. Include the flag to enable the behavior; omit it to keep the default or YAML-configured value.

secret_scraper --url https://example.com --detail
secret_scraper --url https://example.com --validate
secret_scraper --url https://example.com --follow-redirect
secret_scraper --url https://example.com --hide-regex

--validate sends follow-up requests for discovered links to verify HTTP status. This can add requests even for links that are not crawled because of depth limits.

Domain Filters

Allow-list and block-list filters accept comma-separated wildcard patterns:

secret_scraper --url https://example.com --allow-domains '*.example.com,example.org'
secret_scraper --url https://example.com --disallow-domains '*.gov,logout.example.com'

Filters apply to seed URLs and discovered URLs.

Status Filters

Use --status with exact status codes and inclusive ranges:

secret_scraper --url https://example.com --status 200,300-399

Headers, Cookie, User Agent, Proxy, And Rate Limits

secret_scraper \
  --url https://example.com \
  --ua "SecretScraper/0.1" \
  --cookie "session=abc" \
  --proxy "http://127.0.0.1:8080" \
  --max-concurrency-per-domain 10 \
  --min-request-interval 0.2

--max-concurrency-per-domain caps concurrent requests per domain. --min-request-interval is the minimum number of seconds between request starts for the same domain.

CLI Options Summary

The authoritative list is secret_scraper --help. Current key options are:

  • --url, -u: crawl one seed URL.
  • --url-file, -f: crawl seed URLs from a newline-delimited file.
  • --local, -l: scan a local file or directory recursively.
  • --config, -i: load a YAML config file.
  • --mode, -m: normal or thorough.
  • --max-page: maximum number of pages to crawl.
  • --max-depth: explicit crawl depth override.
  • --max-concurrency-per-domain: max concurrent requests per domain.
  • --min-request-interval: seconds between requests to the same domain.
  • --outfile, -o: write crawler CSV or local scan YAML output.
  • --status, -s: filter displayed response statuses.
  • --allow-domains, -d: allow-list domains.
  • --disallow-domains, -D: block-list domains.
  • --ua, -a: set User-Agent.
  • --cookie, -c: set Cookie.
  • --proxy, -x: set HTTP/SOCKS proxy.
  • --debug: enable debug logging.
  • --detail: print detailed crawl output.
  • --validate: validate discovered link statuses.
  • --follow-redirect, -F: follow redirects.
  • --hide-regex, -H: hide regex/secret output.

At least one of --url, --url-file, or --local is required.

Configuration

The runtime configuration is built in this order:

  1. Start from Config::default() or Config::default_with_rules().
  2. Apply YAML with Config::apply_file_layer(...).
  3. Apply CLI options with Config::apply_cli_layer(...).
  4. Validate with Config::validate().

CLI values override YAML values. Missing CLI/YAML fields do not clear existing values.

The default config file path is setting.yaml. When that file does not exist, the binary writes a generated default configuration to that path and exits after printing a message.

Using setting.yaml

Create a config file with the default path:

secret_scraper

Edit setting.yaml, then run the CLI again with the same default path:

secret_scraper

Use a different config file with --config:

secret_scraper --config showcase.yaml

CLI values override values loaded from YAML. For example, this uses all values from showcase.yaml except url and outfile:

secret_scraper --config showcase.yaml --url https://override.example --outfile override.csv

Showcase Configuration

This example shows the expected shape of each configurable field with non-default demonstration values. Use it as a template, not as the generated default. The urlFind, jsFind, and rules entries shown here are custom additions; the generated setting.yaml already contains the built-in lists.

debug: true
user_agent: "SecretScraper/0.1 (+https://example.local)"
cookie: "session=demo; theme=dark"
allow_domains:
  - "*.example.com"
  - "api.example.org"
disallow_domains:
  - "*.gov"
  - "logout.example.com"
url_file: "urls.txt"
config: "showcase.yaml"
timeout: 10.0
mode: Thorough
max_page: 500
max_depth: 3
max_concurrent_per_domain: 10
min_request_interval: 0.5
outfile: "result.csv"
status_filter:
  - Exact: 200
  - Range:
      - 300
      - 399
proxy: "http://127.0.0.1:8080"
hide_regex: false
follow_redirects: true
dangerousPath:
  - logout
  - update
  - remove
  - insert
  - delete
url: "https://example.com"
detail: true
validate: true
local: null

headers:
  accept: "application/json,text/html,*/*"
  user-agent: "SecretScraper/0.1 (+https://example.local)"
  x-demo-header: "demo"

urlFind:
  - "https?://[A-Za-z0-9._~:/?#\\[\\]@!$&'()*+,;=%-]+"

jsFind:
  - "[\"']([^\"']+\\.js)[\"']"

rules:
  - name: Custom Secret
    regex: "SECRET_[A-Z0-9]+"
    loaded: true
    group: false
  - name: Disabled Rule
    regex: "IGNORE_ME"
    loaded: false
    group: false

For local scanning, replace the crawl target fields with local:

url: null
url_file: null
local: "./samples"
outfile: "local-scan.yml"

Default Configuration Values

The generated setting.yaml is the serialized form of Config::default_with_rules(). It includes all built-in URL, JavaScript, and secret rules.

The generated rule sections use two different shapes:

urlFind:
  - "https?://..."
jsFind:
  - "[\"']([^\"']+\\.js)[\"']"
rules:
  - name: Custom Secret
    regex: "SECRET_[A-Z0-9]+"
    loaded: true
    group: false
headers:
  accept: "*/*"
  user-agent: "Mozilla/5.0 ..."

urlFind and jsFind are lists of regex strings that emit capture groups. rules is a list of named secret rules with regex, loaded, and group fields.

Field Default value Meaning
debug false Enable debug logging.
user_agent null Optional user-agent override. When set, it is inserted into request headers.
cookie null Optional cookie header value.
allow_domains null Optional allow-list of wildcard domain patterns.
disallow_domains null Optional block-list of wildcard domain patterns.
url_file null Optional newline-delimited seed URL file.
config setting.yaml Config file path used by the CLI.
timeout 30.0 Request timeout in seconds.
mode Normal Crawl mode preset. Normal uses depth 1; Thorough uses depth 2.
max_page 1000 Maximum number of pages to crawl.
max_depth null Optional explicit crawl depth override. 0 means seed URLs only.
max_concurrent_per_domain 50 Maximum concurrent requests per domain.
min_request_interval 0.2 Minimum seconds between requests to the same domain.
outfile null Optional output path. Crawl output is CSV; local scan output is YAML.
status_filter null Optional response status display filter.
proxy null Optional proxy URL such as http://127.0.0.1:8080 or socks5://127.0.0.1:7890.
hide_regex false Hide regex/secret output in human-readable output.
follow_redirects false Follow HTTP redirects while crawling.
dangerousPath null Optional path fragments to avoid requesting.
url null Optional single crawl seed URL.
detail false Show detailed crawl output.
validate false Validate discovered link status after crawling.
local null Optional local file or directory to scan recursively.
urlFind five built-in regex strings Regex rules used to discover URLs from text.
jsFind three built-in regex strings Regex rules used to discover JavaScript URLs from text.
rules ten built-in named secret rules Regex rules used to detect secrets.
headers accept: "*/*" and user-agent: "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36 SE 2.X MetaSr 1.0" Default HTTP headers sent by crawler requests.

Field Notes

  • At least one of url, url_file, or local must be set before scanning.
  • mode accepts Normal/Thorough in generated YAML and normal/thorough on the CLI.
  • follow_redirects is the YAML field name; the CLI flag is --follow-redirect.
  • headers values are merged onto the default header map. Reusing accept or user-agent overrides the default values.
  • urlFind and jsFind entries are plain regex strings. They do not use name or loaded, and they emit capture groups by default.
  • rules entries use name, regex, loaded, and optional group. group: true emits capture groups instead of the full match; omitted group defaults to false.
  • urlFind, jsFind, and loaded rules entries are appended to any existing rules when you apply a YAML layer to Config::default_with_rules().
  • rules entries with loaded: false are skipped.
  • Invalid regex patterns or invalid header names/values cause configuration loading to fail.

status_filter is serialized as Exact and Range entries. You normally set it from the CLI with --status 200,300-399; if writing YAML manually, use the tagged enum shape shown in the showcase.

Built-In Rules

Use Config::default_with_rules() to populate built-in URL, JavaScript, and secret-detection rules. The built-in custom secret rules currently include:

  • Swagger
  • ID Card
  • Phone
  • JS Map
  • URL as a value
  • Email
  • Internal IP
  • Cloud Key
  • Shiro
  • Suspicious API Key

YAML urlFind, jsFind, and loaded rules entries are appended to the existing rule lists.

Library Usage

SecretScraper can be used directly from Rust code. Before the first crates.io release, depend on it by local path while developing.

[dependencies]
secret_scraper = { path = "../secret-scraper-in-rust" }

After publication, use the crates.io dependency form:

[dependencies]
secret_scraper = "0.1"

Crawl From Library Code

use secret_scraper::{
    cli::{Config, Mode},
    error::Result as SecretScraperResult,
    facade::{CrawlerFacade, ScanFacade, ScanResult},
};

fn crawl() -> SecretScraperResult<()> {
    let mut config = Config::default_with_rules();
    config.url = Some("https://example.com".to_string());
    config.mode = Mode::Thorough;
    config.detail = true;
    config.outfile = Some("crawl.csv".into());

    match Box::new(CrawlerFacade::new(config)?).scan()? {
        ScanResult::CrawlResult(result) => {
            println!(
                "{} domains, {} URL groups, {} secret-bearing URLs",
                result.hosts.len(),
                result.urls.len(),
                result.secrets.len()
            );
        }
        ScanResult::LocalScanResult(_) => unreachable!("CrawlerFacade returns crawl results"),
    }

    Ok(())
}

Scan Local Files From Library Code

use secret_scraper::{
    cli::Config,
    error::Result as SecretScraperResult,
    facade::{FileScannerFacade, ScanFacade, ScanResult},
};

fn scan_local() -> SecretScraperResult<()> {
    let mut config = Config::default_with_rules();
    config.local = Some("./samples".into());
    config.outfile = Some("local-scan.yml".into());

    match Box::new(FileScannerFacade::new(config)?).scan()? {
        ScanResult::LocalScanResult(result) => {
            println!("{} files scanned", result.len());
        }
        ScanResult::CrawlResult(_) => unreachable!("FileScannerFacade returns local scan results"),
    }

    Ok(())
}

The public facade result type is ScanStdResult, an alias for secret_scraper::error::Result<ScanResult>. Errors are represented by SecretScraperError.

Examples

Run the local crawler example:

cargo run --example crawler

Run the local file-scanner facade example:

cargo run --example scan_facade

The crawler example starts a local HTTP server and runs both normal and thorough crawls. The scan facade example creates temporary files, scans them, writes YAML, and handles SecretScraperResult explicitly.

Development

cargo fmt --check
cargo check
cargo check --examples
cargo test

Generate docs locally:

cargo doc --no-deps --open

Notes

  • Crawler output files are CSV.
  • Local scan output files are YAML.
  • The project currently uses Rust regex for rule matching.
  • Public package links are intentionally omitted until the crate is published.