SecretScraper
SecretScraper is a Rust CLI and library for crawling web targets, discovering URLs and JavaScript links, and detecting secrets with regular-expression rules. It can also scan local files or directories recursively.
The package is prepared for use as both a CLI binary and a Rust library. Until a crates.io release is available, build, run, and depend on it from source.
Library Doc: https://docs.rs/secret_scraper/latest/secret_scraper/
Features
- Crawl a single URL or newline-delimited URL seed file.
- Extract links from HTML (
a[href],link[href], JavaScript scripts) and regex-based URL rules. - Detect secrets with built-in and custom regex rules.
- Scan a single local file or a local directory tree.
- Allow-list and block-list domains with wildcard patterns.
- Configure headers, user agent, cookie, proxy, timeout, crawl depth, redirects, validation, and per-domain request limits.
- Write crawler results as CSV and local scan results as YAML.
- Use as a Rust library through
Config,CrawlerFacade,FileScannerFacade,ScanFacade, and typedSecretScraperErrorresults.
Install
This builds an optimized release binary and installs it as secret_scraper in your Cargo bin directory (typically ~/.cargo/bin). Make sure that directory is on your PATH.
For development without installing, you can still use cargo run -- in place of secret_scraper throughout the examples below.
CLI Usage
Crawl One URL
Crawl Multiple URLs
urls.txt is newline-delimited. Blank lines are ignored.
https://example.com/
https://example.com/docs
https://example.org/
Scan Local Files
If --local points to a file, SecretScraper scans that file. If it points to a directory, files are scanned recursively.
Write Output
Crawler output is CSV:
Local file scan output is YAML:
Crawl Modes And Depth
normal uses a crawl depth preset of 1. thorough uses a crawl depth preset of 2. If --max-depth is set, it overrides the mode preset.
--max-depth 0 fetches only the seed URL(s).
Detail, Validation, Redirects, And Regex Output
Boolean CLI options are flags. Include the flag to enable the behavior; omit it to keep the default or YAML-configured value.
--validate sends follow-up requests for discovered links to verify HTTP status. This can add requests even for links that are not crawled because of depth limits.
Domain Filters
Allow-list and block-list filters accept comma-separated wildcard patterns:
Filters apply to seed URLs and discovered URLs.
Status Filters
Use --status with exact status codes and inclusive ranges:
Headers, Cookie, User Agent, Proxy, And Rate Limits
--max-concurrency-per-domain caps concurrent requests per domain. --min-request-interval is the minimum number of seconds between request starts for the same domain.
CLI Options Summary
The authoritative list is secret_scraper --help. Current key options are:
--url,-u: crawl one seed URL.--url-file,-f: crawl seed URLs from a newline-delimited file.--local,-l: scan a local file or directory recursively.--config,-i: load a YAML config file.--mode,-m:normalorthorough.--max-page: maximum number of pages to crawl.--max-depth: explicit crawl depth override.--max-concurrency-per-domain: max concurrent requests per domain.--min-request-interval: seconds between requests to the same domain.--outfile,-o: write crawler CSV or local scan YAML output.--status,-s: filter displayed response statuses.--allow-domains,-d: allow-list domains.--disallow-domains,-D: block-list domains.--ua,-a: setUser-Agent.--cookie,-c: setCookie.--proxy,-x: set HTTP/SOCKS proxy.--debug: enable debug logging.--detail: print detailed crawl output.--validate: validate discovered link statuses.--follow-redirect,-F: follow redirects.--hide-regex,-H: hide regex/secret output.
At least one of --url, --url-file, or --local is required.
Configuration
The runtime configuration is built in this order:
- Start from
Config::default()orConfig::default_with_rules(). - Apply YAML with
Config::apply_file_layer(...). - Apply CLI options with
Config::apply_cli_layer(...). - Validate with
Config::validate().
CLI values override YAML values. Missing CLI/YAML fields do not clear existing values.
The default config file path is setting.yaml. When that file does not exist, the binary writes a generated default configuration to that path and exits after printing a message.
Using setting.yaml
Create a config file with the default path:
Edit setting.yaml, then run the CLI again with the same default path:
Use a different config file with --config:
CLI values override values loaded from YAML. For example, this uses all values from showcase.yaml except url and outfile:
Showcase Configuration
This example shows the expected shape of each configurable field with non-default demonstration values. Use it as a template, not as the generated default. The urlFind, jsFind, and rules entries shown here are custom additions; the generated setting.yaml already contains the built-in lists.
debug: true
user_agent: "SecretScraper/0.1 (+https://example.local)"
cookie: "session=demo; theme=dark"
allow_domains:
- "*.example.com"
- "api.example.org"
disallow_domains:
- "*.gov"
- "logout.example.com"
url_file: "urls.txt"
config: "showcase.yaml"
timeout: 10.0
mode: thorough
max_page: 500
max_depth: 3
max_concurrent_per_domain: 10
min_request_interval: 0.5
outfile: "result.csv"
status_filter:
-
proxy: "http://127.0.0.1:8080"
hide_regex: false
follow_redirects: true
dangerousPath:
- logout
- update
- remove
- insert
- delete
url: "https://example.com"
detail: true
validate: true
local: null
headers:
accept: "application/json,text/html,*/*"
user-agent: "SecretScraper/0.1 (+https://example.local)"
x-demo-header: "demo"
urlFind:
- "https?://[A-Za-z0-9._~:/?#\\[\\]@!$&'()*+,;=%-]+"
jsFind:
- "[\"']([^\"']+\\.js)[\"']"
rules:
- name: Custom Secret
regex: "SECRET_[A-Z0-9]+"
loaded: true
group: false
- name: Disabled Rule
regex: "IGNORE_ME"
loaded: false
group: false
For local scanning, replace the crawl target fields with local:
url: null
url_file: null
local: "./samples"
outfile: "local-scan.yml"
Default Configuration Values
The generated setting.yaml is the serialized form of Config::default_with_rules(). It includes all built-in URL, JavaScript, and secret rules.
The generated rule sections use two different shapes:
urlFind:
- "https?://..."
jsFind:
- "[\"']([^\"']+\\.js)[\"']"
rules:
- name: Custom Secret
regex: "SECRET_[A-Z0-9]+"
loaded: true
group: false
headers:
accept: "*/*"
user-agent: "Mozilla/5.0 ..."
urlFind and jsFind are lists of regex strings that emit capture groups. rules is a list of named secret rules with regex, loaded, and group fields.
| Field | Default value | Meaning |
|---|---|---|
debug |
false |
Enable debug logging. |
user_agent |
null |
Optional user-agent override. When set, it is inserted into request headers. |
cookie |
null |
Optional cookie header value. |
allow_domains |
null |
Optional allow-list of wildcard domain patterns. |
disallow_domains |
null |
Optional block-list of wildcard domain patterns. |
url_file |
null |
Optional newline-delimited seed URL file. |
config |
setting.yaml |
Config file path used by the CLI. |
timeout |
30.0 |
Request timeout in seconds. |
mode |
normal |
Crawl mode preset. normal uses depth 1; thorough uses depth 2. |
max_page |
1000 |
Maximum number of pages to crawl. |
max_depth |
null |
Optional explicit crawl depth override. 0 means seed URLs only. |
max_concurrent_per_domain |
50 |
Maximum concurrent requests per domain. |
min_request_interval |
0.2 |
Minimum seconds between requests to the same domain. |
outfile |
null |
Optional output path. Crawl output is CSV; local scan output is YAML. |
status_filter |
null |
Optional response status display filter. |
proxy |
null |
Optional proxy URL such as http://127.0.0.1:8080 or socks5://127.0.0.1:7890. |
hide_regex |
false |
Hide regex/secret output in human-readable output. |
follow_redirects |
false |
Follow HTTP redirects while crawling. |
dangerousPath |
null |
Optional path fragments to avoid requesting. |
url |
null |
Optional single crawl seed URL. |
detail |
false |
Show detailed crawl output. |
validate |
false |
Validate discovered link status after crawling. |
local |
null |
Optional local file or directory to scan recursively. |
urlFind |
five built-in regex strings | Regex rules used to discover URLs from text. |
jsFind |
three built-in regex strings | Regex rules used to discover JavaScript URLs from text. |
rules |
ten built-in named secret rules | Regex rules used to detect secrets. |
headers |
accept: "*/*" and user-agent: "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36 SE 2.X MetaSr 1.0" |
Default HTTP headers sent by crawler requests. |
Field Notes
- At least one of
url,url_file, orlocalmust be set before scanning. follow_redirectsis the YAML field name; the CLI flag is--follow-redirect.headersvalues are merged onto the default header map. Reusingacceptoruser-agentoverrides the default values.urlFindandjsFindentries are plain regex strings. They do not usenameorloaded, and they emit capture groups by default.rulesentries usename,regex,loaded, and optionalgroup.group: trueemits capture groups instead of the full match; omittedgroupdefaults tofalse.urlFind,jsFind, and loadedrulesentries are appended to any existing rules when you apply a YAML layer toConfig::default_with_rules().
Built-In Rules
Use Config::default_with_rules() to populate built-in URL, JavaScript, and secret-detection rules. The built-in custom secret rules currently include:
- Swagger
- ID Card
- Phone
- JS Map
- URL as a value
- Internal IP
- Cloud Key
- Shiro
- Suspicious API Key
YAML urlFind, jsFind, and loaded rules entries are appended to the existing rule lists.
Library Usage
SecretScraper can be used directly from Rust code. Before the first crates.io release, depend on it by local path while developing.
[]
= { = "../secret-scraper-in-rust" }
After publication, use the crates.io dependency form:
[]
= "0.1"
Crawl From Library Code
use ;
Scan Local Files From Library Code
use ;
The public facade result type is ScanStdResult, an alias for secret_scraper::error::Result<ScanResult>. Errors are represented by SecretScraperError.
Examples
Run the local crawler example:
Run the local file-scanner facade example:
The crawler example starts a local HTTP server and runs both normal and thorough crawls. The scan facade example creates temporary files, scans them, writes YAML, and handles SecretScraperResult explicitly.
Development
Generate docs locally:
Notes
- Crawler output files are CSV.
- Local scan output files are YAML.
- The project currently uses Rust
regexfor rule matching.