# SecretScraper
SecretScraper is a Rust CLI and library for crawling web targets, discovering URLs and JavaScript links, and detecting secrets with regular-expression rules. It can also scan local files or directories recursively.
The package is prepared for use as both a CLI binary and a Rust library. Until a crates.io release is available, build, run, and depend on it from source.
**Library Doc**: <https://docs.rs/secret_scraper/latest/secret_scraper/>
## Features
- Crawl a single URL or newline-delimited URL seed file.
- Extract links from HTML (`a[href]`, `link[href]`, JavaScript scripts) and regex-based URL rules.
- Detect secrets with built-in and custom regex rules.
- Scan a single local file or a local directory tree.
- Allow-list and block-list domains with wildcard patterns.
- Configure headers, user agent, cookie, proxy, timeout, crawl depth, redirects, validation, and per-domain request limits.
- Write crawler results as CSV and local scan results as YAML.
- Use as a Rust library through `Config`, `CrawlerFacade`, `FileScannerFacade`, `ScanFacade`, and typed `SecretScraperError` results.
## Install
```bash
cargo install secret_scraper
```
This builds an optimized release binary and installs it as `secret_scraper` in your Cargo bin directory (typically `~/.cargo/bin`). Make sure that directory is on your `PATH`.
For development without installing, you can still use `cargo run --` in place of `secret_scraper` throughout the examples below.
## CLI Usage
```bash
secret_scraper --help
```
### Crawl One URL
```bash
secret_scraper --url https://example.com
```
### Crawl Multiple URLs
```bash
secret_scraper --url-file urls.txt
```
`urls.txt` is newline-delimited. Blank lines are ignored.
```text
https://example.com/
https://example.com/docs
https://example.org/
```
### Scan Local Files
```bash
secret_scraper --local ./samples
```
If `--local` points to a file, SecretScraper scans that file. If it points to a directory, files are scanned recursively.
### Write Output
Crawler output is CSV:
```bash
secret_scraper --url https://example.com --outfile crawl.csv
```
Local file scan output is YAML:
```bash
secret_scraper --local ./samples --outfile local-scan.yml
```
### Crawl Modes And Depth
```bash
secret_scraper --url https://example.com --mode normal
secret_scraper --url https://example.com --mode thorough
```
`normal` uses a crawl depth preset of `1`. `thorough` uses a crawl depth preset of `2`. If `--max-depth` is set, it overrides the mode preset.
```bash
secret_scraper --url https://example.com --mode thorough --max-depth 4
```
`--max-depth 0` fetches only the seed URL(s).
### Detail, Validation, Redirects, And Regex Output
Boolean CLI options are flags. Include the flag to enable the behavior; omit it to keep the default or YAML-configured value.
```bash
secret_scraper --url https://example.com --detail
secret_scraper --url https://example.com --validate
secret_scraper --url https://example.com --follow-redirect
secret_scraper --url https://example.com --hide-regex
```
`--validate` sends follow-up requests for discovered links to verify HTTP status. This can add requests even for links that are not crawled because of depth limits.
### Domain Filters
Allow-list and block-list filters accept comma-separated wildcard patterns:
```bash
secret_scraper --url https://example.com --allow-domains '*.example.com,example.org'
secret_scraper --url https://example.com --disallow-domains '*.gov,logout.example.com'
```
Filters apply to seed URLs and discovered URLs.
### Status Filters
Use `--status` with exact status codes and inclusive ranges:
```bash
secret_scraper --url https://example.com --status 200,300-399
```
### Headers, Cookie, User Agent, Proxy, And Rate Limits
```bash
secret_scraper \
--url https://example.com \
--ua "SecretScraper/0.1" \
--cookie "session=abc" \
--proxy "http://127.0.0.1:8080" \
--max-concurrency-per-domain 10 \
--min-request-interval 0.2
```
`--max-concurrency-per-domain` caps concurrent requests per domain. `--min-request-interval` is the minimum number of seconds between request starts for the same domain.
## CLI Options Summary
The authoritative list is `secret_scraper --help`. Current key options are:
- `--url`, `-u`: crawl one seed URL.
- `--url-file`, `-f`: crawl seed URLs from a newline-delimited file.
- `--local`, `-l`: scan a local file or directory recursively.
- `--config`, `-i`: load a YAML config file.
- `--mode`, `-m`: `normal` or `thorough`.
- `--max-page`: maximum number of pages to crawl.
- `--max-depth`: explicit crawl depth override.
- `--max-concurrency-per-domain`: max concurrent requests per domain.
- `--min-request-interval`: seconds between requests to the same domain.
- `--outfile`, `-o`: write crawler CSV or local scan YAML output.
- `--status`, `-s`: filter displayed response statuses.
- `--allow-domains`, `-d`: allow-list domains.
- `--disallow-domains`, `-D`: block-list domains.
- `--ua`, `-a`: set `User-Agent`.
- `--cookie`, `-c`: set `Cookie`.
- `--proxy`, `-x`: set HTTP/SOCKS proxy.
- `--debug`: enable debug logging.
- `--detail`: print detailed crawl output.
- `--validate`: validate discovered link statuses.
- `--follow-redirect`, `-F`: follow redirects.
- `--hide-regex`, `-H`: hide regex/secret output.
At least one of `--url`, `--url-file`, or `--local` is required.
## Configuration
The runtime configuration is built in this order:
1. Start from `Config::default()` or `Config::default_with_rules()`.
2. Apply YAML with `Config::apply_file_layer(...)`.
3. Apply CLI options with `Config::apply_cli_layer(...)`.
4. Validate with `Config::validate()`.
CLI values override YAML values. Missing CLI/YAML fields do not clear existing values.
The default config file path is `setting.yaml`. When that file does not exist, the binary writes a generated default configuration to that path and exits after printing a message.
### Using `setting.yaml`
Create a config file with the default path:
```bash
secret_scraper
```
Edit `setting.yaml`, then run the CLI again with the same default path:
```bash
secret_scraper
```
Use a different config file with `--config`:
```bash
secret_scraper --config showcase.yaml
```
CLI values override values loaded from YAML. For example, this uses all values from `showcase.yaml` except `url` and `outfile`:
```bash
secret_scraper --config showcase.yaml --url https://override.example --outfile override.csv
```
### Showcase Configuration
This example shows the expected shape of each configurable field with non-default demonstration values. Use it as a template, not as the generated default. The `urlFind`, `jsFind`, and `rules` entries shown here are custom additions; the generated `setting.yaml` already contains the built-in lists.
```yaml
debug: true
user_agent: "SecretScraper/0.1 (+https://example.local)"
cookie: "session=demo; theme=dark"
allow_domains:
- "*.example.com"
- "api.example.org"
disallow_domains:
- "*.gov"
- "logout.example.com"
url_file: "urls.txt"
config: "showcase.yaml"
timeout: 10.0
mode: thorough
max_page: 500
max_depth: 3
max_concurrent_per_domain: 10
min_request_interval: 0.5
outfile: "result.csv"
status_filter:
- ["200", "200-400"]
proxy: "http://127.0.0.1:8080"
hide_regex: false
follow_redirects: true
dangerousPath:
- logout
- update
- remove
- insert
- delete
url: "https://example.com"
detail: true
validate: true
local: null
headers:
accept: "application/json,text/html,*/*"
user-agent: "SecretScraper/0.1 (+https://example.local)"
x-demo-header: "demo"
urlFind:
- "https?://[A-Za-z0-9._~:/?#\\[\\]@!$&'()*+,;=%-]+"
jsFind:
- "[\"'](+\\.js)[\"']"
rules:
- name: Custom Secret
regex: "SECRET_[A-Z0-9]+"
loaded: true
group: false
- name: Disabled Rule
regex: "IGNORE_ME"
loaded: false
group: false
```
For local scanning, replace the crawl target fields with `local`:
```yaml
url: null
url_file: null
local: "./samples"
outfile: "local-scan.yml"
```
### Default Configuration Values
The generated `setting.yaml` is the serialized form of `Config::default_with_rules()`. It includes all built-in URL, JavaScript, and secret rules.
The generated rule sections use two different shapes:
```yaml
urlFind:
- "https?://..."
jsFind:
- "[\"'](+\\.js)[\"']"
rules:
- name: Custom Secret
regex: "SECRET_[A-Z0-9]+"
loaded: true
group: false
headers:
accept: "*/*"
user-agent: "Mozilla/5.0 ..."
```
`urlFind` and `jsFind` are lists of regex strings that emit capture groups. `rules` is a list of named secret rules with `regex`, `loaded`, and `group` fields.
| Field | Default value | Meaning |
| --- | --- | --- |
| `debug` | `false` | Enable debug logging. |
| `user_agent` | `null` | Optional user-agent override. When set, it is inserted into request headers. |
| `cookie` | `null` | Optional cookie header value. |
| `allow_domains` | `null` | Optional allow-list of wildcard domain patterns. |
| `disallow_domains` | `null` | Optional block-list of wildcard domain patterns. |
| `url_file` | `null` | Optional newline-delimited seed URL file. |
| `config` | `setting.yaml` | Config file path used by the CLI. |
| `timeout` | `30.0` | Request timeout in seconds. |
| `mode` | `normal` | Crawl mode preset. `normal` uses depth 1; `thorough` uses depth 2. |
| `max_page` | `1000` | Maximum number of pages to crawl. |
| `max_depth` | `null` | Optional explicit crawl depth override. `0` means seed URLs only. |
| `max_concurrent_per_domain` | `50` | Maximum concurrent requests per domain. |
| `min_request_interval` | `0.2` | Minimum seconds between requests to the same domain. |
| `outfile` | `null` | Optional output path. Crawl output is CSV; local scan output is YAML. |
| `status_filter` | `null` | Optional response status display filter. |
| `proxy` | `null` | Optional proxy URL such as `http://127.0.0.1:8080` or `socks5://127.0.0.1:7890`. |
| `hide_regex` | `false` | Hide regex/secret output in human-readable output. |
| `follow_redirects` | `false` | Follow HTTP redirects while crawling. |
| `dangerousPath` | `null` | Optional path fragments to avoid requesting. |
| `url` | `null` | Optional single crawl seed URL. |
| `detail` | `false` | Show detailed crawl output. |
| `validate` | `false` | Validate discovered link status after crawling. |
| `local` | `null` | Optional local file or directory to scan recursively. |
| `urlFind` | five built-in regex strings | Regex rules used to discover URLs from text. |
| `jsFind` | three built-in regex strings | Regex rules used to discover JavaScript URLs from text. |
| `rules` | ten built-in named secret rules | Regex rules used to detect secrets. |
| `headers` | `accept: "*/*"` and `user-agent: "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36 SE 2.X MetaSr 1.0"` | Default HTTP headers sent by crawler requests. |
### Field Notes
- At least one of `url`, `url_file`, or `local` must be set before scanning.
- `follow_redirects` is the YAML field name; the CLI flag is `--follow-redirect`.
- `headers` values are merged onto the default header map. Reusing `accept` or `user-agent` overrides the default values.
- `urlFind` and `jsFind` entries are plain regex strings. They do not use `name` or `loaded`, and they emit capture groups by default.
- `rules` entries use `name`, `regex`, `loaded`, and optional `group`. `group: true` emits capture groups instead of the full match; omitted `group` defaults to `false`.
- `urlFind`, `jsFind`, and loaded `rules` entries are appended to any existing rules when you apply a YAML layer to `Config::default_with_rules()`.
### Built-In Rules
Use `Config::default_with_rules()` to populate built-in URL, JavaScript, and secret-detection rules. The built-in custom secret rules currently include:
- Swagger
- ID Card
- Phone
- JS Map
- URL as a value
- Email
- Internal IP
- Cloud Key
- Shiro
- Suspicious API Key
YAML `urlFind`, `jsFind`, and loaded `rules` entries are appended to the existing rule lists.
## Library Usage
SecretScraper can be used directly from Rust code. Before the first crates.io release, depend on it by local path while developing.
```toml
[dependencies]
secret_scraper = { path = "../secret-scraper-in-rust" }
```
After publication, use the crates.io dependency form:
```toml
[dependencies]
secret_scraper = "0.1"
```
### Crawl From Library Code
```rust
use secret_scraper::{
cli::{Config, Mode},
error::Result as SecretScraperResult,
facade::{CrawlerFacade, ScanFacade, ScanResult},
};
fn crawl() -> SecretScraperResult<()> {
let mut config = Config::default_with_rules();
config.url = Some("https://example.com".to_string());
config.mode = Mode::Thorough;
config.detail = true;
config.outfile = Some("crawl.csv".into());
match Box::new(CrawlerFacade::new(config)?).scan()? {
ScanResult::CrawlResult(result) => {
println!(
"{} domains, {} URL groups, {} secret-bearing URLs",
result.hosts.len(),
result.urls.len(),
result.secrets.len()
);
}
ScanResult::LocalScanResult(_) => unreachable!("CrawlerFacade returns crawl results"),
}
Ok(())
}
```
### Scan Local Files From Library Code
```rust
use secret_scraper::{
cli::Config,
error::Result as SecretScraperResult,
facade::{FileScannerFacade, ScanFacade, ScanResult},
};
fn scan_local() -> SecretScraperResult<()> {
let mut config = Config::default_with_rules();
config.local = Some("./samples".into());
config.outfile = Some("local-scan.yml".into());
match Box::new(FileScannerFacade::new(config)?).scan()? {
ScanResult::LocalScanResult(result) => {
println!("{} files scanned", result.len());
}
ScanResult::CrawlResult(_) => unreachable!("FileScannerFacade returns local scan results"),
}
Ok(())
}
```
The public facade result type is `ScanStdResult`, an alias for `secret_scraper::error::Result<ScanResult>`. Errors are represented by `SecretScraperError`.
## Examples
Run the local crawler example:
```bash
cargo run --example crawler
```
Run the local file-scanner facade example:
```bash
cargo run --example scan_facade
```
The crawler example starts a local HTTP server and runs both normal and thorough crawls. The scan facade example creates temporary files, scans them, writes YAML, and handles `SecretScraperResult` explicitly.
## Development
```bash
cargo fmt --check
cargo check
cargo check --examples
cargo test
```
Generate docs locally:
```bash
cargo doc --no-deps --open
```
## Notes
- Crawler output files are CSV.
- Local scan output files are YAML.
- The project currently uses Rust `regex` for rule matching.