secret_scraper 0.1.1

A URL Crawler tool and library for crawling web targets, discovering links, and detecting secrets with configurable regex rules.
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
# SecretScraper

SecretScraper is a Rust CLI and library for crawling web targets, discovering URLs and JavaScript links, and detecting secrets with regular-expression rules. It can also scan local files or directories recursively.

The package is prepared for use as both a CLI binary and a Rust library. Until a crates.io release is available, build, run, and depend on it from source.

## Features

- Crawl a single URL or newline-delimited URL seed file.
- Extract links from HTML (`a[href]`, `link[href]`, JavaScript scripts) and regex-based URL rules.
- Detect secrets with built-in and custom regex rules.
- Scan a single local file or a local directory tree.
- Allow-list and block-list domains with wildcard patterns.
- Configure headers, user agent, cookie, proxy, timeout, crawl depth, redirects, validation, and per-domain request limits.
- Write crawler results as CSV and local scan results as YAML.
- Use as a Rust library through `Config`, `CrawlerFacade`, `FileScannerFacade`, `ScanFacade`, and typed `SecretScraperError` results.



## Install

```bash
cargo install --path .
```

This builds an optimized release binary and installs it as `secret_scraper` in your Cargo bin directory (typically `~/.cargo/bin`). Make sure that directory is on your `PATH`.

For development without installing, you can still use `cargo run --` in place of `secret_scraper` throughout the examples below.

## CLI Usage

```bash
secret_scraper --help
```

### Crawl One URL

```bash
secret_scraper --url https://example.com
```

### Crawl Multiple URLs

```bash
secret_scraper --url-file urls.txt
```

`urls.txt` is newline-delimited. Blank lines are ignored.

```text
https://example.com/
https://example.com/docs
https://example.org/
```

### Scan Local Files

```bash
secret_scraper --local ./samples
```

If `--local` points to a file, SecretScraper scans that file. If it points to a directory, files are scanned recursively.

### Write Output

Crawler output is CSV:

```bash
secret_scraper --url https://example.com --outfile crawl.csv
```

Local file scan output is YAML:

```bash
secret_scraper --local ./samples --outfile local-scan.yml
```

### Crawl Modes And Depth

```bash
secret_scraper --url https://example.com --mode normal
secret_scraper --url https://example.com --mode thorough
```

`normal` uses a crawl depth preset of `1`. `thorough` uses a crawl depth preset of `2`. If `--max-depth` is set, it overrides the mode preset.

```bash
secret_scraper --url https://example.com --mode thorough --max-depth 4
```

`--max-depth 0` fetches only the seed URL(s).

### Detail, Validation, Redirects, And Regex Output

Boolean CLI options are flags. Include the flag to enable the behavior; omit it to keep the default or YAML-configured value.

```bash
secret_scraper --url https://example.com --detail
secret_scraper --url https://example.com --validate
secret_scraper --url https://example.com --follow-redirect
secret_scraper --url https://example.com --hide-regex
```

`--validate` sends follow-up requests for discovered links to verify HTTP status. This can add requests even for links that are not crawled because of depth limits.

### Domain Filters

Allow-list and block-list filters accept comma-separated wildcard patterns:

```bash
secret_scraper --url https://example.com --allow-domains '*.example.com,example.org'
secret_scraper --url https://example.com --disallow-domains '*.gov,logout.example.com'
```

Filters apply to seed URLs and discovered URLs.

### Status Filters

Use `--status` with exact status codes and inclusive ranges:

```bash
secret_scraper --url https://example.com --status 200,300-399
```

### Headers, Cookie, User Agent, Proxy, And Rate Limits

```bash
secret_scraper \
  --url https://example.com \
  --ua "SecretScraper/0.1" \
  --cookie "session=abc" \
  --proxy "http://127.0.0.1:8080" \
  --max-concurrency-per-domain 10 \
  --min-request-interval 0.2
```

`--max-concurrency-per-domain` caps concurrent requests per domain. `--min-request-interval` is the minimum number of seconds between request starts for the same domain.

## CLI Options Summary

The authoritative list is `secret_scraper --help`. Current key options are:

- `--url`, `-u`: crawl one seed URL.
- `--url-file`, `-f`: crawl seed URLs from a newline-delimited file.
- `--local`, `-l`: scan a local file or directory recursively.
- `--config`, `-i`: load a YAML config file.
- `--mode`, `-m`: `normal` or `thorough`.
- `--max-page`: maximum number of pages to crawl.
- `--max-depth`: explicit crawl depth override.
- `--max-concurrency-per-domain`: max concurrent requests per domain.
- `--min-request-interval`: seconds between requests to the same domain.
- `--outfile`, `-o`: write crawler CSV or local scan YAML output.
- `--status`, `-s`: filter displayed response statuses.
- `--allow-domains`, `-d`: allow-list domains.
- `--disallow-domains`, `-D`: block-list domains.
- `--ua`, `-a`: set `User-Agent`.
- `--cookie`, `-c`: set `Cookie`.
- `--proxy`, `-x`: set HTTP/SOCKS proxy.
- `--debug`: enable debug logging.
- `--detail`: print detailed crawl output.
- `--validate`: validate discovered link statuses.
- `--follow-redirect`, `-F`: follow redirects.
- `--hide-regex`, `-H`: hide regex/secret output.

At least one of `--url`, `--url-file`, or `--local` is required.

## Configuration

The runtime configuration is built in this order:

1. Start from `Config::default()` or `Config::default_with_rules()`.
2. Apply YAML with `Config::apply_file_layer(...)`.
3. Apply CLI options with `Config::apply_cli_layer(...)`.
4. Validate with `Config::validate()`.

CLI values override YAML values. Missing CLI/YAML fields do not clear existing values.

The default config file path is `setting.yaml`. When that file does not exist, the binary writes a generated default configuration to that path and exits after printing a message.

### Using `setting.yaml`

Create a config file with the default path:

```bash
secret_scraper
```

Edit `setting.yaml`, then run the CLI again with the same default path:

```bash
secret_scraper
```

Use a different config file with `--config`:

```bash
secret_scraper --config showcase.yaml
```

CLI values override values loaded from YAML. For example, this uses all values from `showcase.yaml` except `url` and `outfile`:

```bash
secret_scraper --config showcase.yaml --url https://override.example --outfile override.csv
```

### Showcase Configuration

This example shows the expected shape of each configurable field with non-default demonstration values. Use it as a template, not as the generated default. The `urlFind`, `jsFind`, and `rules` entries shown here are custom additions; the generated `setting.yaml` already contains the built-in lists.

```yaml
debug: true
user_agent: "SecretScraper/0.1 (+https://example.local)"
cookie: "session=demo; theme=dark"
allow_domains:
  - "*.example.com"
  - "api.example.org"
disallow_domains:
  - "*.gov"
  - "logout.example.com"
url_file: "urls.txt"
config: "showcase.yaml"
timeout: 10.0
mode: Thorough
max_page: 500
max_depth: 3
max_concurrent_per_domain: 10
min_request_interval: 0.5
outfile: "result.csv"
status_filter:
  - Exact: 200
  - Range:
      - 300
      - 399
proxy: "http://127.0.0.1:8080"
hide_regex: false
follow_redirects: true
dangerousPath:
  - logout
  - update
  - remove
  - insert
  - delete
url: "https://example.com"
detail: true
validate: true
local: null

headers:
  accept: "application/json,text/html,*/*"
  user-agent: "SecretScraper/0.1 (+https://example.local)"
  x-demo-header: "demo"

urlFind:
  - "https?://[A-Za-z0-9._~:/?#\\[\\]@!$&'()*+,;=%-]+"

jsFind:
  - "[\"']([^\"']+\\.js)[\"']"

rules:
  - name: Custom Secret
    regex: "SECRET_[A-Z0-9]+"
    loaded: true
    group: false
  - name: Disabled Rule
    regex: "IGNORE_ME"
    loaded: false
    group: false
```

For local scanning, replace the crawl target fields with `local`:

```yaml
url: null
url_file: null
local: "./samples"
outfile: "local-scan.yml"
```

### Default Configuration Values

The generated `setting.yaml` is the serialized form of `Config::default_with_rules()`. It includes all built-in URL, JavaScript, and secret rules.

The generated rule sections use two different shapes:

```yaml
urlFind:
  - "https?://..."
jsFind:
  - "[\"']([^\"']+\\.js)[\"']"
rules:
  - name: Custom Secret
    regex: "SECRET_[A-Z0-9]+"
    loaded: true
    group: false
headers:
  accept: "*/*"
  user-agent: "Mozilla/5.0 ..."
```

`urlFind` and `jsFind` are lists of regex strings that emit capture groups. `rules` is a list of named secret rules with `regex`, `loaded`, and `group` fields.

| Field | Default value | Meaning |
| --- | --- | --- |
| `debug` | `false` | Enable debug logging. |
| `user_agent` | `null` | Optional user-agent override. When set, it is inserted into request headers. |
| `cookie` | `null` | Optional cookie header value. |
| `allow_domains` | `null` | Optional allow-list of wildcard domain patterns. |
| `disallow_domains` | `null` | Optional block-list of wildcard domain patterns. |
| `url_file` | `null` | Optional newline-delimited seed URL file. |
| `config` | `setting.yaml` | Config file path used by the CLI. |
| `timeout` | `30.0` | Request timeout in seconds. |
| `mode` | `Normal` | Crawl mode preset. `Normal` uses depth 1; `Thorough` uses depth 2. |
| `max_page` | `1000` | Maximum number of pages to crawl. |
| `max_depth` | `null` | Optional explicit crawl depth override. `0` means seed URLs only. |
| `max_concurrent_per_domain` | `50` | Maximum concurrent requests per domain. |
| `min_request_interval` | `0.2` | Minimum seconds between requests to the same domain. |
| `outfile` | `null` | Optional output path. Crawl output is CSV; local scan output is YAML. |
| `status_filter` | `null` | Optional response status display filter. |
| `proxy` | `null` | Optional proxy URL such as `http://127.0.0.1:8080` or `socks5://127.0.0.1:7890`. |
| `hide_regex` | `false` | Hide regex/secret output in human-readable output. |
| `follow_redirects` | `false` | Follow HTTP redirects while crawling. |
| `dangerousPath` | `null` | Optional path fragments to avoid requesting. |
| `url` | `null` | Optional single crawl seed URL. |
| `detail` | `false` | Show detailed crawl output. |
| `validate` | `false` | Validate discovered link status after crawling. |
| `local` | `null` | Optional local file or directory to scan recursively. |
| `urlFind` | five built-in regex strings | Regex rules used to discover URLs from text. |
| `jsFind` | three built-in regex strings | Regex rules used to discover JavaScript URLs from text. |
| `rules` | ten built-in named secret rules | Regex rules used to detect secrets. |
| `headers` | `accept: "*/*"` and `user-agent: "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36 SE 2.X MetaSr 1.0"` | Default HTTP headers sent by crawler requests. |

### Field Notes

- At least one of `url`, `url_file`, or `local` must be set before scanning.
- `mode` accepts `Normal`/`Thorough` in generated YAML and `normal`/`thorough` on the CLI.
- `follow_redirects` is the YAML field name; the CLI flag is `--follow-redirect`.
- `headers` values are merged onto the default header map. Reusing `accept` or `user-agent` overrides the default values.
- `urlFind` and `jsFind` entries are plain regex strings. They do not use `name` or `loaded`, and they emit capture groups by default.
- `rules` entries use `name`, `regex`, `loaded`, and optional `group`. `group: true` emits capture groups instead of the full match; omitted `group` defaults to `false`.
- `urlFind`, `jsFind`, and loaded `rules` entries are appended to any existing rules when you apply a YAML layer to `Config::default_with_rules()`.
- `rules` entries with `loaded: false` are skipped.
- Invalid regex patterns or invalid header names/values cause configuration loading to fail.

`status_filter` is serialized as `Exact` and `Range` entries. You normally set it from the CLI with `--status 200,300-399`; if writing YAML manually, use the tagged enum shape shown in the showcase.

### Built-In Rules

Use `Config::default_with_rules()` to populate built-in URL, JavaScript, and secret-detection rules. The built-in custom secret rules currently include:

- Swagger
- ID Card
- Phone
- JS Map
- URL as a value
- Email
- Internal IP
- Cloud Key
- Shiro
- Suspicious API Key

YAML `urlFind`, `jsFind`, and loaded `rules` entries are appended to the existing rule lists.

## Library Usage

SecretScraper can be used directly from Rust code. Before the first crates.io release, depend on it by local path while developing.

```toml
[dependencies]
secret_scraper = { path = "../secret-scraper-in-rust" }
```

After publication, use the crates.io dependency form:

```toml
[dependencies]
secret_scraper = "0.1"
```

### Crawl From Library Code

```rust
use secret_scraper::{
    cli::{Config, Mode},
    error::Result as SecretScraperResult,
    facade::{CrawlerFacade, ScanFacade, ScanResult},
};

fn crawl() -> SecretScraperResult<()> {
    let mut config = Config::default_with_rules();
    config.url = Some("https://example.com".to_string());
    config.mode = Mode::Thorough;
    config.detail = true;
    config.outfile = Some("crawl.csv".into());

    match Box::new(CrawlerFacade::new(config)?).scan()? {
        ScanResult::CrawlResult(result) => {
            println!(
                "{} domains, {} URL groups, {} secret-bearing URLs",
                result.hosts.len(),
                result.urls.len(),
                result.secrets.len()
            );
        }
        ScanResult::LocalScanResult(_) => unreachable!("CrawlerFacade returns crawl results"),
    }

    Ok(())
}
```

### Scan Local Files From Library Code

```rust
use secret_scraper::{
    cli::Config,
    error::Result as SecretScraperResult,
    facade::{FileScannerFacade, ScanFacade, ScanResult},
};

fn scan_local() -> SecretScraperResult<()> {
    let mut config = Config::default_with_rules();
    config.local = Some("./samples".into());
    config.outfile = Some("local-scan.yml".into());

    match Box::new(FileScannerFacade::new(config)?).scan()? {
        ScanResult::LocalScanResult(result) => {
            println!("{} files scanned", result.len());
        }
        ScanResult::CrawlResult(_) => unreachable!("FileScannerFacade returns local scan results"),
    }

    Ok(())
}
```

The public facade result type is `ScanStdResult`, an alias for `secret_scraper::error::Result<ScanResult>`. Errors are represented by `SecretScraperError`.

## Examples

Run the local crawler example:

```bash
cargo run --example crawler
```

Run the local file-scanner facade example:

```bash
cargo run --example scan_facade
```

The crawler example starts a local HTTP server and runs both normal and thorough crawls. The scan facade example creates temporary files, scans them, writes YAML, and handles `SecretScraperResult` explicitly.

## Development

```bash
cargo fmt --check
cargo check
cargo check --examples
cargo test
```

Generate docs locally:

```bash
cargo doc --no-deps --open
```

## Notes

- Crawler output files are CSV.
- Local scan output files are YAML.
- The project currently uses Rust `regex` for rule matching.
- Public package links are intentionally omitted until the crate is published.