# Advanced Content Scanner
> **Module:** `advanced_content_scanner`
> **Feature Flag:** `advanced-content-scanner`
> **Source:** [`src/advanced_content_scanner.rs`](../src/advanced_content_scanner.rs)
> **Lines:** ~755 | **Dependencies:** `reqwest`, `scraper`, `regex`, `serde`
A comprehensive web content security scanner that crawls target domains and performs deep analysis for leaked secrets, JavaScript vulnerabilities, SSRF attack surfaces, and misconfigured security policies. Ported from the Python WebAnalyzer `modules/advanced_content_scanner.py` (1,516 lines) with full feature parity.
---
## Table of Contents
- [Overview](#overview)
- [Architecture](#architecture)
- [Scan Pipeline](#scan-pipeline)
- [Dependency Graph](#dependency-graph)
- [Public API](#public-api)
- [`scan_content()`](#scan_content)
- [Data Structures](#data-structures)
- [`ScannerResult`](#scannerresult)
- [`SecretFinding`](#secretfinding)
- [`JsVulnerability`](#jsvulnerability)
- [`SsrfFinding`](#ssrffinding)
- [`ScanSummary`](#scansummary)
- [Scan Phases](#scan-phases)
- [Phase 1: Pre-Crawl Reconnaissance](#phase-1-pre-crawl-reconnaissance)
- [robots.txt Parsing](#robotstxt-parsing)
- [sitemap.xml Processing](#sitemapxml-processing)
- [Phase 2: BFS Web Crawling](#phase-2-bfs-web-crawling)
- [Crawl Configuration](#crawl-configuration)
- [Link Extraction & Queueing](#link-extraction--queueing)
- [Same-Domain Filtering](#same-domain-filtering)
- [Phase 3: Secret Detection](#phase-3-secret-detection)
- [Secret Pattern Catalog (24 patterns)](#secret-pattern-catalog-24-patterns)
- [Shannon Entropy Validation](#shannon-entropy-validation)
- [False Positive Filtering](#false-positive-filtering)
- [Secret Masking](#secret-masking)
- [Phase 4: JavaScript Security Analysis](#phase-4-javascript-security-analysis)
- [Inline JS Extraction](#inline-js-extraction)
- [External JS Fetching](#external-js-fetching)
- [Known Library Exclusion](#known-library-exclusion)
- [Vulnerability Categories (13 types)](#vulnerability-categories-13-types)
- [Minified File Handling](#minified-file-handling)
- [Phase 5: SSRF Detection](#phase-5-ssrf-detection)
- [Form Parameter Scanning](#form-parameter-scanning)
- [URL Parameter Scanning](#url-parameter-scanning)
- [API Endpoint SSRF Probing](#api-endpoint-ssrf-probing)
- [SSRF Parameter Name List (60+ params)](#ssrf-parameter-name-list-60-params)
- [Phase 6: HTML/Meta Security Checks](#phase-6-htmlmeta-security-checks)
- [Weak CSP Detection](#weak-csp-detection)
- [Missing CSRF Token Detection](#missing-csrf-token-detection)
- [Phase 7: API Endpoint Discovery](#phase-7-api-endpoint-discovery)
- [Regex-Based Extraction](#regex-based-extraction)
- [Phase 8: Post-Processing](#phase-8-post-processing)
- [Deduplication](#deduplication)
- [Internal Functions](#internal-functions)
- [`scan_for_secrets()`](#scan_for_secrets)
- [`scan_js_security()`](#scan_js_security)
- [`check_url_params_ssrf()`](#check_url_params_ssrf)
- [`extract_api_endpoints()`](#extract_api_endpoints)
- [`resolve_url()`](#resolve_url)
- [`is_same_domain()`](#is_same_domain)
- [`shannon_entropy()`](#shannon_entropy)
- [`mask_secret()`](#mask_secret)
- [`is_false_positive_context()`](#is_false_positive_context)
- [`is_known_library()`](#is_known_library)
- [`dedup_secrets()`](#dedup_secrets)
- [`dedup_js_vulns()`](#dedup_js_vulns)
- [Usage Example](#usage-example)
- [Testing](#testing)
- [Configuration Constants](#configuration-constants)
---
## Overview
The scanner performs 8 sequential phases on every target domain:
1. **Pre-Crawl Recon** — Parse `robots.txt` and `sitemap.xml`
2. **BFS Crawl** — Breadth-first crawl up to `max_depth=2`, `max_pages=50`
3. **Secret Detection** — Regex scan all content for 24 types of leaked secrets
4. **JS Security Analysis** — Analyze inline + external JS for 13 vulnerability categories
5. **SSRF Detection** — Scan forms, URL params, and actively probe API endpoints
6. **HTML/Meta Checks** — Detect weak CSP and missing CSRF tokens
7. **API Endpoint Discovery** — Extract API endpoints from HTML and JS content
8. **Post-Processing** — Deduplicate findings and assemble the result
---
## Architecture
### Scan Pipeline
```
┌──────────────────────────────────────────────────────────────────────────┐
│ scan_content(domain) │
├─────────────────┬────────────────────────────────────────────────────────┤
│ Phase 1 │ Fetch /robots.txt → parse Disallow rules │
│ Pre-Crawl │ Fetch /sitemap.xml → extract <loc> URLs → seed queue │
├─────────────────┼────────────────────────────────────────────────────────┤
│ Phase 2 │ BFS queue loop (max_depth=2, max_pages=50) │
│ Crawl │ ├─ Check robots.txt disallow │
│ │ ├─ Check URL params for SSRF │
│ │ ├─ Fetch page │
│ │ ├─ Run scan_for_secrets() on body │
│ │ ├─ Run extract_api_endpoints() on body │
│ │ └─ If HTML: │
│ │ ├─ Extract <a href> → queue new links │
│ │ ├─ Extract <script> inline JS │
│ │ │ ├─ scan_js_security() │
│ │ │ └─ scan_for_secrets() │
│ │ ├─ Collect <script src> external JS URLs │
│ │ ├─ Check forms for SSRF params │
│ │ ├─ Check <meta> for weak CSP │
│ │ └─ Check forms for missing CSRF │
├─────────────────┼────────────────────────────────────────────────────────┤
│ Phase 3 │ Fetch external JS files (skip known libraries) │
│ External JS │ ├─ scan_js_security() │
│ │ ├─ scan_for_secrets() │
│ │ └─ extract_api_endpoints() │
├─────────────────┼────────────────────────────────────────────────────────┤
│ Phase 4 │ Active SSRF probing on discovered API endpoints │
│ SSRF Probing │ └─ Up to 20 endpoints × 5 SSRF payloads │
├─────────────────┼────────────────────────────────────────────────────────┤
│ Phase 5 │ Deduplicate all findings │
│ Post-Process │ └─ Assemble ScannerResult │
└─────────────────┴────────────────────────────────────────────────────────┘
```
### Dependency Graph
```
advanced_content_scanner
├── crate::payloads (SSRF probe URLs from payloads/ssrf.txt)
├── reqwest::Client (HTTP client with 15s timeout, TLS skip)
├── scraper::{Html,Selector} (CSS selector-based HTML parsing)
├── regex::Regex (Pattern matching for secrets, JS vulns, API paths)
└── serde::{Serialize,Deserialize} (JSON serialization for results)
```
---
## Public API
### `scan_content()`
```rust
pub async fn scan_content(
domain: &str
) -> Result<ScannerResult, Box<dyn std::error::Error + Send + Sync>>
```
**Parameters:**
| `domain` | `&str` | Target domain. Accepts bare domain (`example.com`) or full URL (`https://example.com`). If no scheme is provided, `https://` is prepended. |
**Returns:** `Result<ScannerResult, Error>` — Full scan results or an error.
**Behavior:**
- Creates an HTTP client with 15-second timeout and TLS certificate validation disabled
- Executes all 8 scan phases sequentially
- All regex patterns are compiled once at scan start for performance
---
## Data Structures
### `ScannerResult`
Top-level result container returned by `scan_content()`.
```rust
pub struct ScannerResult {
pub domain: String, // Target domain
pub secrets: Vec<SecretFinding>, // Leaked secrets found
pub js_vulnerabilities: Vec<JsVulnerability>,// JavaScript security issues
pub ssrf_vulnerabilities: Vec<SsrfFinding>, // SSRF attack surfaces
pub api_endpoints_discovered: Vec<String>, // API endpoints found in content
pub summary: ScanSummary, // Aggregate stats
}
```
### `SecretFinding`
Represents a single leaked secret or credential.
```rust
pub struct SecretFinding {
pub secret_type: String, // e.g. "AWS Access Key", "JWT Token"
pub severity: String, // "High", "Medium", or "Low"
pub masked_value: String, // e.g. "AKIA****WXYZ" (first 4 + last 4)
pub source_url: String, // URL where the secret was found
pub line: usize, // Line number in the source content
pub entropy: f64, // Shannon entropy (rounded to 2 decimals)
pub recommendation: String, // Remediation guidance
}
```
### `JsVulnerability`
Represents a JavaScript security vulnerability or misconfiguration.
```rust
pub struct JsVulnerability {
pub vuln_type: String, // e.g. "DOM XSS", "Weak CSP"
pub severity: String, // "High", "Medium", or "Low"
pub source_url: String, // URL where the vulnerability was found
pub matched_code: String, // Code snippet (truncated to 200 chars)
pub description: String, // What the vulnerability means
pub recommendation: String, // How to fix it
}
```
### `SsrfFinding`
Represents a Server-Side Request Forgery attack surface.
```rust
pub struct SsrfFinding {
pub finding_type: String, // "Potential SSRF in Form", "Potential SSRF in URL Parameter",
// or "Confirmed SSRF in API Endpoint"
pub severity: String, // "Medium" (passive) or "High" (confirmed)
pub source_url: String, // URL containing the vulnerability
pub vulnerable_params: Vec<String>,// Parameter names flagged
pub description: String, // Details of the finding
}
```
### `ScanSummary`
Aggregate statistics for the scan run.
```rust
pub struct ScanSummary {
pub total_urls_crawled: usize, // Pages visited by the BFS crawler
pub total_js_files: usize, // External JS files fetched and analyzed
pub total_api_endpoints: usize, // API endpoints discovered in content
pub secrets_count: usize, // Total unique secrets found
pub js_vulnerabilities_count: usize, // Total unique JS vulnerabilities
pub ssrf_vulnerabilities_count: usize, // Total SSRF findings
}
```
---
## Scan Phases
### Phase 1: Pre-Crawl Reconnaissance
#### robots.txt Parsing
- Fetches `{base_url}/robots.txt`
- Parses `User-agent: *` blocks
- Collects all `Disallow:` paths into a list
- During BFS crawl, any URL path matching a disallowed prefix is **skipped**
- Gracefully handles missing or inaccessible `robots.txt`
#### sitemap.xml Processing
- Fetches `{base_url}/sitemap.xml`
- Extracts all `<loc>` URLs using regex: `<loc>([^<]+)</loc>`
- Filters to same-domain URLs only
- Seeds the BFS queue at depth 1 (giving the crawler better starting coverage)
- Gracefully handles missing or malformed sitemaps
---
### Phase 2: BFS Web Crawling
#### Crawl Configuration
| `max_depth` | `2` | Maximum link-following depth from the root URL |
| `max_pages` | `50` | Maximum total pages to visit before stopping |
| HTTP timeout | `15s` | Per-request timeout |
| TLS validation | `disabled` | Accepts self-signed certificates |
#### Link Extraction & Queueing
For every HTML page visited, all `<a href="...">` elements are extracted. Each `href` is:
1. Resolved to an absolute URL via `resolve_url()`
2. Filtered to same-domain only via `is_same_domain()`
3. Checked for duplicates against the `visited` set
4. Added to the BFS queue with `depth + 1`
The following URL schemes are **skipped** by `resolve_url()`:
- `javascript:`, `mailto:`, `tel:`, `#` (anchor fragments)
#### Same-Domain Filtering
`is_same_domain()` extracts the hostname from both URLs (stripping scheme and path) and performs a case-insensitive comparison. Only URLs on the same host are queued.
---
### Phase 3: Secret Detection
#### Secret Pattern Catalog (24 patterns)
| 1 | AWS Access Key | `\bAKIA[0-9A-Z]{16}\b` | Medium |
| 2 | AWS Secret Key | `\b[0-9a-zA-Z/+]{40}\b` | High |
| 3 | Google API Key | `\bAIza[0-9A-Za-z\-_]{35}\b` | Medium |
| 4 | Google OAuth | `[0-9]+-[0-9A-Za-z_]{32}\.apps\.googleusercontent\.com` | Medium |
| 5 | Stripe API Key | `\b(?:sk\|pk)_(live\|test)_[0-9a-zA-Z]{24,34}\b` | High |
| 6 | GitHub Token | `\b(?:github\|gh)(?:_pat)?_[0-9a-zA-Z]{36,40}\b` | High |
| 7 | GitHub OAuth | `\bgho_[0-9a-zA-Z]{36,40}\b` | High |
| 8 | Facebook Access Token | `EAACEdEose0cBA[0-9A-Za-z]+` | Medium |
| 9 | JWT Token | `eyJ[a-zA-Z0-9_\-]*\.[a-zA-Z0-9_\-]*\.[a-zA-Z0-9_\-]*` | Medium |
| 10 | SSH Private Key | `-----BEGIN\s+(?:RSA\|DSA\|EC\|OPENSSH)\s+PRIVATE\s+KEY` | High |
| 11 | Password in URL | `[a-zA-Z]{3,10}://[^/\s:@]{3,20}:[^/\s:@]{3,20}@.{1,100}` | High |
| 12 | Firebase URL | `https://[a-z0-9-]+\.firebaseio\.com` | Low |
| 13 | MongoDB Connection String | `mongodb(?:\+srv)?://[^/\s]+:[^/\s]+@[^/\s]+` | High |
| 14 | Slack Token | `xox[baprs]-[0-9a-zA-Z\-]{10,48}` | Medium |
| 15 | Slack Webhook | `https://hooks\.slack\.com/services/T.../B.../...` | Medium |
| 16 | API Key (generic) | `(?i)\b(?:api[_\-]?key\|apikey)\b\s*[=:]\s*["'\`](...)[\"'\`]` | Medium |
| 17 | Secret Key (generic) | `(?i)\b(?:secret[_\-]?key\|secretkey)\b\s*[=:]\s*["'\`](...)[\"'\`]` | Medium |
| 18 | Auth Token (generic) | `(?i)\b(?:auth[_\-]?token\|authtoken)\b\s*[=:]\s*["'\`](...)[\"'\`]` | Medium |
| 19 | Access Token (generic) | `(?i)\b(?:access[_\-]?token\|accesstoken)\b\s*[=:]\s*["'\`](...)[\"'\`]` | Medium |
| 20 | Encryption Key | `(?i)(?:encryption\|aes\|des\|blowfish)[\s_-]?key[\s=:]+["'\`]...[\"'\`]` | High |
| 21 | Stripe Publishable Key | `\bpk_(live\|test)_[0-9a-zA-Z]{24,34}\b` | Low |
| 22 | Twitter Bearer | `AAAAAAAAAAAAAAAAAAA[A-Za-z0-9%]+` | Medium |
| 23 | Password (hardcoded) | `(?i)(?:password\|passwd\|pwd)[\s=:]+["'\`](...)[\"'\`]` | High |
| 24 | Database Credentials | `(?i)(?:db_pass\|db_password\|database_password)[\s=:]+["'\`](...)[\"'\`]` | High |
#### Shannon Entropy Validation
Calculates the [Shannon entropy](https://en.wikipedia.org/wiki/Entropy_(information_theory)) of matched strings:
```
H(X) = -Σ p(x) × log₂(p(x))
```
- **Threshold: 3.5 bits** — Matches below this entropy are discarded for the following pattern types: `AWS Secret Key`, `Google API Key`, `API Key`, `Secret Key`
- Low-entropy strings like `aaaaaaaaaaaaaaaa` (H=0.0) or `abcabcabcabc` are filtered out
- High-entropy strings like `aB3xZ9kL2mN7pQ4s` (H≈4.0) pass through
#### False Positive Filtering
The **80-character context window** around each match is checked against 12 false-positive indicators:
| `example` | `"key": "example-api-key-here"` |
| `sample` | `sample_token = "..."` |
| `placeholder` | `placeholder_secret` |
| `dummy` | `dummy_password` |
| `test` | `test_api_key` |
| `demo` | `demo_credentials` |
| `your_` | `your_api_key_here` |
| `my_` | `my_secret_key` |
| `template` | `template_token` |
| `undefined` | `undefined` |
| `localhost` | `mongodb://user:pass@localhost` |
| `127.0.0.1` | `http://127.0.0.1:3000` |
#### Secret Masking
All reported secret values are masked for safe display:
| ≤ 2 chars | `****` | `****` |
| 3–8 chars | `****XX` | `****xy` |
| > 8 chars | `XXXX****XXXX` | `AKIA****WXYZ` |
---
### Phase 4: JavaScript Security Analysis
#### Inline JS Extraction
All `<script>` tags without a `src` attribute are collected. Inline scripts shorter than 10 characters are skipped.
#### External JS Fetching
All `<script src="...">` URLs are:
1. Resolved to absolute URLs
2. Filtered through `is_known_library()` — known libraries are skipped
3. Collected in a dedicated set
4. Fetched and analyzed after the BFS crawl completes
#### Known Library Exclusion
External JS URLs containing any of these 19 substrings are skipped entirely:
`jquery`, `bootstrap`, `modernizr`, `polyfill`, `vendor`, `bundle`, `analytics`, `tracking`, `ga.js`, `gtm.js`, `react`, `angular`, `vue`, `lodash`, `moment`, `cdn`, `static`, `dist`, `chunk`
#### Vulnerability Categories (13 types)
| 1 | **DOM XSS** | High | 4 | User-controllable data (`location`, `URL`, `referrer`, `hash`) passed to `document.write()`, `.innerHTML`, `.outerHTML`, or `eval()` |
| 2 | **Open Redirect** | High | 3 | `location.href`, `location.replace()`, `location.assign()` set from user-controlled variables |
| 3 | **CORS Misconfiguration** | Medium | 3 | `Access-Control-Allow-Origin: *`, `null`, or `Allow-Credentials: true` |
| 4 | **Insecure Cookie** | Medium | 1 | Direct `document.cookie =` assignment without flags |
| 5 | **Insecure Data Transmission** | Medium | 1 | `postMessage()` with wildcard `"*"` origin |
| 6 | **Prototype Pollution** | Medium | 2 | `__proto__` assignment or `prototype[` access |
| 7 | **Command Injection** | High | 2 | `exec()` or `spawn()` with user-input arguments |
| 8 | **Insecure Data Storage** | Low | 2 | `localStorage.setItem()` / `sessionStorage.setItem()` storing passwords, tokens, keys, etc. |
| 9 | **Event Handler XSS** | Medium | 1 | `.setAttribute('on...',` dynamic event handler assignment |
| 10 | **CSP Bypass** | Medium | 1 | `document.createElement('script')` dynamic script injection |
| 11 | **WebSocket Insecurity** | High | 1 | `new WebSocket('ws://...')` using unencrypted protocol |
| 12 | **Insecure Crypto** | High | 2 | Use of MD5/SHA1 or `Math.random()` for security purposes |
| 13 | **Path Traversal** | Medium | 1 | `../` or `..\` path traversal patterns |
#### Minified File Handling
Files are detected as **minified** when:
- Content length > 5,000 characters **AND**
- Newline count < 50
For minified files, **only High-severity** vulnerability checks are performed. Medium and Low severity checks are skipped to reduce noise.
---
### Phase 5: SSRF Detection
#### Form Parameter Scanning
For every `<form>` on each crawled HTML page:
1. All `<input name="...">` and `<textarea name="...">` elements are extracted
2. Each parameter name is checked (case-insensitive) against the SSRF parameter list
3. If any matches, an `SsrfFinding` with `finding_type: "Potential SSRF in Form"` is emitted
#### URL Parameter Scanning
For every URL the crawler visits:
1. The query string is parsed (`?key=value&key2=value2`)
2. Each parameter name is checked against the SSRF parameter list
3. If any matches, an `SsrfFinding` with `finding_type: "Potential SSRF in URL Parameter"` is emitted
#### API Endpoint SSRF Probing
After all crawling and JS analysis completes:
1. Up to **20 discovered API endpoints** are selected
2. For each, the **top 5 SSRF probe URLs** from `payloads/ssrf.txt` are tested
3. A `GET` request is sent to `{endpoint}?url={probe}`
4. If the response is a **redirect** and the `Location` header contains the probe URL, a **confirmed SSRF** finding with `severity: "High"` is emitted
SSRF probes include AWS metadata (`http://169.254.169.254/...`), GCP metadata, Azure metadata, localhost ports, and local file schemes.
#### SSRF Parameter Name List (60+ params)
The full list of parameter names checked:
```
url, uri, link, src, href, target, destination, redirect, redirect_to,
redirecturl, redirect_uri, return, return_to, returnurl, return_path,
path, load, file, filename, folder, folder_url, image, img, image_url,
image_path, avatar, document, doc, document_url, fetch, get, view,
content, domain, callback, reference, site, page, data, data_url,
resource, template, api_endpoint, endpoint, proxy, feed, host, webhook,
address, media, video, audio, download, upload, preview, source,
location, goto, callback_url, forward, next, origin, continue
```
---
### Phase 6: HTML/Meta Security Checks
#### Weak CSP Detection
Scans for `<meta http-equiv="Content-Security-Policy" content="...">` tags. Flags as vulnerable if the `content` attribute contains:
- `unsafe-inline`
- `unsafe-eval`
Reported as `JsVulnerability` with `vuln_type: "Weak CSP"`, severity Medium.
#### Missing CSRF Token Detection
For every `<form>` element, checks for the presence of hidden inputs named `csrf`, `xsrf`, or `token` (case-insensitive substring match). If none are found, reported as `JsVulnerability` with `vuln_type: "Missing CSRF Protection"`, severity Medium.
---
### Phase 7: API Endpoint Discovery
#### Regex-Based Extraction
All fetched content (HTML bodies and JS files) is scanned for API path patterns:
| `/api/v\d+/` | `/api/v1/`, `/api/v2/users` |
| `/api/` | `/api/login` |
| `/graphql` | `/graphql` |
| `/rest/` | `/rest/v1/endpoint` |
| `/v\d+/\w+` | `/v2/users` |
| `/service/` | `/service/auth` |
| `/json/` | `/json/data` |
| `/rpc/` | `/rpc/call` |
| `/gateway/` | `/gateway/api` |
| `/ajax/` | `/ajax/handler` |
| `/data/` | `/data/export` |
| `/query/` | `/query/search` |
| `/feeds/` | `/feeds/rss` |
| `/svc/` | `/svc/core` |
| `/soap/` | `/soap/endpoint` |
Discovered paths are appended to the base URL and collected in a `HashSet` for deduplication.
---
### Phase 8: Post-Processing
#### Deduplication
Both `secrets` and `js_vulnerabilities` are deduplicated using composite hash keys:
- **Secrets:** `{secret_type}:{source_url}:{masked_value}`
- **JS Vulns:** `{vuln_type}:{source_url}:{matched_code}`
Duplicate entries with the same key are removed, keeping the first occurrence.
---
## Internal Functions
### `scan_for_secrets()`
```rust
fn scan_for_secrets(content, source_url, patterns, results)
```
Iterates all 24 compiled regex patterns against the content. For each match: calculates entropy, checks false positive context, masks the value, and appends a `SecretFinding`.
### `scan_js_security()`
```rust
fn scan_js_security(content, source_url, categories, results)
```
Iterates all 13 vulnerability categories. Detects minified files and skips non-critical checks. Truncates matched code to 200 chars.
### `check_url_params_ssrf()`
```rust
fn check_url_params_ssrf(url, findings)
```
Parses query string from a URL and checks each parameter name against the 60+ SSRF parameter list.
### `extract_api_endpoints()`
```rust
fn extract_api_endpoints(content, base_url, patterns, endpoints)
```
Runs 15 API path regexes against content and inserts full URLs into the endpoints set.
### `resolve_url()`
```rust
fn resolve_url(base, href) -> Option<String>
```
Resolves relative URLs to absolute. Handles `//protocol-relative`, `http(s)://absolute`, and `relative/path` forms. Returns `None` for `javascript:`, `mailto:`, `tel:`, and `#` anchors.
### `is_same_domain()`
```rust
fn is_same_domain(base, url) -> bool
```
Extracts hostnames from both URLs and compares case-insensitively.
### `shannon_entropy()`
```rust
fn shannon_entropy(data) -> f64
```
Calculates Shannon entropy in bits per character. Uses a 256-slot byte frequency table.
### `mask_secret()`
```rust
fn mask_secret(s) -> String
```
Returns a masked version showing only the first 4 and last 4 characters with `****` in between.
### `is_false_positive_context()`
```rust
fn is_false_positive_context(context) -> bool
```
Checks 80 chars of surrounding context against 12 false positive indicator strings.
### `is_known_library()`
```rust
fn is_known_library(url) -> bool
```
Checks if a JS URL contains any of 19 known library/framework substrings.
### `dedup_secrets()`
```rust
fn dedup_secrets(v: &mut Vec<SecretFinding>)
```
Removes duplicate secrets using composite hash key `{type}:{url}:{masked_value}`.
### `dedup_js_vulns()`
```rust
fn dedup_js_vulns(v: &mut Vec<JsVulnerability>)
```
Removes duplicate JS vulnerabilities using composite hash key `{type}:{url}:{code}`.
---
## Usage Example
```rust
use web_analyzer::advanced_content_scanner::scan_content;
#[tokio::main]
async fn main() {
let result = scan_content("example.com").await.unwrap();
println!("Crawled {} pages, found {} JS files",
result.summary.total_urls_crawled,
result.summary.total_js_files);
for secret in &result.secrets {
println!("[{}] {} at {} (line {}, entropy {:.2})",
secret.severity, secret.secret_type,
secret.source_url, secret.line, secret.entropy);
println!(" Value: {}", secret.masked_value);
println!(" Fix: {}", secret.recommendation);
}
for vuln in &result.js_vulnerabilities {
println!("[{}] {} at {}",
vuln.severity, vuln.vuln_type, vuln.source_url);
println!(" Code: {}", vuln.matched_code);
println!(" Fix: {}", vuln.recommendation);
}
for ssrf in &result.ssrf_vulnerabilities {
println!("[{}] {} at {} — params: {:?}",
ssrf.severity, ssrf.finding_type,
ssrf.source_url, ssrf.vulnerable_params);
}
println!("API endpoints discovered: {:?}", result.api_endpoints_discovered);
}
```
---
## Testing
```bash
# Run scanner tests only
cargo test --features advanced-content-scanner
# Run with output
cargo test --features advanced-content-scanner -- --nocapture
```
**Test suite covers:**
- `test_scan_content` — End-to-end scan against `example.com`, verifies result structure and crawl metrics
- `test_shannon_entropy` — Validates entropy calculation: high-entropy strings > 3.5, low-entropy < 1.0
---
## Configuration Constants
| `max_depth` | `2` | `scan_content()` | Maximum BFS crawl depth |
| `max_pages` | `50` | `scan_content()` | Maximum pages to visit |
| HTTP timeout | `15s` | `Client::builder()` | Per-request timeout |
| TLS validation | `false` | `danger_accept_invalid_certs(true)` | Accept self-signed certs |
| Entropy threshold | `3.5` | `scan_for_secrets()` | Minimum entropy for key-type secrets |
| Context window | `±80 chars` | `scan_for_secrets()` | Chars around match for FP check |
| Minified threshold | `>5000 chars, <50 newlines` | `scan_js_security()` | Minified file detection |
| Matched code limit | `200 chars` | `scan_js_security()` | Truncation for `matched_code` field |
| SSRF endpoint limit | `20` | SSRF probing | Max API endpoints to actively probe |
| SSRF probes per endpoint | `5` | SSRF probing | Top N SSRF payloads per endpoint |
| `SECRET_PATTERNS` | `24` | module-level | Number of secret regex patterns |
| `JS_VULN_CATEGORIES` | `13` | module-level | Number of JS vulnerability categories |
| `SSRF_PARAMS` | `60+` | module-level | Number of SSRF-vulnerable parameter names |
| Known libraries | `19` | `is_known_library()` | Library substrings to skip |
| False positive indicators | `12` | `is_false_positive_context()` | Context-based FP filters |