# webrisk_hash
URL canonicalization and hashing for the [Google Web Risk API](https://docs.cloud.google.com/web-risk/docs/urls-hashing).
Implements the full canonicalization spec: percent-encoding normalization, IP address parsing (decimal/hex/octal), hostname normalization, path resolution, and suffix/prefix expression generation with SHA-256 hash prefixes.
## Installation
```sh
cargo add webrisk_hash
```
## Usage
### Canonicalize a URL
```rust
use webrisk_hash::canonicalize;
let url = canonicalize("http://www.GOOgle.com/foo/../bar");
assert_eq!(url, Some("http://www.google.com/bar".to_string()));
// Integer IP normalization
let url = canonicalize("http://3279880203/blah");
assert_eq!(url, Some("http://195.127.0.11/blah".to_string()));
// Returns None for invalid URLs
assert_eq!(canonicalize(""), None);
```
### Generate suffix/prefix expressions
```rust
use webrisk_hash::suffix_postfix_expressions;
let exprs = suffix_postfix_expressions("http://a.b.c/1/2.html?param=1");
assert!(exprs.contains(&"a.b.c/1/2.html?param=1".to_string()));
assert!(exprs.contains(&"b.c/".to_string()));
```
### Compute hash prefixes
```rust
use webrisk_hash::{get_prefixes, truncated_sha256_prefix};
// Get 32-bit hash prefixes for all expressions of a URL
let prefixes = get_prefixes("https://example.com/path", 32);
// Or hash a single string
let hash = truncated_sha256_prefix("abc", 32);
assert_eq!(hash, vec![0xba, 0x78, 0x16, 0xbf]);
```
### End-to-end: URL to hash prefix set
```rust
use webrisk_hash::get_prefixes;
let prefixes = get_prefixes("https://google.com/a/test/index.html?abc123", 32);
assert_eq!(prefixes.len(), 5);
```
## API
| `canonicalize(url)` | Canonicalize a URL per the Web Risk spec. Returns `Option<String>`. |
| `suffix_postfix_expressions(url)` | Generate up to 30 host suffix / path prefix combinations. |
| `truncated_sha256_prefix(s, bits)` | SHA-256 hash truncated to `bits/8` bytes (max 32). |
| `get_prefixes(url, bits)` | Canonicalize + expressions + hash. Returns `HashSet<Vec<u8>>`. |
| `get_prefix_map(url, bits)` | Like `get_prefixes` but returns `Vec<(expression, hash)>`. |
## Limitations
- Accepts `&str` input only (valid UTF-8). Raw `&[u8]` byte sequences are not supported.
- URLs longer than 8192 bytes return `None`.
- Hostnames longer than 255 characters return `None`.
## License
MIT