webrisk_hash 0.1.0

URL canonicalization and hashing for Google Web Risk API
Documentation
# webrisk_hash

URL canonicalization and hashing for the [Google Web Risk API](https://docs.cloud.google.com/web-risk/docs/urls-hashing).

Implements the full canonicalization spec: percent-encoding normalization, IP address parsing (decimal/hex/octal), hostname normalization, path resolution, and suffix/prefix expression generation with SHA-256 hash prefixes.

## Installation

```sh
cargo add webrisk_hash
```

## Usage

### Canonicalize a URL

```rust
use webrisk_hash::canonicalize;

let url = canonicalize("http://www.GOOgle.com/foo/../bar");
assert_eq!(url, Some("http://www.google.com/bar".to_string()));

// Integer IP normalization
let url = canonicalize("http://3279880203/blah");
assert_eq!(url, Some("http://195.127.0.11/blah".to_string()));

// Returns None for invalid URLs
assert_eq!(canonicalize(""), None);
```

### Generate suffix/prefix expressions

```rust
use webrisk_hash::suffix_postfix_expressions;

let exprs = suffix_postfix_expressions("http://a.b.c/1/2.html?param=1");
assert!(exprs.contains(&"a.b.c/1/2.html?param=1".to_string()));
assert!(exprs.contains(&"b.c/".to_string()));
```

### Compute hash prefixes

```rust
use webrisk_hash::{get_prefixes, truncated_sha256_prefix};

// Get 32-bit hash prefixes for all expressions of a URL
let prefixes = get_prefixes("https://example.com/path", 32);

// Or hash a single string
let hash = truncated_sha256_prefix("abc", 32);
assert_eq!(hash, vec![0xba, 0x78, 0x16, 0xbf]);
```

### End-to-end: URL to hash prefix set

```rust
use webrisk_hash::get_prefixes;

let prefixes = get_prefixes("https://google.com/a/test/index.html?abc123", 32);
assert_eq!(prefixes.len(), 5);
```

## API

| Function | Description |
|---|---|
| `canonicalize(url)` | Canonicalize a URL per the Web Risk spec. Returns `Option<String>`. |
| `suffix_postfix_expressions(url)` | Generate up to 30 host suffix / path prefix combinations. |
| `truncated_sha256_prefix(s, bits)` | SHA-256 hash truncated to `bits/8` bytes (max 32). |
| `get_prefixes(url, bits)` | Canonicalize + expressions + hash. Returns `HashSet<Vec<u8>>`. |
| `get_prefix_map(url, bits)` | Like `get_prefixes` but returns `Vec<(expression, hash)>`. |

## Limitations

- Accepts `&str` input only (valid UTF-8). Raw `&[u8]` byte sequences are not supported.
- URLs longer than 8192 bytes return `None`.
- Hostnames longer than 255 characters return `None`.

## License

MIT