htmlsanitizer 0.2.0

# htmlsanitizer

[![Crates.io](https://img.shields.io/crates/v/htmlsanitizer)](https://crates.io/crates/htmlsanitizer)
[![docs.rs](https://img.shields.io/docsrs/htmlsanitizer)](https://docs.rs/htmlsanitizer)
[![npm](https://img.shields.io/npm/v/@bytevet/htmlsanitizer)](https://www.npmjs.com/package/@bytevet/htmlsanitizer)
[![CI](https://github.com/SYM01/htmlsanitizer-rs/actions/workflows/ci.yml/badge.svg)](https://github.com/SYM01/htmlsanitizer-rs/actions/workflows/ci.yml)
[![License: MIT](https://img.shields.io/badge/license-MIT-blue.svg)](LICENSE)

A fast, allowlist-based HTML sanitizer. Available as a **Rust crate** and an **npm package** (via WebAssembly). [**3.7–44x faster**](#performance) than DOMPurify on real HTML content.

Also available in [Go](https://github.com/SYM01/htmlsanitizer).

## Features

- **O(n) streaming parser** — DFA-based finite state machine; no DOM tree, no backtracking
- **Allowlist-based** — only explicitly permitted tags and attributes pass through; everything else is stripped
- **URL sanitization** — rejects `javascript:`, `data:`, `ftp:`, control characters, and opaque URIs
- **Customizable** — add/remove tags, modify allowed attributes, supply a custom URL validator
- **Streaming writer** (Rust) — implements `std::io::Write`; process HTML in chunks without buffering the entire document
- **Cross-platform** — native Rust crate + WASM-powered npm package with identical sanitization logic

## Installation

### Rust

```toml
[dependencies]
htmlsanitizer = "0.1"
```

### npm / TypeScript

```bash
npm install @bytevet/htmlsanitizer
```

The npm package ships a pre-built WASM binary — no native toolchain required.

## Quick Start

### Rust

```rust
use htmlsanitizer::sanitize_string;

let safe = sanitize_string(r#"<p>Hello</p><script>alert("xss")</script>"#);
assert_eq!(safe, "<p>Hello</p>");
```

### TypeScript / JavaScript

```ts
import { sanitize } from "@bytevet/htmlsanitizer";

const safe = sanitize('<p>Hello</p><script>alert("xss")</script>');
// => "<p>Hello</p>"
```

## Usage (Rust)

### Default Sanitization

Use the convenience functions for one-shot sanitization with the default allow list:

```rust
use htmlsanitizer::{sanitize_string, sanitize};

// Sanitize a string
let input = r#"<p>Hello <b>world</b></p><script>alert("xss")</script>"#;
let clean = sanitize_string(input);
// => "<p>Hello <b>world</b></p>"

// Sanitize bytes
let bytes = b"<img src=\"http://example.com/img.png\" onerror=\"alert(1)\">";
let clean_bytes = sanitize(bytes);
// => b"<img src=\"http://example.com/img.png\">"
```

### Custom Allow List

Create an `HtmlSanitizer` instance to customize which tags and attributes are allowed:

```rust
use htmlsanitizer::{HtmlSanitizer, Tag};

// Remove a tag from the default allow list
let mut sanitizer = HtmlSanitizer::new();
sanitizer.allow_list.remove_tag("a");

let input = r#"<a href="http://example.com">click</a> <p>safe</p>"#;
let clean = sanitizer.sanitize_string(input);
// => "click <p>safe</p>"

// Add a custom tag with specific allowed attributes
let mut sanitizer = HtmlSanitizer::new();
sanitizer.allow_list.add_tag(Tag::new("custom-el", &["data-x"], &[]));

let input = r#"<custom-el data-x="1" onclick="bad">content</custom-el>"#;
let clean = sanitizer.sanitize_string(input);
// => "<custom-el data-x=\"1\">content</custom-el>"
```

### Custom URL Sanitizer

Supply a custom URL validator using the builder pattern. The built-in `default_url_sanitizer` is exported so you can compose it with your own logic:

```rust
use htmlsanitizer::HtmlSanitizer;

let sanitizer = HtmlSanitizer::new().with_url_sanitizer(|raw_url| {
    let sanitized = htmlsanitizer::default_url_sanitizer(raw_url)?;
    if sanitized.contains("trusted.com") {
        Some(sanitized)
    } else {
        None
    }
});

let input = r#"<a href="http://trusted.com/page">ok</a> <a href="http://evil.com">bad</a>"#;
let clean = sanitizer.sanitize_string(input);
// => "<a href=\"http://trusted.com/page\">ok</a> <a>bad</a>"
```

### Streaming Writer

The streaming interface implements `std::io::Write`, enabling you to process HTML in chunks. State is preserved between writes:

```rust
use std::io::Write;
use htmlsanitizer::HtmlSanitizer;

let sanitizer = HtmlSanitizer::new();
let mut output = Vec::new();

{
    let mut writer = sanitizer.new_writer(&mut output);

    // Write HTML in chunks — state is preserved between writes
    writer.write_all(b"<p>Hello </p><scr").expect("write failed");
    writer.write_all(b"ipt>alert('xss')</script>").expect("write failed");
    writer.write_all(b"<b>world</b>").expect("write failed");
}

let result = String::from_utf8(output).unwrap();
// => "<p>Hello </p><b>world</b>"
```

## Usage (npm / TypeScript)

### Default Sanitization

```ts
import { sanitize } from "@bytevet/htmlsanitizer";

sanitize('<img src=x onerror="alert(1)">');
// => '<img src="x">'

sanitize('<a href="javascript:alert(1)">click</a>');
// => '<a>click</a>'
```

### Custom Configuration

```ts
import { HtmlSanitizer } from "@bytevet/htmlsanitizer";

const s = new HtmlSanitizer();

// Remove a tag from the allow list
s.removeTag("a");
s.sanitize('<a href="http://example.com">link</a>');
// => "link"

// Add a custom tag
// Arguments: name, comma-separated attributes, comma-separated URL attributes
s.addTag("custom-el", "data-x,title", "href");
s.sanitize('<custom-el data-x="1" onclick="bad">content</custom-el>');
// => '<custom-el data-x="1">content</custom-el>'

// Add a global attribute (allowed on all tags)
s.addGlobalAttr("data-testid");

// Release WASM memory when done (instance is unusable after this)
s.free();
```

## API Reference

### Rust

| Function / Type | Description |
|---|---|
| `sanitize(data: &[u8]) -> Vec<u8>` | One-shot sanitization (bytes) with the default allow list |
| `sanitize_string(data: &str) -> String` | One-shot sanitization (string) with the default allow list |
| `HtmlSanitizer::new()` | Create a sanitizer with the default allow list |
| `HtmlSanitizer::with_url_sanitizer(f)` | Builder: attach a custom URL validator |
| `HtmlSanitizer::set_url_sanitizer(&mut self, f)` | Set a custom URL validator on an existing instance |
| `HtmlSanitizer::sanitize(&self, &[u8]) -> Vec<u8>` | Sanitize bytes |
| `HtmlSanitizer::sanitize_string(&self, &str) -> String` | Sanitize a string |
| `HtmlSanitizer::new_writer(w) -> SanitizeWriter<W>` | Create a streaming writer (`impl io::Write`) |
| `AllowList` | Tag/attribute configuration; fields: `tags`, `global_attr`, `non_html_tags` |
| `AllowList::add_tag(&mut self, tag: Tag)` | Add a tag to the allow list |
| `AllowList::remove_tag(&mut self, name: &str)` | Remove a tag by name |
| `Tag::new(name, attr, url_attr)` | Define a tag with its allowed regular and URL attributes |
| `default_allow_list() -> AllowList` | Returns the built-in default allow list |
| `default_url_sanitizer(&str) -> Option<String>` | The built-in URL validator (reusable in custom validators) |

For complete API documentation, see [docs.rs](https://docs.rs/htmlsanitizer).

### npm / TypeScript

| Export | Description |
|---|---|
| `sanitize(input: string): string` | One-shot sanitization with the default allow list |
| `new HtmlSanitizer()` | Create a configurable sanitizer instance |
| `.sanitize(input: string): string` | Sanitize HTML using the instance's configuration |
| `.addTag(name, attrs?, urlAttrs?)` | Add a tag; `attrs` and `urlAttrs` are comma-separated strings |
| `.removeTag(name: string)` | Remove a tag from the allow list |
| `.addGlobalAttr(name: string)` | Allow an attribute on all tags |
| `.free()` | Release WASM memory; the instance is unusable after this |

## Default Allow List

The default allow list permits 68 commonly used HTML tags. All other tags are stripped — their text content is preserved. Tags in the non-HTML list (`script`, `style`, `object`) have both their tags **and** content removed.

**Global attributes** (allowed on every permitted tag): `class`, `id`

<details>
<summary>View all 68 default tags</summary>

| Category | Tags |
|---|---|
| Structural | `address`, `article`, `aside`, `footer`, `header`, `h1`–`h6`, `hgroup`, `main`, `nav`, `section` |
| Block content | `blockquote`, `dd`, `div`, `dl`, `dt`, `figcaption`, `figure`, `hr`, `li`, `ol`, `p`, `pre`, `ul` |
| Inline text | `a`, `abbr`, `b`, `bdi`, `bdo`, `br`, `cite`, `code`, `data`, `em`, `i`, `kbd`, `mark`, `q`, `s`, `small`, `span`, `strong`, `sub`, `sup`, `time`, `u` |
| Media | `area`, `audio`, `img`, `map`, `track`, `video`, `picture`, `source` |
| Table | `caption`, `col`, `colgroup`, `table`, `tbody`, `td`, `tfoot`, `th`, `thead`, `tr` |
| Edit marks | `del`, `ins` |
| Interactive | `details`, `summary` |

**Notable tag-specific attributes:**

| Tag | Regular attributes | URL attributes |
|---|---|---|
| `a` | `rel`, `target`, `referrerpolicy` | `href` |
| `img` | `alt`, `crossorigin`, `height`, `width`, `loading`, `referrerpolicy` | `src` |
| `video` | `autoplay`, `buffered`, `controls`, `crossorigin`, `duration`, `loop`, `muted`, `preload`, `height`, `width` | `src`, `poster` |
| `audio` | `autoplay`, `controls`, `crossorigin`, `duration`, `loop`, `muted`, `preload` | `src` |
| `td` / `th` | `colspan`, `rowspan` (+ `scope` for `th`) | — |

</details>

## URL Sanitization

Attributes marked as URL attributes (`href`, `src`, `poster`, `cite`, etc.) are validated by the URL sanitizer. The default behavior:

**Accepted:**
- `http://` and `https://` URLs
- Relative URLs (paths, fragments, query strings)

**Rejected:**
- `javascript:` (including case variations and HTML-entity-encoded forms)
- `data:` URIs
- `ftp:` and all other non-HTTP schemes
- URLs containing ASCII control characters (bytes < 0x20 or 0x7F)
- Opaque (cannot-be-a-base) URIs
- Percent-encoded ASCII in hostnames

When a URL is rejected, the attribute is removed but the tag and its content are preserved (e.g., `<a href="javascript:...">text</a>` becomes `<a>text</a>`).

You can supply a custom URL validator via `with_url_sanitizer` (Rust) to implement domain restrictions or additional checks. The built-in `default_url_sanitizer` is exported so you can compose it with your own logic.

## Security Considerations

- **Defense in depth.** This sanitizer is designed as one layer of an XSS mitigation strategy. Combine it with Content Security Policy headers and context-aware output encoding.
- **Not a full HTML parser.** The DFA-based approach handles real-world HTML effectively but does not build a DOM tree. It is designed to be conservative — when in doubt, content is stripped.
- **Fuzz-tested.** The project includes a `cargo-fuzz` harness. If you discover a bypass, please report it via [GitHub Issues](https://github.com/SYM01/htmlsanitizer-rs/issues).
- **Consistent cross-platform behavior.** The Rust and WASM/npm builds share the same sanitization engine, ensuring identical output.
- **Tested against known XSS vectors.** The test suite includes vectors from OWASP and other common XSS payloads.

## Performance

The sanitizer operates in a single O(n) pass over the input using a 17-state DFA. It allocates no DOM tree and performs no backtracking.

### npm: `@bytevet/htmlsanitizer` vs [DOMPurify](https://github.com/cure53/DOMPurify)

Benchmarked with [Vitest bench](https://vitest.dev/guide/features#benchmarking) on Node.js (DOMPurify uses jsdom):

| Payload | @bytevet/htmlsanitizer | DOMPurify + jsdom | Ratio |
|---|---|---|---|
| Simple HTML (small) | 56,716 ops/s | 15,253 ops/s | **3.7x faster** |
| XSS vectors | 40,908 ops/s | 5,373 ops/s | **7.6x faster** |
| Blog post (medium) | 33,259 ops/s | 1,381 ops/s | **24x faster** |
| Mixed safe + dangerous | 40,326 ops/s | 3,987 ops/s | **10x faster** |
| Large document (~50 KB) | 1,054 ops/s | 24 ops/s | **44x faster** |

> DOMPurify is faster on tiny plain-text inputs (no HTML tags) due to WASM call overhead (~10 µs). For any real HTML content, `@bytevet/htmlsanitizer` is **3.7–44x faster**, with the advantage growing as input size increases.

Reproduce with:

```bash
cd bench-npm && npm install && npm run bench
```

### Rust

```bash
cargo bench
```

## Development

```bash
# Run tests
cargo test

# Run clippy
cargo clippy --all-targets --all-features

# Run benchmarks
cargo bench

# Build WASM and run npm tests
cd npm && npm run build && npm test

# Fuzz testing
cargo +nightly fuzz run sanitize
```

## Related Projects

- [sym01/htmlsanitizer](https://github.com/SYM01/htmlsanitizer) — Go version

## License

MIT — see [LICENSE](LICENSE).