htmlsanitizer 0.2.0

A fast, allowlist-based HTML sanitizer
Documentation

htmlsanitizer

Crates.io docs.rs npm CI License: MIT

A fast, allowlist-based HTML sanitizer. Available as a Rust crate and an npm package (via WebAssembly). 3.7–44x faster than DOMPurify on real HTML content.

Also available in Go.

Features

  • O(n) streaming parser — DFA-based finite state machine; no DOM tree, no backtracking
  • Allowlist-based — only explicitly permitted tags and attributes pass through; everything else is stripped
  • URL sanitization — rejects javascript:, data:, ftp:, control characters, and opaque URIs
  • Customizable — add/remove tags, modify allowed attributes, supply a custom URL validator
  • Streaming writer (Rust) — implements std::io::Write; process HTML in chunks without buffering the entire document
  • Cross-platform — native Rust crate + WASM-powered npm package with identical sanitization logic

Installation

Rust

[dependencies]
htmlsanitizer = "0.1"

npm / TypeScript

npm install @bytevet/htmlsanitizer

The npm package ships a pre-built WASM binary — no native toolchain required.

Quick Start

Rust

use htmlsanitizer::sanitize_string;

let safe = sanitize_string(r#"<p>Hello</p><script>alert("xss")</script>"#);
assert_eq!(safe, "<p>Hello</p>");

TypeScript / JavaScript

import { sanitize } from "@bytevet/htmlsanitizer";

const safe = sanitize('<p>Hello</p><script>alert("xss")</script>');
// => "<p>Hello</p>"

Usage (Rust)

Default Sanitization

Use the convenience functions for one-shot sanitization with the default allow list:

use htmlsanitizer::{sanitize_string, sanitize};

// Sanitize a string
let input = r#"<p>Hello <b>world</b></p><script>alert("xss")</script>"#;
let clean = sanitize_string(input);
// => "<p>Hello <b>world</b></p>"

// Sanitize bytes
let bytes = b"<img src=\"http://example.com/img.png\" onerror=\"alert(1)\">";
let clean_bytes = sanitize(bytes);
// => b"<img src=\"http://example.com/img.png\">"

Custom Allow List

Create an HtmlSanitizer instance to customize which tags and attributes are allowed:

use htmlsanitizer::{HtmlSanitizer, Tag};

// Remove a tag from the default allow list
let mut sanitizer = HtmlSanitizer::new();
sanitizer.allow_list.remove_tag("a");

let input = r#"<a href="http://example.com">click</a> <p>safe</p>"#;
let clean = sanitizer.sanitize_string(input);
// => "click <p>safe</p>"

// Add a custom tag with specific allowed attributes
let mut sanitizer = HtmlSanitizer::new();
sanitizer.allow_list.add_tag(Tag::new("custom-el", &["data-x"], &[]));

let input = r#"<custom-el data-x="1" onclick="bad">content</custom-el>"#;
let clean = sanitizer.sanitize_string(input);
// => "<custom-el data-x=\"1\">content</custom-el>"

Custom URL Sanitizer

Supply a custom URL validator using the builder pattern. The built-in default_url_sanitizer is exported so you can compose it with your own logic:

use htmlsanitizer::HtmlSanitizer;

let sanitizer = HtmlSanitizer::new().with_url_sanitizer(|raw_url| {
    let sanitized = htmlsanitizer::default_url_sanitizer(raw_url)?;
    if sanitized.contains("trusted.com") {
        Some(sanitized)
    } else {
        None
    }
});

let input = r#"<a href="http://trusted.com/page">ok</a> <a href="http://evil.com">bad</a>"#;
let clean = sanitizer.sanitize_string(input);
// => "<a href=\"http://trusted.com/page\">ok</a> <a>bad</a>"

Streaming Writer

The streaming interface implements std::io::Write, enabling you to process HTML in chunks. State is preserved between writes:

use std::io::Write;
use htmlsanitizer::HtmlSanitizer;

let sanitizer = HtmlSanitizer::new();
let mut output = Vec::new();

{
    let mut writer = sanitizer.new_writer(&mut output);

    // Write HTML in chunks — state is preserved between writes
    writer.write_all(b"<p>Hello </p><scr").expect("write failed");
    writer.write_all(b"ipt>alert('xss')</script>").expect("write failed");
    writer.write_all(b"<b>world</b>").expect("write failed");
}

let result = String::from_utf8(output).unwrap();
// => "<p>Hello </p><b>world</b>"

Usage (npm / TypeScript)

Default Sanitization

import { sanitize } from "@bytevet/htmlsanitizer";

sanitize('<img src=x onerror="alert(1)">');
// => '<img src="x">'

sanitize('<a href="javascript:alert(1)">click</a>');
// => '<a>click</a>'

Custom Configuration

import { HtmlSanitizer } from "@bytevet/htmlsanitizer";

const s = new HtmlSanitizer();

// Remove a tag from the allow list
s.removeTag("a");
s.sanitize('<a href="http://example.com">link</a>');
// => "link"

// Add a custom tag
// Arguments: name, comma-separated attributes, comma-separated URL attributes
s.addTag("custom-el", "data-x,title", "href");
s.sanitize('<custom-el data-x="1" onclick="bad">content</custom-el>');
// => '<custom-el data-x="1">content</custom-el>'

// Add a global attribute (allowed on all tags)
s.addGlobalAttr("data-testid");

// Release WASM memory when done (instance is unusable after this)
s.free();

API Reference

Rust

Function / Type Description
sanitize(data: &[u8]) -> Vec<u8> One-shot sanitization (bytes) with the default allow list
sanitize_string(data: &str) -> String One-shot sanitization (string) with the default allow list
HtmlSanitizer::new() Create a sanitizer with the default allow list
HtmlSanitizer::with_url_sanitizer(f) Builder: attach a custom URL validator
HtmlSanitizer::set_url_sanitizer(&mut self, f) Set a custom URL validator on an existing instance
HtmlSanitizer::sanitize(&self, &[u8]) -> Vec<u8> Sanitize bytes
HtmlSanitizer::sanitize_string(&self, &str) -> String Sanitize a string
HtmlSanitizer::new_writer(w) -> SanitizeWriter<W> Create a streaming writer (impl io::Write)
AllowList Tag/attribute configuration; fields: tags, global_attr, non_html_tags
AllowList::add_tag(&mut self, tag: Tag) Add a tag to the allow list
AllowList::remove_tag(&mut self, name: &str) Remove a tag by name
Tag::new(name, attr, url_attr) Define a tag with its allowed regular and URL attributes
default_allow_list() -> AllowList Returns the built-in default allow list
default_url_sanitizer(&str) -> Option<String> The built-in URL validator (reusable in custom validators)

For complete API documentation, see docs.rs.

npm / TypeScript

Export Description
sanitize(input: string): string One-shot sanitization with the default allow list
new HtmlSanitizer() Create a configurable sanitizer instance
.sanitize(input: string): string Sanitize HTML using the instance's configuration
.addTag(name, attrs?, urlAttrs?) Add a tag; attrs and urlAttrs are comma-separated strings
.removeTag(name: string) Remove a tag from the allow list
.addGlobalAttr(name: string) Allow an attribute on all tags
.free() Release WASM memory; the instance is unusable after this

Default Allow List

The default allow list permits 68 commonly used HTML tags. All other tags are stripped — their text content is preserved. Tags in the non-HTML list (script, style, object) have both their tags and content removed.

Global attributes (allowed on every permitted tag): class, id

Category Tags
Structural address, article, aside, footer, header, h1h6, hgroup, main, nav, section
Block content blockquote, dd, div, dl, dt, figcaption, figure, hr, li, ol, p, pre, ul
Inline text a, abbr, b, bdi, bdo, br, cite, code, data, em, i, kbd, mark, q, s, small, span, strong, sub, sup, time, u
Media area, audio, img, map, track, video, picture, source
Table caption, col, colgroup, table, tbody, td, tfoot, th, thead, tr
Edit marks del, ins
Interactive details, summary

Notable tag-specific attributes:

Tag Regular attributes URL attributes
a rel, target, referrerpolicy href
img alt, crossorigin, height, width, loading, referrerpolicy src
video autoplay, buffered, controls, crossorigin, duration, loop, muted, preload, height, width src, poster
audio autoplay, controls, crossorigin, duration, loop, muted, preload src
td / th colspan, rowspan (+ scope for th)

URL Sanitization

Attributes marked as URL attributes (href, src, poster, cite, etc.) are validated by the URL sanitizer. The default behavior:

Accepted:

  • http:// and https:// URLs
  • Relative URLs (paths, fragments, query strings)

Rejected:

  • javascript: (including case variations and HTML-entity-encoded forms)
  • data: URIs
  • ftp: and all other non-HTTP schemes
  • URLs containing ASCII control characters (bytes < 0x20 or 0x7F)
  • Opaque (cannot-be-a-base) URIs
  • Percent-encoded ASCII in hostnames

When a URL is rejected, the attribute is removed but the tag and its content are preserved (e.g., <a href="javascript:...">text</a> becomes <a>text</a>).

You can supply a custom URL validator via with_url_sanitizer (Rust) to implement domain restrictions or additional checks. The built-in default_url_sanitizer is exported so you can compose it with your own logic.

Security Considerations

  • Defense in depth. This sanitizer is designed as one layer of an XSS mitigation strategy. Combine it with Content Security Policy headers and context-aware output encoding.
  • Not a full HTML parser. The DFA-based approach handles real-world HTML effectively but does not build a DOM tree. It is designed to be conservative — when in doubt, content is stripped.
  • Fuzz-tested. The project includes a cargo-fuzz harness. If you discover a bypass, please report it via GitHub Issues.
  • Consistent cross-platform behavior. The Rust and WASM/npm builds share the same sanitization engine, ensuring identical output.
  • Tested against known XSS vectors. The test suite includes vectors from OWASP and other common XSS payloads.

Performance

The sanitizer operates in a single O(n) pass over the input using a 17-state DFA. It allocates no DOM tree and performs no backtracking.

npm: @bytevet/htmlsanitizer vs DOMPurify

Benchmarked with Vitest bench on Node.js (DOMPurify uses jsdom):

Payload @bytevet/htmlsanitizer DOMPurify + jsdom Ratio
Simple HTML (small) 56,716 ops/s 15,253 ops/s 3.7x faster
XSS vectors 40,908 ops/s 5,373 ops/s 7.6x faster
Blog post (medium) 33,259 ops/s 1,381 ops/s 24x faster
Mixed safe + dangerous 40,326 ops/s 3,987 ops/s 10x faster
Large document (~50 KB) 1,054 ops/s 24 ops/s 44x faster

DOMPurify is faster on tiny plain-text inputs (no HTML tags) due to WASM call overhead (~10 µs). For any real HTML content, @bytevet/htmlsanitizer is 3.7–44x faster, with the advantage growing as input size increases.

Reproduce with:

cd bench-npm && npm install && npm run bench

Rust

cargo bench

Development

# Run tests
cargo test

# Run clippy
cargo clippy --all-targets --all-features

# Run benchmarks
cargo bench

# Build WASM and run npm tests
cd npm && npm run build && npm test

# Fuzz testing
cargo +nightly fuzz run sanitize

Related Projects

License

MIT — see LICENSE.