mailrs-clean
Email content cleanup primitives for Rust — multi-stage HTML sanitization, tracking-pixel detection, bulk/automated sender heuristics, and quoted-reply splitting. Zero I/O, no async runtime.
Extracted from mailrs so any mail client / inbound pipeline can run an email body through the same battle-tested pipeline that fronts a production server.
What you get
| Entry point | What it does |
|---|---|
[clean_email_html] |
strip unsafe elements + tracking pixels + hidden blocks + marketing chrome, then convert to paragraph-aware plain text. Returns a [CleanResult] with the text plus flags for tracking / template-heavy / unsubscribe-link presence. |
[detect_bulk_sender] |
RFC 2369 List-* header heuristic — true for mailing-list / newsletter traffic. |
[is_automated_sender] |
local-part pattern check (no-reply@, notification@, …). |
[split_quoted_content] |
separate a fresh reply from quoted ancestry so UIs can collapse old context. |
Quick start
use ;
let result = clean_email_html;
assert!;
// result.text is the cleaned, paragraph-aware plain text
let headers = "From: news@example.com\r\nList-Id: <news.example.com>\r\n";
assert!;
let = split_quoted_content;
// fresh = "Sounds good.", quoted = vec!["Could we move it to 10am?"]
What the HTML cleaner removes
<script>,<style>,<iframe>,<object>,<embed>— unsafe.<img width=1 height=1>from known tracking domains (Mailchimp, SendGrid, HubSpot, Mailgun, Amazon SES, …) — privacy.- Blocks with
display:none,visibility:hidden,opacity:0— usually hidden spam-keyword harvesting. - Marketing-template chrome: outer
<table>wrappers around a single content block, repeated unsubscribe footers. - After cleanup, the residue goes through
html2textfor paragraph-aware plain-text conversion.
Tracking-domain list is const and exhaustive for the top 18 commercial email-marketing platforms — see lib.rs for the current list.
Performance
benches/clean.rs covers the HTML cleaner across realistic email sizes plus the sender heuristics + quote splitter. Measured with criterion 0.8 on Apple Silicon (M-series), cargo bench, release profile.
| Operation | Input size | Median | Throughput |
|---|---|---|---|
clean_email_html (short body) |
60 B | ~20 µs | — |
clean_email_html (short marketing) |
560 B | ~56 µs | ~10 MB/s |
clean_email_html (5 KB marketing) |
5.6 KB | ~336 µs | ~17 MB/s |
clean_email_html (50 KB worst-case) |
56 KB | ~2.5 ms | ~22 MB/s |
detect_bulk_sender (with List-Id) |
— | ~60 ns | — |
is_automated_sender("no-reply@…") |
— | ~50 ns | — |
is_automated_sender(human address) |
— | ~50 ns | — |
split_quoted_content (typical reply) |
— | ~970 ns | — |
The cost is dominated by scraper parser construction + tree walk; the per-byte rate is roughly flat across input sizes once parser startup is amortized. For workloads that process every inbound message at SMTP time, this means a 50 KB worst-case marketing email costs ~2.5 ms of CPU on top of the DATA-stage I/O.
Run with cargo bench -p mailrs-clean. See tests/perf_gate.rs for the regression budgets.
What this is NOT
- Not a general HTML sanitizer — for that, use
ammonia. We make email-specific assumptions (single-document message, no script execution, structured-output text). - Not a parser for MIME bodies — bring your own (
mail-parser,lettre, etc.); feed us thetext/htmlpart. - Not opinionated about how you score senders — the heuristics return booleans, the policy decisions are yours.
License
Licensed under either of Apache License 2.0 or MIT license at your option.