rustrict
rustrict
is a profanity filter for Rust.
Disclaimer: Multiple source files (.txt
, .csv
, .rs
test cases) contain profanity. Viewer discretion is advised.
Features
- Multiple types (profane, offensive, sexual, mean, spam)
- Multiple levels (mild, moderate, severe)
- Resistant to evasion
- Alternative spellings (like "fck")
- Repeated characters (like "craaaap")
- Confusable characters (like 'ᑭ', '𝕡', and '🅿')
- Spacing (like "c r_a-p")
- Accents (like "pÓöp")
- Bidirectional Unicode (related reading)
- Self-censoring (like "f*ck")
- Safe phrase list for known bad actors]
- Censors invalid Unicode characters
- Battle-tested in Mk48.io
- Resistant to false positives
- One word (like "assassin")
- Two words (like "push it")
- Flexible
- Censor and/or analyze
- Input
&str
orIterator<Item = char>
- Can track per-user state with
context
feature - Can add words with the
customize
feature - Accurately reports the width of Unicode via the
width
feature - Plenty of options
- Performant
- O(n) analysis and censoring
- No
regex
(uses custom trie) - 3 MB/s in
release
mode - 100 KB/s in
debug
mode
Limitations
- Mostly English/emoji
- Censoring removes most diacritics (accents)
- Does not detect right-to-left profanity while analyzing, so...
- Censoring forces Unicode to be left-to-right
- Doesn't understand context
- Not resistant to false positives affecting profanities added at runtime
Usage
Strings (&str
)
use CensorStr;
let censored: String = "hello crap".censor;
let inappropriate: bool = "f u c k".is_inappropriate;
assert_eq!;
assert!;
Iterators (Iterator<Type = char>
)
use CensorIter;
let censored: String = "hello crap".chars.censor.collect;
assert_eq!;
Advanced
By constructing a Censor
, one can avoid scanning text multiple times to get a censored String
and/or
answer multiple is
queries. This also opens up more customization options (defaults are below).
use ;
let = from_str
.with_censor_threshold
.with_censor_first_character_threshold
.with_ignore_false_positives
.with_ignore_self_censoring
.with_censor_replacement
.censor_and_analyze;
assert_eq!;
assert!;
assert!;
If you cannot afford to let anything slip though, or have reason to believe a particular user is trying to evade the filter, you can check if their input matches a short list of safe strings:
use ;
// Figure out if a user is trying to evade the filter.
assert!;
assert!;
// Only let safe messages through.
assert!;
assert!;
assert!;
assert!;
assert!;
assert!;
If you want to add custom profanities or safe words, enable the customize
feature.
If your use-case is chat moderation, and you store data on a per-user basis, you can use rustrict::Context
as a reference implementation:
Comparison
To compare filters, the first 100,000 items of this list is used as a dataset. Positive accuracy is the percentage of profanity detected as profanity. Negative accuracy is the percentage of clean text detected as clean.
Crate | Accuracy | Positive Accuracy | Negative Accuracy | Time |
---|---|---|---|---|
rustrict | 80.00% | 94.01% | 76.50% | 9s |
censor | 76.16% | 72.76% | 77.01% | 23s |
stfu | 91.74% | 77.69% | 95.25% | 45s |
profane-rs | 80.47% | 73.79% | 82.14% | 52s |
Development
If you make an adjustment that would affect false positives, such as adding profanity,
you will need to run false_positive_finder
:
- Run
make downloads
to download the required word lists and dictionaries - Run
make false_positives
to automatically find false positives
If you modify replacements_extra.csv
, run make replacements
to rebuild replacements.csv
.
Finally, run make test
for a full test or make test_debug
for a fast test.
License
Licensed under either of
- Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
- MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)
at your option.
Contribution
Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.