eunicode 0.1.5

A text processing CLI and library that helps sanitize text by removing the naughty bits to make strings good and safe.
Documentation

eunicode

eunicode is a text processing CLI and library that helps sanitize text by removing the naughty bits to make strings good and safe:

  1. No unexpected unicode characters to prevent homograph attacks.
  2. No dangerous characters (zero-width, text direction changes, etc.) used to hide dangerous text in innocent looking strings.
  3. "De-fanging" links so that sharing links to IoCs and malware analysis repositories can't be accidentally clicked.

This was built partly as a demo for a talk about NewType and TypeState patterns, and also because I couldn't find a simple CLI to sanitize strings even though there are some great Rust crates to do exactly that.

Naming

From the prefix eu-:

  1. eu-: Used to form taxonomic names corresponding to English vernacular names beginning with true

  2. eu-: Pronunciation: /iːu̟/. Origin: Ancient Greek: εὖ (eû). Meaning: "good", "well"; also extended via Neo-Latin to mean "true". Used in a variety of ways, often to indicate well-preserved specimens, well-developed bones, "truer" examples of fossil forms, or simply admiration on the part of the discoverer.

Also, from eunuch

  1. eunuch: n 1: a man who has been castrated and is incapable of reproduction; "eunuchs guarded the harem"

  2. eunuch: Such a man employed as harem guard or in certain (mainly Eastern) monarchies (e.g. late Roman and Chinese Empires) as court or state officials.

Features

Output

  • --output <FILE>: Output files to write results to
  • --clipboard: Also copy output to clipboard (suppresses stdout unless redirected)

Sanitization

  • --clean: Normalize Unicode characters to only safe, ASCII text chars. Uses deunicode.
  • --strip: Remove HTML tags. Uses ammonia.
  • --defang: De-fang hyperlinks. Uses linkify to find links with custom code to de-fang them.
  • --censor: Replace profanity with placeholders. Uses rustrict.
  • --slugify: Convert text into chars suitable for a URI slug or filename. Uses limace.
  • --keep-colors: Keep CSI SGR codes to allow text formatting in the terminal. Uses a custom implementation via termwiz.

Inspection

  • --detect: Detect dangerous characters in the input
  • --chars: Show characters present in input, their names, and code points
  • Highlight confusables: change the color and make potentially confusable characters bold.

Testing

  • --raw-bytes: Emit the raw bytes after ANSI stripping but without further cleaning (mostly useful for testing)

Caveats

  • clean is overly aggressive and emits only ASCII text. We probably need to allow different levels of cleaning.
  • strip removes all html tags. We might want to allow for some exceptions.
  • censor uses the rustrict library, so it will not censor words that rustrict misses. This can be tested via the swear-words repo with the command eunicode --censor < ~/Repos/swear-words/en or similarly with other lists of profanity.

Unicode Standard References

RFC 9839 and Bad Unicode by Tim Bray