Deencode: Reverse engineer encoding errors

My first name is Clément. Throughout my life, I've encountered my fair share of bad printings of my name because of bad encoding management: the text is encoded (turned from an internal representation into a sequence of bytes) then decoded (turned from a sequence of bytes into an internal representation) using different schemes. This often leads to non-ASCII characters being mangled, replaced, or outright ignored.

For example:

The string "Clément"
└╴encoded as UTF-8 is 43 6C C3 A9 6D 65 6E 74
  └╴decoded as Latin-1 / Codepage 1252 is "ClÃ©ment"

Having this sort of visualisations is why I created this crate. You take a number of engines, pass them to deencode::deencode() to get back a tree of possible sequences of encodings and decodings, and then work on that tree.

This crate is published on crates.io; with documentation at docs.rs.

Example usage

// List the engines to use.
let engines: Vec<&dyn Engine> = vec![&UTF8, &LATIN1, &MIXED816BE, &MIXED816LE, &UTF7];
// Explore the tree of possible encodings and decodings.
let mut tree = deencode("Clément", &engines, 1);
// Remove duplicate entries from the tree.
let _ = tree.deduplicate();
// Export the tree with box drawings.
println!("{}", tree);
// Export the tree as JSON.
println!("{}", serde_json::to_string(&tree).unwrap());

The provided executable does a 1-level deencoding using all engines, and prints the tree using box drawings, on each argument:

$ deencode Clément ミク

deencode 1.0.3

Deencode: Reverse engineer encoding errors

Example usage

Some additional reading