chardetng 0.1.2

A character encoding detector for legacy Web content
Documentation

chardetng

crates.io docs.rs Apache 2 / MIT dual-licensed

A character encoding detector for legacy Web content.

Licensing

Please see the file named COPYRIGHT.

Documentation

Generated API documentation is available online.

Purpose

The purpose of this detector is user retention for Firefox by ensuring that the long tail of the legacy Web is not more convenient to use in Chrome than in Firefox. (Chrome deployed ced, which left Firefox less convenient to use until the deployment of this detector.)

About the Name

chardet was the name of Mozilla's old encoding detector. I named this one chardetng, because this the next generation of encoding detector in Firefox. There is no code reuse from the old chardet.

Optimization Goals

This crate aims to be more accurate than ICU, more complete than chardet, more explainable and modifiable than compact_enc_det (aka. ced), and, in an application that already depends on encoding_rs for other reasons, smaller in added binary footprint than compact_enc_det.

Principle of Operation

In general chardetng prefers to do negative matching (rule out possibilities from the set of plausible encodings) than to do positive matching. Since negative matching is insufficient, there is positive matching, too.

  • Except for ISO-2022-JP, pairs of ASCII bytes never contribute to the detection, which has the effect of ignoring HTML syntax without an HTML-aware state machine.
  • A single encoding error disqualifies an encoding from the set of possible outcomes. Notably, as the length of the input increases, it becomes increasingly improbable for the input to be valid according to a legacy CJK encoding without being intended as such. Also, there are single-byte encodings that have unmapped bytes in areas that are in active use by other encodings, so such bytes narrow the set of possibilities very effectively.
  • A single occurrence of a C1 control character disqualifies an encoding from possible outcomes.
  • The first non-ASCII character being a half-width katakana character disqualifies an encoding. (This is very effective for deciding between Shift_JIS and EUC-JP.)
  • For single-byte encodings, character pairs are given scores according to their relative frequencies in the applicable Wikipedias.
  • There's a variety of smaller penalty rules, such as:
    • For encodings for bicameral scripts, having an upper-case letter follow a lower-case letter is penalized.
    • For Latin encodings, having three non-ASCII letters in a row is penalized a little and having four or more is penalized a lot.
    • For non-Latin encodings, having a non-Latin letter right next to a Latin letter is penalized.
    • For single-byte encodings, having a character pair (excluding pairs where both characters are ASCII) that never occurs in the Wikipedias for the applicable languages is heavily penalized.

Notes About Encodings

Of the detected encodings, ISO-8859-5, ISO-8859-6, and ISO-8859-4 are the ones that something else is the most likely to be misdetected as, and, I believe, the three least-used encodings of the ones detected, so these three are the most likely ones to either be removed or downplayed.

Known Problems

  • GBK detection is less accurate than in ced for short titles consisting of fewer than six hanzi. This is mostly due to the design that prioritizes optimizing binary size over accuracy on very short inputs.
  • Thai detection is inaccurate for short inputs.
  • windows-1257 detection is very inaccurate. (This detector currently doesn't use trigrams. ced uses 8 KB of trigram data to solve this.)
  • On non-generic domains, some encodings that are confusable with the legacy encodings native to the TLD are excluded from guesses outright unless the input is invalid according to all the TLD-native encodings.

Roadmap

  • Investigate parallelizing the feed method using Rayon.
  • Improve windows-874 detection for short inputs.
  • Improve GBK detection for short inputs.
  • Reorganize the frequency data for telling short GBK, EUC-JP, and EUC-KR inputs apart.
  • Make windows-1257 detection on generic domains a lot more accurate (likely requires looking at trigrams).
  • Tune Central European detection.
  • Tune the penalties applied to confusable encodings on non-generic TLDs to make detection of confusable encodings possible on non-generic TLDs.

Release Notes

0.1.2

  • Return UTF-8 if valid and allowed even if all-ASCII.
  • Return windows-1252 if UTF-8 valid and prohibited, because various test cases require this.

0.1.1

  • Detect Visual Hebrew more often.

0.1.0

  • Initial release.