Struct wordfreq::preprocessers::Standardizer

source ·

pub struct Standardizer { /* private fields */ }

Expand description

This class provides pre-processing steps that convert forms of words considered equivalent into one standardized form.

As one straightforward step, it case-folds the text. For the purposes of wordfreq and related tools, a capitalized word shouldn’t have a different frequency from its lowercase version.

The steps that are applied in order, only some of which apply to each language, are:

NFC or NFKC normalization, as needed for the language
Transliteration of multi-script languages
Abjad mark removal
Case folding
Fixing of diacritics

We’ll describe these steps out of order, to start with the more obvious steps.

Case folding

The most common effect of this function is that it case-folds alphabetic text to lowercase:

use wordfreq::Standardizer;
let standardizer = Standardizer::new("en").unwrap();
assert_eq!(standardizer.apply("Word"), "word");

This is proper Unicode-aware case-folding, so it eliminates distinctions in lowercase letters that would not appear in uppercase. This accounts for the German ß and the Greek final sigma:

use wordfreq::Standardizer;
let standardizer = Standardizer::new("de").unwrap();
assert_eq!(standardizer.apply("groß"), "gross");

use wordfreq::Standardizer;
let standardizer = Standardizer::new("el").unwrap();
assert_eq!(standardizer.apply("λέξις"), "λέξισ");

In Turkish (and Azerbaijani), case-folding is different, because the uppercase and lowercase I come in two variants, one with a dot and one without. They are matched in a way that preserves the number of dots, which the usual pair of “I” and “i” do not.

use wordfreq::Standardizer;
let standardizer = Standardizer::new("tr").unwrap();
assert_eq!(standardizer.apply("HAKKINDA İSTANBUL"), "hakkında istanbul");

Fixing of diacritics

While we’re talking about Turkish: the Turkish alphabet contains letters with cedillas attached to the bottom. In the case of “ş” and “ţ”, these letters are very similar to two Romanian letters, “ș” and “ț”, which have separate commas below them.

(Did you know that a cedilla is not the same as a comma under a letter? I didn’t until I started dealing with text normalization. My keyboard layout even inputs a letter with a cedilla when you hit Compose+comma.)

Because these letters look so similar, and because some fonts only include one pair of letters and not the other, there are many cases where the letters are confused with each other. Our preprocessing normalizes these Turkish and Romanian letters to the letters each language prefers.

use wordfreq::Standardizer;
let standardizer = Standardizer::new("tr").unwrap();
assert_eq!(standardizer.apply("kișinin"), "kişinin");

use wordfreq::Standardizer;
let standardizer = Standardizer::new("ro").unwrap();
assert_eq!(standardizer.apply("ACELAŞI"), "același");

Unicode normalization

Unicode text is NFC normalized in most languages, removing trivial distinctions between strings that should be considered equivalent in all cases:

use wordfreq::Standardizer;
let standardizer = Standardizer::new("de").unwrap();
let word = standardizer.apply("natu\u{0308}rlich");
assert!(word.contains("ü"));

NFC normalization is sufficient (and NFKC normalization is a bit too strong) for many languages that are written in cased, alphabetic scripts. Languages in other scripts tend to need stronger normalization to properly compare text. So we use NFC normalization when the language’s script is Latin, Greek, or Cyrillic, and we use NFKC normalization for all other languages.

Here’s an example in Japanese, where preprocessing changes the width (and the case) of a Latin letter that’s used as part of a word:

use wordfreq::Standardizer;
let standardizer = Standardizer::new("ja").unwrap();
assert_eq!(standardizer.apply("Ｕターン"), "uターン");

In Korean, NFKC normalization is important because it aligns two different ways of encoding text – as individual letters that are grouped together into square characters, or as the entire syllables that those characters represent:

use wordfreq::Standardizer;
let standardizer = Standardizer::new("ko").unwrap();
let word = "\u{1102}\u{1161}\u{11c0}\u{1106}\u{1161}\u{11af}";
assert_eq!(word, "낱말");
assert_eq!(word.chars().count(), 6);
let word = standardizer.apply(word);
assert_eq!(word, "낱말");
assert_eq!(word.chars().count(), 2);

Abjad mark removal

There are many abjad languages, such as Arabic, Hebrew, Persian, and Urdu, where words can be marked with vowel points but rarely are. In languages that use abjad scripts, we remove all modifiers that are classified by Unicode as “marks”. We also remove an Arabic character called the tatweel, which is used to visually lengthen a word.

use wordfreq::Standardizer;
let standardizer = Standardizer::new("ar").unwrap();
assert_eq!(standardizer.apply("كَلِمَة"), "كلمة");

use wordfreq::Standardizer;
let standardizer = Standardizer::new("ar").unwrap();
assert_eq!(standardizer.apply("الحمــــــد"), "الحمد");

Transliteration of multi-script languages

Some languages are written in multiple scripts, and require special care. These languages include Chinese, Serbian, and Azerbaijani.

In Serbian, there is a well-established mapping from Cyrillic letters to Latin letters. We apply this mapping so that Serbian is always represented in Latin letters.

use wordfreq::Standardizer;
let standardizer = Standardizer::new("sr").unwrap();
assert_eq!(standardizer.apply("схваташ"), "shvataš");

The transliteration is more complete than it needs to be to cover just Serbian, so that – for example – borrowings from Russian can be transliterated, instead of coming out in a mixed script.

use wordfreq::Standardizer;
let standardizer = Standardizer::new("sr").unwrap();
assert_eq!(standardizer.apply("культуры"), "kul'tury");

Azerbaijani (Azeri) has a similar transliteration step to Serbian, and then the Latin-alphabet text is handled similarly to Turkish.

use wordfreq::Standardizer;
let standardizer = Standardizer::new("az").unwrap();
assert_eq!(standardizer.apply("бағырты"), "bağırtı");

In Chinese, there is a transliteration step from traditional characters to simplified ones.

use wordfreq::Standardizer;
let standardizer = Standardizer::new("zh").unwrap();
assert_eq!(standardizer.apply("愛情"), "爱情");

Differences from the original Python’s implementation

This class is a straightforward port of preprocess_text in wordfreq/preprocess.py, but differs in the following:

Chinese transliteration step: The original implementation performs this step during tokenization, but ours supports it in this class, because our library does not support tokenization.
Language tag parsing: Our implementation employs a simple approach to parse language tags, just looking up language::LIKELY_SUBTAGS.

Struct wordfreq::preprocessers::Standardizer

Implementations§

impl Standardizer

pub fn new(language_tag: &str) -> Result<Self>

pub fn apply(&self, text: &str) -> String

Trait Implementations§

impl Clone for Standardizer

fn clone(&self) -> Standardizer

fn clone_from(&mut self, source: &Self)

Auto Trait Implementations§

impl RefUnwindSafe for Standardizer

impl Send for Standardizer

impl Sync for Standardizer

impl Unpin for Standardizer

impl UnwindSafe for Standardizer

Blanket Implementations§

impl<T> Any for Twhere T: 'static + ?Sized,

fn type_id(&self) -> TypeId

impl<T> Borrow<T> for Twhere T: ?Sized,

fn borrow(&self) -> &T

impl<T> BorrowMut<T> for Twhere T: ?Sized,

fn borrow_mut(&mut self) -> &mut T

impl<T> From<T> for T

fn from(t: T) -> T

impl<T, U> Into<U> for Twhere U: From<T>,

fn into(self) -> U

impl<T> ToOwned for Twhere T: Clone,

type Owned = T

fn to_owned(&self) -> T

fn clone_into(&self, target: &mut T)

impl<T, U> TryFrom<U> for Twhere U: Into<T>,

type Error = Infallible

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

impl<T, U> TryInto<U> for Twhere U: TryFrom<T>,

type Error = <U as TryFrom<T>>::Error

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>