pub struct Standardizer { /* private fields */ }
Expand description

This class provides pre-processing steps that convert forms of words considered equivalent into one standardized form.

As one straightforward step, it case-folds the text. For the purposes of wordfreq and related tools, a capitalized word shouldn’t have a different frequency from its lowercase version.

The steps that are applied in order, only some of which apply to each language, are:

We’ll describe these steps out of order, to start with the more obvious steps.

Case folding

The most common effect of this function is that it case-folds alphabetic text to lowercase:

use wordfreq::Standardizer;
let standardizer = Standardizer::new("en").unwrap();
assert_eq!(standardizer.apply("Word"), "word");

This is proper Unicode-aware case-folding, so it eliminates distinctions in lowercase letters that would not appear in uppercase. This accounts for the German ß and the Greek final sigma:

use wordfreq::Standardizer;
let standardizer = Standardizer::new("de").unwrap();
assert_eq!(standardizer.apply("groß"), "gross");
use wordfreq::Standardizer;
let standardizer = Standardizer::new("el").unwrap();
assert_eq!(standardizer.apply("λέξις"), "λέξισ");

In Turkish (and Azerbaijani), case-folding is different, because the uppercase and lowercase I come in two variants, one with a dot and one without. They are matched in a way that preserves the number of dots, which the usual pair of “I” and “i” do not.

use wordfreq::Standardizer;
let standardizer = Standardizer::new("tr").unwrap();
assert_eq!(standardizer.apply("HAKKINDA İSTANBUL"), "hakkında istanbul");

Fixing of diacritics

While we’re talking about Turkish: the Turkish alphabet contains letters with cedillas attached to the bottom. In the case of “ş” and “ţ”, these letters are very similar to two Romanian letters, “ș” and “ț”, which have separate commas below them.

(Did you know that a cedilla is not the same as a comma under a letter? I didn’t until I started dealing with text normalization. My keyboard layout even inputs a letter with a cedilla when you hit Compose+comma.)

Because these letters look so similar, and because some fonts only include one pair of letters and not the other, there are many cases where the letters are confused with each other. Our preprocessing normalizes these Turkish and Romanian letters to the letters each language prefers.

use wordfreq::Standardizer;
let standardizer = Standardizer::new("tr").unwrap();
assert_eq!(standardizer.apply("kișinin"), "kişinin");
use wordfreq::Standardizer;
let standardizer = Standardizer::new("ro").unwrap();
assert_eq!(standardizer.apply("ACELAŞI"), "același");

Unicode normalization

Unicode text is NFC normalized in most languages, removing trivial distinctions between strings that should be considered equivalent in all cases:

use wordfreq::Standardizer;
let standardizer = Standardizer::new("de").unwrap();
let word = standardizer.apply("natu\u{0308}rlich");
assert!(word.contains("ü"));

NFC normalization is sufficient (and NFKC normalization is a bit too strong) for many languages that are written in cased, alphabetic scripts. Languages in other scripts tend to need stronger normalization to properly compare text. So we use NFC normalization when the language’s script is Latin, Greek, or Cyrillic, and we use NFKC normalization for all other languages.

Here’s an example in Japanese, where preprocessing changes the width (and the case) of a Latin letter that’s used as part of a word:

use wordfreq::Standardizer;
let standardizer = Standardizer::new("ja").unwrap();
assert_eq!(standardizer.apply("Uターン"), "uターン");

In Korean, NFKC normalization is important because it aligns two different ways of encoding text – as individual letters that are grouped together into square characters, or as the entire syllables that those characters represent:

use wordfreq::Standardizer;
let standardizer = Standardizer::new("ko").unwrap();
let word = "\u{1102}\u{1161}\u{11c0}\u{1106}\u{1161}\u{11af}";
assert_eq!(word, "낱말");
assert_eq!(word.chars().count(), 6);
let word = standardizer.apply(word);
assert_eq!(word, "낱말");
assert_eq!(word.chars().count(), 2);

Abjad mark removal

There are many abjad languages, such as Arabic, Hebrew, Persian, and Urdu, where words can be marked with vowel points but rarely are. In languages that use abjad scripts, we remove all modifiers that are classified by Unicode as “marks”. We also remove an Arabic character called the tatweel, which is used to visually lengthen a word.

use wordfreq::Standardizer;
let standardizer = Standardizer::new("ar").unwrap();
assert_eq!(standardizer.apply("كَلِمَة"), "كلمة");
use wordfreq::Standardizer;
let standardizer = Standardizer::new("ar").unwrap();
assert_eq!(standardizer.apply("الحمــــــد"), "الحمد");

Transliteration of multi-script languages

Some languages are written in multiple scripts, and require special care. These languages include Chinese, Serbian, and Azerbaijani.

In Serbian, there is a well-established mapping from Cyrillic letters to Latin letters. We apply this mapping so that Serbian is always represented in Latin letters.

use wordfreq::Standardizer;
let standardizer = Standardizer::new("sr").unwrap();
assert_eq!(standardizer.apply("схваташ"), "shvataš");

The transliteration is more complete than it needs to be to cover just Serbian, so that – for example – borrowings from Russian can be transliterated, instead of coming out in a mixed script.

use wordfreq::Standardizer;
let standardizer = Standardizer::new("sr").unwrap();
assert_eq!(standardizer.apply("культуры"), "kul'tury");

Azerbaijani (Azeri) has a similar transliteration step to Serbian, and then the Latin-alphabet text is handled similarly to Turkish.

use wordfreq::Standardizer;
let standardizer = Standardizer::new("az").unwrap();
assert_eq!(standardizer.apply("бағырты"), "bağırtı");

In Chinese, there is a transliteration step from traditional characters to simplified ones.

use wordfreq::Standardizer;
let standardizer = Standardizer::new("zh").unwrap();
assert_eq!(standardizer.apply("愛情"), "爱情");

Differences from the original Python’s implementation

This class is a straightforward port of preprocess_text in wordfreq/preprocess.py, but differs in the following:

  • Chinese transliteration step: The original implementation performs this step during tokenization, but ours supports it in this class, because our library does not support tokenization.
  • Language tag parsing: Our implementation employs a simple approach to parse language tags, just looking up language::LIKELY_SUBTAGS.

Implementations§

source§

impl Standardizer

source

pub fn new(language_tag: &str) -> Result<Self>

Creates a new Standardizer for the given language.

Arguments
source

pub fn apply(&self, text: &str) -> String

Standardizes the given text.

Trait Implementations§

source§

impl Clone for Standardizer

source§

fn clone(&self) -> Standardizer

Returns a copy of the value. Read more
1.0.0 · source§

fn clone_from(&mut self, source: &Self)

Performs copy-assignment from source. Read more

Auto Trait Implementations§

Blanket Implementations§

source§

impl<T> Any for Twhere T: 'static + ?Sized,

source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
source§

impl<T> Borrow<T> for Twhere T: ?Sized,

source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
source§

impl<T> BorrowMut<T> for Twhere T: ?Sized,

source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
source§

impl<T> From<T> for T

source§

fn from(t: T) -> T

Returns the argument unchanged.

source§

impl<T, U> Into<U> for Twhere U: From<T>,

source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

source§

impl<T> ToOwned for Twhere T: Clone,

§

type Owned = T

The resulting type after obtaining ownership.
source§

fn to_owned(&self) -> T

Creates owned data from borrowed data, usually by cloning. Read more
source§

fn clone_into(&self, target: &mut T)

Uses borrowed data to replace owned data, usually by cloning. Read more
source§

impl<T, U> TryFrom<U> for Twhere U: Into<T>,

§

type Error = Infallible

The type returned in the event of a conversion error.
source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
source§

impl<T, U> TryInto<U> for Twhere U: TryFrom<T>,

§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.