About
Stop words are words that don't carry much meaning, and are typically removed as a preprocessing step before text analysis or natural language processing. This crate contains common stop words for a variety of languages. This crate uses stop word lists from this resource and also from NLTK.
This crate currently includes the following languages:
- Arabic
- Azerbaijani
- Bulgarian
- Catalan
- Czech
- Danish
- Dutch
- English
- Finnish
- French
- German
- Greek
- Gujarati
- Hebrew
- Hindi
- Hungarian
- Indonesian
- Kazakh
- Italian
- Nepali
- Norwegian
- Polish
- Portuguese
- Romanian
- Russian
- Slovak
- Slovenian
- Spanish
- Swedish
- Tajik
- Turkish
- Ukrainian
- Vietnamese
Installation
Install through crates.io with:
cargo install stop_words
Then add it to your ``Cargo.toml` with:
[]
= "0.3.2"
and add this to your root:
use stop_words;
Usage
Using this crate is fairly straight-forward:
use stop_words;
The function get will pull stop words in all of the languages given above, drawing on
this resource and also from
NLTK if the target language doesn't exist in the former. If you'd like to specifically get stop
words from NLTK, that's easy too, just do:
let words = get_nltk;
Both get and get_nltk accept full language names (in English), ISO 693-1 language codes (2-letter codes), and
ISO 693-2T (3-letter codes) language codes. This means you can also do this:
let words = get;
or this:
let words = get;
Finally, if you prefer to have a HashSet<String> of words instead of a Vec<String>, you can do this:
let vec = get;
vec.into_iter.collect;
So you want to use enums?
That's easy too! All of the above functionality works, just add the following to your Cargo.toml instead of what is shown above:
[]
= { = "0.3.2", =["enum"]}
And then swap out the language string (e.g., "english") with the LANGUAGE enum (e.g., stop_words::LANGUAGE::English).
use stop_words;
Of course, there are benefits and downsides to using this version. The primary benefit is that the enum is self-documenting, so it becomes much less likely that you will misspell a language name. The downside is that the enum implementation relies on strum, which adds some size to the build.