About
Stop words are words that don't carry much meaning, and are typically removed as a preprocessing step before text analysis or natural language processing. This crate contains common stop words for a variety of languages. This crate uses stop word lists from this resource and also from NLTK.
This crate currently includes the following languages:
- Arabic
- Azerbaijani
- Bulgarian
- Catalan
- Czech
- Danish
- Dutch
- English
- Finnish
- French
- German
- Greek
- Hebrew
- Hindi
- Hungarian
- Indonesian
- Kazakh
- Italian
- Nepali
- Norwegian
- Polish
- Portuguese
- Romanian
- Russian
- Slovak
- Slovenian
- Spanish
- Swedish
- Tajik
- Turkish
- Ukrainian
- Vietnamese
Installation
Install through crates.io with:
cargo install stop_words
Then add it to your ``Cargo.toml` with:
[]
= "0.2.1"
and add this to your root:
use stop_words;
Usage
Using this crate is fairly straight-forward:
use stop_words;
The function get will pull stop words in all of the languages given above, drawing on
this resource and also from
NLTK if the target language doesn't exist in the former. If you'd like to specifically get stop
words from NLTK, that's easy too, just do:
let words = stop_words::get_nltk("en");
Both get and get_nltk accept full language names (in English), ISO 693-1 language codes (2-letter codes), and
ISO 693-2T (3-letter codes) language codes. This means you can also do this:
let words = get;
or this:
let words = get;
Finally, you can convert the Vec<String> of words to a HashSet<String>. I'm not here to judge you.
let vec = get;
let set = vec_to_set;