Rust-based Natural Language Toolkit (rsnltk)
A Rust library to support natural language processing with pure Rust implementation and Python bindings
Rust Docs | Crates Home Page | Tests | NER-Kit
Features
The rsnltk
library integrates various existing Python-written NLP toolkits for powerful text analysis in Rust-based applications.
Functions
This toolkit is based on the Python-written Stanza and other important NLP crates.
A list of functions from Stanza and others we bind here include:
- Tokenize
- Sentence Segmentation
- Multi-Word Token Expansion
- Part-of-Speech & Morphological Features
- Named Entity Recognition
- Sentiment Analysis
- Language Identification
- Dependency Tree Analysis
Some amazing crates are also included in rsnltk
but with simplified APIs for actual use:
Additionally, we can calculate the similarity between words based on WordNet though the semantic-kit
PyPI project via pip install semantic-kit
.
Installation
-
Make sure you install Python 3.6.6+ and PIP environment in your computer. Type
python -V
in the Terminal should print no error message; -
Install our Python-based ner-kit (version>=0.0.5a2) for binding the
Stanza
package viapip install ner-kit==0.0.5a2
; -
Then, Rust should be also installed in your computer. I use IntelliJ to develop Rust-based applications, where you can write Rust codes;
-
Create a simple Rust application project with a
main()
function. -
Add the
rsnltk
dependency to theCargo.toml
file, keep up the Latest version. -
After you add the
rsnltk
dependency in thetoml file
, install necessary language models from Stanza using the following Rust code for the first time you use this package.
Or you can manually install those language models via the Python-written ner-kit
package which provides more features in using Stanza. Go to: ner-kit
If no error occurs in the above example, then it works. Finally, you can try the following advanced example usage.
Currently, we tested the use of English and Chinese language models; however, other language models should work as well.
Examples with Stanza Bindings
Example 1: Part-of-speech Analysis
Example 2: Sentiment Analysis
Example 3: Named Entity Recognition
Example 4: Tokenize for Multiple Languages
Example 5: Tokenize Sentence
Example 6: Language Identification
Example 7: MWT expand
Example 8: Estimate the similarity between words in WordNet
You need to firstly install semantic-kit
PyPI package!
Example 9: Obtain a dependency tree from a text
Examples in Pure Rust
Example 1: Word2Vec similarity
Example 2: Text summarization
use *;
Example 3: Get token list from English strings
use get_token_list;
Example 4: Word segmentation for some language where no space exists between terms, e.g. Chinese text.
We implement three word segmentation methods in this version:
- Forward Maximum Matching (fmm), which is baseline method
- Backward Maximum Matching (bmm), which is considered better
- Bidirectional Maximum Matching (bimm), high accuracy but low speed
use *;
Credits
Thank Stanford NLP Group for their hard work in Stanza.
License
The rsnltk
library with MIT License is provided by Donghua Chen.