vtext
NLP in Rust with Python bindings
This package aims to provide a high performance toolkit for ingesting textual data for machine learning applications.
The API is currently unstable.
Features
- Tokenization: Regexp tokenizer, Unicode segmentation + language specific rules
- Stemming: Snowball (in Python 15-20x faster than NLTK)
- Analyzers (planned): word and character n-grams, skip grams
- Token counting: converting token counts to sparse matrices for use
in machine learning libraries. Similar to
CountVectorizerandHashingVectorizerin scikit-learn. - Feature weighting (planned): feature weighting based on document frequency (TF-IDF), feature normalization.
- Levenshtein edit distance; Sørensen-Dice, Jaro, Jaro Winkler string similarities
Usage
Usage in Python
vtext requires Python 3.5+ and can be installed with,
pip install --pre vtext
Below is a simple tokenization example,
>>>
>>>
For more details see the project documentation: vtext.io/doc/latest/index.html
Usage in Rust
Add the following to Cargo.toml,
[]
= "0.1.0-alpha.1"
For more details see rust documentation: docs.rs/vtext
Benchmarks
Tokenization
Following benchmarks illustrate the tokenization accuracy (F1 score) on UD treebanks ,
| lang | dataset | regexp | spacy 2.1 | vtext |
|---|---|---|---|---|
| en | EWT | 0.812 | 0.972 | 0.966 |
| en | GUM | 0.881 | 0.989 | 0.996 |
| de | GSD | 0.896 | 0.944 | 0.964 |
| fr | Sequoia | 0.844 | 0.968 | 0.971 |
and the English tokenization speed,
| regexp | spacy 2.1 | vtext | |
|---|---|---|---|
| Speed (10⁶ tokens/s) | 3.1 | 0.14 | 2.1 |
Text vectorization
Below are benchmarks for converting textual data to a sparse document-term matrix using the 20 newsgroups dataset,
| Speed (MB/s) | scikit-learn 0.20.1 | vtext 0.1.0a1 |
|---|---|---|
| CountVectorizer | 14 | 35 |
| HashingVectorizer | 19 | 68 |
see benchmarks/README.md for more details.
License
vtext is released under the Apache License, Version 2.0.