vtext

NLP in Rust with Python bindings

This package aims to provide a high performance toolkit for ingesting textual data for machine learning applications.

The API is currently unstable.

Features

Tokenization: Regexp tokenizer, Unicode segmentation + language specific rules
Stemming: Snowball (in Python 15-20x faster than NLTK)
Analyzers (planned): word and character n-grams, skip grams
Token counting: converting token counts to sparse matrices for use in machine learning libraries. Similar to CountVectorizer and HashingVectorizer in scikit-learn.
Feature weighting (planned): feature weighting based on document frequency (TF-IDF), feature normalization.
Levenshtein edit distance; Sørensen-Dice, Jaro, Jaro Winkler string similarities

Usage

Usage in Python

vtext requires Python 3.5+ and can be installed with,

pip install --pre vtext

Below is a simple tokenization example,

>>> from vtext.tokenize import VTextTokenizer
>>> VTextTokenizer("en").tokenize("Flights can't depart after 2:00 pm.")
["Flights", "ca", "n't", "depart" "after", "2:00", "pm", "."]

For more details see the project documentation: vtext.io/doc/latest/index.html

Usage in Rust

Add the following to Cargo.toml,

[dependencies]
vtext = "0.1.0-alpha.1"

For more details see rust documentation: docs.rs/vtext

Benchmarks

Tokenization

Following benchmarks illustrate the tokenization accuracy (F1 score) on UD treebanks ,

lang	dataset	regexp	spacy 2.1	vtext
en	EWT	0.812	0.972	0.966
en	GUM	0.881	0.989	0.996
de	GSD	0.896	0.944	0.964
fr	Sequoia	0.844	0.968	0.971

and the English tokenization speed,

	regexp	spacy 2.1	vtext
Speed (10⁶ tokens/s)	3.1	0.14	2.1

Text vectorization

Below are benchmarks for converting textual data to a sparse document-term matrix using the 20 newsgroups dataset,

Speed (MB/s)	scikit-learn 0.20.1	vtext 0.1.0a1
CountVectorizer	14	35
HashingVectorizer	19	68

see benchmarks/README.md for more details.

License

vtext is released under the Apache License, Version 2.0.

vtext 0.1.0-alpha.2