[−][src]Crate vtext

vtext

NLP in Rust with Python bindings

This package aims to provide a high performance toolkit for ingesting textual data for machine learning applications.

The API is currently unstable.

Features

Tokenization: Regexp tokenizer, Unicode segmentation + language specific rules
Token counting: converting token counts to sparse matrices for use in machine learning libraries. Similar to CountVectorizer and HashingVectorizer in scikit-learn but will less broad functionality.
Levenshtein edit distance; Sørensen-Dice, Jaro, Jaro Winkler string similarities

Example

A simple tokenization example can be found below,

extern crate vtext;

use vtext::tokenize::{VTextTokenizerParams,Tokenizer};

let tok = VTextTokenizerParams::default().lang("en").build().unwrap();
let tokens = tok.tokenize("Flights can't depart after 2:00 pm.");

// returns &["Flights", "ca", "n't", "depart", "after", "2:00", "pm", "."]

Modules

errors
metrics	Metrics module
tokenize	Tokenization module
vectorize	Vectorization module