Crate vtext

Expand description

§vtext

NLP in Rust with Python bindings

This package aims to provide a high performance toolkit for ingesting textual data for machine learning applications.

§Features

Tokenization: Regexp tokenizer, Unicode segmentation + language specific rules
Token counting: converting token counts to sparse matrices for use in machine learning libraries. Similar to CountVectorizer and HashingVectorizer in scikit-learn but will less broad functionality.
Levenshtein edit distance; Sørensen-Dice, Jaro, Jaro Winkler string similarities

§Example

A simple tokenization example can be found below,

extern crate vtext;

use vtext::tokenize::{VTextTokenizerParams,Tokenizer};

let tok = VTextTokenizerParams::default().lang("en").build().unwrap();
let tokens: Vec<&str> = tok.tokenize("Flights can't depart after 2:00 pm.").collect();

assert_eq!(tokens, vec!["Flights", "ca", "n't", "depart", "after", "2:00", "pm", "."])

Modules§

errors
metrics: Metrics module
tokenize: Tokenization module
tokenize_sentence: Sentence tokenization module
vectorize: Vectorization module

Macros§

vecString