Crate vtext

Source
Expand description

§vtext

NLP in Rust with Python bindings

This package aims to provide a high performance toolkit for ingesting textual data for machine learning applications.

§Features

  • Tokenization: Regexp tokenizer, Unicode segmentation + language specific rules
  • Token counting: converting token counts to sparse matrices for use in machine learning libraries. Similar to CountVectorizer and HashingVectorizer in scikit-learn but will less broad functionality.
  • Levenshtein edit distance; Sørensen-Dice, Jaro, Jaro Winkler string similarities

§Example

A simple tokenization example can be found below,

extern crate vtext;

use vtext::tokenize::{VTextTokenizerParams,Tokenizer};

let tok = VTextTokenizerParams::default().lang("en").build().unwrap();
let tokens: Vec<&str> = tok.tokenize("Flights can't depart after 2:00 pm.").collect();

assert_eq!(tokens, vec!["Flights", "ca", "n't", "depart", "after", "2:00", "pm", "."])

Modules§

errors
metrics
Metrics module
tokenize
Tokenization module
tokenize_sentence
Sentence tokenization module
vectorize
Vectorization module

Macros§

vecString