[][src]Crate vtext

vtext

NLP in Rust with Python bindings

This package aims to provide a high performance toolkit for ingesting textual data for machine learning applications.

The API is currently unstable.

Features

  • Tokenization: Regexp tokenizer, Unicode segmentation + language specific rules
  • Token counting: converting token counts to sparse matrices for use in machine learning libraries. Similar to CountVectorizer and HashingVectorizer in scikit-learn but will less broad functionality.
  • Levenshtein edit distance; Sørensen-Dice, Jaro, Jaro Winkler string similarities

Example

A simple tokenization example can be found below,

extern crate vtext;

use vtext::tokenize::{VTextTokenizerParams,Tokenizer};

let tok = VTextTokenizerParams::default().lang("en").build().unwrap();
let tokens = tok.tokenize("Flights can't depart after 2:00 pm.");

// returns &["Flights", "ca", "n't", "depart", "after", "2:00", "pm", "."]

Modules

errors
metrics

Metrics module

tokenize

Tokenization module

vectorize

Vectorization module