Crate charabia

Source
Expand description

Charabia library tokenize a text detecting the Script/Language, segmenting, normalizing, and classifying it.

§Examples

§Tokenization
use charabia::Tokenize;

let orig = "Thé quick (\"brown\") fox can't jump 32.3 feet, right? Brr, it's 29.3°F!";

// tokenize the text.
let mut tokens = orig.tokenize();

let token = tokens.next().unwrap();
// the lemma into the token is normalized: `Thé` became `the`.
assert_eq!(token.lemma(), "the");
// token is classfied as a word
assert!(token.is_word());

let token = tokens.next().unwrap();
assert_eq!(token.lemma(), " ");
// token is classfied as a separator
assert!(token.is_separator());
§Segmentation
use charabia::Segment;

let orig = "The quick (\"brown\") fox can't jump 32.3 feet, right? Brr, it's 29.3°F!";

let mut segments = orig.segment_str();

assert_eq!(segments.next(), Some("The"));
assert_eq!(segments.next(), Some(" "));
assert_eq!(segments.next(), Some("quick"));

§Build features

Charabia comes with default features that can be deactivated at compile time, this features are additional Language supports that need to download and/or build a specialized dictionary that impact the compilation time. Theses features are listed in charabia’s cargo.toml and can be deactivated via dependency features.

Re-exports§

pub use normalizer::Normalize;
pub use segmenter::Segment;

Modules§

normalizer
segmenter
separators

Structs§

ReconstructedTokenIter
Iterator over tuples of &str (part of the original text) and Token.
StrDetection
Token
Tokenizer
Structure used to tokenize a text with custom configurations.
TokenizerBuilder
Structure to build a tokenizer with custom settings.

Enums§

Language
Script
SeparatorKind
Define the kind of a TokenKind::Separator.
TokenKind
Define the kind of a Token.

Traits§

Tokenize
Trait defining methods to tokenize a text.