Module segmentation

Source
Expand description

Things that split tokens into one or more subtokens.

Structs§

DecompositionAhoCorasick
Decompose compound words into their parts found in a given dictionary. Useful for small or on the fly generated dictionaries.
DecompositionFst
Decompose compound words into their parts found in a given dictionary. Useful for compressed, memory mapped dictionaries.
NaiveWordSplitter
Naive word splitting that is based on tokens being cut where alphanimerics, space and symbols change from one type to another.
UnicodeSentenceSplitter
Split Large chunks of text into sentances according to the Unicode definition of a Sentence.
UnicodeWordSplitter
Split text into words according to the Unicode definition of what a word is. While not perfect, it should work well enough as an easy starting point.

Traits§

Segmenter
Allows the segmenter to be part of a SubdivisionMap iterator and be chained using the chain_segmenter methods.