Expand description
Crate that gives you building blocks for putting together text segmentation pipelines.
The base unit of data it works with is a SegmentedToken. It is based on splitting these into incresingly smaller tokens using segmenters, inbetween adding metadata using augmenters and finally applying normalization.
All this is based on Rusts builtin Iterator framework, so whenever a “magic” metod from this crate operates on multiple tokens that could be any iterator that happens to contain SegmentedToken items and whenever something is returned it can be treated like any other iterator.
use unobtanium_segmenter::augmentation::AugmentationClassify;
use unobtanium_segmenter::augmentation::AugmentationDetectLanguage;
use unobtanium_segmenter::chain::ChainAugmenter;
use unobtanium_segmenter::chain::ChainSegmenter;
use unobtanium_segmenter::chain::StartSegmentationChain;
use unobtanium_segmenter::normalization::NormalizationLowercase;
use unobtanium_segmenter::normalization::NormalizationRustStemmers;
use unobtanium_segmenter::segmentation::UnicodeSentenceSplitter;
use unobtanium_segmenter::segmentation::UnicodeWordSplitter;
let sample_text = "The first digits of π are 3.141592. Dieser Satz ist in deutscher Sprache verfasst.";
let output: Vec<String> = sample_text
.start_segmentation_chain() // Text to token iterator
.chain_segmenter(&UnicodeSentenceSplitter::new())
.chain_augmenter(&AugmentationDetectLanguage::new())
.inspect(|t| println!("{t:?}")) // Debug helper
.chain_segmenter(&UnicodeWordSplitter::new())
.chain_augmenter(&AugmentationClassify::new()) // adds useful metadata and speeds up stemming
.chain_augmenter(&NormalizationRustStemmers::new())
.chain_augmenter(&NormalizationLowercase::new())
.map(|t| t.get_text_prefer_normalized_owned()) // token to text mapping
.collect();
let expected_output: Vec<String> = vec![
"the", " ", "first", " ", "digit", " ", "of", " ", "π", " ", "are", " ", "3.141592", ".", " ", "",
"dies", " ", "satz", " ", "ist", " ", "in", " ", "deutsch", " ", "sprach", " ", "verfasst", ".", ""
].iter().map(|s| s.to_string()).collect();;
assert_eq!(output, expected_output);
Modules§
- augmentation
- Things that add more metadata to tokens.
- chain
- Helpers for applying segmentation, augmentation and normalization to iterators of tokens.
- normalization
- Things that turn different looking things into same looking things.
- segmentation
- Things that split tokens into one or more subtokens.
Structs§
- Segmented
Token - The main representation of data this crate works on.
- Sentence
Grouped Iterator - Iterator that wraps another Iterator of
SegmentedToken
s and inserts a break after each end of sentence token, which interrupts afor
-loop but can be used. - Subdivision
Map - Iterator that allows subdividing each item into zero, one or multiple of itself.
Enums§
- Segmented
Token Kind - What kind of content to expect from a SegmentedToken.
- UseOr
Subdivide - An owned iterator wrapper that allows shortcuts for common empty and one element cases. It is inteded for use with the SubdivisionMap iterator as the return type of the callback.