Crate unobtanium_segmenter

Source
Expand description

Crate that gives you building blocks for putting together text segmentation pipelines.

The base unit of data it works with is a SegmentedToken. It is based on splitting these into incresingly smaller tokens using segmenters, inbetween adding metadata using augmenters and finally applying normalization.

All this is based on Rusts builtin Iterator framework, so whenever a “magic” metod from this crate operates on multiple tokens that could be any iterator that happens to contain SegmentedToken items and whenever something is returned it can be treated like any other iterator.

use unobtanium_segmenter::augmentation::AugmentationClassify;
use unobtanium_segmenter::augmentation::AugmentationDetectLanguage;
use unobtanium_segmenter::chain::ChainAugmenter;
use unobtanium_segmenter::chain::ChainSegmenter;
use unobtanium_segmenter::chain::StartSegmentationChain;
use unobtanium_segmenter::normalization::NormalizationLowercase;
use unobtanium_segmenter::normalization::NormalizationRustStemmers;
use unobtanium_segmenter::segmentation::UnicodeSentenceSplitter;
use unobtanium_segmenter::segmentation::UnicodeWordSplitter;

let sample_text = "The first digits of π are 3.141592. Dieser Satz ist in deutscher Sprache verfasst.";

let output: Vec<String> = sample_text
	.start_segmentation_chain() // Text to token iterator
	.chain_segmenter(&UnicodeSentenceSplitter::new())
	.chain_augmenter(&AugmentationDetectLanguage::new())
	.inspect(|t| println!("{t:?}")) // Debug helper
	.chain_segmenter(&UnicodeWordSplitter::new())
	.chain_augmenter(&AugmentationClassify::new()) // adds useful metadata and speeds up stemming
	.chain_augmenter(&NormalizationRustStemmers::new())
	.chain_augmenter(&NormalizationLowercase::new())
	.map(|t| t.get_text_prefer_normalized_owned()) // token to text mapping
	.collect();

let expected_output: Vec<String> = vec![
	"the", " ", "first", " ", "digit", " ", "of", " ", "π", " ", "are", " ", "3.141592", ".", " ", "",
	"dies", " ", "satz", " ", "ist", " ", "in", " ", "deutsch", " ", "sprach", " ", "verfasst", ".", ""
].iter().map(|s| s.to_string()).collect();;

assert_eq!(output, expected_output);

Modules§

augmentation
Things that add more metadata to tokens.
chain
Helpers for applying segmentation, augmentation and normalization to iterators of tokens.
normalization
Things that turn different looking things into same looking things.
segmentation
Things that split tokens into one or more subtokens.

Structs§

SegmentedToken
The main representation of data this crate works on.
SentenceGroupedIterator
Iterator that wraps another Iterator of SegmentedTokens and inserts a break after each end of sentence token, which interrupts a for-loop but can be used.
SubdivisionMap
Iterator that allows subdividing each item into zero, one or multiple of itself.

Enums§

SegmentedTokenKind
What kind of content to expect from a SegmentedToken.
UseOrSubdivide
An owned iterator wrapper that allows shortcuts for common empty and one element cases. It is inteded for use with the SubdivisionMap iterator as the return type of the callback.