Crate unobtanium_segmenter

Expand description

Crate that gives you building blocks for putting together text segmentation pipelines.

The base unit of data it works with is a SegmentedToken. It is based on splitting these into incresingly smaller tokens using segmenters, inbetween adding metadata using augmenters and finally applying normalization.

All this is based on Rusts builtin Iterator framework, so whenever a “magic” metod from this crate operates on multiple tokens that could be any iterator that happens to contain SegmentedToken items and whenever something is returned it can be treated like any other iterator.

use unobtanium_segmenter::augmentation::AugmentationClassify;
use unobtanium_segmenter::augmentation::AugmentationDetectLanguage;
use unobtanium_segmenter::chain::ChainAugmenter;
use unobtanium_segmenter::chain::ChainSegmenter;
use unobtanium_segmenter::chain::StartSegmentationChain;
use unobtanium_segmenter::normalization::NormalizationLowercase;
use unobtanium_segmenter::normalization::NormalizationRustStemmers;
use unobtanium_segmenter::segmentation::UnicodeSentenceSplitter;
use unobtanium_segmenter::segmentation::UnicodeWordSplitter;

let sample_text = "The first digits of π are 3.141592. Dieser Satz ist in deutscher Sprache verfasst.";

let output: Vec<String> = sample_text
	.start_segmentation_chain() // Text to token iterator
	.chain_segmenter(&UnicodeSentenceSplitter::new())
	.chain_augmenter(&AugmentationDetectLanguage::new())
	.inspect(|t| println!("{t:?}")) // Debug helper
	.chain_segmenter(&UnicodeWordSplitter::new())
	.chain_augmenter(&AugmentationClassify::new()) // adds useful metadata and speeds up stemming
	.chain_augmenter(&NormalizationRustStemmers::new())
	.chain_augmenter(&NormalizationLowercase::new())
	.map(|t| t.get_text_prefer_normalized_owned()) // token to text mapping
	.collect();

let expected_output: Vec<String> = vec![
	"the", " ", "first", " ", "digit", " ", "of", " ", "π", " ", "are", " ", "3.141592", ".", " ", "",
	"dies", " ", "satz", " ", "ist", " ", "in", " ", "deutsch", " ", "sprach", " ", "verfasst", ".", ""
].iter().map(|s| s.to_string()).collect();;

assert_eq!(output, expected_output);

Modules§

augmentation: Things that add more metadata to tokens.
chain: Helpers for applying segmentation, augmentation and normalization to iterators of tokens.
normalization: Things that turn different looking things into same looking things.
segmentation: Things that split tokens into one or more subtokens.

Structs§

SegmentedToken: The main representation of data this crate works on.
SentenceGroupedIterator: Iterator that wraps another Iterator of SegmentedTokens and inserts a break after each end of sentence token, which interrupts a for-loop but can be used.
SubdivisionMap: Iterator that allows subdividing each item into zero, one or multiple of itself.

Enums§

SegmentedTokenKind: What kind of content to expect from a SegmentedToken.
UseOrSubdivide: An owned iterator wrapper that allows shortcuts for common empty and one element cases. It is inteded for use with the SubdivisionMap iterator as the return type of the callback.

Crate unobtanium_segmenterCopy item path

Modules§

Structs§

Enums§

Crate unobtanium_segmenter