pub struct SegmentedToken<'a> {
pub text: &'a str,
pub normalized_text: NormalizedText,
pub normalization_language: Option<Lang>,
pub kind: Option<SegmentedTokenKind>,
pub detected_script: Option<Script>,
pub detected_language: Option<Lang>,
pub detected_language_confidence: f64,
pub is_detected_language_relible: bool,
pub is_known_word: bool,
pub is_end_of_sentence: bool,
}Expand description
The main representation of data this crate works on.
A token is effectively text with metadata attached, this struct being the metadata carrier.
Fields§
§text: &'a strThe piece of text that this token represents.
This should be borrowed from the initial text that was fed to the segmenter chain.
normalized_text: NormalizedTextIf a normalizer output was different from text the result will be stored in here.
It is recommended that you fetch normalized text using get_text_prefer_normalized() or get_text_prefer_normalized_owned().
normalization_language: Option<Lang>Which language the normalization that resulted in normalized_text happend with.
None means that only language independent normalizations were applied, for language dependent normalizations this should make sure they’re all applied for the same language.
If already set to Some it shouldn’t be changed.
kind: Option<SegmentedTokenKind>§detected_script: Option<Script>The primary script as detected by a script or language detection augmenter.
Information about detected scripts is inherited across splitting.
detected_language: Option<Lang>The primary language detected by a language detection augmenter.
Information about detected languages in inherited across splitting.
detected_language_confidence: f64How confident the language detector was about the language that it detectd.
This scales inbetween 0 (not confident at all) and 1 (most confident).
is_detected_language_relible: boolWheter the language detector considers its output to be reliable.
is_known_word: boolIndicates that no further splitting is neccessary.
This should be set to true if the token was a valid word in a dictionary.
is_end_of_sentence: boolIndicates that this token marks the end of a sentence.
This should only be set on tokens with an empty text field. It is not inherited.
Implementations§
Source§impl<'a> SegmentedToken<'a>
impl<'a> SegmentedToken<'a>
Sourcepub fn new(text: &'a str, kind: Option<SegmentedTokenKind>) -> Self
pub fn new(text: &'a str, kind: Option<SegmentedTokenKind>) -> Self
Create a segmented token from scratch. (You likely won’t need it)
If you are wwriting a segmenter have a look at new_derived_from().
For creating the initial token consider usng the From implementations or the StartSegmentationChain trait.
Sourcepub fn new_derived_from(text: &'a str, from: &Self) -> Self
pub fn new_derived_from(text: &'a str, from: &Self) -> Self
Create a token with a given text that inerits metadata from the from token.
This is the recommended constructor to use inside a segmenter after splitting.
Sourcepub fn new_end_of_sentence(empty_text: &'a str) -> Self
pub fn new_end_of_sentence(empty_text: &'a str) -> Self
Create a new token that carries an is_end_of_sentence marker.
Recommended way of deriving the empty text:
let (main, tail) = sentence.split_at(sentence.len());
SegmentedToken::new_derived_from(main, &token);
SegmentedToken::new_end_of_sentence(tail);Sourcepub fn covert_to_child_segements_of_self(
&'a self,
texts: &'a [&'a str],
) -> impl Iterator<Item = SegmentedToken<'a>> + 'a
pub fn covert_to_child_segements_of_self( &'a self, texts: &'a [&'a str], ) -> impl Iterator<Item = SegmentedToken<'a>> + 'a
Helper function to convert texts that came ot of a simple helper function back into segments.
Using this implies that further segmenting didn’t change anything for the metadta of the child segments.
Sourcepub fn with_is_kown_word(self, is_known_word: bool) -> Self
pub fn with_is_kown_word(self, is_known_word: bool) -> Self
Builder like convenience function to set the is_known_word flag.
Sourcepub fn get_text_prefer_normalized(&self) -> &str
pub fn get_text_prefer_normalized(&self) -> &str
Return the normalized_text of this token if present and text if not as a str.
Sourcepub fn get_text_prefer_normalized_owned(&self) -> String
pub fn get_text_prefer_normalized_owned(&self) -> String
This is the same as get_text_prefer_normalized(), but returns an owned String instead.
Sourcepub fn get_normalized_text(&self) -> Option<&str>
pub fn get_normalized_text(&self) -> Option<&str>
Returns the normalized text behind this token.
If the normalization is NormalizedText::NormalizedToSelf it’ll return the original text.
It will only return None if not normalization was applied.
Sourcepub fn update_normalized_str(&mut self, normalized: &str, lang: Option<Lang>)
pub fn update_normalized_str(&mut self, normalized: &str, lang: Option<Lang>)
Update this tokens normalized text with an unowned &str.
If the text already matches the unnormalized text, the normalized_text will be set to NormalizedText::NormalizedToSelf.
lang is the language that the normalization happend for, set to None if the normalization was language independent.
Sourcepub fn update_normalized_string(
&mut self,
normalized: String,
lang: Option<Lang>,
)
pub fn update_normalized_string( &mut self, normalized: String, lang: Option<Lang>, )
Update this tokens normalized text with an owned String.
If the text already matches the unnormalized text, the normalized_text will be set to NormalizedText::NormalizedToSelf.
lang is the language that the normalization happend for, set to None if the normalization was language independent.
Sourcepub fn update_normalization_language(&mut self, lang: Option<Lang>)
pub fn update_normalization_language(&mut self, lang: Option<Lang>)
Update the normalization language, None means languge independent
Sourcepub fn was_normalized(&self) -> bool
pub fn was_normalized(&self) -> bool
Returns wheather the text was normalized or not.
Trait Implementations§
Source§impl<'a> Clone for SegmentedToken<'a>
impl<'a> Clone for SegmentedToken<'a>
Source§fn clone(&self) -> SegmentedToken<'a>
fn clone(&self) -> SegmentedToken<'a>
1.0.0 · Source§fn clone_from(&mut self, source: &Self)
fn clone_from(&mut self, source: &Self)
source. Read more