pub struct SegmentedToken<'a> {
pub text: &'a str,
pub normalized_text: Option<String>,
pub kind: Option<SegmentedTokenKind>,
pub detected_script: Option<Script>,
pub detected_language: Option<Lang>,
pub detected_language_confidence: f64,
pub is_detected_language_relible: bool,
pub is_known_word: bool,
pub is_end_of_sentence: bool,
}
Expand description
The main representation of data this crate works on.
A token is effectively text
with metadata attached, this struct being the metadata carrier.
Fields§
§text: &'a str
The piece of text that this token represents.
This should be borrowed from the initial text that was fed to the segmenter chain.
normalized_text: Option<String>
If a normalizer output was different from text
the result will be stored in here.
It is recommended that you fetch normalized text using get_text_prefer_normalized() or get_text_prefer_normalized_owned().
kind: Option<SegmentedTokenKind>
§detected_script: Option<Script>
The primary script as detected by a script or language detection augmenter.
Information about detected scripts is inherited across splitting.
detected_language: Option<Lang>
The primary language detected by a language detection augmenter.
Information about detected languages in inherited across splitting.
detected_language_confidence: f64
How confident the language detector was about the language that it detectd.
This scales inbetween 0
(not confident at all) and 1
(most confident).
is_detected_language_relible: bool
Wheter the language detector considers its output to be reliable.
is_known_word: bool
Indicates that no further splitting is neccessary.
This should be set to true if the token was a valid word in a dictionary.
is_end_of_sentence: bool
Indicates that this token marks the end of a sentence.
This should only be set on tokens with an empty text
field. It is not inherited.
Implementations§
Source§impl<'a> SegmentedToken<'a>
impl<'a> SegmentedToken<'a>
Sourcepub fn new(text: &'a str, kind: Option<SegmentedTokenKind>) -> Self
pub fn new(text: &'a str, kind: Option<SegmentedTokenKind>) -> Self
Create a segmented token from scratch. (You likely won’t need it)
If you are wwriting a segmenter have a look at new_derived_from().
For creating the initial token consider usng the From
implementations or the StartSegmentationChain trait.
Sourcepub fn new_derived_from(text: &'a str, from: &Self) -> Self
pub fn new_derived_from(text: &'a str, from: &Self) -> Self
Create a token with a given text that inerits metadata from the from
token.
This is the recommended constructor to use inside a segmenter after splitting.
Sourcepub fn new_end_of_sentence(empty_text: &'a str) -> Self
pub fn new_end_of_sentence(empty_text: &'a str) -> Self
Create a new token that carries an is_end_of_sentence
marker.
Recommended way of deriving the empty text:
let (main, tail) = sentence.split_at(sentence.len());
SegmentedToken::new_derived_from(main, &token);
SegmentedToken::new_end_of_sentence(tail);
Sourcepub fn covert_to_child_segements_of_self(
&'a self,
texts: &'a [&'a str],
) -> impl Iterator<Item = SegmentedToken<'a>> + 'a
pub fn covert_to_child_segements_of_self( &'a self, texts: &'a [&'a str], ) -> impl Iterator<Item = SegmentedToken<'a>> + 'a
Helper function to convert texts that came ot of a simple helper function back into segments.
Using this implies that further segmenting didn’t change anything for the metadta of the child segments.
Sourcepub fn with_is_kown_word(self, is_known_word: bool) -> Self
pub fn with_is_kown_word(self, is_known_word: bool) -> Self
Builder like convenience function to set the is_known_word
flag.
Sourcepub fn get_text_prefer_normalized(&self) -> &str
pub fn get_text_prefer_normalized(&self) -> &str
Return the normalized_text
of this token if present and text
if not as a str
.
Sourcepub fn get_text_prefer_normalized_owned(&self) -> String
pub fn get_text_prefer_normalized_owned(&self) -> String
This is the same as get_text_prefer_normalized(), but returns an owned String instead.
Trait Implementations§
Source§impl<'a> Clone for SegmentedToken<'a>
impl<'a> Clone for SegmentedToken<'a>
Source§fn clone(&self) -> SegmentedToken<'a>
fn clone(&self) -> SegmentedToken<'a>
1.0.0 · Source§fn clone_from(&mut self, source: &Self)
fn clone_from(&mut self, source: &Self)
source
. Read more