SegmentedToken

Struct SegmentedToken 

Source
pub struct SegmentedToken<'a> {
    pub text: &'a str,
    pub normalized_text: NormalizedText,
    pub normalization_language: Option<Lang>,
    pub kind: Option<SegmentedTokenKind>,
    pub detected_script: Option<Script>,
    pub detected_language: Option<Lang>,
    pub detected_language_confidence: f64,
    pub is_detected_language_relible: bool,
    pub is_known_word: bool,
    pub is_end_of_sentence: bool,
}
Expand description

The main representation of data this crate works on.

A token is effectively text with metadata attached, this struct being the metadata carrier.

Fields§

§text: &'a str

The piece of text that this token represents.

This should be borrowed from the initial text that was fed to the segmenter chain.

§normalized_text: NormalizedText

If a normalizer output was different from text the result will be stored in here.

It is recommended that you fetch normalized text using get_text_prefer_normalized() or get_text_prefer_normalized_owned().

§normalization_language: Option<Lang>

Which language the normalization that resulted in normalized_text happend with.

None means that only language independent normalizations were applied, for language dependent normalizations this should make sure they’re all applied for the same language.

If already set to Some it shouldn’t be changed.

§kind: Option<SegmentedTokenKind>

What kind of token this is.

Set by:

§detected_script: Option<Script>

The primary script as detected by a script or language detection augmenter.

Information about detected scripts is inherited across splitting.

§detected_language: Option<Lang>

The primary language detected by a language detection augmenter.

Information about detected languages in inherited across splitting.

§detected_language_confidence: f64

How confident the language detector was about the language that it detectd.

This scales inbetween 0 (not confident at all) and 1 (most confident).

§is_detected_language_relible: bool

Wheter the language detector considers its output to be reliable.

§is_known_word: bool

Indicates that no further splitting is neccessary.

This should be set to true if the token was a valid word in a dictionary.

§is_end_of_sentence: bool

Indicates that this token marks the end of a sentence.

This should only be set on tokens with an empty text field. It is not inherited.

Implementations§

Source§

impl<'a> SegmentedToken<'a>

Source

pub fn new(text: &'a str, kind: Option<SegmentedTokenKind>) -> Self

Create a segmented token from scratch. (You likely won’t need it)

If you are wwriting a segmenter have a look at new_derived_from().

For creating the initial token consider usng the From implementations or the StartSegmentationChain trait.

Source

pub fn new_derived_from(text: &'a str, from: &Self) -> Self

Create a token with a given text that inerits metadata from the from token.

This is the recommended constructor to use inside a segmenter after splitting.

Source

pub fn new_end_of_sentence(empty_text: &'a str) -> Self

Create a new token that carries an is_end_of_sentence marker.

Recommended way of deriving the empty text:

let (main, tail) = sentence.split_at(sentence.len());
SegmentedToken::new_derived_from(main, &token);
SegmentedToken::new_end_of_sentence(tail);
Source

pub fn covert_to_child_segements_of_self( &'a self, texts: &'a [&'a str], ) -> impl Iterator<Item = SegmentedToken<'a>> + 'a

Helper function to convert texts that came ot of a simple helper function back into segments.

Using this implies that further segmenting didn’t change anything for the metadta of the child segments.

Source

pub fn with_is_kown_word(self, is_known_word: bool) -> Self

Builder like convenience function to set the is_known_word flag.

Source

pub fn get_text_prefer_normalized(&self) -> &str

Return the normalized_text of this token if present and text if not as a str.

Source

pub fn get_text_prefer_normalized_owned(&self) -> String

This is the same as get_text_prefer_normalized(), but returns an owned String instead.

Source

pub fn get_normalized_text(&self) -> Option<&str>

Returns the normalized text behind this token.

If the normalization is NormalizedText::NormalizedToSelf it’ll return the original text.

It will only return None if not normalization was applied.

Source

pub fn update_normalized_str(&mut self, normalized: &str, lang: Option<Lang>)

Update this tokens normalized text with an unowned &str.

If the text already matches the unnormalized text, the normalized_text will be set to NormalizedText::NormalizedToSelf.

lang is the language that the normalization happend for, set to None if the normalization was language independent.

Source

pub fn update_normalized_string( &mut self, normalized: String, lang: Option<Lang>, )

Update this tokens normalized text with an owned String.

If the text already matches the unnormalized text, the normalized_text will be set to NormalizedText::NormalizedToSelf.

lang is the language that the normalization happend for, set to None if the normalization was language independent.

Source

pub fn update_normalization_language(&mut self, lang: Option<Lang>)

Update the normalization language, None means languge independent

Source

pub fn was_normalized(&self) -> bool

Returns wheather the text was normalized or not.

Trait Implementations§

Source§

impl<'a> Clone for SegmentedToken<'a>

Source§

fn clone(&self) -> SegmentedToken<'a>

Returns a duplicate of the value. Read more
1.0.0 · Source§

fn clone_from(&mut self, source: &Self)

Performs copy-assignment from source. Read more
Source§

impl<'a> Debug for SegmentedToken<'a>

Source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more
Source§

impl<'a> From<&'a String> for SegmentedToken<'a>

Source§

fn from(value: &'a String) -> Self

Converts to this type from the input type.
Source§

impl<'a> From<&'a str> for SegmentedToken<'a>

Source§

fn from(value: &'a str) -> Self

Converts to this type from the input type.
Source§

impl<'a> PartialEq for SegmentedToken<'a>

Source§

fn eq(&self, other: &SegmentedToken<'a>) -> bool

Tests for self and other values to be equal, and is used by ==.
1.0.0 · Source§

fn ne(&self, other: &Rhs) -> bool

Tests for !=. The default implementation is almost always sufficient, and should not be overridden without very good reason.
Source§

impl<'a> StructuralPartialEq for SegmentedToken<'a>

Auto Trait Implementations§

§

impl<'a> Freeze for SegmentedToken<'a>

§

impl<'a> RefUnwindSafe for SegmentedToken<'a>

§

impl<'a> Send for SegmentedToken<'a>

§

impl<'a> Sync for SegmentedToken<'a>

§

impl<'a> Unpin for SegmentedToken<'a>

§

impl<'a> UnwindSafe for SegmentedToken<'a>

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> CloneToUninit for T
where T: Clone,

Source§

unsafe fn clone_to_uninit(&self, dest: *mut u8)

🔬This is a nightly-only experimental API. (clone_to_uninit)
Performs copy-assignment from self to dest. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<'a, T> StartSegmentationChain<'a> for T
where T: Into<SegmentedToken<'a>>,

Source§

fn start_segmentation_chain(self) -> impl Iterator<Item = SegmentedToken<'a>>

Create the iterator.
Source§

impl<T> ToOwned for T
where T: Clone,

Source§

type Owned = T

The resulting type after obtaining ownership.
Source§

fn to_owned(&self) -> T

Creates owned data from borrowed data, usually by cloning. Read more
Source§

fn clone_into(&self, target: &mut T)

Uses borrowed data to replace owned data, usually by cloning. Read more
Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.