Struct UnicodeTokenizer

Source

pub struct UnicodeTokenizer { /* private fields */ }

Expand description

Language-agnostic Unicode-aware tokenizer.

Handles:

ASCII and multi-byte Unicode text
CJK ideographs (each character becomes a separate token)
Punctuation splitting
Optional accent stripping (approximate, pure Rust — no ICU)
Optional lowercasing

Implementations§

Source §

impl UnicodeTokenizer

Source

pub fn new(config: UnicodeTokenizerConfig) -> Self

Create a new tokenizer with the given configuration.

Source

pub fn tokenize(&self, text: &str) -> Vec<String>

Tokenize text into a Vec<String> of Unicode-aware tokens.

Source

pub fn encode(&self, text: &str, vocab: &HashMap<String, usize>) -> Vec<usize>

Tokenize text and convert tokens to vocabulary indices.

Unknown tokens are silently dropped.

Source

pub fn is_cjk(c: char) -> bool

Returns true for CJK Unified Ideographs and CJK Extension A/B.

Covers:

U+4E00–U+9FFF: CJK Unified Ideographs
U+3400–U+4DBF: CJK Extension A
U+20000–U+2A6DF: CJK Extension B

Source

pub fn is_punctuation(c: char) -> bool

Returns true for ASCII punctuation and common Unicode punctuation.

Source

pub fn strip_accents_approx(s: &str) -> String

Approximate accent stripping: remove characters in the Unicode combining diacritics range U+0300–U+036F.

This removes the combining marks but does not perform NFKD decomposition — for most Latin-script text the result is correct. Full NFKD would require the unicode-normalization crate.

Trait Implementations§

Source §

impl Debug for UnicodeTokenizer

Source §

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more

Auto Trait Implementations§

§

impl UnwindSafe for UnicodeTokenizer

Blanket Implementations§

Source §

impl<T> Any for T
where T: 'static + ?Sized,

Source §

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more

Source §

impl<T> Borrow<T> for T
where T: ?Sized,

Source §

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more

Source §

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source §

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more

Source §

impl<T> From<T> for T

Source §

fn from(t: T) -> T

Returns the argument unchanged.

Source §

impl<T, U> Into for T
where U: From<T>,

Source §

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source §

impl<T> IntoEither for T

Source §

fn into_either(self, into_left: bool) -> Either<Self, Self>

Converts self into a Left variant of Either<Self, Self> if into_left is true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more

Source §

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

Converts self into a Left variant of Either<Self, Self> if into_left(&self) returns true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more

Source §