pub struct UnicodeTokenizer { /* private fields */ }Expand description
Language-agnostic Unicode-aware tokenizer.
Handles:
- ASCII and multi-byte Unicode text
- CJK ideographs (each character becomes a separate token)
- Punctuation splitting
- Optional accent stripping (approximate, pure Rust — no ICU)
- Optional lowercasing
Implementations§
Source§impl UnicodeTokenizer
impl UnicodeTokenizer
Sourcepub fn new(config: UnicodeTokenizerConfig) -> Self
pub fn new(config: UnicodeTokenizerConfig) -> Self
Create a new tokenizer with the given configuration.
Sourcepub fn tokenize(&self, text: &str) -> Vec<String>
pub fn tokenize(&self, text: &str) -> Vec<String>
Tokenize text into a Vec<String> of Unicode-aware tokens.
Sourcepub fn encode(&self, text: &str, vocab: &HashMap<String, usize>) -> Vec<usize>
pub fn encode(&self, text: &str, vocab: &HashMap<String, usize>) -> Vec<usize>
Tokenize text and convert tokens to vocabulary indices.
Unknown tokens are silently dropped.
Sourcepub fn is_cjk(c: char) -> bool
pub fn is_cjk(c: char) -> bool
Returns true for CJK Unified Ideographs and CJK Extension A/B.
Covers:
U+4E00–U+9FFF: CJK Unified IdeographsU+3400–U+4DBF: CJK Extension AU+20000–U+2A6DF: CJK Extension B
Sourcepub fn is_punctuation(c: char) -> bool
pub fn is_punctuation(c: char) -> bool
Returns true for ASCII punctuation and common Unicode punctuation.
Sourcepub fn strip_accents_approx(s: &str) -> String
pub fn strip_accents_approx(s: &str) -> String
Approximate accent stripping: remove characters in the Unicode
combining diacritics range U+0300–U+036F.
This removes the combining marks but does not perform NFKD
decomposition — for most Latin-script text the result is correct.
Full NFKD would require the unicode-normalization crate.
Trait Implementations§
Auto Trait Implementations§
impl Freeze for UnicodeTokenizer
impl RefUnwindSafe for UnicodeTokenizer
impl Send for UnicodeTokenizer
impl Sync for UnicodeTokenizer
impl Unpin for UnicodeTokenizer
impl UnsafeUnpin for UnicodeTokenizer
impl UnwindSafe for UnicodeTokenizer
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Mutably borrows from an owned value. Read more
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
Converts
self into a Left variant of Either<Self, Self>
if into_left is true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
Converts
self into a Left variant of Either<Self, Self>
if into_left(&self) returns true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§impl<T> Pointable for T
impl<T> Pointable for T
Source§impl<SS, SP> SupersetOf<SS> for SPwhere
SS: SubsetOf<SP>,
impl<SS, SP> SupersetOf<SS> for SPwhere
SS: SubsetOf<SP>,
Source§fn to_subset(&self) -> Option<SS>
fn to_subset(&self) -> Option<SS>
The inverse inclusion map: attempts to construct
self from the equivalent element of its
superset. Read moreSource§fn is_in_subset(&self) -> bool
fn is_in_subset(&self) -> bool
Checks if
self is actually part of its subset T (and can be converted to it).Source§fn to_subset_unchecked(&self) -> SS
fn to_subset_unchecked(&self) -> SS
Use with care! Same as
self.to_subset but without any property checks. Always succeeds.Source§fn from_subset(element: &SS) -> SP
fn from_subset(element: &SS) -> SP
The inclusion map: converts
self to the equivalent element of its superset.