pub struct UnicodeTokenizer { /* private fields */ }Expand description
Language-agnostic Unicode tokenizer.
Works for any writing system:
- CJK ideographs become individual tokens (spaces inserted around them).
- Optional accent stripping via a pure-Rust NFD approximation.
- Optional lowercasing.
- Punctuation splitting (Unicode category Po/Ps/Pe/…).
No external Unicode library is required.
Implementations§
Source§impl UnicodeTokenizer
impl UnicodeTokenizer
Sourcepub fn new(config: UnicodeTokenizerConfig) -> Self
pub fn new(config: UnicodeTokenizerConfig) -> Self
Create a new tokenizer with the given configuration.
Sourcepub fn default_tokenizer() -> Self
pub fn default_tokenizer() -> Self
Create a tokenizer with sensible defaults: lowercase=true, strip_accents=true, split on whitespace + punctuation.
Sourcepub fn tokenize(&self, text: &str) -> Vec<String>
pub fn tokenize(&self, text: &str) -> Vec<String>
Tokenize text into a list of tokens (Unicode-aware).
Processing order:
- Insert spaces around CJK characters.
- Optionally lowercase.
- Optionally strip accents.
- Split on whitespace (always) and optionally on punctuation.
- Discard empty tokens and enforce max_token_length.
Sourcepub fn detect_script(&self, text: &str) -> ScriptFamily
pub fn detect_script(&self, text: &str) -> ScriptFamily
Detect the dominant script family of text by majority vote over
non-whitespace characters.
Sourcepub fn tokenize_cjk(&self, text: &str) -> Vec<String>
pub fn tokenize_cjk(&self, text: &str) -> Vec<String>
Tokenize a (potentially mixed) text where CJK characters each become their own token, while non-CJK sequences are tokenized by whitespace.
Trait Implementations§
Auto Trait Implementations§
impl Freeze for UnicodeTokenizer
impl RefUnwindSafe for UnicodeTokenizer
impl Send for UnicodeTokenizer
impl Sync for UnicodeTokenizer
impl Unpin for UnicodeTokenizer
impl UnsafeUnpin for UnicodeTokenizer
impl UnwindSafe for UnicodeTokenizer
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Mutably borrows from an owned value. Read more
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
Converts
self into a Left variant of Either<Self, Self>
if into_left is true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
Converts
self into a Left variant of Either<Self, Self>
if into_left(&self) returns true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§impl<T> Pointable for T
impl<T> Pointable for T
Source§impl<SS, SP> SupersetOf<SS> for SPwhere
SS: SubsetOf<SP>,
impl<SS, SP> SupersetOf<SS> for SPwhere
SS: SubsetOf<SP>,
Source§fn to_subset(&self) -> Option<SS>
fn to_subset(&self) -> Option<SS>
The inverse inclusion map: attempts to construct
self from the equivalent element of its
superset. Read moreSource§fn is_in_subset(&self) -> bool
fn is_in_subset(&self) -> bool
Checks if
self is actually part of its subset T (and can be converted to it).Source§fn to_subset_unchecked(&self) -> SS
fn to_subset_unchecked(&self) -> SS
Use with care! Same as
self.to_subset but without any property checks. Always succeeds.Source§fn from_subset(element: &SS) -> SP
fn from_subset(element: &SS) -> SP
The inclusion map: converts
self to the equivalent element of its superset.