pub struct UnicodeNormalizer { /* private fields */ }Expand description
Unicode-aware text normalizer.
Supports NFC/NFD normalization, accent stripping, case folding, and language-agnostic CJK tokenization.
§Example
use scirs2_text::tokenization::unicode_normalizer::{UnicodeNormalizer, UnicodeNormalizerConfig, NormForm};
let config = UnicodeNormalizerConfig {
form: NormForm::Nfc,
strip_accents: true,
lowercase: true,
tokenize_cjk: true,
};
let normalizer = UnicodeNormalizer::new(config);
let tokens = normalizer.tokenize_language_agnostic("Héllo 世界");
assert!(tokens.len() >= 3); // "hello", "世", "界"Implementations§
Source§impl UnicodeNormalizer
impl UnicodeNormalizer
Sourcepub fn new(config: UnicodeNormalizerConfig) -> Self
pub fn new(config: UnicodeNormalizerConfig) -> Self
Create a new UnicodeNormalizer with the given configuration.
Sourcepub fn default_normalizer() -> Self
pub fn default_normalizer() -> Self
Create a normalizer with default settings.
Sourcepub fn normalize(&self, text: &str) -> String
pub fn normalize(&self, text: &str) -> String
Normalize text according to the configuration.
Steps applied in order:
- Lowercase (if configured)
- NFD decomposition + accent stripping (if configured)
- NFC composition (if configured, after potential NFD strip)
Sourcepub fn tokenize_language_agnostic(&self, text: &str) -> Vec<String>
pub fn tokenize_language_agnostic(&self, text: &str) -> Vec<String>
Tokenize text in a language-agnostic manner.
Algorithm:
- Normalize the text.
- Insert whitespace around CJK characters (when
tokenize_cjkis set). - Split on Unicode whitespace.
- Filter empty tokens.
This approach works across scripts without any language-specific logic.
Sourcepub fn config(&self) -> &UnicodeNormalizerConfig
pub fn config(&self) -> &UnicodeNormalizerConfig
Return the configuration.
Trait Implementations§
Source§impl Clone for UnicodeNormalizer
impl Clone for UnicodeNormalizer
Source§fn clone(&self) -> UnicodeNormalizer
fn clone(&self) -> UnicodeNormalizer
Returns a duplicate of the value. Read more
1.0.0 (const: unstable) · Source§fn clone_from(&mut self, source: &Self)
fn clone_from(&mut self, source: &Self)
Performs copy-assignment from
source. Read moreSource§impl Debug for UnicodeNormalizer
impl Debug for UnicodeNormalizer
Auto Trait Implementations§
impl Freeze for UnicodeNormalizer
impl RefUnwindSafe for UnicodeNormalizer
impl Send for UnicodeNormalizer
impl Sync for UnicodeNormalizer
impl Unpin for UnicodeNormalizer
impl UnsafeUnpin for UnicodeNormalizer
impl UnwindSafe for UnicodeNormalizer
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Mutably borrows from an owned value. Read more
Source§impl<T> CloneToUninit for Twhere
T: Clone,
impl<T> CloneToUninit for Twhere
T: Clone,
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
Converts
self into a Left variant of Either<Self, Self>
if into_left is true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
Converts
self into a Left variant of Either<Self, Self>
if into_left(&self) returns true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§impl<T> Pointable for T
impl<T> Pointable for T
Source§impl<SS, SP> SupersetOf<SS> for SPwhere
SS: SubsetOf<SP>,
impl<SS, SP> SupersetOf<SS> for SPwhere
SS: SubsetOf<SP>,
Source§fn to_subset(&self) -> Option<SS>
fn to_subset(&self) -> Option<SS>
The inverse inclusion map: attempts to construct
self from the equivalent element of its
superset. Read moreSource§fn is_in_subset(&self) -> bool
fn is_in_subset(&self) -> bool
Checks if
self is actually part of its subset T (and can be converted to it).Source§fn to_subset_unchecked(&self) -> SS
fn to_subset_unchecked(&self) -> SS
Use with care! Same as
self.to_subset but without any property checks. Always succeeds.Source§fn from_subset(element: &SS) -> SP
fn from_subset(element: &SS) -> SP
The inclusion map: converts
self to the equivalent element of its superset.