Struct tantivy::tokenizer::SplitCompoundWords
source · pub struct SplitCompoundWords { /* private fields */ }
Expand description
A TokenFilter
which splits compound words into their parts
based on a given dictionary.
Words only will be split if they can be fully decomposed into consecutive matches into the given dictionary.
This is mostly useful to split compound nouns common to many Germanic languages into their constituents.
§Example
The quality of the dictionary determines the quality of the splits, e.g. the missing stem “back” of “backen” implies that “brotbackautomat” is not split in the following example.
use tantivy::tokenizer::{SimpleTokenizer, SplitCompoundWords, TextAnalyzer};
let mut tokenizer =
TextAnalyzer::builder(SimpleTokenizer::default())
.filter(
SplitCompoundWords::from_dictionary([
"dampf", "schiff", "fahrt", "brot", "backen", "automat",
])
.unwrap()
)
.build();
{
let mut stream = tokenizer.token_stream("dampfschifffahrt");
assert_eq!(stream.next().unwrap().text, "dampf");
assert_eq!(stream.next().unwrap().text, "schiff");
assert_eq!(stream.next().unwrap().text, "fahrt");
assert_eq!(stream.next(), None);
}
let mut stream = tokenizer.token_stream("brotbackautomat");
assert_eq!(stream.next().unwrap().text, "brotbackautomat");
assert_eq!(stream.next(), None);
Implementations§
source§impl SplitCompoundWords
impl SplitCompoundWords
sourcepub fn from_dictionary<I, P>(dict: I) -> Result<Self>
pub fn from_dictionary<I, P>(dict: I) -> Result<Self>
Create a filter from a given dictionary.
The dictionary will be used to construct an AhoCorasick
automaton
with reasonable defaults. See from_automaton
if
more control over its construction is required.
sourcepub fn from_automaton(dict: AhoCorasick) -> Self
pub fn from_automaton(dict: AhoCorasick) -> Self
Create a filter from a given automaton.
The automaton should use one of the leftmost-first match kinds and it should not be anchored.
Trait Implementations§
source§impl Clone for SplitCompoundWords
impl Clone for SplitCompoundWords
source§fn clone(&self) -> SplitCompoundWords
fn clone(&self) -> SplitCompoundWords
1.0.0 · source§fn clone_from(&mut self, source: &Self)
fn clone_from(&mut self, source: &Self)
source
. Read moresource§impl TokenFilter for SplitCompoundWords
impl TokenFilter for SplitCompoundWords
Auto Trait Implementations§
impl Freeze for SplitCompoundWords
impl RefUnwindSafe for SplitCompoundWords
impl Send for SplitCompoundWords
impl Sync for SplitCompoundWords
impl Unpin for SplitCompoundWords
impl UnwindSafe for SplitCompoundWords
Blanket Implementations§
source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
source§impl<T> Downcast for Twhere
T: Any,
impl<T> Downcast for Twhere
T: Any,
source§fn into_any(self: Box<T>) -> Box<dyn Any>
fn into_any(self: Box<T>) -> Box<dyn Any>
Box<dyn Trait>
(where Trait: Downcast
) to Box<dyn Any>
. Box<dyn Any>
can
then be further downcast
into Box<ConcreteType>
where ConcreteType
implements Trait
.source§fn into_any_rc(self: Rc<T>) -> Rc<dyn Any>
fn into_any_rc(self: Rc<T>) -> Rc<dyn Any>
Rc<Trait>
(where Trait: Downcast
) to Rc<Any>
. Rc<Any>
can then be
further downcast
into Rc<ConcreteType>
where ConcreteType
implements Trait
.source§fn as_any(&self) -> &(dyn Any + 'static)
fn as_any(&self) -> &(dyn Any + 'static)
&Trait
(where Trait: Downcast
) to &Any
. This is needed since Rust cannot
generate &Any
’s vtable from &Trait
’s.source§fn as_any_mut(&mut self) -> &mut (dyn Any + 'static)
fn as_any_mut(&mut self) -> &mut (dyn Any + 'static)
&mut Trait
(where Trait: Downcast
) to &Any
. This is needed since Rust cannot
generate &mut Any
’s vtable from &mut Trait
’s.