Struct tantivy::tokenizer::SplitCompoundWords

source ·
pub struct SplitCompoundWords { /* private fields */ }
Expand description

A TokenFilter which splits compound words into their parts based on a given dictionary.

Words only will be split if they can be fully decomposed into consecutive matches into the given dictionary.

This is mostly useful to split compound nouns common to many Germanic languages into their constituents.

§Example

The quality of the dictionary determines the quality of the splits, e.g. the missing stem “back” of “backen” implies that “brotbackautomat” is not split in the following example.

use tantivy::tokenizer::{SimpleTokenizer, SplitCompoundWords, TextAnalyzer};

let mut tokenizer =
       TextAnalyzer::builder(SimpleTokenizer::default())
       .filter(
           SplitCompoundWords::from_dictionary([
                "dampf", "schiff", "fahrt", "brot", "backen", "automat",
           ])
           .unwrap()
       )
       .build();
{
    let mut stream = tokenizer.token_stream("dampfschifffahrt");
    assert_eq!(stream.next().unwrap().text, "dampf");
    assert_eq!(stream.next().unwrap().text, "schiff");
    assert_eq!(stream.next().unwrap().text, "fahrt");
    assert_eq!(stream.next(), None);
}
let mut stream = tokenizer.token_stream("brotbackautomat");
assert_eq!(stream.next().unwrap().text, "brotbackautomat");
assert_eq!(stream.next(), None);

Implementations§

source§

impl SplitCompoundWords

source

pub fn from_dictionary<I, P>(dict: I) -> Result<Self>
where I: IntoIterator<Item = P>, P: AsRef<[u8]>,

Create a filter from a given dictionary.

The dictionary will be used to construct an AhoCorasick automaton with reasonable defaults. See from_automaton if more control over its construction is required.

source

pub fn from_automaton(dict: AhoCorasick) -> Self

Create a filter from a given automaton.

The automaton should use one of the leftmost-first match kinds and it should not be anchored.

Trait Implementations§

source§

impl Clone for SplitCompoundWords

source§

fn clone(&self) -> SplitCompoundWords

Returns a copy of the value. Read more
1.0.0 · source§

fn clone_from(&mut self, source: &Self)

Performs copy-assignment from source. Read more
source§

impl TokenFilter for SplitCompoundWords

§

type Tokenizer<T: Tokenizer> = SplitCompoundWordsFilter<T>

The Tokenizer type returned by this filter, typically parametrized by the underlying Tokenizer.
source§

fn transform<T: Tokenizer>(self, tokenizer: T) -> SplitCompoundWordsFilter<T>

Wraps a Tokenizer and returns a new one.

Auto Trait Implementations§

Blanket Implementations§

source§

impl<T> Any for T
where T: 'static + ?Sized,

source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
source§

impl<T> Borrow<T> for T
where T: ?Sized,

source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
source§

impl<T> Downcast for T
where T: Any,

source§

fn into_any(self: Box<T>) -> Box<dyn Any>

Convert Box<dyn Trait> (where Trait: Downcast) to Box<dyn Any>. Box<dyn Any> can then be further downcast into Box<ConcreteType> where ConcreteType implements Trait.
source§

fn into_any_rc(self: Rc<T>) -> Rc<dyn Any>

Convert Rc<Trait> (where Trait: Downcast) to Rc<Any>. Rc<Any> can then be further downcast into Rc<ConcreteType> where ConcreteType implements Trait.
source§

fn as_any(&self) -> &(dyn Any + 'static)

Convert &Trait (where Trait: Downcast) to &Any. This is needed since Rust cannot generate &Any’s vtable from &Trait’s.
source§

fn as_any_mut(&mut self) -> &mut (dyn Any + 'static)

Convert &mut Trait (where Trait: Downcast) to &Any. This is needed since Rust cannot generate &mut Any’s vtable from &mut Trait’s.
source§

impl<T> DowncastSync for T
where T: Any + Send + Sync,

source§

fn into_any_arc(self: Arc<T>) -> Arc<dyn Any + Sync + Send>

Convert Arc<Trait> (where Trait: Downcast) to Arc<Any>. Arc<Any> can then be further downcast into Arc<ConcreteType> where ConcreteType implements Trait.
source§

impl<T> From<T> for T

source§

fn from(t: T) -> T

Returns the argument unchanged.

source§

impl<T, U> Into<U> for T
where U: From<T>,

source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

source§

impl<T> Pointable for T

source§

const ALIGN: usize = _

The alignment of pointer.
§

type Init = T

The type for initializers.
source§

unsafe fn init(init: <T as Pointable>::Init) -> usize

Initializes a with the given initializer. Read more
source§

unsafe fn deref<'a>(ptr: usize) -> &'a T

Dereferences the given pointer. Read more
source§

unsafe fn deref_mut<'a>(ptr: usize) -> &'a mut T

Mutably dereferences the given pointer. Read more
source§

unsafe fn drop(ptr: usize)

Drops the object pointed to by the given pointer. Read more
source§

impl<T> ToOwned for T
where T: Clone,

§

type Owned = T

The resulting type after obtaining ownership.
source§

fn to_owned(&self) -> T

Creates owned data from borrowed data, usually by cloning. Read more
source§

fn clone_into(&self, target: &mut T)

Uses borrowed data to replace owned data, usually by cloning. Read more
source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

§

type Error = Infallible

The type returned in the event of a conversion error.
source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.
source§

impl<T> Fruit for T
where T: Send + Downcast,