Struct tantivy::tokenizer::NgramTokenizer[][src]

pub struct NgramTokenizer { /* fields omitted */ }
Expand description

Tokenize the text by splitting words into n-grams of the given size(s)

With this tokenizer, the position is always 0. Beware however, in presence of multiple value for the same field, the position will be POSITION_GAP * index of value.

Example 1: hello would be tokenized as (min_gram: 2, max_gram: 3, prefix_only: false)

Termhehelelellllllolo
Position0000000
Offsets0,20,31,31,42,42,53,5

Example 2: hello would be tokenized as (min_gram: 2, max_gram: 5, prefix_only: true)

Termhehelhellhello
Position0000
Offsets0,20,30,40,5

Example 3: hεllo (non-ascii) would be tokenized as (min_gram: 2, max_gram: 5, prefix_only: true)

Termhεlhεllhεllo
Position0000
Offsets0,30,40,50,6

Example

use tantivy::tokenizer::*;

let tokenizer = NgramTokenizer::new(2, 3, false);
let mut stream = tokenizer.token_stream("hello");
{
    let token = stream.next().unwrap();
    assert_eq!(token.text, "he");
    assert_eq!(token.offset_from, 0);
    assert_eq!(token.offset_to, 2);
}
{
  let token = stream.next().unwrap();
    assert_eq!(token.text, "hel");
    assert_eq!(token.offset_from, 0);
    assert_eq!(token.offset_to, 3);
}
{
  let token = stream.next().unwrap();
    assert_eq!(token.text, "el");
    assert_eq!(token.offset_from, 1);
    assert_eq!(token.offset_to, 3);
}
{
  let token = stream.next().unwrap();
    assert_eq!(token.text, "ell");
    assert_eq!(token.offset_from, 1);
    assert_eq!(token.offset_to, 4);
}
{
  let token = stream.next().unwrap();
    assert_eq!(token.text, "ll");
    assert_eq!(token.offset_from, 2);
    assert_eq!(token.offset_to, 4);
}
{
  let token = stream.next().unwrap();
    assert_eq!(token.text, "llo");
    assert_eq!(token.offset_from, 2);
    assert_eq!(token.offset_to, 5);
}
{
  let token = stream.next().unwrap();
  assert_eq!(token.text, "lo");
  assert_eq!(token.offset_from, 3);
  assert_eq!(token.offset_to, 5);
}
assert!(stream.next().is_none());

Implementations

Configures a new Ngram tokenizer

Create a NGramTokenizer which generates tokens for all inner ngrams.

This is as opposed to only prefix ngrams .

Create a NGramTokenizer which only generates tokens for the prefix ngrams.

Trait Implementations

Returns a copy of the value. Read more

Performs copy-assignment from source. Read more

Creates a token stream for a given str.

Auto Trait Implementations

Blanket Implementations

Gets the TypeId of self. Read more

Immutably borrows from an owned value. Read more

Mutably borrows from an owned value. Read more

Convert Box<dyn Trait> (where Trait: Downcast) to Box<dyn Any>. Box<dyn Any> can then be further downcast into Box<ConcreteType> where ConcreteType implements Trait. Read more

Convert Rc<Trait> (where Trait: Downcast) to Rc<Any>. Rc<Any> can then be further downcast into Rc<ConcreteType> where ConcreteType implements Trait. Read more

Convert &Trait (where Trait: Downcast) to &Any. This is needed since Rust cannot generate &Any’s vtable from &Trait’s. Read more

Convert &mut Trait (where Trait: Downcast) to &Any. This is needed since Rust cannot generate &mut Any’s vtable from &mut Trait’s. Read more

Convert Arc<Trait> (where Trait: Downcast) to Arc<Any>. Arc<Any> can then be further downcast into Arc<ConcreteType> where ConcreteType implements Trait. Read more

Performs the conversion.

Performs the conversion.

The alignment of pointer.

The type for initializers.

Initializes a with the given initializer. Read more

Dereferences the given pointer. Read more

Mutably dereferences the given pointer. Read more

Drops the object pointed to by the given pointer. Read more

The resulting type after obtaining ownership.

Creates owned data from borrowed data, usually by cloning. Read more

🔬 This is a nightly-only experimental API. (toowned_clone_into)

recently added

Uses borrowed data to replace owned data, usually by cloning. Read more

The type returned in the event of a conversion error.

Performs the conversion.

The type returned in the event of a conversion error.

Performs the conversion.