Struct tantivy::tokenizer::NgramTokenizer [−][src]
pub struct NgramTokenizer { /* fields omitted */ }
Expand description
Tokenize the text by splitting words into n-grams of the given size(s)
With this tokenizer, the position
is always 0.
Beware however, in presence of multiple value for the same field,
the position will be POSITION_GAP * index of value
.
Example 1: hello
would be tokenized as (min_gram: 2, max_gram: 3, prefix_only: false)
Term | he | hel | el | ell | ll | llo | lo |
---|---|---|---|---|---|---|---|
Position | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Offsets | 0,2 | 0,3 | 1,3 | 1,4 | 2,4 | 2,5 | 3,5 |
Example 2: hello
would be tokenized as (min_gram: 2, max_gram: 5, prefix_only: true)
Term | he | hel | hell | hello |
---|---|---|---|---|
Position | 0 | 0 | 0 | 0 |
Offsets | 0,2 | 0,3 | 0,4 | 0,5 |
Example 3: hεllo
(non-ascii) would be tokenized as (min_gram: 2, max_gram: 5, prefix_only: true)
Term | hε | hεl | hεll | hεllo |
---|---|---|---|---|
Position | 0 | 0 | 0 | 0 |
Offsets | 0,3 | 0,4 | 0,5 | 0,6 |
Example
use tantivy::tokenizer::*;
let tokenizer = NgramTokenizer::new(2, 3, false);
let mut stream = tokenizer.token_stream("hello");
{
let token = stream.next().unwrap();
assert_eq!(token.text, "he");
assert_eq!(token.offset_from, 0);
assert_eq!(token.offset_to, 2);
}
{
let token = stream.next().unwrap();
assert_eq!(token.text, "hel");
assert_eq!(token.offset_from, 0);
assert_eq!(token.offset_to, 3);
}
{
let token = stream.next().unwrap();
assert_eq!(token.text, "el");
assert_eq!(token.offset_from, 1);
assert_eq!(token.offset_to, 3);
}
{
let token = stream.next().unwrap();
assert_eq!(token.text, "ell");
assert_eq!(token.offset_from, 1);
assert_eq!(token.offset_to, 4);
}
{
let token = stream.next().unwrap();
assert_eq!(token.text, "ll");
assert_eq!(token.offset_from, 2);
assert_eq!(token.offset_to, 4);
}
{
let token = stream.next().unwrap();
assert_eq!(token.text, "llo");
assert_eq!(token.offset_from, 2);
assert_eq!(token.offset_to, 5);
}
{
let token = stream.next().unwrap();
assert_eq!(token.text, "lo");
assert_eq!(token.offset_from, 3);
assert_eq!(token.offset_to, 5);
}
assert!(stream.next().is_none());
Implementations
Configures a new Ngram tokenizer
Create a NGramTokenizer
which generates tokens for all inner ngrams.
This is as opposed to only prefix ngrams .
Create a NGramTokenizer
which only generates tokens for the
prefix ngrams.
Trait Implementations
Creates a token stream for a given str
.
Auto Trait Implementations
impl RefUnwindSafe for NgramTokenizer
impl Send for NgramTokenizer
impl Sync for NgramTokenizer
impl Unpin for NgramTokenizer
impl UnwindSafe for NgramTokenizer
Blanket Implementations
Mutably borrows from an owned value. Read more
Convert Box<dyn Trait>
(where Trait: Downcast
) to Box<dyn Any>
. Box<dyn Any>
can
then be further downcast
into Box<ConcreteType>
where ConcreteType
implements Trait
. Read more
Convert Rc<Trait>
(where Trait: Downcast
) to Rc<Any>
. Rc<Any>
can then be
further downcast
into Rc<ConcreteType>
where ConcreteType
implements Trait
. Read more
Convert &Trait
(where Trait: Downcast
) to &Any
. This is needed since Rust cannot
generate &Any
’s vtable from &Trait
’s. Read more
Convert &mut Trait
(where Trait: Downcast
) to &Any
. This is needed since Rust cannot
generate &mut Any
’s vtable from &mut Trait
’s. Read more