Module tantivy::tokenizer

Expand description

Tokenizer are in charge of chopping text into a stream of tokens ready for indexing.

You must define in your schema which tokenizer should be used for each of your fields :

use tantivy::schema::*;

let mut schema_builder = Schema::builder();

let text_options = TextOptions::default()
    .set_indexing_options(
        TextFieldIndexing::default()
            .set_tokenizer("en_stem")
            .set_index_option(IndexRecordOption::Basic)
    )
    .set_stored();

let id_options = TextOptions::default()
    .set_indexing_options(
        TextFieldIndexing::default()
            .set_tokenizer("raw_ids")
            .set_index_option(IndexRecordOption::WithFreqsAndPositions)
    )
    .set_stored();

schema_builder.add_text_field("title", text_options.clone());
schema_builder.add_text_field("text", text_options);
schema_builder.add_text_field("uuid", id_options);

let schema = schema_builder.build();

By default, tantivy offers the following tokenizers:

§`default`

default is the tokenizer that will be used if you do not assign a specific tokenizer to your text field. It will chop your text on punctuation and whitespaces, removes tokens that are longer than 40 chars, and lowercase your text.

§`raw`

Does not actual tokenizer your text. It keeps it entirely unprocessed. It can be useful to index uuids, or urls for instance.

§`en_stem`

In addition to what default does, the en_stem tokenizer also apply stemming to your tokens. Stemming consists in trimming words to remove their inflection. This tokenizer is slower than the default one, but is recommended to improve recall.

§Custom tokenizer Library

Avoid using tantivy as dependency and prefer tantivy-tokenizer-api instead.

§Custom tokenizers

You can write your own tokenizer by implementing the Tokenizer trait or you can extend an existing Tokenizer by chaining it with several TokenFilters.

For instance, the en_stem is defined as follows.

use tantivy::tokenizer::*;

let en_stem = TextAnalyzer::builder(SimpleTokenizer::default())
    .filter(RemoveLongFilter::limit(40))
    .filter(LowerCaser)
    .filter(Stemmer::new(Language::English))
    .build();

Once your tokenizer is defined, you need to register it with a name in your index’s TokenizerManager.

let custom_en_tokenizer = SimpleTokenizer::default();
let index = Index::create_in_ram(schema);
index.tokenizers()
     .register("custom_en", custom_en_tokenizer);

If you built your schema programmatically, a complete example could like this for instance.

Note that tokens with a len greater or equal to MAX_TOKEN_LEN.

§Example

use tantivy::schema::{Schema, IndexRecordOption, TextOptions, TextFieldIndexing};
use tantivy::tokenizer::*;
use tantivy::Index;

let mut schema_builder = Schema::builder();
let text_field_indexing = TextFieldIndexing::default()
    .set_tokenizer("custom_en")
    .set_index_option(IndexRecordOption::WithFreqsAndPositions);
let text_options = TextOptions::default()
    .set_indexing_options(text_field_indexing)
    .set_stored();
schema_builder.add_text_field("title", text_options);
let schema = schema_builder.build();
let index = Index::create_in_ram(schema);

// We need to register our tokenizer :
let custom_en_tokenizer = TextAnalyzer::builder(SimpleTokenizer::default())
    .filter(RemoveLongFilter::limit(40))
    .filter(LowerCaser)
    .build();
index
    .tokenizers()
    .register("custom_en", custom_en_tokenizer);

Structs§

AlphaNumOnlyFilter
TokenFilter that removes all tokens that contain non ascii alphanumeric characters.
AsciiFoldingFilter
This class converts alphabetic, numeric, and symbolic Unicode characters which are not in the first 127 ASCII characters (the “Basic Latin” Unicode block) into their ASCII equivalents, if one exists.
BoxTokenStream
Simple wrapper of Box<dyn TokenStream + 'a>.
FacetTokenizer
The FacetTokenizer process a Facet binary representation and emits a token for all of its parent.
LowerCaser
Token filter that lowercase terms.
NgramTokenizer
Tokenize the text by splitting words into n-grams of the given size(s)
PreTokenizedStream
TokenStream implementation which wraps PreTokenizedString
PreTokenizedString
Struct representing pre-tokenized text
RawTokenizer
For each value of the field, emit a single unprocessed token.
RegexTokenizer
Tokenize the text by using a regex pattern to split. Each match of the regex emits a distinct token, empty tokens will not be emitted. Anchors such as \A will match the text from the part where the last token was emitted or the beginning of the complete text if no token was emitted yet.
RemoveLongFilter
RemoveLongFilter removes tokens that are longer than a given number of bytes (in UTF-8 representation).
SimpleTokenStream
TokenStream produced by the SimpleTokenizer.
SimpleTokenizer
Tokenize the text by splitting on whitespaces and punctuation.
SplitCompoundWords
A TokenFilter which splits compound words into their parts based on a given dictionary.
Stemmer
Stemmer token filter. Several languages are supported, see Language for the available languages. Tokens are expected to be lowercased beforehand.
StopWordFilter
TokenFilter that removes stop words from a token stream
TextAnalyzer
TextAnalyzer tokenizes an input text into tokens and modifies the resulting TokenStream.
TextAnalyzerBuilder
Builder helper for TextAnalyzer
Token
Token
TokenizerManager
The tokenizer manager serves as a store for all of the pre-configured tokenizer pipelines.
WhitespaceTokenizer
Tokenize the text by splitting on whitespaces.

Enums§

Language
Available stemmer languages.

Constants§

MAX_TOKEN_LEN
Maximum authorized len (in bytes) for a token.

Traits§

TokenFilter
Trait for the pluggable components of Tokenizers.
TokenStream
TokenStream is the result of the tokenization.
Tokenizer
Tokenizer are in charge of splitting text into a stream of token before indexing.

Module tantivy::tokenizerCopy item path

§default

§raw

§en_stem