Module rten_text::tokenizers

Expand description

Tokenizers for converting text into sequences of token IDs.

There are two ways to construct a tokenizer:

Load a preconfigured tokenizer from JSON, using Tokenizer::from_json. This crate supports a subset of the tokenizer.json format that Hugging Face Tokenizers generates.
Manually configure a Tokenizer by creating an Encoder implementation, such as WordPiece and then wrap it with a tokenizer using Tokenizer::new.

Modules§

Bpe
Byte Pair Encoding tokenizer used by GPT-2 and subsequently used by many other models.
EncodeOptions
Options that control chunking and truncation by Tokenizer::encode and Tokenizer::encode_chunks.
Encoded
Output produced by a Tokenizer::encode implementation.
Tokenizer
Tokenizes text inputs into sequences of token IDs that can be fed to a machine learning model.
TokenizerOptions
Configuration for a Tokenizer.
WordPiece
WordPiece tokenizer used by BERT models.
WordPieceOptions
Configuration for a WordPiece tokenizer.

BpeError
Errors that can occur when building a Bpe tokenizer or encoding or decoding text using it.
EncoderInput
Input sequences for Tokenizer::encode.
FromJsonError
Errors returned by Tokenizer::from_json.
TokenizerError
Error type returned when tokenizing a string.

Encoder
An Encoder implements a specific method of converting strings into token IDs using a pre-computed model.