Module rten_text::tokenizers
source · Expand description
Tokenizers for converting text into sequences of token IDs.
There are two ways to construct a tokenizer:
-
Load a preconfigured tokenizer from JSON, using Tokenizer::from_json. This crate supports a subset of the
tokenizer.json
format that Hugging Face Tokenizers generates. -
Manually configure a Tokenizer by creating an Encoder implementation, such as WordPiece and then wrap it with a tokenizer using Tokenizer::new.
Modules§
- Regex patterns used by popular tokenizer models.
Structs§
- Byte Pair Encoding tokenizer used by GPT-2 and subsequently used by many other models.
- Options that control chunking and truncation by Tokenizer::encode and Tokenizer::encode_chunks.
- Output produced by a Tokenizer::encode implementation.
- Tokenizes text inputs into sequences of token IDs that can be fed to a machine learning model.
- Configuration for a Tokenizer.
- WordPiece tokenizer used by BERT models.
- Configuration for a WordPiece tokenizer.
Enums§
- Errors that can occur when building a Bpe tokenizer or encoding or decoding text using it.
- Input sequences for Tokenizer::encode.
- Errors returned by Tokenizer::from_json.
- Error type returned when tokenizing a string.
Traits§
- An Encoder implements a specific method of converting strings into token IDs using a pre-computed model.
Type Aliases§
- Integer type used to represent token IDs.