Expand description
This crate provides tokenizers for encoding text into token IDs for model inputs and decoding output token IDs back into text.
The tokenization process follows the
pipeline used by the
Hugging Face Tokenizers
library. Tokenizers can either be constructed manually or loaded from
Hugging Face tokenizer.json files.
§Comparison to tokenizers crate
The canonical implementation of this tokenization pipeline is the
tokenizers crate. The main
differences compared to that crate are:
- rten-text focuses on inference only and does not support training tokenizers.
- rten-text is a pure Rust library with no dependencies written in C/C++. This means it is easy to build for WebAssembly and other targets where non-Rust dependencies may cause difficulties.
- rten-text is integrated with the rten-generate library which handles running the complete inference loop for auto-regressive transformer models. Note that you can use rten-generate’s outputs with other tokenizer libraries if rten-text is not suitable.
- Not all tokenizer features are currently implemented in rten-text. Please file an issue if you find that rten-text is missing a feature needed for a particular model’s tokenizer.
§Loading a pre-trained tokenizer
The main entry point is the Tokenizer type. Use Tokenizer::from_file
or Tokenizer::from_json to construct a tokenizer from a tokenizer.json
file.
§Encoding text
The Tokenizer::encode method is used to encode text into token IDs.
This can be used for example to encode a model’s prompt:
use rten_text::Tokenizer;
let tokenizer = Tokenizer::from_file("gpt2/tokenizer.json")?;
let encoded = tokenizer.encode("some text to tokenize", None)?;
let token_ids = encoded.token_ids(); // Sequence of token IDs§Decoding text
Given token IDs generated by a model, you can decode them back into text
using the Tokenizer::decode method:
use rten_text::Tokenizer;
let tokenizer = Tokenizer::from_file("gpt2/tokenizer.json")?;
// Run model and get token IDs from outputs...
let token_ids = [101, 4256, 300];
let text = tokenizer.decode(&token_ids)?;§More examples
See the rten-examples crate for various examples showing how to use this crate as part of an end-to-end pipeline.
Re-exports§
pub use tokenizer::TokenId;pub use tokenizer::Tokenizer;pub use tokenizer::TokenizerError;
Modules§
- models
- Implementations of popular tokenization models including WordPiece and Byte Pair Encoding (BPE).
- normalizers
- Tools for performing string normalization prior to tokenization.
- pre_
tokenizers - Pre-tokenizers which split text after normalization and before encoding into token IDs by models.
- tokenizer
- Defines the
Tokenizertype that implements the tokenization pipeline.