rten_text/
lib.rs

1//! This crate provides tokenizers for encoding text into token IDs
2//! for model inputs and decoding output token IDs back into text.
3//!
4//! The tokenization process follows the
5//! [pipeline](https://huggingface.co/docs/tokenizers/en/pipeline) used by the
6//! Hugging Face [Tokenizers](https://huggingface.co/docs/tokenizers/en/)
7//! library.  Tokenizers can either be constructed manually or loaded from
8//! Hugging Face `tokenizer.json` files.
9//!
10//! ## Comparison to _tokenizers_ crate
11//!
12//! The canonical implementation of this tokenization pipeline is the
13//! [`tokenizers`](https://github.com/huggingface/tokenizers) crate. The main
14//! differences compared to that crate are:
15//!
16//! - rten-text focuses on inference only and does not support training
17//!   tokenizers.
18//! - rten-text is a pure Rust library with no dependencies written in C/C++.
19//!   This means it is easy to build for WebAssembly and other targets where
20//!   non-Rust dependencies may cause difficulties.
21//! - rten-text is integrated with the
22//!   [rten-generate](https://docs.rs/rten-generate/) library which handles
23//!   running the complete inference loop for auto-regressive transformer
24//!   models. Note that you can use rten-generate's outputs with other tokenizer
25//!   libraries if rten-text is not suitable.
26//! - Not all tokenizer features are currently implemented in rten-text. Please
27//!   file an issue if you find that rten-text is missing a feature needed for a
28//!   particular model's tokenizer.
29//!
30//! ## Loading a pre-trained tokenizer
31//!
32//! The main entry point is the [`Tokenizer`] type. Use [`Tokenizer::from_file`]
33//! or [`Tokenizer::from_json`] to construct a tokenizer from a `tokenizer.json`
34//! file.
35//!
36//! ## Encoding text
37//!
38//! The [`Tokenizer::encode`] method is used to encode text into token IDs.
39//! This can be used for example to encode a model's prompt:
40//!
41//! ```no_run
42//! use rten_text::Tokenizer;
43//!
44//! let tokenizer = Tokenizer::from_file("gpt2/tokenizer.json")?;
45//! let encoded = tokenizer.encode("some text to tokenize", None)?;
46//! let token_ids = encoded.token_ids(); // Sequence of token IDs
47//! # Ok::<_, Box<dyn std::error::Error>>(())
48//! ```
49//!
50//! ## Decoding text
51//!
52//! Given token IDs generated by a model, you can decode them back into text
53//! using the [`Tokenizer::decode`] method:
54//!
55//! ```no_run
56//! use rten_text::Tokenizer;
57//!
58//! let tokenizer = Tokenizer::from_file("gpt2/tokenizer.json")?;
59//! // Run model and get token IDs from outputs...
60//! let token_ids = [101, 4256, 300];
61//! let text = tokenizer.decode(&token_ids)?;
62//! # Ok::<_, Box<dyn std::error::Error>>(())
63//! ```
64//!
65//! ## More examples
66//!
67//! See the
68//! [rten-examples](https://github.com/robertknight/rten/tree/main/rten-examples)
69//! crate for various examples showing how to use this crate as part of an
70//! end-to-end pipeline.
71
72pub mod models;
73pub mod normalizers;
74pub mod pre_tokenizers;
75pub mod tokenizer;
76
77mod serde;
78mod split;
79
80pub use tokenizer::{TokenId, Tokenizer, TokenizerError};