The core of tokenizers
, written in Rust.
Provides an implementation of today's most used tokenizers, with a focus on performance and
versatility.
What is a Tokenizer
A Tokenizer works as a pipeline, it processes some raw text as input and outputs an Encoding
.
The various steps of the pipeline are:
- The
Normalizer
: in charge of normalizing the text. Common examples of normalization are the unicode normalization standards, such asNFD
orNFKC
. More details about how to use theNormalizers
are available on the Hugging Face blog - The
PreTokenizer
: in charge of creating initial words splits in the text. The most common way of splitting text is simply on whitespace. - The
Model
: in charge of doing the actual tokenization. An example of aModel
would beBPE
orWordPiece
. - The
PostProcessor
: in charge of post-processing theEncoding
to add anything relevant that, for example, a language model would need, such as special tokens.
Loading a pretrained tokenizer from the Hub
use ;
Deserialization and tokenization example
use ;
use BPE;
Training and serialization example
use DecoderWrapper;
use ;
use ;
use ByteLevel;
use PreTokenizerWrapper;
use PostProcessorWrapper;
use ;
use Path;
Additional information
- tokenizers is designed to leverage CPU parallelism when possible. The level of parallelism is determined
by the total number of core/threads your CPU provides but this can be tuned by setting the
RAYON_RS_NUM_THREADS
environment variable. As an example settingRAYON_RS_NUM_THREADS=4
will allocate a maximum of 4 threads. Please note this behavior may evolve in the future
Features
-
progressbar: The progress bar visualization is enabled by default. It might be disabled if compilation for certain targets is not supported by the termios dependency of the indicatif progress bar.
-
http: This feature enables downloading the tokenizer via HTTP. It is disabled by default. With this feature enabled,
Tokenizer::from_pretrained
becomes accessible.