[−][src]Crate tokenizers
Tokenizers
Provides an implementation of today's most used tokenizers, with a focus on performance and versatility.
What is a Tokenizer
A Tokenizer works as a pipeline, it processes some raw text as input and outputs an
Encoding
.
The various steps of the pipeline are:
- The
Normalizer
: in charge of normalizing the text. Common examples of normalization are the unicode normalization standards, such asNFD
orNFKC
. - The
PreTokenizer
: in charge of creating initial words splits in the text. The most common way of splitting text is simply on whitespace. - The
Model
: in charge of doing the actual tokenization. An example of aModel
would beBPE
orWordPiece
. - The
PostProcessor
: in charge of post-processing theEncoding
to add anything relevant that, for example, a language model would need, such as special tokens.
Quick example
use tokenizers::tokenizer::{Result, Tokenizer, EncodeInput}; use tokenizers::models::bpe::BPE; fn main() -> Result<()> { let bpe_builder = BPE::from_files("./path/to/vocab.json", "./path/to/merges.txt"); let bpe = bpe_builder .dropout(0.1) .unk_token("[UNK]".into()) .build()?; let mut tokenizer = Tokenizer::new(Box::new(bpe)); let encoding = tokenizer.encode(EncodeInput::Single("Hey there!".into()), false)?; println!("{:?}", encoding.get_tokens()); Ok(()) }
Modules
decoders | |
models | Popular tokenizer models. |
normalizers | |
pre_tokenizers | |
processors | |
tokenizer | Represents a tokenization pipeline. |
utils |