[][src]Crate tokenizers

Tokenizers

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility.

What is a Tokenizer

A Tokenizer works as a pipeline, it processes some raw text as input and outputs an Encoding. The various steps of the pipeline are:

  1. The Normalizer: in charge of normalizing the text. Common examples of normalization are the unicode normalization standards, such as NFD or NFKC.
  2. The PreTokenizer: in charge of creating initial words splits in the text. The most common way of splitting text is simply on whitespace.
  3. The Model: in charge of doing the actual tokenization. An example of a Model would be BPE or WordPiece.
  4. The PostProcessor: in charge of post-processing the Encoding to add anything relevant that, for example, a language model would need, such as special tokens.

Quick example

use tokenizers::tokenizer::{Result, Tokenizer, EncodeInput};
use tokenizers::models::bpe::BPE;

fn main() -> Result<()> {
	let bpe_builder = BPE::from_files("./path/to/vocab.json", "./path/to/merges.txt")?;
	let bpe = bpe_builder
		.dropout(0.1)
		.unk_token("[UNK]".into())
		.build()?;

	let mut tokenizer = Tokenizer::new(Box::new(bpe));

	let encoding = tokenizer.encode(EncodeInput::Single("Hey there!".into()))?;
	println!("{:?}", encoding.get_tokens());

	Ok(())
}

Modules

decoders
models

Popular tokenizer models.

normalizers
pre_tokenizers
processors
tokenizer

Represents a tokenization pipeline.

utils