1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56
#![warn(clippy::all)] #![doc(html_favicon_url = "https://huggingface.co/favicon.ico")] #![doc(html_logo_url = "https://huggingface.co/landing/assets/huggingface_logo.svg")] //! # Tokenizers //! //! Provides an implementation of today's most used tokenizers, with a focus on performance and //! versatility. //! //! ## What is a Tokenizer //! //! A Tokenizer works as a pipeline, it processes some raw text as input and outputs an //! `Encoding`. //! The various steps of the pipeline are: //! //! 1. The `Normalizer`: in charge of normalizing the text. Common examples of normalization are //! the [unicode normalization standards](https://unicode.org/reports/tr15/#Norm_Forms), such as `NFD` or `NFKC`. //! 2. The `PreTokenizer`: in charge of creating initial words splits in the text. The most common way of //! splitting text is simply on whitespace. //! 3. The `Model`: in charge of doing the actual tokenization. An example of a `Model` would be //! `BPE` or `WordPiece`. //! 4. The `PostProcessor`: in charge of post-processing the `Encoding` to add anything relevant //! that, for example, a language model would need, such as special tokens. //! //! ## Quick example //! //! ```no_run //! use tokenizers::tokenizer::{Result, Tokenizer, EncodeInput}; //! use tokenizers::models::bpe::BPE; //! //! fn main() -> Result<()> { //! let bpe_builder = BPE::from_files("./path/to/vocab.json", "./path/to/merges.txt"); //! let bpe = bpe_builder //! .dropout(0.1) //! .unk_token("[UNK]".into()) //! .build()?; //! //! let mut tokenizer = Tokenizer::new(Box::new(bpe)); //! //! let encoding = tokenizer.encode(EncodeInput::Single("Hey there!".into()), false)?; //! println!("{:?}", encoding.get_tokens()); //! //! Ok(()) //! } //! ``` #[macro_use] extern crate lazy_static; pub mod decoders; pub mod models; pub mod normalizers; pub mod pre_tokenizers; pub mod processors; pub mod tokenizer; pub mod utils;