Expand description
SmolToken: A fast library for Byte Pair Encoding (BPE) tokenization.
SmolToken is a fast lightweight tokenizer library designed to tokenize text using
Byte Pair Encoding (BPE), a widely-used subword tokenization algorithm. Inspired by OpenAI’s
tiktoken
, SmolToken aims to provide a robust solution for encoding and decoding text,
with additional flexibility to train tokenizers from scratch on your own data.
§Example
use std::collections::HashSet;
use fancy_regex::Regex;
use smoltoken::{TokenizerDataSource, BytePairTokenizer};
// Define a simple pattern and some training data.
let name = String::from("simple_tokenizer");
let pattern = Regex::new(r"\w+|\S").unwrap();
let data = TokenizerDataSource::Text("hello hello world");
// Special tokens to be handled explicitly.
let special_tokens: HashSet<&str> = HashSet::from(["<unk>", "<pad>"]);
// Train a BPE tokenizer with a vocabulary size of 300.
let tokenizer = BytePairTokenizer::train(name, r"\w+|\S", 300, special_tokens.clone(), data).unwrap();
// Encode text into token ranks.
let encoded = tokenizer.encode("hello <unk> world", &special_tokens);
println!("Encoded: {:?}", encoded);
// Decode token ranks back into text.
let decoded = tokenizer.decode(&encoded).unwrap();
println!("Decoded: {}", decoded);
Structs§
- Byte
Pair Tokenizer - A tokenizer that uses byte pair encoding algorithm to encode/decode text.
- Fake
Thread Id
Enums§
- Decode
Error - Represents errors that may occur during decoding in the BPE algorithm.
- Tokenizer
Data Source - A enum representing different types of input data that can be processed to train the BPE tokenizer from scratch.