Crate smoltoken

Expand description

SmolToken: A fast library for Byte Pair Encoding (BPE) tokenization.

SmolToken is a fast lightweight tokenizer library designed to tokenize text using Byte Pair Encoding (BPE), a widely-used subword tokenization algorithm. Inspired by OpenAI’s tiktoken, SmolToken aims to provide a robust solution for encoding and decoding text, with additional flexibility to train tokenizers from scratch on your own data.

§Example

use std::collections::HashSet;

use fancy_regex::Regex;
use smoltoken::{TokenizerDataSource, BytePairTokenizer};

// Define a simple pattern and some training data.
let name = String::from("simple_tokenizer");
let pattern = Regex::new(r"\w+|\S").unwrap();
let data = TokenizerDataSource::Text("hello hello world");

// Special tokens to be handled explicitly.
let special_tokens: HashSet<&str> = HashSet::from(["<unk>", "<pad>"]);

// Train a BPE tokenizer with a vocabulary size of 300.
let tokenizer = BytePairTokenizer::train(name, r"\w+|\S", 300, special_tokens.clone(), data).unwrap();

// Encode text into token ranks.
let encoded = tokenizer.encode("hello <unk> world", &special_tokens);
println!("Encoded: {:?}", encoded);

// Decode token ranks back into text.
let decoded = tokenizer.decode(&encoded).unwrap();
println!("Decoded: {}", decoded);

Structs§

BytePairTokenizer: A tokenizer that uses byte pair encoding algorithm to encode/decode text.
FakeThreadId

Enums§

DecodeError: Represents errors that may occur during decoding in the BPE algorithm.
TokenizerDataSource: A enum representing different types of input data that can be processed to train the BPE tokenizer from scratch.

Crate smoltokenCopy item path

§Example

Structs§

Enums§

Crate smoltoken