FIBpeTokenizer 🚀
A blazing fast Byte Pair Encoding (BPE) tokenizer library written in Rust with Python bindings.
Features ✨
- 🔥 Blazing Fast: Written in Rust with parallel processing support
- 🐍 Python Support: Use it in Python via PyO3 bindings
- 🎯 Flexible Pre-tokenization: Choose between whitespace or punctuation-based splitting
- 🔖 Special Token Handling: Built-in support for special tokens like
<pad>,<mask>, etc. - 💾 Save/Load Models: Train once, reuse anywhere
- 🔧 Customizable: Configure vocabulary size, special tokens, and more
Installation
Rust
Add this to your Cargo.toml:
[]
= "0.1.0"
Python
Quick Start
Rust Usage
use ;
Python Usage
# Define special tokens
=
# Create tokenizer
=
# Train the tokenizer
# Encode text
=
=
# Decode back to text
=
# Load a pretrained tokenizer
=
API Reference
BpeTokenizer
The main tokenizer class.
Constructor
new
Methods
train(&mut self) -> Result<(), TokenizerError>: Train the tokenizer on the corpusencode(&self, text: &str) -> Result<Encoder, TokenizerError>: Encode text into tokens and IDsdecode(&self, ids: &Vec<u32>) -> Result<String, TokenizerError>: Decode token IDs back to textnew_from_pretrained(files_path: &str) -> Self: Load a pretrained tokenizerget_id_by_token(&self, token: String) -> Result<u32, TokenizerError>: Get ID for a tokenget_token_by_id(&self, id: u32) -> Result<String, TokenizerError>: Get token for an ID
PreTokenization
Pre-tokenization strategies:
PreTokenization::Whitespace: Split on whitespacePreTokenization::Punctuation: Split on whitespace and punctuation
SpecialTokenRemovalMethod
Methods for removing special tokens from the training corpus:
SpecialTokenRemovalMethod::Simple: Simple string replacementSpecialTokenRemovalMethod::AhoCorasick: Fast multi-pattern search using Aho-Corasick algorithm
Encoder
The result of encoding text.
Fields
original_text: String: The original input texttokens: Vec<String>: The tokenized representationids: Vec<u32>: Token IDstoken_types: Vec<TokenType>: Type of each token (WORD, SUBWORD, or SPECIALTOKEN)
Methods
get_token_type(&self, token: &str) -> Result<TokenType, TokenizerError>: Get the type of a specific token
How It Works
BPE (Byte Pair Encoding) is a data compression technique that iteratively merges the most frequent pair of bytes (or characters) in a sequence. This library:
- Pre-tokenizes the input text based on the selected strategy
- Builds an initial vocabulary from individual characters
- Iteratively merges the most frequent adjacent token pairs
- Stops when the target vocabulary size is reached
- Saves the vocabulary, merge rules, and configuration for later use
Performance
- Parallel processing using Rayon for fast training
- Efficient special token removal using Aho-Corasick algorithm
- Optimized data structures for merge operations
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
This project is licensed under either of:
- Apache License, Version 2.0 (LICENSE-APACHE)
- MIT License (LICENSE-MIT)
at your option.
Credits
Developed with ❤️ using Rust and PyO3.