bpe-tokenizer
A Rust implementation of Byte Pair Encoding (BPE) tokenization. This crate provides functionality to tokenize text into subword units using pre-trained vocabularies. BPE is widely used in natural language processing (NLP) tasks, where it breaks down words into subword tokens using a vocabulary of the most frequent token pairs.
It supports Unicode-aware text segmentation for sentence and word splitting, making it suitable for processing a variety of languages and scripts.
Features
- Bring your own BPE token vocabularies, or use ...
- Pre-trained multilingual vocabularies sourced from the BPEmb project, with support for tokenizing text in 275 languages.
- Unicode-aware sentence and word segmentation: Leveraging the
unicode-segmentationcrate for proper text splitting.
Installation
To add this crate to your project, run:
Or manually include it in your Cargo.toml:
[]
= "<version>"
Full Example
Here is an example of how to create a BytePairEncoder from a string and use it
to tokenize text:
use ;
let vocab = new_from_str.unwrap;
let tokenized = vocab.tokenize;
println!;
The output will be a vector of tokens:
["<s>", "▁hello", "▁world", "</s>"]
Or load a vocabulary from a file:
use ;
let vocab = new_from_file.unwrap;
Cargo Features
The crate also includes several sizes of default pre-trained vocabularies, which are optional and can be enabled via Cargo features. They are sourced from Wikipedia data, pre-trained as part of the BPEmb project. These MIT-licensed vocabularies support 275 languages and provide different sizes depending on usage needs:
Available Optional Features
default-small(100,000 tokens): Suitable for memory-constrained environments.default-medium(320,000 tokens): Balances between token coverage and memory efficiency.default-large(1,000,000 tokens): Provides the most detailed token representations for high granularity tasks.
Enabling Optional Features
To use these default vocabularies, specify the feature in your Cargo.toml:
[]
= { = "<version>", = ["default-medium"] }
Example with default-medium Vocabulary
An example of using the medium vocabulary (320,000 tokens):
#
Tokenization Functions
The crate provides various ways to interact with the tokenizer:
-
Tokenize into a flat
Vec<String>:BytePairEncoder::tokenize
Splits and flattens the text into tokens.
let tokenized = vocab.tokenize; // Output: ["<s>", "▁example", "▁sentence", "</s>"] -
Tokenize into nested sentence vectors
Vec<Vec<String>>:BytePairEncoder::tokenize_sentences
Useful for processing multiple sentences separately.
let tokenized = vocab.tokenize_sentences; // Output: [["<s>", "▁this", "▁is", "▁sentence", "▁one", "</s>"], ["<s>", "▁and", "▁this", "▁is", "▁sentence", "▁two", "</s>"]] -
Iterative tokenization:
BytePairEncoder::tokenize_iterandBytePairEncoder::tokenize_sentences_iter
Provides an iterator over generated tokens for better memory efficiency in large-scale text.
let tokens_iter: = vocab.tokenize_iter.collect; // Output: ["<s>", "▁example", "▁sentence", "</s>"]
Licensing
This crate is licensed under the MIT License.
Contributing
Contributions are welcome! Please open an issue, submit a pull request, or reach out if you'd like to contribute awesome new features or fixes to this crate.