bpe-tokenizer

A Rust implementation of Byte Pair Encoding (BPE) tokenization. This crate provides functionality to tokenize text into subword units using pre-trained vocabularies. BPE is widely used in natural language processing (NLP) tasks, where it breaks down words into subword tokens using a vocabulary of the most frequent token pairs.

It supports Unicode-aware text segmentation for sentence and word splitting, making it suitable for processing a variety of languages and scripts.

Features

Bring your own BPE token vocabularies, or use ...
Pre-trained multilingual vocabularies sourced from the BPEmb project, with support for tokenizing text in 275 languages.
Unicode-aware sentence and word segmentation: Leveraging the unicode-segmentation crate for proper text splitting.

Installation

To add this crate to your project, run:

cargo add bpe-tokenizer

Or manually include it in your Cargo.toml:

[dependencies]
bpe-tokenizer = "<version>"

Full Example

Here is an example of how to create a BytePairEncoder from a string and use it to tokenize text:

use bpe_tokenizer::{BytePairEncoder, BytePairEncoderError};

let vocab = BytePairEncoder::new_from_str("hello\t1\nworld\t2").unwrap();
let tokenized = vocab.tokenize("Hello, world!");
println!("{:?}", tokenized);

The output will be a vector of tokens:

["<s>", "▁hello", "▁world", "</s>"]

Or load a vocabulary from a file:

use bpe_tokenizer::{BytePairEncoder, BytePairEncoderError};
let vocab = BytePairEncoder::new_from_file("path/to/file.vocab").unwrap();

Cargo Features

The crate also includes several sizes of default pre-trained vocabularies, which are optional and can be enabled via Cargo features. They are sourced from Wikipedia data, pre-trained as part of the BPEmb project. These MIT-licensed vocabularies support 275 languages and provide different sizes depending on usage needs:

Available Optional Features

default-small (100,000 tokens): Suitable for memory-constrained environments.
default-medium (320,000 tokens): Balances between token coverage and memory efficiency.
default-large (1,000,000 tokens): Provides the most detailed token representations for high granularity tasks.

Enabling Optional Features

To use these default vocabularies, specify the feature in your Cargo.toml:

[dependencies]
bpe-tokenizer = { version = "<version>", features = ["default-medium"] }

Example with `default-medium` Vocabulary

An example of using the medium vocabulary (320,000 tokens):

# #[cfg(feature = "default-medium")] {
use bpe_tokenizer::{BytePairEncoder, BytePairEncoderError};

let encoder = BytePairEncoder::new_default_medium().unwrap();
let tokenized = encoder.tokenize("This is a test sentence.");
println!("{:?}", tokenized);
// Output: ["<s>", "▁this", "▁is", "▁a", "▁test", "▁sentence", "</s>"]
# }

Tokenization Functions

The crate provides various ways to interact with the tokenizer:

Tokenize into a flat Vec<String>:

BytePairEncoder::tokenize

Splits and flattens the text into tokens.

let tokenized = vocab.tokenize("Example sentence.");
// Output: ["<s>", "▁example", "▁sentence", "</s>"]

Tokenize into nested sentence vectors Vec<Vec<String>>:

BytePairEncoder::tokenize_sentences

Useful for processing multiple sentences separately.

let tokenized = vocab.tokenize_sentences("This is sentence one. And this is sentence two.");
// Output: [["<s>", "▁this", "▁is", "▁sentence", "▁one", "</s>"], ["<s>", "▁and", "▁this", "▁is", "▁sentence", "▁two", "</s>"]]

Iterative tokenization:
- BytePairEncoder::tokenize_iter and BytePairEncoder::tokenize_sentences_iter
Provides an iterator over generated tokens for better memory efficiency in large-scale text.
```
let tokens_iter: Vec<String> = vocab.tokenize_iter("Example sentence").collect();
// Output: ["<s>", "▁example", "▁sentence", "</s>"]
```

Licensing

This crate is licensed under the MIT License.

Contributing

Contributions are welcome! Please open an issue, submit a pull request, or reach out if you'd like to contribute awesome new features or fixes to this crate.

bpe-tokenizer 0.1.4