bpe_tokenizer/
lib.rs

1//! # A Byte Pair Encoding (BPE) tokenizer implementation.
2//!
3//! This module provides functionality for [BPE
4//! tokenization](https://en.wikipedia.org/wiki/Byte_pair_encoding), a text tokenization technique
5//! that iteratively replaces the most frequent pair of bytes in a sequence with a single, unused
6//! byte. In natural language processing, it's used to break down words into subword
7//! tokens.
8//!
9//! This implementation does not start with bytes and iteratively replace them with pairs as
10//! described above. Instead, it uses a pre-trained token vocabulary to identify the most frequent
11//! pairs.
12//!
13//! Text input for tokenization is first split into sentences, which are then split into words.
14//! All sentence and word splitting is Unicode-aware through the functionality provided by the
15//! [`unicode-segmentation`](https://docs.rs/unicode-segmentation) crate. Next, each word (`&str`)
16//! is tokenized into a vector of tokens (`Vec<String>`) as follows:
17//!
18//! 1. Iterate through possible substrings of the word, from longest to shortest.
19//! 1. For each substring length, find any matching token in the vocabulary.
20//! 1. Choose the matching token with the highest score in the vocabulary.
21//! 1. Split the word at the chosen token and recursively tokenize the parts before and after it.
22//!
23//! ##  Main Components
24//!
25//! ### Initialization
26//!
27//! A `BytePairEncoder` is created from a pre-trained token vocabulary file. You can find
28//! MIT-licensed vocabulary files at the [BPEmb](https://github.com/bheinzerling/bpemb) project.
29//!
30//! Initialization can be done in two ways:
31//!
32//! - [`BytePairEncoder::new_from_file`]: Create a `BytePairEncoder` from a file.
33//! - [`BytePairEncoder::new_from_str`]: Create a `BytePairEncoder` from a string.
34//!
35//! The crate also includes default token vocabularies which support 275 languages. These are
36//! disabled by default and can be enabled with the "default-{small,medium,large}" features.
37//!
38//! - [`BytePairEncoder::new_default_small`]: Create a `BytePairEncoder` for the default small
39//!   model (100k vocabulary).
40//! - [`BytePairEncoder::new_default_medium`]: Create a `BytePairEncoder` for the default medium
41//!   model (320k vocabulary).
42//! - [`BytePairEncoder::new_default_large`]: Create a `BytePairEncoder` for the default large
43//!   model (1M vocabulary).
44//!
45//! For more information on these, see the **Features** section below.
46//!
47//! ### Tokenization into `Vec<String>` or `Vec<Vec<String>>`
48//!
49//! Once you have a `BytePairEncoder`, you can use the following associated functions to tokenize
50//! text into vectors of tokens:
51//!
52//! - [`BytePairEncoder::tokenize`]: Tokenize text into a flat vector of BPE tokens.
53//! - [`BytePairEncoder::tokenize_sentences`]: Tokenize text into nested vectors of sentences and tokens.
54//!
55//! ### Tokenization via Iterators
56//!
57//! Alternatively, you can use the following associated functions to tokenize text into iterators:
58//!
59//! - [`BytePairEncoder::tokenize_iter`]: Tokenize text into a flat sequence of BPE tokens.
60//! - [`BytePairEncoder::tokenize_sentences_iter`]: Tokenize text into nested sentences and tokens.
61//!
62//! ##  Example
63//!
64//! ```
65//! use bpe_tokenizer::{BytePairEncoder, BytePairEncoderError};
66//!
67//! let vocab = BytePairEncoder::new_from_str("hello\t1\nworld\t2").unwrap();
68//! let tokenized = vocab.tokenize("Hello, world!");
69//! ```
70//!
71//! ## Features
72//!
73//! This crate offers the following optional features that can be enabled via Cargo features in
74//! your `Cargo.toml`. Depending on your application, you can choose a default vocabulary size for
75//! the `BytePairEncoder` to work with multilingual tokens. The default vocabularies are
76//! pre-trained on wikipedia data by the [BPEmb](https://github.com/bheinzerling/bpemb) project,
77//! providing multilingual tokenization support for 275 languages.
78//!
79//! ### `default-small` (100,000 tokens):
80//! - Enables construction of `BytePairEncoder` with a smaller vocabulary size of 100,000 tokens.
81//! - Suitable for memory-constrained environments and simpler tasks where fine-grained
82//!   tokenization is less necessary.
83//!
84//!   Example of enabling this in your `Cargo.toml`:
85//!   ```toml
86//!   [dependencies]
87//!   bpe-tokenizer = { version = "<version", features = ["default-small"] }
88//!   ```
89//!
90//! ### `default-medium` (320,000 tokens):
91//! - Enables construction of `BytePairEncoder` with a vocabulary size of 320,000 tokens.
92//! - Provides a balance between vocabulary size and memory usage, making it suitable for a
93//!   broader range of tasks.
94//!
95//!   Example of enabling this in your `Cargo.toml`:
96//!   ```toml
97//!   [dependencies]
98//!   bpe-tokenizer = { version = "<version", features = ["default-medium"] }
99//!   ```
100//!
101//! ### `default-large` (1,000,000 tokens):
102//! - Enables construction of `BytePairEncoder` with a vocabulary size of 1,000,000 tokens.
103//! - Ideal for tasks that require high token coverage, providing the most detailed token
104//!   representations at the expense of additional memory usage.
105//!
106//!   Example of enabling this in your `Cargo.toml`:
107//!   ```toml
108//!   [dependencies]
109//!   bpe-tokenizer = { version = "<version>", features = ["default-large"] }
110//!   ```
111//!
112//! The vocabulary size directly impacts the granularity of the tokenization and memory
113//! consumption, so choose based on your application's needs.
114//!
115//! ### Example with Default Vocabularies
116//!
117//! ```rust
118//! # #[cfg(feature = "default-medium")] {
119//! use bpe_tokenizer::{BytePairEncoder, BytePairEncoderError};
120//!
121//! let encoder = BytePairEncoder::new_default_medium().unwrap();
122//! let tokenized = encoder.tokenize("This is a test sentence.");
123//! assert_eq!(tokenized[0], "<s>".to_string());
124//! # }
125//! ```
126//!
127//! Note that when multiple features are enabled, the respective `new_default_*` functions (e.g.,
128//! [`BytePairEncoder::new_default_small`], [`BytePairEncoder::new_default_medium`],
129//! [`BytePairEncoder::new_default_large`]) become available for constructing a `BytePairEncoder`.
130//! Only enable the features that you need to ensure minimized memory and binary size.
131
132mod constants;
133mod default_vocabs;
134mod errors;
135mod tokenizer;
136
137// tests
138#[cfg(test)]
139mod tests;
140
141// re-exports
142pub use errors::BytePairEncoderError;
143pub use tokenizer::BytePairEncoder;
bpe_tokenizer/lib.rs

bpe_tokenizer/
lib.rs