1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
//! # A Byte Pair Encoding (BPE) tokenizer implementation.
//!
//! This module provides functionality for [BPE
//! tokenization](https://en.wikipedia.org/wiki/Byte_pair_encoding), a text tokenization technique
//! that iteratively replaces the most frequent pair of bytes in a sequence with a single, unused
//! byte. In natural language processing, it's used to break down words into subword
//! tokens.
//!
//! This implementation does not start with bytes and iteratively replace them with pairs as
//! described above. Instead, it uses a pre-trained token vocabulary to identify the most frequent
//! pairs.
//!
//! Text input for tokenization is first split into sentences, which are then split into words.
//! All sentence and word splitting is Unicode-aware through the functionality provided by the
//! [`unicode-segmentation`](https://docs.rs/unicode-segmentation) crate. Next, each word (`&str`)
//! is tokenized into a vector of tokens (`Vec<String>`) as follows:
//!
//! 1. Iterate through possible substrings of the word, from longest to shortest.
//! 1. For each substring length, find any matching token in the vocabulary.
//! 1. Choose the matching token with the highest score in the vocabulary.
//! 1. Split the word at the chosen token and recursively tokenize the parts before and after it.
//!
//! ## Main Components
//!
//! ### Initialization
//!
//! A `BytePairEncoder` is created from a pre-trained token vocabulary file. You can find
//! MIT-licensed vocabulary files at the [BPEmb](https://github.com/bheinzerling/bpemb) project.
//!
//! Initialization can be done in two ways:
//!
//! - [`BytePairEncoder::new_from_file`]: Create a `BytePairEncoder` from a file.
//! - [`BytePairEncoder::new_from_str`]: Create a `BytePairEncoder` from a string.
//!
//! The crate also includes default token vocabularies which support 275 languages. These are
//! disabled by default and can be enabled with the "default-{small,medium,large}" features.
//!
//! - [`BytePairEncoder::new_default_small`]: Create a `BytePairEncoder` for the default small
//! model (100k vocabulary).
//! - [`BytePairEncoder::new_default_medium`]: Create a `BytePairEncoder` for the default medium
//! model (320k vocabulary).
//! - [`BytePairEncoder::new_default_large`]: Create a `BytePairEncoder` for the default large
//! model (1M vocabulary).
//!
//! For more information on these, see the **Features** section below.
//!
//! ### Tokenization into `Vec<String>` or `Vec<Vec<String>>`
//!
//! Once you have a `BytePairEncoder`, you can use the following associated functions to tokenize
//! text into vectors of tokens:
//!
//! - [`BytePairEncoder::tokenize`]: Tokenize text into a flat vector of BPE tokens.
//! - [`BytePairEncoder::tokenize_sentences`]: Tokenize text into nested vectors of sentences and tokens.
//!
//! ### Tokenization via Iterators
//!
//! Alternatively, you can use the following associated functions to tokenize text into iterators:
//!
//! - [`BytePairEncoder::tokenize_iter`]: Tokenize text into a flat sequence of BPE tokens.
//! - [`BytePairEncoder::tokenize_sentences_iter`]: Tokenize text into nested sentences and tokens.
//!
//! ## Example
//!
//! ```
//! use bpe_tokenizer::{BytePairEncoder, BytePairEncoderError};
//!
//! let vocab = BytePairEncoder::new_from_str("hello\t1\nworld\t2").unwrap();
//! let tokenized = vocab.tokenize("Hello, world!");
//! ```
//!
//! ## Features
//!
//! This crate offers the following optional features that can be enabled via Cargo features in
//! your `Cargo.toml`. Depending on your application, you can choose a default vocabulary size for
//! the `BytePairEncoder` to work with multilingual tokens. The default vocabularies are
//! pre-trained on wikipedia data by the [BPEmb](https://github.com/bheinzerling/bpemb) project,
//! providing multilingual tokenization support for 275 languages.
//!
//! ### `default-small` (100,000 tokens):
//! - Enables construction of `BytePairEncoder` with a smaller vocabulary size of 100,000 tokens.
//! - Suitable for memory-constrained environments and simpler tasks where fine-grained
//! tokenization is less necessary.
//!
//! Example of enabling this in your `Cargo.toml`:
//! ```toml
//! [dependencies]
//! bpe-tokenizer = { version = "<version", features = ["default-small"] }
//! ```
//!
//! ### `default-medium` (320,000 tokens):
//! - Enables construction of `BytePairEncoder` with a vocabulary size of 320,000 tokens.
//! - Provides a balance between vocabulary size and memory usage, making it suitable for a
//! broader range of tasks.
//!
//! Example of enabling this in your `Cargo.toml`:
//! ```toml
//! [dependencies]
//! bpe-tokenizer = { version = "<version", features = ["default-medium"] }
//! ```
//!
//! ### `default-large` (1,000,000 tokens):
//! - Enables construction of `BytePairEncoder` with a vocabulary size of 1,000,000 tokens.
//! - Ideal for tasks that require high token coverage, providing the most detailed token
//! representations at the expense of additional memory usage.
//!
//! Example of enabling this in your `Cargo.toml`:
//! ```toml
//! [dependencies]
//! bpe-tokenizer = { version = "<version>", features = ["default-large"] }
//! ```
//!
//! The vocabulary size directly impacts the granularity of the tokenization and memory
//! consumption, so choose based on your application's needs.
//!
//! ### Example with Default Vocabularies
//!
//! ```rust
//! # #[cfg(feature = "default-medium")] {
//! use bpe_tokenizer::{BytePairEncoder, BytePairEncoderError};
//!
//! let encoder = BytePairEncoder::new_default_medium().unwrap();
//! let tokenized = encoder.tokenize("This is a test sentence.");
//! assert_eq!(tokenized[0], "<s>".to_string());
//! # }
//! ```
//!
//! Note that when multiple features are enabled, the respective `new_default_*` functions (e.g.,
//! [`BytePairEncoder::new_default_small`], [`BytePairEncoder::new_default_medium`],
//! [`BytePairEncoder::new_default_large`]) become available for constructing a `BytePairEncoder`.
//! Only enable the features that you need to ensure minimized memory and binary size.
mod constants;
mod default_vocabs;
mod errors;
mod tokenizer;
// tests
#[cfg(test)]
mod tests;
// re-exports
pub use errors::BytePairEncoderError;
pub use tokenizer::BytePairEncoder;