Crate bpe_tokenizer

Source
Expand description

§A Byte Pair Encoding (BPE) tokenizer implementation.

This module provides functionality for BPE tokenization, a text tokenization technique that iteratively replaces the most frequent pair of bytes in a sequence with a single, unused byte. In natural language processing, it’s used to break down words into subword tokens.

This implementation does not start with bytes and iteratively replace them with pairs as described above. Instead, it uses a pre-trained token vocabulary to identify the most frequent pairs.

Text input for tokenization is first split into sentences, which are then split into words. All sentence and word splitting is Unicode-aware through the functionality provided by the unicode-segmentation crate. Next, each word (&str) is tokenized into a vector of tokens (Vec<String>) as follows:

  1. Iterate through possible substrings of the word, from longest to shortest.
  2. For each substring length, find any matching token in the vocabulary.
  3. Choose the matching token with the highest score in the vocabulary.
  4. Split the word at the chosen token and recursively tokenize the parts before and after it.

§Main Components

§Initialization

A BytePairEncoder is created from a pre-trained token vocabulary file. You can find MIT-licensed vocabulary files at the BPEmb project.

Initialization can be done in two ways:

The crate also includes default token vocabularies which support 275 languages. These are disabled by default and can be enabled with the “default-{small,medium,large}” features.

For more information on these, see the Features section below.

§Tokenization into Vec<String> or Vec<Vec<String>>

Once you have a BytePairEncoder, you can use the following associated functions to tokenize text into vectors of tokens:

§Tokenization via Iterators

Alternatively, you can use the following associated functions to tokenize text into iterators:

§Example

use bpe_tokenizer::{BytePairEncoder, BytePairEncoderError};

let vocab = BytePairEncoder::new_from_str("hello\t1\nworld\t2").unwrap();
let tokenized = vocab.tokenize("Hello, world!");

§Features

This crate offers the following optional features that can be enabled via Cargo features in your Cargo.toml. Depending on your application, you can choose a default vocabulary size for the BytePairEncoder to work with multilingual tokens. The default vocabularies are pre-trained on wikipedia data by the BPEmb project, providing multilingual tokenization support for 275 languages.

§default-small (100,000 tokens):

  • Enables construction of BytePairEncoder with a smaller vocabulary size of 100,000 tokens.

  • Suitable for memory-constrained environments and simpler tasks where fine-grained tokenization is less necessary.

    Example of enabling this in your Cargo.toml:

    [dependencies]
    bpe-tokenizer = { version = "<version", features = ["default-small"] }

§default-medium (320,000 tokens):

  • Enables construction of BytePairEncoder with a vocabulary size of 320,000 tokens.

  • Provides a balance between vocabulary size and memory usage, making it suitable for a broader range of tasks.

    Example of enabling this in your Cargo.toml:

    [dependencies]
    bpe-tokenizer = { version = "<version", features = ["default-medium"] }

§default-large (1,000,000 tokens):

  • Enables construction of BytePairEncoder with a vocabulary size of 1,000,000 tokens.

  • Ideal for tasks that require high token coverage, providing the most detailed token representations at the expense of additional memory usage.

    Example of enabling this in your Cargo.toml:

    [dependencies]
    bpe-tokenizer = { version = "<version>", features = ["default-large"] }

The vocabulary size directly impacts the granularity of the tokenization and memory consumption, so choose based on your application’s needs.

§Example with Default Vocabularies

use bpe_tokenizer::{BytePairEncoder, BytePairEncoderError};

let encoder = BytePairEncoder::new_default_medium().unwrap();
let tokenized = encoder.tokenize("This is a test sentence.");
assert_eq!(tokenized[0], "<s>".to_string());

Note that when multiple features are enabled, the respective new_default_* functions (e.g., BytePairEncoder::new_default_small, BytePairEncoder::new_default_medium, BytePairEncoder::new_default_large) become available for constructing a BytePairEncoder. Only enable the features that you need to ensure minimized memory and binary size.

Structs§

BytePairEncoder
Represents a Byte Pair Encoding (BPE) vocabulary used for tokenization.

Enums§

BytePairEncoderError
Represents errors that can occur during BPE tokenization operations.