Struct bpe_tokenizer::BytePairEncoder

source ·

pub struct BytePairEncoder { /* private fields */ }

Expand description

§Represents a Byte Pair Encoding (BPE) vocabulary used for tokenization.

This struct holds the mapping of tokens to their respective scores and provides methods for tokenizing text using the BPE algorithm.

The vocabulary is typically loaded from a file or string where each line contains a token and its score, separated by a tab character.

§Example

use bpe_tokenizer::BytePairEncoder;

let vocab = BytePairEncoder::new_from_str("hello\t1\nworld\t2").unwrap();
let tokenized = vocab.tokenize("Hello, world!");

Implementations§

source §

impl BytePairEncoder

source

pub fn new_from_file(file_path: &str) -> Result<Self, BytePairEncoderError>

§Creates a new `BytePairEncoder` from a file containing token-score pairs.

This function reads the contents of the file specified by file_path and constructs a BytePairEncoder from it. The file should contain token-score pairs, with each pair on a separate line and the token and score separated by a tab character (\t).

§Input Format

The file is expected to follow this format:

<token>\t<score>\n

Each line should consist of:

A token (a string) followed by a tab character (\t)
A score (an integer) as either a positive or negative value.

Example lines from the file:

<unk>    0
▁t       -0
▁the     -4

§Arguments

file_path - A string slice that holds the path to the file containing token-score pairs.

§Returns

Result<Self, BytePairEncoderError> - A Result containing the created BytePairEncoder if successful, or a BytePairEncoderError if there was an error reading the file or parsing its contents.

§Errors

This function will return an error if:

The file cannot be read (returns BytePairEncoderError::InvalidFile)
The file contents are not in the expected format (returns BytePairEncoderError::InvalidVocabularyInput)

§Example

use bpe_tokenizer::BytePairEncoder;

let vocab = BytePairEncoder::new_from_file("path/to/vocabulary/file.txt");

source

pub fn new_from_str(input: &str) -> Result<Self, BytePairEncoderError>

§Creates a new `BytePairEncoder` from a string containing token-score pairs.

This function parses the input string to construct a BytePairEncoder. The input should contain token-score pairs, with each pair on a separate line and the token and score separated by a tab character (\t).

§Input Format

The string must follow this format:

<token>\t<score>\n

Each line in the string should consist of:

A token (a string) followed by a tab character (\t)
A score (an integer) as either a positive or negative value.

For example:

hello   1
world   2
▁the    -4

§Arguments

input - A string slice that holds the token-score pairs.

§Returns

Result<Self, BytePairEncoderError> - A Result containing the created BytePairEncoder if successful, or a BytePairEncoderError if there was an error parsing the input.

§Errors

This function will return BytePairEncoderError::InvalidVocabularyInput if:

A line doesn’t contain a tab character to separate token and score.
The score cannot be parsed as an isize.

§Example

use bpe_tokenizer::BytePairEncoder;

let input = "hello\t1\nworld\t2";
let vocab = BytePairEncoder::new_from_str(input).unwrap();

source

pub fn new_default_small() -> Result<Self, BytePairEncoderError>

§Creates a new `BytePairEncoder` with a default small vocabulary size (100,000 tokens).

This function constructs a BytePairEncoder using a pre-trained multilingual vocabulary that supports 275 languages. The vocabulary is sourced from the BPEmb project, licensed under MIT. The small-sized vocabulary file consists of 100,000 tokens, allowing for highly compressed tokenization suitable for tasks with limited memory constraints.

§Returns

A Result<Self, BytePairEncoderError>, constructing the BytePairEncoder on successful vocabulary loading, or a corresponding error if initialization fails.

§Example

use bpe_tokenizer::BytePairEncoder;

let encoder = BytePairEncoder::new_default_small().unwrap();

§Note

This is only enabled when the default-small feature is enabled in Cargo.toml.

[dependencies]
bpe-tokenizer = { version = "<version", features = ["default-small"] }

source

pub fn new_default_medium() -> Result<Self, BytePairEncoderError>

§Creates a new `BytePairEncoder` with a default medium vocabulary size (320,000 tokens).

This function constructs a BytePairEncoder using a pre-trained multilingual vocabulary that supports 275 languages. The vocabulary is sourced from the BPEmb project, licensed under MIT. The medium-sized vocabulary file consists of 320,000 tokens, offering a balance between token coverage and memory efficiency, making it suitable for a wide variety of NLP tasks.

§Returns

A Result<Self, BytePairEncoderError>, constructing the BytePairEncoder on successful vocabulary loading, or a corresponding error if initialization fails.

§Example

use bpe_tokenizer::BytePairEncoder;

let encoder = BytePairEncoder::new_default_medium().unwrap();

§Note

This is only enabled when the default-medium feature is enabled in Cargo.toml.

[dependencies]
bpe-tokenizer = { version = "<version", features = ["default-medium"] }

source

pub fn new_default_large() -> Result<Self, BytePairEncoderError>

§Creates a new `BytePairEncoder` with a default large vocabulary size (1,000,000 tokens).

This function constructs a BytePairEncoder using a pre-trained multilingual vocabulary that supports 275 languages. The vocabulary is sourced from the BPEmb project, licensed under MIT. The large-sized vocabulary consists of 1,000,000 tokens, providing maximum coverage for detailed language representation, especially useful in applications requiring high granularity.

§Returns

A Result<Self, BytePairEncoderError>, constructing the BytePairEncoder on successful vocabulary loading, or a corresponding error if initialization fails.

§Example

use bpe_tokenizer::BytePairEncoder;

let encoder = BytePairEncoder::new_default_large().unwrap();

§Note

This is only enabled when the default-large feature is enabled in Cargo.toml.

[dependencies]
bpe-tokenizer = { version = "<version", features = ["default-large"] }

source

pub fn tokenize_sentences_iter<'a>( &'a self, text: &'a str, ) -> impl Iterator<Item = impl Iterator<Item = String> + 'a> + 'a

§Tokenizes a text into sentences, then words, and finally into BPE tokens.

This function takes a string of text and returns an iterator that yields vectors of tokens, where each vector represents a tokenized sentence.

§Arguments

text - A string slice containing the text to be tokenized.

§Returns

An iterator that yields Vec<String>, where each Vec<String> represents a tokenized sentence.

§Example

use bpe_tokenizer::BytePairEncoder;

let vocab = BytePairEncoder::new_from_str("hello\t1\nworld\t2").unwrap();
let text = "Hello, world! How are you?";
let tokenized: Vec<Vec<String>> = vocab
    .tokenize_sentences_iter(text)
    .map(|sentence_iter| sentence_iter.collect())  // Collect each inner iterator into a Vec<String>
    .collect();  // Then collect everything into Vec<Vec<String>>

§Notes

This function uses Unicode-aware sentence and word segmentation.
Each sentence is wrapped with sentence start (<s>) and end (</s>) tokens.
Words are prefixed with the word break character (▁).
Unknown tokens are replaced with the <unk> token.

source

pub fn tokenize_iter<'a>( &'a self, text: &'a str, ) -> impl Iterator<Item = String> + 'a

§Tokenizes a text into a flat sequence of BPE tokens.

This function takes a string of text and returns an iterator that yields individual tokens. It first tokenizes the text into sentences, then words, and finally into BPE tokens, flattening the result into a single sequence.

§Arguments

text - A string slice containing the text to be tokenized.

§Returns

An iterator that yields String, where each String represents a token.

§Example

use bpe_tokenizer::BytePairEncoder;

let vocab = BytePairEncoder::new_from_str("hello\t1\nworld\t2").unwrap();
let text = "Hello, world! How are you?";
let tokenized: Vec<String> = vocab.tokenize_iter(text).collect();

§Notes

This function uses Unicode-aware sentence and word segmentation.
Each sentence is wrapped with sentence start (<s>) and end (</s>) tokens.
Words are prefixed with the word break character (▁).
Unknown tokens are replaced with the <unk> token.

source

pub fn tokenize_sentences(&self, text: &str) -> Vec<Vec<String>>

§Tokenizes a text into sentences, then words, and finally into BPE tokens.

This function takes a string of text and returns a vector of tokenized sentences, where each sentence is represented as a vector of tokens.

§Arguments

text - A string slice containing the text to be tokenized.

§Returns

A Vec<Vec<String>>, where each inner Vec<String> represents a tokenized sentence.

§Example

use bpe_tokenizer::BytePairEncoder;

let vocab = BytePairEncoder::new_from_str("hello\t1\nworld\t2").unwrap();
let text = "Hello, world! How are you?";
let tokenized = vocab.tokenize_sentences(text);

§Notes

This function uses Unicode-aware sentence and word segmentation.
Each sentence is wrapped with sentence start (<s>) and end (</s>) tokens.
Words are prefixed with the word break character (▁).
Unknown tokens are replaced with the <unk> token.

source

pub fn tokenize(&self, text: &str) -> Vec<String>

§Tokenizes a text into a flat sequence of BPE tokens.

This function takes a string of text and returns a vector of tokens. It first tokenizes the text into sentences, then words, and finally into BPE tokens, flattening the result into a single sequence.

§Arguments

text - A string slice containing the text to be tokenized.

§Returns

A Vec<String>, where each String represents a token.

§Example

use bpe_tokenizer::BytePairEncoder;

let vocab = BytePairEncoder::new_from_str("hello\t1\nworld\t2").unwrap();
let text = "Hello, world! How are you?";
let tokenized = vocab.tokenize(text);