Struct bpe_tokenizer::BytePairEncoder

source ·
pub struct BytePairEncoder { /* private fields */ }
Expand description

§Represents a Byte Pair Encoding (BPE) vocabulary used for tokenization.

This struct holds the mapping of tokens to their respective scores and provides methods for tokenizing text using the BPE algorithm.

The vocabulary is typically loaded from a file or string where each line contains a token and its score, separated by a tab character.

§Example

use bpe_tokenizer::BytePairEncoder;

let vocab = BytePairEncoder::new_from_str("hello\t1\nworld\t2").unwrap();
let tokenized = vocab.tokenize("Hello, world!");

Implementations§

source§

impl BytePairEncoder

source

pub fn new_from_file(file_path: &str) -> Result<Self, BytePairEncoderError>

§Creates a new BytePairEncoder from a file containing token-score pairs.

This function reads the contents of the file specified by file_path and constructs a BytePairEncoder from it. The file should contain token-score pairs, with each pair on a separate line and the token and score separated by a tab character (\t).

§Input Format

The file is expected to follow this format:

<token>\t<score>\n

Each line should consist of:

  • A token (a string) followed by a tab character (\t)
  • A score (an integer) as either a positive or negative value.

Example lines from the file:

<unk>    0
▁t       -0
▁the     -4
§Arguments
  • file_path - A string slice that holds the path to the file containing token-score pairs.
§Returns
  • Result<Self, BytePairEncoderError> - A Result containing the created BytePairEncoder if successful, or a BytePairEncoderError if there was an error reading the file or parsing its contents.
§Errors

This function will return an error if:

  • The file cannot be read (returns BytePairEncoderError::InvalidFile)
  • The file contents are not in the expected format (returns BytePairEncoderError::InvalidVocabularyInput)
§Example
use bpe_tokenizer::BytePairEncoder;

let vocab = BytePairEncoder::new_from_file("path/to/vocabulary/file.txt");
source

pub fn new_from_str(input: &str) -> Result<Self, BytePairEncoderError>

§Creates a new BytePairEncoder from a string containing token-score pairs.

This function parses the input string to construct a BytePairEncoder. The input should contain token-score pairs, with each pair on a separate line and the token and score separated by a tab character (\t).

§Input Format

The string must follow this format:

<token>\t<score>\n

Each line in the string should consist of:

  • A token (a string) followed by a tab character (\t)
  • A score (an integer) as either a positive or negative value.

For example:

hello   1
world   2
▁the    -4
§Arguments
  • input - A string slice that holds the token-score pairs.
§Returns
  • Result<Self, BytePairEncoderError> - A Result containing the created BytePairEncoder if successful, or a BytePairEncoderError if there was an error parsing the input.
§Errors

This function will return BytePairEncoderError::InvalidVocabularyInput if:

  • A line doesn’t contain a tab character to separate token and score.
  • The score cannot be parsed as an isize.
§Example
use bpe_tokenizer::BytePairEncoder;

let input = "hello\t1\nworld\t2";
let vocab = BytePairEncoder::new_from_str(input).unwrap();
source

pub fn new_default_small() -> Result<Self, BytePairEncoderError>

§Creates a new BytePairEncoder with a default small vocabulary size (100,000 tokens).

This function constructs a BytePairEncoder using a pre-trained multilingual vocabulary that supports 275 languages. The vocabulary is sourced from the BPEmb project, licensed under MIT. The small-sized vocabulary file consists of 100,000 tokens, allowing for highly compressed tokenization suitable for tasks with limited memory constraints.

§Returns

A Result<Self, BytePairEncoderError>, constructing the BytePairEncoder on successful vocabulary loading, or a corresponding error if initialization fails.

§Example
use bpe_tokenizer::BytePairEncoder;

let encoder = BytePairEncoder::new_default_small().unwrap();
§Note

This is only enabled when the default-small feature is enabled in Cargo.toml.

[dependencies]
bpe-tokenizer = { version = "<version", features = ["default-small"] }
source

pub fn new_default_medium() -> Result<Self, BytePairEncoderError>

§Creates a new BytePairEncoder with a default medium vocabulary size (320,000 tokens).

This function constructs a BytePairEncoder using a pre-trained multilingual vocabulary that supports 275 languages. The vocabulary is sourced from the BPEmb project, licensed under MIT. The medium-sized vocabulary file consists of 320,000 tokens, offering a balance between token coverage and memory efficiency, making it suitable for a wide variety of NLP tasks.

§Returns

A Result<Self, BytePairEncoderError>, constructing the BytePairEncoder on successful vocabulary loading, or a corresponding error if initialization fails.

§Example
use bpe_tokenizer::BytePairEncoder;

let encoder = BytePairEncoder::new_default_medium().unwrap();
§Note

This is only enabled when the default-medium feature is enabled in Cargo.toml.

[dependencies]
bpe-tokenizer = { version = "<version", features = ["default-medium"] }
source

pub fn new_default_large() -> Result<Self, BytePairEncoderError>

§Creates a new BytePairEncoder with a default large vocabulary size (1,000,000 tokens).

This function constructs a BytePairEncoder using a pre-trained multilingual vocabulary that supports 275 languages. The vocabulary is sourced from the BPEmb project, licensed under MIT. The large-sized vocabulary consists of 1,000,000 tokens, providing maximum coverage for detailed language representation, especially useful in applications requiring high granularity.

§Returns

A Result<Self, BytePairEncoderError>, constructing the BytePairEncoder on successful vocabulary loading, or a corresponding error if initialization fails.

§Example
use bpe_tokenizer::BytePairEncoder;

let encoder = BytePairEncoder::new_default_large().unwrap();
§Note

This is only enabled when the default-large feature is enabled in Cargo.toml.

[dependencies]
bpe-tokenizer = { version = "<version", features = ["default-large"] }
source

pub fn tokenize_sentences_iter<'a>( &'a self, text: &'a str, ) -> impl Iterator<Item = impl Iterator<Item = String> + 'a> + 'a

§Tokenizes a text into sentences, then words, and finally into BPE tokens.

This function takes a string of text and returns an iterator that yields vectors of tokens, where each vector represents a tokenized sentence.

§Arguments
  • text - A string slice containing the text to be tokenized.
§Returns

An iterator that yields Vec<String>, where each Vec<String> represents a tokenized sentence.

§Example
use bpe_tokenizer::BytePairEncoder;

let vocab = BytePairEncoder::new_from_str("hello\t1\nworld\t2").unwrap();
let text = "Hello, world! How are you?";
let tokenized: Vec<Vec<String>> = vocab
    .tokenize_sentences_iter(text)
    .map(|sentence_iter| sentence_iter.collect())  // Collect each inner iterator into a Vec<String>
    .collect();  // Then collect everything into Vec<Vec<String>>
§Notes
  • This function uses Unicode-aware sentence and word segmentation.
  • Each sentence is wrapped with sentence start (<s>) and end (</s>) tokens.
  • Words are prefixed with the word break character ().
  • Unknown tokens are replaced with the <unk> token.
source

pub fn tokenize_iter<'a>( &'a self, text: &'a str, ) -> impl Iterator<Item = String> + 'a

§Tokenizes a text into a flat sequence of BPE tokens.

This function takes a string of text and returns an iterator that yields individual tokens. It first tokenizes the text into sentences, then words, and finally into BPE tokens, flattening the result into a single sequence.

§Arguments
  • text - A string slice containing the text to be tokenized.
§Returns

An iterator that yields String, where each String represents a token.

§Example
use bpe_tokenizer::BytePairEncoder;

let vocab = BytePairEncoder::new_from_str("hello\t1\nworld\t2").unwrap();
let text = "Hello, world! How are you?";
let tokenized: Vec<String> = vocab.tokenize_iter(text).collect();
§Notes
  • This function uses Unicode-aware sentence and word segmentation.
  • Each sentence is wrapped with sentence start (<s>) and end (</s>) tokens.
  • Words are prefixed with the word break character ().
  • Unknown tokens are replaced with the <unk> token.
source

pub fn tokenize_sentences(&self, text: &str) -> Vec<Vec<String>>

§Tokenizes a text into sentences, then words, and finally into BPE tokens.

This function takes a string of text and returns a vector of tokenized sentences, where each sentence is represented as a vector of tokens.

§Arguments
  • text - A string slice containing the text to be tokenized.
§Returns

A Vec<Vec<String>>, where each inner Vec<String> represents a tokenized sentence.

§Example
use bpe_tokenizer::BytePairEncoder;

let vocab = BytePairEncoder::new_from_str("hello\t1\nworld\t2").unwrap();
let text = "Hello, world! How are you?";
let tokenized = vocab.tokenize_sentences(text);
§Notes
  • This function uses Unicode-aware sentence and word segmentation.
  • Each sentence is wrapped with sentence start (<s>) and end (</s>) tokens.
  • Words are prefixed with the word break character ().
  • Unknown tokens are replaced with the <unk> token.
source

pub fn tokenize(&self, text: &str) -> Vec<String>

§Tokenizes a text into a flat sequence of BPE tokens.

This function takes a string of text and returns a vector of tokens. It first tokenizes the text into sentences, then words, and finally into BPE tokens, flattening the result into a single sequence.

§Arguments
  • text - A string slice containing the text to be tokenized.
§Returns

A Vec<String>, where each String represents a token.

§Example
use bpe_tokenizer::BytePairEncoder;

let vocab = BytePairEncoder::new_from_str("hello\t1\nworld\t2").unwrap();
let text = "Hello, world! How are you?";
let tokenized = vocab.tokenize(text);
§Notes
  • This function uses Unicode-aware sentence and word segmentation.
  • Each sentence is wrapped with sentence start (<s>) and end (</s>) tokens.
  • Words are prefixed with the word break character ().
  • Unknown tokens are replaced with the <unk> token.

Trait Implementations§

source§

impl Clone for BytePairEncoder

source§

fn clone(&self) -> BytePairEncoder

Returns a copy of the value. Read more
1.0.0 · source§

fn clone_from(&mut self, source: &Self)

Performs copy-assignment from source. Read more
source§

impl Debug for BytePairEncoder

source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more
source§

impl PartialEq for BytePairEncoder

source§

fn eq(&self, other: &BytePairEncoder) -> bool

Tests for self and other values to be equal, and is used by ==.
1.0.0 · source§

fn ne(&self, other: &Rhs) -> bool

Tests for !=. The default implementation is almost always sufficient, and should not be overridden without very good reason.
source§

impl Eq for BytePairEncoder

source§

impl StructuralPartialEq for BytePairEncoder

Auto Trait Implementations§

Blanket Implementations§

source§

impl<T> Any for T
where T: 'static + ?Sized,

source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
source§

impl<T> Borrow<T> for T
where T: ?Sized,

source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
source§

impl<T> CloneToUninit for T
where T: Clone,

source§

unsafe fn clone_to_uninit(&self, dst: *mut T)

🔬This is a nightly-only experimental API. (clone_to_uninit)
Performs copy-assignment from self to dst. Read more
source§

impl<T> From<T> for T

source§

fn from(t: T) -> T

Returns the argument unchanged.

source§

impl<T, U> Into<U> for T
where U: From<T>,

source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

source§

impl<T> ToOwned for T
where T: Clone,

source§

type Owned = T

The resulting type after obtaining ownership.
source§

fn to_owned(&self) -> T

Creates owned data from borrowed data, usually by cloning. Read more
source§

fn clone_into(&self, target: &mut T)

Uses borrowed data to replace owned data, usually by cloning. Read more
source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

source§

type Error = Infallible

The type returned in the event of a conversion error.
source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.