pub struct BytePairEncoder { /* private fields */ }Expand description
§Represents a Byte Pair Encoding (BPE) vocabulary used for tokenization.
This struct holds the mapping of tokens to their respective scores and provides methods for tokenizing text using the BPE algorithm.
The vocabulary is typically loaded from a file or string where each line contains a token and its score, separated by a tab character.
§Example
use bpe_tokenizer::BytePairEncoder;
let vocab = BytePairEncoder::new_from_str("hello\t1\nworld\t2").unwrap();
let tokenized = vocab.tokenize("Hello, world!");Implementations§
Source§impl BytePairEncoder
impl BytePairEncoder
Sourcepub fn new_from_file(file_path: &str) -> Result<Self, BytePairEncoderError>
pub fn new_from_file(file_path: &str) -> Result<Self, BytePairEncoderError>
§Creates a new BytePairEncoder from a file containing token-score pairs.
This function reads the contents of the file specified by file_path and constructs
a BytePairEncoder from it. The file should contain token-score pairs, with each pair
on a separate line and the token and score separated by a tab character (\t).
§Input Format
The file is expected to follow this format:
<token>\t<score>\nEach line should consist of:
- A token (a string) followed by a tab character (
\t) - A score (an integer) as either a positive or negative value.
Example lines from the file:
<unk> 0
▁t -0
▁the -4§Arguments
file_path- A string slice that holds the path to the file containing token-score pairs.
§Returns
Result<Self, BytePairEncoderError>- A Result containing the createdBytePairEncoderif successful, or aBytePairEncoderErrorif there was an error reading the file or parsing its contents.
§Errors
This function will return an error if:
- The file cannot be read (returns
BytePairEncoderError::InvalidFile) - The file contents are not in the expected format (returns
BytePairEncoderError::InvalidVocabularyInput)
§Example
use bpe_tokenizer::BytePairEncoder;
let vocab = BytePairEncoder::new_from_file("path/to/vocabulary/file.txt");Sourcepub fn new_from_str(input: &str) -> Result<Self, BytePairEncoderError>
pub fn new_from_str(input: &str) -> Result<Self, BytePairEncoderError>
§Creates a new BytePairEncoder from a string containing token-score pairs.
This function parses the input string to construct a BytePairEncoder. The input should
contain token-score pairs, with each pair on a separate line and the token and score
separated by a tab character (\t).
§Input Format
The string must follow this format:
<token>\t<score>\nEach line in the string should consist of:
- A token (a string) followed by a tab character (
\t) - A score (an integer) as either a positive or negative value.
For example:
hello 1
world 2
▁the -4§Arguments
input- A string slice that holds the token-score pairs.
§Returns
Result<Self, BytePairEncoderError>- A Result containing the createdBytePairEncoderif successful, or aBytePairEncoderErrorif there was an error parsing the input.
§Errors
This function will return BytePairEncoderError::InvalidVocabularyInput if:
- A line doesn’t contain a tab character to separate token and score.
- The score cannot be parsed as an
isize.
§Example
use bpe_tokenizer::BytePairEncoder;
let input = "hello\t1\nworld\t2";
let vocab = BytePairEncoder::new_from_str(input).unwrap();Sourcepub fn new_default_small() -> Result<Self, BytePairEncoderError>
pub fn new_default_small() -> Result<Self, BytePairEncoderError>
§Creates a new BytePairEncoder with a default small vocabulary size (100,000 tokens).
This function constructs a BytePairEncoder using a pre-trained multilingual vocabulary
that supports 275 languages. The vocabulary is sourced from the
BPEmb project, licensed under MIT. The small-sized
vocabulary file consists of 100,000 tokens, allowing for highly compressed tokenization
suitable for tasks with limited memory constraints.
§Returns
A Result<Self, BytePairEncoderError>, constructing the BytePairEncoder on successful
vocabulary loading, or a corresponding error if initialization fails.
§Example
use bpe_tokenizer::BytePairEncoder;
let encoder = BytePairEncoder::new_default_small().unwrap();§Note
This is only enabled when the default-small feature is enabled in Cargo.toml.
[dependencies]
bpe-tokenizer = { version = "<version", features = ["default-small"] }Sourcepub fn new_default_medium() -> Result<Self, BytePairEncoderError>
pub fn new_default_medium() -> Result<Self, BytePairEncoderError>
§Creates a new BytePairEncoder with a default medium vocabulary size (320,000 tokens).
This function constructs a BytePairEncoder using a pre-trained multilingual vocabulary
that supports 275 languages. The vocabulary is sourced from the
BPEmb project, licensed under MIT. The
medium-sized vocabulary file consists of 320,000 tokens, offering a balance between token
coverage and memory efficiency, making it suitable for a wide variety of NLP tasks.
§Returns
A Result<Self, BytePairEncoderError>, constructing the BytePairEncoder on successful
vocabulary loading, or a corresponding error if initialization fails.
§Example
use bpe_tokenizer::BytePairEncoder;
let encoder = BytePairEncoder::new_default_medium().unwrap();§Note
This is only enabled when the default-medium feature is enabled in Cargo.toml.
[dependencies]
bpe-tokenizer = { version = "<version", features = ["default-medium"] }Sourcepub fn new_default_large() -> Result<Self, BytePairEncoderError>
pub fn new_default_large() -> Result<Self, BytePairEncoderError>
§Creates a new BytePairEncoder with a default large vocabulary size (1,000,000 tokens).
This function constructs a BytePairEncoder using a pre-trained multilingual vocabulary
that supports 275 languages. The vocabulary is sourced from the
BPEmb project, licensed under MIT. The large-sized
vocabulary consists of 1,000,000 tokens, providing maximum coverage for detailed language
representation, especially useful in applications requiring high granularity.
§Returns
A Result<Self, BytePairEncoderError>, constructing the BytePairEncoder on successful
vocabulary loading, or a corresponding error if initialization fails.
§Example
use bpe_tokenizer::BytePairEncoder;
let encoder = BytePairEncoder::new_default_large().unwrap();§Note
This is only enabled when the default-large feature is enabled in Cargo.toml.
[dependencies]
bpe-tokenizer = { version = "<version", features = ["default-large"] }Sourcepub fn tokenize_sentences_iter<'a>(
&'a self,
text: &'a str,
) -> impl Iterator<Item = impl Iterator<Item = String> + 'a> + 'a
pub fn tokenize_sentences_iter<'a>( &'a self, text: &'a str, ) -> impl Iterator<Item = impl Iterator<Item = String> + 'a> + 'a
§Tokenizes a text into sentences, then words, and finally into BPE tokens.
This function takes a string of text and returns an iterator that yields vectors of tokens, where each vector represents a tokenized sentence.
§Arguments
text- A string slice containing the text to be tokenized.
§Returns
An iterator that yields Vec<String>, where each Vec<String> represents
a tokenized sentence.
§Example
use bpe_tokenizer::BytePairEncoder;
let vocab = BytePairEncoder::new_from_str("hello\t1\nworld\t2").unwrap();
let text = "Hello, world! How are you?";
let tokenized: Vec<Vec<String>> = vocab
.tokenize_sentences_iter(text)
.map(|sentence_iter| sentence_iter.collect()) // Collect each inner iterator into a Vec<String>
.collect(); // Then collect everything into Vec<Vec<String>>§Notes
- This function uses Unicode-aware sentence and word segmentation.
- Each sentence is wrapped with sentence start (
<s>) and end (</s>) tokens. - Words are prefixed with the word break character (
▁). - Unknown tokens are replaced with the
<unk>token.
Sourcepub fn tokenize_iter<'a>(
&'a self,
text: &'a str,
) -> impl Iterator<Item = String> + 'a
pub fn tokenize_iter<'a>( &'a self, text: &'a str, ) -> impl Iterator<Item = String> + 'a
§Tokenizes a text into a flat sequence of BPE tokens.
This function takes a string of text and returns an iterator that yields individual tokens. It first tokenizes the text into sentences, then words, and finally into BPE tokens, flattening the result into a single sequence.
§Arguments
text- A string slice containing the text to be tokenized.
§Returns
An iterator that yields String, where each String represents a token.
§Example
use bpe_tokenizer::BytePairEncoder;
let vocab = BytePairEncoder::new_from_str("hello\t1\nworld\t2").unwrap();
let text = "Hello, world! How are you?";
let tokenized: Vec<String> = vocab.tokenize_iter(text).collect();§Notes
- This function uses Unicode-aware sentence and word segmentation.
- Each sentence is wrapped with sentence start (
<s>) and end (</s>) tokens. - Words are prefixed with the word break character (
▁). - Unknown tokens are replaced with the
<unk>token.
Sourcepub fn tokenize_sentences(&self, text: &str) -> Vec<Vec<String>>
pub fn tokenize_sentences(&self, text: &str) -> Vec<Vec<String>>
§Tokenizes a text into sentences, then words, and finally into BPE tokens.
This function takes a string of text and returns a vector of tokenized sentences, where each sentence is represented as a vector of tokens.
§Arguments
text- A string slice containing the text to be tokenized.
§Returns
A Vec<Vec<String>>, where each inner Vec<String> represents a tokenized sentence.
§Example
use bpe_tokenizer::BytePairEncoder;
let vocab = BytePairEncoder::new_from_str("hello\t1\nworld\t2").unwrap();
let text = "Hello, world! How are you?";
let tokenized = vocab.tokenize_sentences(text);§Notes
- This function uses Unicode-aware sentence and word segmentation.
- Each sentence is wrapped with sentence start (
<s>) and end (</s>) tokens. - Words are prefixed with the word break character (
▁). - Unknown tokens are replaced with the
<unk>token.
Sourcepub fn tokenize(&self, text: &str) -> Vec<String>
pub fn tokenize(&self, text: &str) -> Vec<String>
§Tokenizes a text into a flat sequence of BPE tokens.
This function takes a string of text and returns a vector of tokens. It first tokenizes the text into sentences, then words, and finally into BPE tokens, flattening the result into a single sequence.
§Arguments
text- A string slice containing the text to be tokenized.
§Returns
A Vec<String>, where each String represents a token.
§Example
use bpe_tokenizer::BytePairEncoder;
let vocab = BytePairEncoder::new_from_str("hello\t1\nworld\t2").unwrap();
let text = "Hello, world! How are you?";
let tokenized = vocab.tokenize(text);§Notes
- This function uses Unicode-aware sentence and word segmentation.
- Each sentence is wrapped with sentence start (
<s>) and end (</s>) tokens. - Words are prefixed with the word break character (
▁). - Unknown tokens are replaced with the
<unk>token.
Trait Implementations§
Source§impl Clone for BytePairEncoder
impl Clone for BytePairEncoder
Source§fn clone(&self) -> BytePairEncoder
fn clone(&self) -> BytePairEncoder
1.0.0§fn clone_from(&mut self, source: &Self)
fn clone_from(&mut self, source: &Self)
source. Read moreSource§impl Debug for BytePairEncoder
impl Debug for BytePairEncoder
Source§impl PartialEq for BytePairEncoder
impl PartialEq for BytePairEncoder
impl Eq for BytePairEncoder
impl StructuralPartialEq for BytePairEncoder
Auto Trait Implementations§
impl Freeze for BytePairEncoder
impl RefUnwindSafe for BytePairEncoder
impl Send for BytePairEncoder
impl Sync for BytePairEncoder
impl Unpin for BytePairEncoder
impl UnwindSafe for BytePairEncoder
Blanket Implementations§
§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
§impl<T> CloneToUninit for Twhere
T: Clone,
impl<T> CloneToUninit for Twhere
T: Clone,
§unsafe fn clone_to_uninit(&self, dest: *mut u8)
unsafe fn clone_to_uninit(&self, dest: *mut u8)
clone_to_uninit)