Struct rust_tokenizers::tokenizer::Gpt2Tokenizer
source · [−]pub struct Gpt2Tokenizer { /* private fields */ }
Expand description
GPT2 tokenizer
GPT2 tokenizer performing:
- splitting on special characters
- whitespace splitting
- (optional) lower casing
- BPE tokenization
Implementations
sourceimpl Gpt2Tokenizer
impl Gpt2Tokenizer
sourcepub fn from_file(
vocab_path: &str,
merges_path: &str,
lower_case: bool
) -> Result<Gpt2Tokenizer, TokenizerError>
pub fn from_file(
vocab_path: &str,
merges_path: &str,
lower_case: bool
) -> Result<Gpt2Tokenizer, TokenizerError>
Create a new instance of a Gpt2Tokenizer
Expects a vocabulary json file and a merges file as an input.
Parameters
- vocab_path (
&str
): path to the vocabulary file - merges_path (
&str
): path to the merges file (use as part of the BPE encoding process) - lower_case (
bool
): flag indicating if the text should be lower-cased as part of the tokenization
Example
use rust_tokenizers::tokenizer::{Gpt2Tokenizer, Tokenizer};
let lower_case = false;
let tokenizer =
Gpt2Tokenizer::from_file("path/to/vocab/file", "path/to/merges/file", lower_case).unwrap();
sourcepub fn from_existing_vocab_and_merges(
vocab: Gpt2Vocab,
merges: BpePairVocab,
lower_case: bool
) -> Gpt2Tokenizer
pub fn from_existing_vocab_and_merges(
vocab: Gpt2Vocab,
merges: BpePairVocab,
lower_case: bool
) -> Gpt2Tokenizer
Create a new instance of a Gpt2Tokenizer
from an existing vocabulary and merges
Parameters
- vocab (
Gpt2Vocab
): GPT-like vocabulary - merges (
BpePairVocab
): BPE pairs vocabulary - lower_case (
bool
): flag indicating if the text should be lower-cased as part of the tokenization
Example
use rust_tokenizers::tokenizer::{Gpt2Tokenizer, Tokenizer};
use rust_tokenizers::vocab::{BpePairVocab, Gpt2Vocab, Vocab};
let lower_case = false;
let vocab = Gpt2Vocab::from_file("path/to/vocab/file").unwrap();
let merges = BpePairVocab::from_file("path/to/merges/file").unwrap();
let tokenizer = Gpt2Tokenizer::from_existing_vocab_and_merges(vocab, merges, lower_case);
Trait Implementations
sourceimpl MultiThreadedTokenizer<Gpt2Vocab> for Gpt2Tokenizer
impl MultiThreadedTokenizer<Gpt2Vocab> for Gpt2Tokenizer
sourcefn tokenize_list_with_offsets<S>(
&self,
text_list: &[S]
) -> Vec<TokensWithOffsets>ⓘNotable traits for Vec<u8, A>impl<A> Write for Vec<u8, A> where
A: Allocator,
where
S: AsRef<str> + Sync,
fn tokenize_list_with_offsets<S>(
&self,
text_list: &[S]
) -> Vec<TokensWithOffsets>ⓘNotable traits for Vec<u8, A>impl<A> Write for Vec<u8, A> where
A: Allocator,
where
S: AsRef<str> + Sync,
A: Allocator,
Tokenize a list of strings (with multithreading), where each corresponds to for example a sentence, returns a
vector of TokensWithOffsets containing the tokens and their offset information. This calls
tokenize_with_offsets
on the list provided. Read more
sourcefn tokenize_list<S>(&self, text_list: &[S]) -> Vec<Vec<String>>ⓘNotable traits for Vec<u8, A>impl<A> Write for Vec<u8, A> where
A: Allocator,
where
S: AsRef<str> + Sync,
fn tokenize_list<S>(&self, text_list: &[S]) -> Vec<Vec<String>>ⓘNotable traits for Vec<u8, A>impl<A> Write for Vec<u8, A> where
A: Allocator,
where
S: AsRef<str> + Sync,
A: Allocator,
Multithreaded tokenization of a list of strings, returning tokens with offset information Read more
sourcefn encode_list<S>(
&self,
text_list: &[S],
max_len: usize,
truncation_strategy: &TruncationStrategy,
stride: usize
) -> Vec<TokenizedInput>ⓘNotable traits for Vec<u8, A>impl<A> Write for Vec<u8, A> where
A: Allocator,
where
S: AsRef<str> + Sync,
fn encode_list<S>(
&self,
text_list: &[S],
max_len: usize,
truncation_strategy: &TruncationStrategy,
stride: usize
) -> Vec<TokenizedInput>ⓘNotable traits for Vec<u8, A>impl<A> Write for Vec<u8, A> where
A: Allocator,
where
S: AsRef<str> + Sync,
A: Allocator,
Multithreaded encoding of a sequence of string-like texts (tokenization followed by encoding). Not that in contrast
with encode
optional second text, each text provided is encoded independently. Read more
sourcefn encode_pair_list<S>(
&self,
text_list: &[(S, S)],
max_len: usize,
truncation_strategy: &TruncationStrategy,
stride: usize
) -> Vec<TokenizedInput>ⓘNotable traits for Vec<u8, A>impl<A> Write for Vec<u8, A> where
A: Allocator,
where
S: AsRef<str> + Sync,
fn encode_pair_list<S>(
&self,
text_list: &[(S, S)],
max_len: usize,
truncation_strategy: &TruncationStrategy,
stride: usize
) -> Vec<TokenizedInput>ⓘNotable traits for Vec<u8, A>impl<A> Write for Vec<u8, A> where
A: Allocator,
where
S: AsRef<str> + Sync,
A: Allocator,
Multithreaded ncoding of a sequence of string-like text pairs (tokenization followed by encoding). This combines
with encode
with the list processing of encode_list
. Read more
sourcefn decode_list(
&self,
token_ids_list: &[Vec<i64>],
skip_special_tokens: bool,
clean_up_tokenization_spaces: bool
) -> Vec<String>ⓘNotable traits for Vec<u8, A>impl<A> Write for Vec<u8, A> where
A: Allocator,
fn decode_list(
&self,
token_ids_list: &[Vec<i64>],
skip_special_tokens: bool,
clean_up_tokenization_spaces: bool
) -> Vec<String>ⓘNotable traits for Vec<u8, A>impl<A> Write for Vec<u8, A> where
A: Allocator,
A: Allocator,
Multithreaded conversion a list of sequence of ids (integer) into a string, using the tokenizer and vocabulary
with options to remove special tokens and clean up tokenization spaces. This calls decode
for each provided sequence of ids Read more
sourceimpl Tokenizer<Gpt2Vocab> for Gpt2Tokenizer
impl Tokenizer<Gpt2Vocab> for Gpt2Tokenizer
sourcefn tokenize_to_tokens(&self, initial_token: TokenRef<'_>) -> Vec<Token>ⓘNotable traits for Vec<u8, A>impl<A> Write for Vec<u8, A> where
A: Allocator,
fn tokenize_to_tokens(&self, initial_token: TokenRef<'_>) -> Vec<Token>ⓘNotable traits for Vec<u8, A>impl<A> Write for Vec<u8, A> where
A: Allocator,
A: Allocator,
Tokenize a TokenRef, returning a sequence of tokens Read more
sourcefn convert_tokens_to_string(&self, tokens: Vec<String>) -> String
fn convert_tokens_to_string(&self, tokens: Vec<String>) -> String
Converts a sequence of strings into a single string. This will clean-up artifacts from tokenization
(for example sub ##word
) and generate a single output string Read more
sourcefn tokenize(&self, text: &str) -> Vec<String>ⓘNotable traits for Vec<u8, A>impl<A> Write for Vec<u8, A> where
A: Allocator,
fn tokenize(&self, text: &str) -> Vec<String>ⓘNotable traits for Vec<u8, A>impl<A> Write for Vec<u8, A> where
A: Allocator,
A: Allocator,
Tokenize a string, returns a vector of tokens as strings.
Use tokenize_with_offsets
or tokenize_to_tokens
to return offset information. Read more
sourcefn tokenize_with_offsets(&self, text: &str) -> TokensWithOffsets
fn tokenize_with_offsets(&self, text: &str) -> TokensWithOffsets
Tokenize a string, returning tokens with offset information Read more
sourcefn tokenize_list<S>(&self, text_list: &[S]) -> Vec<Vec<String>>ⓘNotable traits for Vec<u8, A>impl<A> Write for Vec<u8, A> where
A: Allocator,
where
S: AsRef<str>,
fn tokenize_list<S>(&self, text_list: &[S]) -> Vec<Vec<String>>ⓘNotable traits for Vec<u8, A>impl<A> Write for Vec<u8, A> where
A: Allocator,
where
S: AsRef<str>,
A: Allocator,
Tokenize a list of strings, returning tokens with offset information Read more
sourcefn tokenize_list_with_offsets<S>(
&self,
text_list: &[S]
) -> Vec<TokensWithOffsets>ⓘNotable traits for Vec<u8, A>impl<A> Write for Vec<u8, A> where
A: Allocator,
where
S: AsRef<str>,
fn tokenize_list_with_offsets<S>(
&self,
text_list: &[S]
) -> Vec<TokensWithOffsets>ⓘNotable traits for Vec<u8, A>impl<A> Write for Vec<u8, A> where
A: Allocator,
where
S: AsRef<str>,
A: Allocator,
Tokenize a list of strings, where each corresponds to for example a sentence, returns a
vector of TokensWithOffsets containing the tokens and their offset information. This calls
tokenize_with_offsets
on the list provided. Read more
sourcefn convert_tokens_to_ids<S>(&self, tokens: &[S]) -> Vec<i64>ⓘNotable traits for Vec<u8, A>impl<A> Write for Vec<u8, A> where
A: Allocator,
where
S: AsRef<str>,
fn convert_tokens_to_ids<S>(&self, tokens: &[S]) -> Vec<i64>ⓘNotable traits for Vec<u8, A>impl<A> Write for Vec<u8, A> where
A: Allocator,
where
S: AsRef<str>,
A: Allocator,
Convert a slice of string-like to a vector ot token indices Read more
sourcefn encode(
&self,
text_1: &str,
text_2: Option<&str>,
max_len: usize,
truncation_strategy: &TruncationStrategy,
stride: usize
) -> TokenizedInput
fn encode(
&self,
text_1: &str,
text_2: Option<&str>,
max_len: usize,
truncation_strategy: &TruncationStrategy,
stride: usize
) -> TokenizedInput
Encode a string-like (tokenization followed by encoding) Read more
sourcefn encode_list<S>(
&self,
text_list: &[S],
max_len: usize,
truncation_strategy: &TruncationStrategy,
stride: usize
) -> Vec<TokenizedInput>ⓘNotable traits for Vec<u8, A>impl<A> Write for Vec<u8, A> where
A: Allocator,
where
S: AsRef<str>,
fn encode_list<S>(
&self,
text_list: &[S],
max_len: usize,
truncation_strategy: &TruncationStrategy,
stride: usize
) -> Vec<TokenizedInput>ⓘNotable traits for Vec<u8, A>impl<A> Write for Vec<u8, A> where
A: Allocator,
where
S: AsRef<str>,
A: Allocator,
Encode a sequence of string-like texts (tokenization followed by encoding). Not that in contrast
with encode
optional second text, each text provided is encoded independently. Read more
sourcefn encode_pair_list<S>(
&self,
text_list: &[(S, S)],
max_len: usize,
truncation_strategy: &TruncationStrategy,
stride: usize
) -> Vec<TokenizedInput>ⓘNotable traits for Vec<u8, A>impl<A> Write for Vec<u8, A> where
A: Allocator,
where
S: AsRef<str>,
fn encode_pair_list<S>(
&self,
text_list: &[(S, S)],
max_len: usize,
truncation_strategy: &TruncationStrategy,
stride: usize
) -> Vec<TokenizedInput>ⓘNotable traits for Vec<u8, A>impl<A> Write for Vec<u8, A> where
A: Allocator,
where
S: AsRef<str>,
A: Allocator,
Encode a sequence of string-like text pairs (tokenization followed by encoding). This combines
with encode
with the list processing of encode_list
. Read more
sourcefn decode_to_vec(
&self,
token_ids: &[i64],
skip_special_tokens: bool
) -> Vec<String>ⓘNotable traits for Vec<u8, A>impl<A> Write for Vec<u8, A> where
A: Allocator,
fn decode_to_vec(
&self,
token_ids: &[i64],
skip_special_tokens: bool
) -> Vec<String>ⓘNotable traits for Vec<u8, A>impl<A> Write for Vec<u8, A> where
A: Allocator,
A: Allocator,
Decode a sequence of token indices to a sequence of Strings, optionally skipping special indices Read more
sourcefn decode(
&self,
token_ids: &[i64],
skip_special_tokens: bool,
clean_up_tokenization_spaces: bool
) -> String
fn decode(
&self,
token_ids: &[i64],
skip_special_tokens: bool,
clean_up_tokenization_spaces: bool
) -> String
Converts a sequence of ids (integer) into a string, using the tokenizer and vocabulary with options to remove special tokens and clean up tokenization spaces. Read more
sourcefn clean_up_tokenization(&self, input_string: String) -> String
fn clean_up_tokenization(&self, input_string: String) -> String
Cleans-up tokenization artifacts (for example whitespace before punctuation) Read more
sourcefn decode_list(
&self,
token_ids_list: &[Vec<i64>],
skip_special_tokens: bool,
clean_up_tokenization_spaces: bool
) -> Vec<String>ⓘNotable traits for Vec<u8, A>impl<A> Write for Vec<u8, A> where
A: Allocator,
fn decode_list(
&self,
token_ids_list: &[Vec<i64>],
skip_special_tokens: bool,
clean_up_tokenization_spaces: bool
) -> Vec<String>ⓘNotable traits for Vec<u8, A>impl<A> Write for Vec<u8, A> where
A: Allocator,
A: Allocator,
Converts a list of sequence of ids (integer) into a string, using the tokenizer and vocabulary
with options to remove special tokens and clean up tokenization spaces. This calls decode
for each provided sequence of ids Read more
sourcefn build_input_with_special_tokens(
&self,
tokens_ids_with_offsets_1: TokenIdsWithOffsets,
tokens_ids_with_offsets_2: Option<TokenIdsWithOffsets>
) -> TokenIdsWithSpecialTokens
fn build_input_with_special_tokens(
&self,
tokens_ids_with_offsets_1: TokenIdsWithOffsets,
tokens_ids_with_offsets_2: Option<TokenIdsWithOffsets>
) -> TokenIdsWithSpecialTokens
Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. Read more
Auto Trait Implementations
impl RefUnwindSafe for Gpt2Tokenizer
impl Send for Gpt2Tokenizer
impl Sync for Gpt2Tokenizer
impl Unpin for Gpt2Tokenizer
impl UnwindSafe for Gpt2Tokenizer
Blanket Implementations
sourceimpl<T> BorrowMut<T> for T where
T: ?Sized,
impl<T> BorrowMut<T> for T where
T: ?Sized,
const: unstable · sourcefn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Mutably borrows from an owned value. Read more