Struct rust_tokenizers::tokenizer::XLNetTokenizer[][src]

pub struct XLNetTokenizer { /* fields omitted */ }
Expand description

XLNet tokenizer

XLNet tokenizer performing:

  • Splitting on special tokens
  • Text cleaning
  • NFKC decomposition
  • (optional) lower casing
  • (optional) accents stripping
  • SentencePiece decomposition

Implementations

Create a new instance of a XLNetTokenizer Expects a SentencePiece protobuf file as an input.

Parameters

  • path (&str): path to the SentencePiece model file
  • lower_case (bool): flag indicating if the text should be lower-cased as part of the tokenization
  • strip_accents (bool): flag indicating if accents should be stripped from the text

Example

use rust_tokenizers::tokenizer::{Tokenizer, XLNetTokenizer};
let lower_case = false;
let strip_accents = false;
let tokenizer =
    XLNetTokenizer::from_file("path/to/vocab/file", lower_case, strip_accents).unwrap();

Create a new instance of a XLNetTokenizer from an existing vocabulary and model

Parameters

  • vocab (XLNetVocab): vocabulary
  • model (SentencePieceModel): SentencePiece model
  • lower_case (bool): flag indicating if the text should be lower-cased as part of the tokenization
  • strip_accents (bool): flag indicating if accents should be stripped from the text

Example

use rust_tokenizers::tokenizer::{Tokenizer, XLNetTokenizer};
use rust_tokenizers::vocab::{SentencePieceModel, Vocab, XLNetVocab};
let lower_case = false;
let strip_accents = false;
let vocab = XLNetVocab::from_file("path/to/vocab/file").unwrap();
let model = SentencePieceModel::from_file("path/to/model/file").unwrap();

let tokenizer =
    XLNetTokenizer::from_existing_vocab_and_model(vocab, model, lower_case, strip_accents);

Trait Implementations

returns a reference to the tokenizer vocabulary

Tokenize a list of strings (with multithreading), where each corresponds to for example a sentence, returns a vector of TokensWithOffsets containing the tokens and their offset information. This calls tokenize_with_offsets on the list provided. Read more

Multithreaded tokenization of a list of strings, returning tokens with offset information Read more

Multithreaded encoding of a sequence of string-like texts (tokenization followed by encoding). Not that in contrast with encode optional second text, each text provided is encoded independently. Read more

Multithreaded ncoding of a sequence of string-like text pairs (tokenization followed by encoding). This combines with encode with the list processing of encode_list. Read more

Multithreaded conversion a list of sequence of ids (integer) into a string, using the tokenizer and vocabulary with options to remove special tokens and clean up tokenization spaces. This calls decode for each provided sequence of ids Read more

returns a reference to the tokenizer vocabulary

Tokenize a TokenRef, returning a sequence of tokens Read more

Converts a sequence of strings into a single string. This will clean-up artifacts from tokenization (for example sub ##word) and generate a single output string Read more

Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. Read more

Tokenize a string, returns a vector of tokens as strings. Use tokenize_with_offsets or tokenize_to_tokens to return offset information. Read more

Tokenize a string, returning tokens with offset information Read more

Tokenize a list of strings, returning tokens with offset information Read more

Tokenize a list of strings, where each corresponds to for example a sentence, returns a vector of TokensWithOffsets containing the tokens and their offset information. This calls tokenize_with_offsets on the list provided. Read more

Convert a slice of string-like to a vector ot token indices Read more

Encode a string-like (tokenization followed by encoding) Read more

Encode a sequence of string-like texts (tokenization followed by encoding). Not that in contrast with encode optional second text, each text provided is encoded independently. Read more

Encode a sequence of string-like text pairs (tokenization followed by encoding). This combines with encode with the list processing of encode_list. Read more

Decode a sequence of token indices to a sequence of Strings, optionally skipping special indices Read more

Converts a sequence of ids (integer) into a string, using the tokenizer and vocabulary with options to remove special tokens and clean up tokenization spaces. Read more

Cleans-up tokenization artifacts (for example whitespace before punctuation) Read more

Converts a list of sequence of ids (integer) into a string, using the tokenizer and vocabulary with options to remove special tokens and clean up tokenization spaces. This calls decode for each provided sequence of ids Read more

Auto Trait Implementations

Blanket Implementations

Gets the TypeId of self. Read more

Immutably borrows from an owned value. Read more

Mutably borrows from an owned value. Read more

Performs the conversion.

Performs the conversion.

The alignment of pointer.

The type for initializers.

Initializes a with the given initializer. Read more

Dereferences the given pointer. Read more

Mutably dereferences the given pointer. Read more

Drops the object pointed to by the given pointer. Read more

The type returned in the event of a conversion error.

Performs the conversion.

The type returned in the event of a conversion error.

Performs the conversion.