Module tokenizers::processors::template

source ·
Expand description

§Template Processing

Provides a way to specify templates in order to add the special tokens to each input sequence as relevant.

§Example

Let’s take BERT tokenizer as an example. It uses two special tokens, used to delimitate each sequence. [CLS] is always used at the beginning of the first sequence, and [SEP] is added at the end of both the first, and the pair sequences. The final result looks like this:

  • Single sequence: [CLS] Hello there [SEP]
  • Pair sequences: [CLS] My name is Anthony [SEP] What is my name? [SEP] With the type ids as following:
[CLS]   ...   [SEP]   ...   [SEP]
  0      0      0      1      1

So, we can define a TemplateProcessing that will achieve this result:

let template = TemplateProcessing::builder()
    // The template when we only have a single sequence:
    .try_single(vec!["[CLS]", "$0", "[SEP]"]).unwrap()
    // Same as:
    .try_single("[CLS] $0 [SEP]").unwrap()

    // The template when we have both sequences:
    .try_pair(vec!["[CLS]:0", "$A:0", "[SEP]:0", "$B:1", "[SEP]:1"]).unwrap()
    // Same as:
    .try_pair("[CLS]:0 $A:0 [SEP]:0 $B:1 [SEP]:1").unwrap()
    // Or:
    .try_pair("[CLS] $0 [SEP] $B:1 [SEP]:1").unwrap()

    // The list of special tokens used by each sequences
    .special_tokens(vec![("[CLS]", 1), ("[SEP]", 0)])
    .build()
    .unwrap();

In this example, each input sequence is identified using a $ construct. This identifier lets us specify each input sequence, and the type_id to use. When nothing is specified, it uses the default values. Here are the different ways to specify it:

  • Specifying the sequence, with default type_id == 0: $A or $B
  • Specifying the type_id with default sequence == A: $0, $1, $2, …
  • Specifying both: $A:0, $B:1, …

The same construct is used for special tokens: <identifier>(:<type_id>)?.

Warning: You must ensure that you are giving the correct tokens/ids as these will be added to the Encoding without any further check. If the given ids correspond to something totally different in a Tokenizer using this PostProcessor, it might lead to unexpected results.

Structs§

  • Represents a bunch of tokens to be used in a template. Usually, special tokens have only one associated id/token but in some cases, it might be interesting to have multiple ids/tokens.
  • A Template represents a Vec<Piece>.
  • This PostProcessor takes care of processing each input Encoding by applying the corresponding template, before merging them in the final Encoding.
  • Builder for TemplateProcessing.
  • A bunch of SpecialToken represented by their ID. Internally, Tokens is a HashMap<String, SpecialToken> and can be built from a HashMap or a Vec<SpecialToken>.

Enums§