Function rust_tokenizers::tokenizer::truncate_sequences[][src]

pub fn truncate_sequences(
    token_ids_with_offsets_1: TokenIdsWithOffsets,
    token_ids_with_offsets_2: Option<TokenIdsWithOffsets>,
    num_tokens_to_remove: usize,
    truncation_strategy: &TruncationStrategy,
    stride: usize
) -> Result<(TokenIdsWithOffsets, Option<TokenIdsWithOffsets>, Vec<i64>, Vec<Option<Offset>>), TokenizerError>
Expand description

Truncates a sequence pair in place to the maximum length.

  • tokens_1: list of tokenized input ids. Can be obtained from a string by chaining the tokenize and convert_tokens_to_ids methods.
  • tokens_2: Optional second list of input ids. Can be obtained from a string by chaining the tokenize and convert_tokens_to_ids methods.
  • offsets: list of offsets for tokens_1 (must be same length or empty if not used at all)
  • offsets_2: optional second list of offsets for tokens_2 (must be same length or empty if not used at all)
  • tokens_2: Optional second list of input ids. Can be obtained from a string by chaining the tokenize and convert_tokens_to_ids methods.
  • num_tokens_to_remove number of tokens to remove using the truncation strategy
  • truncation_strategy: truncation strategy
    • TruncationStrategy::LongestFirst (default) Iteratively reduce the inputs sequence until the input is under max_length starting from the longest one at each token (when there is a pair of input sequences). Overflowing tokens only contains overflow from the first sequence.
    • TruncationStrategy::OnlyFirst: Only truncate the first sequence. raise an error if the first sequence is shorter or equal to than num_tokens_to_remove.
    • TruncationStrategy::OnlySecond: Only truncate the second sequence
    • TruncationStrategy::DoNotTruncate: Does not truncate (raise an error if the input sequence is longer than max_length)
  • stride If set to a number along with max_length, the overflowing tokens returned will contain some tokens from the main sequence returned. The value of this argument defines the number of additional tokens.