[][src]Function rust_tokenizers::preprocessing::tokenizer::tokenization_utils::truncate_sequences

pub fn truncate_sequences(
    tokens_1: Vec<i64>,
    tokens_2: Option<Vec<i64>>,
    offsets_1: Vec<Option<Offset>>,
    offsets_2: Option<Vec<Option<Offset>>>,
    original_positions_1: Vec<Vec<OffsetSize>>,
    original_positions_2: Option<Vec<Vec<OffsetSize>>>,
    mask_1: Vec<Mask>,
    mask_2: Option<Vec<Mask>>,
    num_tokens_to_remove: usize,
    truncation_strategy: &TruncationStrategy,
    stride: usize
) -> Result<(Vec<i64>, Option<Vec<i64>>, Vec<Option<Offset>>, Option<Vec<Option<Offset>>>, Vec<Vec<OffsetSize>>, Option<Vec<Vec<OffsetSize>>>, Vec<Mask>, Option<Vec<Mask>>, Vec<i64>, Vec<Option<Offset>>), Box<dyn Error>>

Truncates a sequence pair in place to the maximum length.

  • tokens_1: list of tokenized input ids. Can be obtained from a string by chaining the tokenize and convert_tokens_to_ids methods.
  • tokens_2: Optional second list of input ids. Can be obtained from a string by chaining the tokenize and convert_tokens_to_ids methods.
  • offsets: list of offsets for tokens_1 (must be same length or empty if not used at all)
  • offsets_2: optional second list of offsets for tokens_2 (must be same length or empty if not used at all)
  • tokens_2: Optional second list of input ids. Can be obtained from a string by chaining the tokenize and convert_tokens_to_ids methods.
  • num_tokens_to_remove number of tokens to remove using the truncation strategy
  • truncation_strategy: truncation strategy
    • TruncationStrategy::LongestFirst (default) Iteratively reduce the inputs sequence until the input is under max_length starting from the longest one at each token (when there is a pair of input sequences). Overflowing tokens only contains overflow from the first sequence.
    • TruncationStrategy::OnlyFirst: Only truncate the first sequence. raise an error if the first sequence is shorter or equal to than num_tokens_to_remove.
    • TruncationStrategy::OnlySecond: Only truncate the second sequence
    • TruncationStrategy::DoNotTruncate: Does not truncate (raise an error if the input sequence is longer than max_length)
  • stride If set to a number along with max_length, the overflowing tokens returned will contain some tokens from the main sequence returned. The value of this argument defines the number of additional tokens.