pub struct BpeOptions<'a> {
pub merges: &'a [(Cow<'a, str>, Cow<'a, str>)],
pub vocab: Option<FxHashMap<EncodedBytes, TokenId>>,
pub added_tokens: FxHashMap<TokenId, String>,
pub end_of_word_suffix: Option<String>,
pub ignore_merges: bool,
}Expand description
Configuration for a Bpe tokenization model.
Fields§
§merges: &'a [(Cow<'a, str>, Cow<'a, str>)]Ordered entries of the merge list. Each entry is a pair of strings
representing byte sequences. See also merge_pairs_from_lines which
can be used to extract pairs from the space-separated format used in eg.
merges.txt files.
vocab: Option<FxHashMap<EncodedBytes, TokenId>>Mapping between token strings and IDs. If not provided, the ID of a token is 256 + the index of the pair in the merge list which form the token string when concatenated. For example, if index 10 in the merge list is “foo bar”, then the token ID of “foobar” would be 266. Token IDs below 256 are reserved for individual bytes.
added_tokens: FxHashMap<TokenId, String>Set of tokens which don’t appear in merges but do have a mapping in
vocab. These are used for special purposes such as representing the
end of output.
end_of_word_suffix: Option<String>A string which is implicitly appended to each substring that is tokenized, after initial splitting.
ignore_merges: boolWhen encoding a string piece, match the entire piece against the vocabulary before applying merge rules.