convert_tokenizers

Function convert_tokenizers 

Source
pub fn convert_tokenizers(
    data: impl AsRef<[u8]>,
) -> Result<Definition, ConversionError>
Available on crate feature convert-tokenizers only.
Expand description

Converts a tokenizers definition into the definition format used by this crate.

data is the JSON data used by the tokenizers library, commonly stored as tokenizer.json.

Returns the tokenizer definition, or an error if the conversion fails.

§Examples

use kitoken::convert::convert_tokenizers;
use kitoken::Kitoken;

let data = std::fs::read("tests/models/tokenizers/llama2.json")?;
let definition = convert_tokenizers(data).unwrap();

let tokenizer = Kitoken::try_from(definition).unwrap();

Additional conversion utilities are defined in Definition and Kitoken.

§Format

The tokenizers definition is composed of a JSON object with the following fields:

  • model: The model definition.
  • added_tokens: An optional array of added tokens.
  • normalizer: An optional normalizer definition array.
  • pre_tokenizer: An optional pre-tokenizer definition array.
  • post_processor: An optional post-processor definition array.
  • decoder: An optional decoder definition array.
  • truncation: An optional truncation definition.
  • padding: An optional padding definition.

See the tokenizers documentation for more information.

Tokenizers definitions can contain different model types, including BPE, Unigram, WordPiece and WordLevel. This function supports conversion of BPE, Unigram and WordPiece models.