Crate instant_clip_tokenizer

Source
Expand description

This crate provides a text tokenizer for OpenAI’s CLIP model.

It is intended to be a fast replacement for the original Python-based tokenizer included in the CLIP repository, aiming for 100% compatibility with the original implementation. It can also be used with OpenCLIP and other implementations using the same tokenizer.

§Examples

Basic usage with the bundled vocabulary data suitable for OpenAI’s CLIP model (requires the openai-vocabulary-file crate feature):

let tokenizer = Tokenizer::new();
let mut tokens = vec![tokenizer.start_of_text()];
tokenizer.encode("Hi there", &mut tokens);
tokens.push(tokenizer.end_of_text());
let tokens = tokens.into_iter().map(Token::to_u16).collect::<Vec<_>>();
assert_eq!(tokens, [49406, 1883, 997, 49407]);

Using a custom vocabulary file:

let f = BufReader::new(File::open("bpe_simple_vocab_16e6.txt")?);
let tokenizer = Tokenizer::with_vocabulary(f, 50_000)?;
let mut tokens = vec![tokenizer.start_of_text()];
tokenizer.encode("Hi there", &mut tokens);
tokens.push(tokenizer.end_of_text());
let tokens = tokens.into_iter().map(Token::to_u16).collect::<Vec<_>>();
assert_eq!(tokens, [49998, 1883, 997, 49999]);

§Crate features

This crate provides two features:

  • ndarray - Enables the ndarray dependency and the Tokenizer::tokenize_batch method that can be used to tokenize several input strings at once, returning a matrix suitable for directly passing to the CLIP neural network.
  • openai-vocabulary-file - This feature bundles the default vocabulary file used for OpenAI’s CLIP model together with this crate and allows users to construct a new tokenizer simply by calling Tokenizer::new. When disabled, you will need to supply your own vocabulary file and construct the tokenizer using Tokenizer::with_vocabulary.

The openai-vocabulary-file feature is enabled by default. To disable it use default-features = false when specifying the dependency on this crate in your Cargo.toml.

Structs§

Token
Represents a single token.
Tokenizer
A text tokenizer for the CLIP neural network.