Expand description
This crate provides a text tokenizer for OpenAI’s CLIP model.
It is intended to be a fast replacement for the original Python-based tokenizer included in the CLIP repository, aiming for 100% compatibility with the original implementation. It can also be used with OpenCLIP and other implementations using the same tokenizer.
§Examples
Basic usage with the bundled vocabulary data suitable for OpenAI’s CLIP
model (requires the openai-vocabulary-file
crate
feature):
let tokenizer = Tokenizer::new();
let mut tokens = vec![tokenizer.start_of_text()];
tokenizer.encode("Hi there", &mut tokens);
tokens.push(tokenizer.end_of_text());
let tokens = tokens.into_iter().map(Token::to_u16).collect::<Vec<_>>();
assert_eq!(tokens, [49406, 1883, 997, 49407]);
Using a custom vocabulary file:
let f = BufReader::new(File::open("bpe_simple_vocab_16e6.txt")?);
let tokenizer = Tokenizer::with_vocabulary(f, 50_000)?;
let mut tokens = vec![tokenizer.start_of_text()];
tokenizer.encode("Hi there", &mut tokens);
tokens.push(tokenizer.end_of_text());
let tokens = tokens.into_iter().map(Token::to_u16).collect::<Vec<_>>();
assert_eq!(tokens, [49998, 1883, 997, 49999]);
§Crate features
This crate provides two features:
- ndarray - Enables the
ndarray
dependency and theTokenizer::tokenize_batch
method that can be used to tokenize several input strings at once, returning a matrix suitable for directly passing to the CLIP neural network. - openai-vocabulary-file - This feature bundles the default vocabulary
file used for OpenAI’s CLIP model together with this crate and allows
users to construct a new tokenizer simply by calling
Tokenizer::new
. When disabled, you will need to supply your own vocabulary file and construct the tokenizer usingTokenizer::with_vocabulary
.
The openai-vocabulary-file feature is enabled by default. To disable it
use default-features = false
when specifying the dependency on this crate
in your Cargo.toml
.