Instant CLIP Tokenizer: a fast tokenizer for the CLIP neural network
Instant CLIP Tokenizer is a fast pure-Rust text tokenizer for OpenAI's CLIP model. It is intended to be a replacement for the original Python-based tokenizer included in the CLIP repository, aiming for 100% compatibility with the original implementation. It can also be used with OpenCLIP and other implementations using the same tokenizer.
In addition to being usable as a Rust crate it also includes Python bindings built with PyO3 so that it can be used as a native Python module.
For the microbenchmarks included in this repository, Instant CLIP Tokenizer is ~70x faster than the Python implementation (with preprocessing and caching disabled to ensure a fair comparison).
Using the library
Rust
[]
= "0.1.0"
# To enable additional functionality that depends on the `ndarray` crate:
# instant-clip-tokenizer = { version = "0.1.0", features = ["ndarray"] }
Python (>= 3.9)
Using the library requires numpy >= 1.16.0 installed in your Python environment (e.g., via pip install numpy).
Examples
use ;
let tokenizer = new;
let mut tokens = Vecnew;
tokenizer.encode;
let tokens = tokens.into_iter.map.;
println!;
// -> [320, 2533, 6765, 320, 10297]
=
=
# -> [320, 2533, 6765, 320, 10297]
=
# -> [[49406 320 2533 6765 49407]
# [49406 1883 997 49407 0]]
Testing
To run the tests run the following:
You can also test the Python bindings with:
Acknowledgements
The vocabulary file and original Python tokenizer code included in this repository are copyright (c) 2021 OpenAI (MIT-License).