Expand description
§OpenCLIP in Rust
Easily run pre-trained open_clip compatible models in Rust via ONNX Runtime.
§Features
- Run CLIP models in Rust via ONNX.
- Should support any model compatible with
open_clip. - Automatic model downloading: Just provide the Hugging Face model ID (has to point to
HuggingFacerepo with ONNX files &open_clip_config.json).- Note: This is enabled by default via the
hf-hubfeature. Disable it to removetokio&hf-hubdependencies in case you don’t need internet model loading.
- Note: This is enabled by default via the
§Usage
Add open_clip_inference to your Cargo.toml.
§Option 1: Combined vision & text Clip API
Use the Clip struct to perform classification or image ranking.
use open_clip_inference::Clip;
use std::path::Path;
let model_id = "RuteNL/MobileCLIP2-S3-OpenCLIP-ONNX";
let clip = Clip::from_hf(model_id).build().await?;
let img = image::open(Path::new("assets/img/cat_face.jpg")).expect("Failed to load image");
let labels = &["cat", "dog"];
let results = clip.classify(&img, labels)?;
for (label, prob) in results {
println!("{}: {:.2}%", label, prob * 100.0);
}§Option 2: Individual text & vision embedders
Use VisionEmbedder or TextEmbedder standalone to reduce memory usage if you only need one or the other.
use open_clip_inference::{VisionEmbedder, TextEmbedder, Clip};
use std::path::Path;
let model_id = "timm/MobileCLIP2-S3-OpenCLIP";
let vision = VisionEmbedder::from_local_id(model_id).build()?;
let text = TextEmbedder::from_local_id(model_id).build()?;
let img = image::open(Path::new("assets/img/cat_face.jpg")).expect("Failed to load image");
let img_emb = vision.embed_image(&img)?;
let text_embs = text.embed_texts(&["cat", "dog"])?;
// Raw dot product
let similarities = text_embs.dot(&img_emb);
// Apply model scale and bias
let scale = text.model_config.logit_scale.unwrap_or(1.0);
let bias = text.model_config.logit_bias.unwrap_or(0.0);
let logits: Vec<f32> = similarities.iter().map(|&s| s.mul_add(scale, bias)).collect();
// Convert to probabilities
let probs = Clip::softmax(&logits);§Execution Providers (Nvidia, AMD, Intel, Mac, Arm, etc.)
Since this is implemented with ort, many execution providers are available to enable hardware acceleration. You can
enable an execution provider in this crate with cargo features. A full list of execution providers is
available here.
To enable cuda, add the “cuda” feature,
and pass the CUDA execution provider when creating the embedder:
use open_clip_inference::Clip;
use ort::ep::{CUDA, CoreML, DirectML, TensorRT};
use std::path::Path;
let model_id = "RuteNL/MobileCLIP2-S3-OpenCLIP-ONNX";
// Execution providers can be passed to the Clip, TextEmbedder, and VisionEmbedder builders.
// By default, an empty list is passed, which results in CPU inference.
// When multiple are passed, each execution provider is tried in order, if one doesn't work,
// the next one is tried, until falling back to CPU with no options left.
let clip = Clip::from_hf(model_id)
.with_execution_providers(&[
TensorRT::default().build(),
CUDA::default().build(),
DirectML::default().build(),
CoreML::default().build(),
])
.build()
.await?;§Model support
This crate is implemented with ort, it runs ONNX models. I’ve uploaded the following
ONNX Clip Embedding models to HuggingFace:
- RuteNL/ViT-SO400M-16-SigLIP2-384-ONNX
- RuteNL/ViT-gopt-16-SigLIP2-384-ONNX
- RuteNL/DFN5B-CLIP-ViT-H-14-378-ONNX
- RuteNL/MobileCLIP2-S2-OpenCLIP-ONNX
- RuteNL/MobileCLIP2-S3-OpenCLIP-ONNX
- RuteNL/MobileCLIP2-S4-OpenCLIP-ONNX
If you need a model that hasn’t been converted to ONNX on HuggingFace yet, you can easily convert any open_clip
compatible model
yourself, using pull_onnx.py from this repo.
- Make sure you have uv.
- Run
uv run pull_onnx.py --id timm/vit_base_patch32_clip_224.openai - After the Python script is done, you can run the following in your Rust code:
let clip = Clip::from_local_id("timm/vit_base_patch32_clip_224.openai").build()?;I’ve tested the following models to work with pull_onnx.py & this crate:
- timm/MobileCLIP2-S4-OpenCLIP *
- timm/ViT-SO400M-16-SigLIP2-384 *
- timm/ViT-SO400M-14-SigLIP-384 *
- timm/vit_base_patch32_clip_224.openai *
- Marqo/marqo-fashionSigLIP *
- laion/CLIP-ViT-B-32-laion2B-s34B-b79K
- microsoft/BiomedCLIP-PubMedBERT_256-vit_base_patch16_224
- imageomics/bioclip
- timm/PE-Core-bigG-14-448
* Verified equal embedding outputs compared
to reference Python implemenation
See the examples/ directory for detailed usage.
Re-exports§
pub use clip::Clip;pub use error::ClipError;pub use text::TextEmbedder;pub use vision::VisionEmbedder;