OpenCLIP embedding in Rust

Easily run pre-trained open_clip compatible embedding models in Rust via ONNX Runtime.

Features

Run CLIP models in Rust via ONNX.
Should support any model compatible with open_clip ( Python).
Automatic model downloading: Just provide the Hugging Face model ID (has to point to HuggingFace repo with ONNX files & open_clip_config.json).
Python is only needed if you want to convert new models yourself.

Prerequisites

Rust & Cargo.
(Optional) uv - Only if you want to convert models from HuggingFace to ONNX.
(Optional) If you have to link dynamically (on Windows) - onnxruntime.

Usage: Embedding text & image

Option 1: `Clip` struct

The Clip struct is built for ease of use, handling both vision and text together, with convenience functions for similarity rankings.

use open_clip_inference::Clip;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // `model_id` from Hugging Face -> This `model_id` is a pre-converted model.
    // Use `from_local_id` or `from_local_dir` to supply a locally stored model, not on HuggingFace.
    let model_id = "RuteNL/MobileCLIP2-S3-OpenCLIP-ONNX";
    let clip = Clip::from_hf(model_id).build().await?;

    let img = image::open(Path::new("assets/img/cat_face.jpg"))?;
    let labels = &["cat", "dog", "beignet"];

    let results = clip.classify(&img, labels)?;

    for (label, prob) in results {
        println!("{}: {:.2}%", label, prob * 100.0);
    }

    Ok(())
}

Input image: Poekie

Outputs:

A photo of a cat: 99.99%
A photo of a dog: 0.01%
A photo of a beignet: 0.00%

Option 2: Individual vision & text embedders

Use VisionEmbedder or TextEmbedder standalone to just produce embeddings from images & text.

use open_clip_inference::{VisionEmbedder, TextEmbedder, Clip};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let model_id = "timm/ViT-SO400M-16-SigLIP2-384";
    let vision = VisionEmbedder::from_model_id(model_id).build()?;
    let text = TextEmbedder::from_model_id(model_id).build()?;

    let img = image::open(Path::new("assets/img/cat_face.jpg"))?;
    let img_emb = vision.embed_image(&img)?;
    // Now you may put the embeddings in a database like Postgres with PgVector to set up semantic image search.

    let text_embs = text.embed_text("a cat")?;
    // You can search with the text embedding through images using cosine similarity.
    // All embeddings produced are already l2 normalized.

    Ok(())
}

Examples

Run the included examples (ensure you have exported the relevant model first):

# Simple generic example
cargo run --example basic

# Semantic image search demo
cargo run --example search

Model support

This crate is implemented with ort, it runs ONNX models. I've uploaded the following ONNX Clip Embedding models to HuggingFace. To get an idea of the speed / quality tradeoff for these models, I've benchmarked them, and put them alongside the ImageNet zero-shot accuracy score.

Model ID	ImageNet Zero-Shot Accuracy	Vision Embedding (ms)*	Text Embedding (ms)*
RuteNL/ViT-gopt-16-SigLIP2-384-ONNX	85.0%	2354	128
RuteNL/DFN5B-CLIP-ViT-H-14-378-ONNX	84.4%	1860	131
RuteNL/ViT-SO400M-16-SigLIP2-384-ONNX	84.1%	988	136
RuteNL/MobileCLIP2-S3-OpenCLIP-ONNX	80.7%	116	35
RuteNL/MobileCLIP2-S4-OpenCLIP-ONNX	79.4%	192	38
RuteNL/MobileCLIP2-S2-OpenCLIP-ONNX	77.2%	75	19

* Embedding speed measured on my CPU, vision embedding includes 10-20 ms preprocessing.

Source for MobileCLIP ImageNet acc.

Source for other ImageNet accuracy numbers.

Other models

If you need a model that hasn't been converted to ONNX on HuggingFace yet, you can easily convert any open_clip compatible model yourself, using pull_onnx.py from this repo.

Make sure you have uv.
Run uv run pull_onnx.py --id timm/vit_base_patch32_clip_224.openai
After the Python script is done, you can run the following in your Rust code:

let clip = Clip::from_local_id("timm/vit_base_patch32_clip_224.openai").build()?

I've tested the following models to work with pull_onnx.py & this crate:

* Verified equal embedding outputs compared to reference Python implemenation

Execution Providers (Nvidia, AMD, Intel, Mac, Arm, etc.)

Since this is implemented with ort, many execution providers are available to enable hardware acceleration. You can enable an execution provider in this crate with cargo features. A full list of execution providers is available here.

To enable cuda, add the "cuda" feature, and pass the CUDA execution provider when creating the embedder:

use open_clip_inference::Clip;
use ort::ep::{CUDA, CoreML, DirectML, TensorRT};
use std::path::Path;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let model_id = "RuteNL/MobileCLIP2-S3-OpenCLIP-ONNX";
    // Execution providers can be passed to the Clip, TextEmbedder, and VisionEmbedder constructor builders.
    // By default, an empty list is passed, which results in CPU inference.
    // When multiple are passed, each execution provider is tried in order, if one doesn't work, the next one is tried, 
    // until falling back to CPU with no options left.
    let clip = Clip::from_hf(model_id)
        .with_execution_providers(&[
            TensorRT::default().build(),
            CUDA::default().build(),
            DirectML::default().build(),
            CoreML::default().build(),
        ])
        .build()
        .await?;

    let img = image::open(Path::new("assets/img/cat_face.jpg")).expect("Failed to load image");
    let texts = &[
        "A photo of a cat",
        "A photo of a dog",
        "A photo of a beignet",
    ];

    let results = clip.classify(&img, texts)?;

    for (text, prob) in results {
        println!("{}: {:.2}", text, prob * 100.0);
    }

    Ok(())
}

Features

[default] hf-hub - Enable function from_hf to fetch a model from HuggingFace, relies on tokio.
[default] fast_image_resize - Use fast_image_resize instead of image to resize image for preprocessing. Is about 77% faster, but has slightly more differences than image compared to PIL, which affects the embedding outputs slightly.
load-dynamic - Link ONNXRuntime dynamically instead of statically. See section below, or ort crate features documentation for more info.
And more ort forwarded features, see cargo.toml for a list of these, and see ort docs for their explanation.

Troubleshooting

If it doesn't build on Windows due to onnxruntime problems

Try using the feature load-dynamic and point to the onnxruntime dll as described below.

[When using `load-dynamic` feature] ONNX Runtime Library Not Found

OnnxRuntime is dynamically loaded, so if it's not found correctly, then download the correct onnxruntime library from GitHub Releases.

Then put the dll/so/dylib location in your PATH, or point the ORT_DYLIB_PATH env var to it.

PowerShell example:

Adjust path to where the dll is.

$env:ORT_DYLIB_PATH = "C:/Apps/onnxruntime/lib/onnxruntime.dll"

Shell example:

export ORT_DYLIB_PATH="/usr/local/lib/libonnxruntime.so"

open_clip_inference 0.3.2

OpenCLIP embedding in Rust

Features

Prerequisites

Usage: Embedding text & image

Option 1: Clip struct

Option 2: Individual vision & text embedders

Examples

Model support

Other models

Execution Providers (Nvidia, AMD, Intel, Mac, Arm, etc.)

Features

Troubleshooting

If it doesn't build on Windows due to onnxruntime problems

[When using load-dynamic feature] ONNX Runtime Library Not Found

Option 1: `Clip` struct

[When using `load-dynamic` feature] ONNX Runtime Library Not Found