Rust ONNX inference library for Google's EmbeddingGemma text embeddings. Produces 768-dim
L2-normalized sentence embeddings via ort and tokenizers.
Install
[]
= "0.1"
Download the canonical fp32 export from
onnx-community/embeddinggemma-300m-ONNX
(model.onnx plus its model.onnx_data sidecar, and tokenizer.json).
The model card flags fp16 as an unsupported activation dtype for this
graph; pass model_fp16.onnx only if you've validated it for your
workload.
Cargo features
| Feature | Default | Effect |
|---|---|---|
inference |
✅ | Pulls ort + tokenizers; activates TextEncoder. Native targets only. |
serde |
Serialize / Deserialize on Options, BatchOptions, ThreadOptions. |
|
cuda |
NVIDIA GPUs (Linux/Windows). Requires CUDA toolkit + cuDNN at build and run time. | |
tensorrt |
NVIDIA, optimized inference. Falls back to CUDA, then CPU. Requires CUDA + TensorRT. | |
directml |
Windows GPUs (any vendor) via DirectX 12. | |
rocm |
AMD GPUs (Linux). Requires ROCm SDK. | |
coreml |
macOS / iOS via Core ML (Neural Engine + GPU + Metal Performance Shaders). |
The execution-provider features are off by default — none are needed for CPU inference, and each requires the corresponding vendor SDK at build time.
Target / feature contract
The inference feature is native-only. It pulls ort (ONNX
Runtime FFI) and tokenizers (which transitively depends on C-only
libraries like onig_sys); neither builds on wasm32-* today.
Building wasm with default features fails deep in getrandom /
onig_sys before this crate's code is reached.
Wasm consumers must opt out:
Without inference, the public surface is the Embedding type,
Options / BatchOptions / ThreadOptions, and the Error enum
— useful when inference itself happens elsewhere (a server, a
different runtime) and only the value types and similarity primitive
need to be present.
API surface
The crate exposes:
TextEncoder— owns oneort::Sessionand onetokenizers::Tokenizer.embed,embed_batch,warmup.Send + !Sync(mirrorsort::Session); for parallelism, instantiate one encoder per thread, or share one behind aMutex.Embedding(Arc<[f32]>)— 768-dim L2-normalized sentence embedding.try_cosinereturnsResult<f32, Error>(no panic on dim mismatch).Options/BatchOptions/ThreadOptions— session, batch, and threading configuration.with_*/set_*builders areconst fnwhere the underlying types permit.Error(#[non_exhaustive],thiserror-derived).
Embedding deliberately does not implement Serialize /
Deserialize — see its docstring for the validated round-trip pattern
through the inner slice.
SIMD
Embedding::try_cosine dispatches the 768-element f32 dot product
through a runtime-detected backend:
- NEON on aarch64 (baseline ISA feature, always available).
- AVX2 + FMA on x86_64 when both are detected.
- Scalar four-accumulator fallback elsewhere.
The unsafe per-arch kernels take &[f32; 768] rather than &[f32] —
the type-level length invariant is what makes the raw-pointer reads
sound, and a wrong-length slice can never reach the unsafe boundary.
The dispatcher short-circuits to scalar under cfg!(miri) so Miri
matrices exercise the same call sites without entering platform
intrinsics it can't model.
License
Dual-licensed under MIT or Apache-2.0, at your option.
See LICENSE-MIT and LICENSE-APACHE.