egemma 0.1.0

Rust ONNX inference library for Google's EmbeddingGemma (text embeddings)
Documentation

Rust ONNX inference library for Google's EmbeddingGemma text embeddings. Produces 768-dim L2-normalized sentence embeddings via ort and tokenizers.

Install

[dependencies]
egemma = "0.1"

Download the canonical fp32 export from onnx-community/embeddinggemma-300m-ONNX (model.onnx plus its model.onnx_data sidecar, and tokenizer.json). The model card flags fp16 as an unsupported activation dtype for this graph; pass model_fp16.onnx only if you've validated it for your workload.

Cargo features

Feature Default Effect
inference Pulls ort + tokenizers; activates TextEncoder. Native targets only.
serde Serialize / Deserialize on Options, BatchOptions, ThreadOptions.
cuda NVIDIA GPUs (Linux/Windows). Requires CUDA toolkit + cuDNN at build and run time.
tensorrt NVIDIA, optimized inference. Falls back to CUDA, then CPU. Requires CUDA + TensorRT.
directml Windows GPUs (any vendor) via DirectX 12.
rocm AMD GPUs (Linux). Requires ROCm SDK.
coreml macOS / iOS via Core ML (Neural Engine + GPU + Metal Performance Shaders).

The execution-provider features are off by default — none are needed for CPU inference, and each requires the corresponding vendor SDK at build time.

Target / feature contract

The inference feature is native-only. It pulls ort (ONNX Runtime FFI) and tokenizers (which transitively depends on C-only libraries like onig_sys); neither builds on wasm32-* today. Building wasm with default features fails deep in getrandom / onig_sys before this crate's code is reached.

Wasm consumers must opt out:

cargo check --target wasm32-unknown-unknown --no-default-features

Without inference, the public surface is the Embedding type, Options / BatchOptions / ThreadOptions, and the Error enum — useful when inference itself happens elsewhere (a server, a different runtime) and only the value types and similarity primitive need to be present.

API surface

The crate exposes:

  • TextEncoder — owns one ort::Session and one tokenizers::Tokenizer. embed, embed_batch, warmup. Send + !Sync (mirrors ort::Session); for parallelism, instantiate one encoder per thread, or share one behind a Mutex.
  • Embedding(Arc<[f32]>) — 768-dim L2-normalized sentence embedding. try_cosine returns Result<f32, Error> (no panic on dim mismatch).
  • Options / BatchOptions / ThreadOptions — session, batch, and threading configuration. with_* / set_* builders are const fn where the underlying types permit.
  • Error (#[non_exhaustive], thiserror-derived).

Embedding deliberately does not implement Serialize / Deserialize — see its docstring for the validated round-trip pattern through the inner slice.

SIMD

Embedding::try_cosine dispatches the 768-element f32 dot product through a runtime-detected backend:

  • NEON on aarch64 (baseline ISA feature, always available).
  • AVX2 + FMA on x86_64 when both are detected.
  • Scalar four-accumulator fallback elsewhere.

The unsafe per-arch kernels take &[f32; 768] rather than &[f32] — the type-level length invariant is what makes the raw-pointer reads sound, and a wrong-length slice can never reach the unsafe boundary. The dispatcher short-circuits to scalar under cfg!(miri) so Miri matrices exercise the same call sites without entering platform intrinsics it can't model.

License

Dual-licensed under MIT or Apache-2.0, at your option.

See LICENSE-MIT and LICENSE-APACHE.