Skip to main content

Crate egemma

Crate egemma 

Source
Expand description

E-gemma

Rust ONNX inference library for Google’s EmbeddingGemma text embeddings. Produces 768-dim L2-normalized sentence embeddings via ort and tokenizers.

github LoC Build codecov

docs.rs crates.io crates.io license

§Install

[dependencies]
egemma = "0.1"

Download the canonical fp32 export from onnx-community/embeddinggemma-300m-ONNX (model.onnx plus its model.onnx_data sidecar, and tokenizer.json). The model card flags fp16 as an unsupported activation dtype for this graph; pass model_fp16.onnx only if you’ve validated it for your workload.

§Cargo features

FeatureDefaultEffect
inferencePulls ort + tokenizers; activates TextEncoder. Native targets only.
serdeSerialize / Deserialize on Options, BatchOptions, ThreadOptions.
cudaNVIDIA GPUs (Linux/Windows). Requires CUDA toolkit + cuDNN at build and run time.
tensorrtNVIDIA, optimized inference. Falls back to CUDA, then CPU. Requires CUDA + TensorRT.
directmlWindows GPUs (any vendor) via DirectX 12.
rocmAMD GPUs (Linux). Requires ROCm SDK.
coremlmacOS / iOS via Core ML (Neural Engine + GPU + Metal Performance Shaders).

The execution-provider features are off by default — none are needed for CPU inference, and each requires the corresponding vendor SDK at build time.

§Target / feature contract

The inference feature is native-only. It pulls ort (ONNX Runtime FFI) and tokenizers (which transitively depends on C-only libraries like onig_sys); neither builds on wasm32-* today. Building wasm with default features fails deep in getrandom / onig_sys before this crate’s code is reached.

Wasm consumers must opt out:

cargo check --target wasm32-unknown-unknown --no-default-features

Without inference, the public surface is the Embedding type, Options / BatchOptions / ThreadOptions, and the Error enum — useful when inference itself happens elsewhere (a server, a different runtime) and only the value types and similarity primitive need to be present.

§API surface

The crate exposes:

  • TextEncoder — owns one ort::Session and one tokenizers::Tokenizer. embed, embed_batch, warmup. Send + !Sync (mirrors ort::Session); for parallelism, instantiate one encoder per thread, or share one behind a Mutex.
  • Embedding(Arc<[f32]>) — 768-dim L2-normalized sentence embedding. try_cosine returns Result<f32, Error> (no panic on dim mismatch).
  • Options / BatchOptions / ThreadOptions — session, batch, and threading configuration. with_* / set_* builders are const fn where the underlying types permit.
  • Error (#[non_exhaustive], thiserror-derived).

Embedding deliberately does not implement Serialize / Deserialize — see its docstring for the validated round-trip pattern through the inner slice.

§SIMD

Embedding::try_cosine dispatches the 768-element f32 dot product through a runtime-detected backend:

  • NEON on aarch64 (baseline ISA feature, always available).
  • AVX2 + FMA on x86_64 when both are detected.
  • Scalar four-accumulator fallback elsewhere.

The unsafe per-arch kernels take &[f32; 768] rather than &[f32] — the type-level length invariant is what makes the raw-pointer reads sound, and a wrong-length slice can never reach the unsafe boundary. The dispatcher short-circuits to scalar under cfg!(miri) so Miri matrices exercise the same call sites without entering platform intrinsics it can’t model.

§License

Dual-licensed under MIT or Apache-2.0, at your option.

See LICENSE-MIT and LICENSE-APACHE.

Re-exports§

pub use embedding::Embedding;
pub use error::Error;
pub use error::Result;
pub use options::BatchOptions;
pub use options::Options;
pub use options::ThreadOptions;
pub use text_enc::TextEncoder;inference

Modules§

embedding
Embedding — L2-normalized 768-dim sentence embedding.
error
Error type for the egemma crate.
options
Session, batch, and threading options for crate::TextEncoder.
text_encinference
Text encoder for embedding-gemma.

Enums§

GraphOptimizationLevelinference
ONNX Runtime provides various graph optimizations to improve performance. Graph optimizations are essentially graph-level transformations, ranging from small graph simplifications and node eliminations to more complex node fusions and layout optimizations.