Expand description
SigLIP2 NaFlex
Rust ONNX inference library for SigLIP2 NaFlex (image + text embeddings).
A sibling of textclap (CLAP audio inference) and a downstream of scenesdetect (keyframe extraction).
§Install
[dependencies]
siglip2-naflex = "0.1"The package is siglip2-naflex on crates.io but the lib name is
siglip2, so import sites are unchanged: use siglip2::*.
§Examples
Runnable examples live in examples/. Notable entry
points:
embed_keyframes.rs— single-towerImageEncoderover a directory of images.index_and_search.rs— end-to-end retrieval usingSiglip2(bundledconstructor,classify, top-K).bench_ep.rs/bench_ep_text.rs— execution-provider latency microbenchmarks for vision and text.
§Parity-against-PyTorch testing
The parity-against-upstream-PyTorch tests live in tests/integration.rs
and are gated on SIGLIP2_MODELS_DIR (the released ONNX graphs) — they
remain #[ignore]-d in default cargo test runs because they need the
release artifacts, but the golden fixtures
(tests/fixtures/{images,embeddings}/*, text_prompts.json,
text_embeddings.npy) are committed in-tree.
The CI workflow at .github/workflows/parity.yml is two-stage:
| stage | what it proves | requires |
|---|---|---|
model-load-smoke | Runtime can load the released ONNX + tokenizer, session shapes match contract | FINDIT_INDEXER_TOKEN repo secret |
parity-against-pytorch | Cosine-floor parity (≥ 0.99917) against the in-tree PyTorch reference | same secret |
The smoke gate is not a parity gate. The parity gate is, and it runs on every push and PR in any environment that has the secret. Forks without the secret skip the whole workflow (forks are not expected to hold the release-repo PAT).
Fixtures are reproducible end-to-end via
scripts/generate_synthetic_keyframes.py +
scripts/generate_parity_fixtures.py; see tests/fixtures/README.md.
§Model files
The runtime expects the assets from
Findit-AI/indexer release models-siglip2-naflex-v1.
See models/MODELS.md for the download recipe.
§Cargo features
Defaults: ["inference", "bundled", "decoders"].
| Feature | Default | Effect |
|---|---|---|
inference | ✅ | Pulls ort + tokenizers; activates ImageEncoder, TextEncoder, Siglip2. Native targets only. |
bundled | ✅ | Embeds the 32.8 MB text-tower tokenizer.json via include_bytes! so Siglip2::bundled / TextEncoder::bundled_with_options work without a tokenizer file on disk. Implies inference. |
decoders | ✅ | Activates image crate JPEG/PNG decoders. Without this, callers supply pre-decoded RGB pixels via ImageView. |
serde | Pulls serde + serde_json. Activates Serialize / Deserialize on Options, BatchOptions, ThreadOptions, LabeledScore (Serialize only), LabeledScoreOwned, plus Calibration::from_path / from_bytes and the Siglip2::from_files* constructors that load calibration.json. Embedding and Calibration deliberately do not derive serde. The bundled path (Siglip2::bundled / Calibration::bundled) does not need this feature; calibration is baked in at build time from models/siglip2/calibration.json. | |
cuda | NVIDIA GPUs (Linux/Windows). Requires CUDA toolkit + cuDNN. Implies inference. | |
tensorrt | NVIDIA, optimized inference. Falls back to CUDA, then CPU. Implies inference. | |
directml | Windows GPUs (any vendor) via DirectX 12. Implies inference. | |
rocm | AMD GPUs (Linux). Requires ROCm SDK. Implies inference. | |
coreml | macOS / iOS via Core ML (Neural Engine + GPU + Metal). Implies inference. |
The execution-provider features are off by default — none are required
for CPU inference, and each requires its vendor SDK at build time.
Building with --features cuda (etc.) will fail on stock CI runners that
don’t have the SDK.
§Execution providers without a feature flag
If your deployment needs an EP that isn’t in the list above, build the
session yourself with the relevant ort EP enabled and pass it via
ImageEncoder::from_ort_session / TextEncoder::from_ort_session /
Siglip2::from_parts. ANE-on-Mac is an example: it requires explicit
opt-in via the coreml feature and Session::with_execution_providers
on the caller side.
§Target / feature contract
The inference family is native-only. ort (ONNX Runtime FFI) and
tokenizers (which transitively depends on onig_sys / esaxx-rs)
don’t build on wasm32-*. Building wasm with default features fails
deep in upstream C-toolchain code before this crate’s source is touched.
Wasm consumers must opt out:
cargo check --target wasm32-unknown-unknown --no-default-featuresWithout inference, the public surface is the preprocessing path
(Preprocessor, ImageView, PreprocessedBatch), the value types
(Embedding, Calibration, Options / BatchOptions / ThreadOptions,
LabeledScoreOwned, Error), and the SIMD primitives — useful for
browser / edge deployments that compute embeddings server-side and need
only the value types and similarity / decoding / preprocessing on the
client.
§License
MIT or Apache-2.0, at your option. The bundled tokenizer.json is derived
from google/siglip2-base-patch16-naflex (Apache-2.0); see
THIRD_PARTY_NOTICES.md.
Re-exports§
pub use calibration::Calibration;pub use embedding::Embedding;pub use embedding::LabeledScore;pub use embedding::LabeledScoreOwned;pub use error::Error;pub use error::Result;pub use image_enc::ImageEncoder;inferencepub use image_view::ImageView;pub use options::BatchOptions;pub use options::Options;pub use options::ThreadOptions;pub use preproc::PreprocessedBatch;pub use preproc::Preprocessor;pub use siglip2::Siglip2;inferencepub use text_enc::TextEncoder;inference
Modules§
- calibration
Calibration— sigmoid scale/bias for SigLIP2’s calibrated probabilities.- embedding
Embedding,LabeledScore[Owned].- error
- Error type for the full enum and its semantics.
- image_
enc inference - Image encoder.
ImageViewlives incrate::image_view(always compiled, used by the preprocessor on both wasm and native); this module is gated onfeature = "inference"and provides the ORT-backedImageEncoder. - image_
view ImageView— borrowed RGB pixel buffer with validating constructor.- options
- 6 for the full surface and rationale (defaults match the
existing
findit-siglip2-visionservice’s settings). - preproc
- Preprocessing pipeline. For the algorithm for the
public
PreprocessorAPI. - siglip2
inference Siglip2wrapper.- text_
enc inference - Text encoder. 4 and §5.
Enums§
- Graph
Optimization Level inference - ONNX Runtime provides various graph optimizations to improve performance. Graph optimizations are essentially graph-level transformations, ranging from small graph simplifications and node eliminations to more complex node fusions and layout optimizations.
Constants§
- BUNDLED_
TOKENIZER bundled - Raw bytes of the bundled
google/siglip2-base-patch16-naflextokenizer.json, embedded viainclude_bytes!. Used internally by thebundledconstructors onTextEncoderandSiglip2; exposed publicly so callers who need to assemble aTokenizeroff the bundled JSON (for example, ahead offrom_ort_session) can do so without round-tripping through disk.