Rust ONNX inference for LiquidAI LFM2.5-VL — a 450M-parameter vision-language model with schema-constrained sampling via llguidance. Implements the engine-agnostic llmtask::Task contract, so any Task written against llmtask runs through lfm unchanged.
Overview
lfm is the LiquidAI LFM2.5-VL inference engine on Rust + ONNX Runtime + llguidance:
Engine— sync, single-threaded; built onort2.0.Engine::run<T: Task<Value = serde_json::Value>>accepts anyllmtask::Taskwhose grammar is JSON Schema, Lark, or Regex. Schema-constrained sampling is enforced by llguidance token-mask filtering.Engine::generateis the unconstrained path for free-form text.ImageAnalysisTask— built-in image-analysis preset that produces the canonicalllmtask::ImageAnalysisoutput type, sharing the schema and parser withqwen.- Bundled assets — the
bundledfeature ships LFM2.5-VL's tokenizer, chat template, and preprocessor configs asinclude_bytes!.Engine::from_onnx_dirthen accepts a directory containing only the three ONNX graphs; no separate tokenizer download required. - Wasm-friendly preprocessing —
preproc::Preprocessor,TileGrid, and EXIF-aware decode helpers compile under--no-default-features --features decoders(noort, notokenizers).
Why an llmtask-driven engine?
A bespoke lfm::Task would force every prompt + schema + parser to be rewritten against the next inference engine. Implementing llmtask::Task instead means the same Task code targets lfm (llguidance), qwen (mistralrs), or any future llmtask-compatible backend without modification — only the hardware backend selection differs.
┌──────────────────────────┐
YourTask: impl Task ──▶ │ llmtask::Task contract │ ──▶ lfm / qwen / …
│ prompt + Grammar │
│ parse → Output │
└──────────────────────────┘
Because lfm's backend is llguidance, all three llmtask::Grammar variants (JSON Schema, Lark, Regex) are accepted — engines that only speak JSON Schema (e.g. qwen) reject the others via UnsupportedGrammar, and the caller can route to lfm.
Features
- All three
Grammarvariants — JSON Schema, Lark, and Regex are all native to llguidance, so anyllmtask::Taskruns throughEngine::run. The HIR-anchored regex validator on theGrammarside matches engine semantics exactly (no substring vs. full-match drift). - Bundled tokenizer + configs (
bundledfeature, default) —Engine::from_onnx_diraccepts an ONNX-only directory; tokenizer / chat template / preprocessor configs are embedded in the binary at compile time.Engine::from_diris the strict constructor that byte-validates a supplied tokenizer + chat template against the bundled blobs to catch silent prompt-envelope drift. - Hybrid KV/conv-state cache decoder — LFM2 architecture has 10 conv-state layers and 6 attention layers at sparse indices.
decoder.rsmanages the non-contiguous cache layout transparently. - Wasm-friendly preprocessing — drop the
inferenceandbundleddefaults to get a pure-CPU image-preprocessing surface (Preprocessor,TileGrid, EXIF-aware decode) usable fromwasm32-unknown-unknown. - GPU acceleration —
cuda,tensorrt,directml,rocm,coremlORT execution providers gated behind feature flags. None are required for CPU inference. - Admission-control DoS guards — bounded request shape (max messages, max content parts), text-size cap, image-count lower bound from
min_image_tokens, header-time decoded-buffer cap, and a special-token denylist seeded from the live tokenizer'sadded_vocabulary. All run BEFORE any image decode or template render.
Example
From a HuggingFace download (tokenizer.json + configs in dir)
use ;
use SmolStr;
ONNX-only dir + bundled tokenizer
If you've downloaded just the ONNX files (not tokenizer.json and the JSON configs), use Engine::from_onnx_dir. The tokenizer + configs are embedded in the binary and written to a temp file on first use.
use ;
Structured output via the ImageAnalysisTask preset
use ;
use SmolStr;
Installation
[]
= "0.1"
Download the ONNX artifacts from LiquidAI/LFM2.5-VL-450M-ONNX and set LFM_MODEL_PATH to the directory containing them:
vision_encoder.onnx
embed_tokens.onnx
decoder_model_merged.onnx
tokenizer.json (optional — bundled if absent and `bundled` feature is on)
Cargo features
Defaults: ["inference", "bundled", "decoders"].
| Feature | Default | What it adds |
|---|---|---|
inference |
yes | Pulls ort, tokenizers, llguidance, minijinja. Activates Engine. Native targets only. |
bundled |
yes | Embeds tokenizer.json + JSON configs (~4.5 MB) at compile time; adds Engine::from_onnx_dir. Implies inference. |
decoders |
yes | Activates JPEG/PNG decoding via the image crate. |
serde |
no | Serialize/Deserialize on Options, RequestOptions, ThreadOptions, ImageBudget. |
cuda |
no | NVIDIA GPUs (Linux / Windows). Requires CUDA toolkit + cuDNN. Implies inference. |
tensorrt |
no | NVIDIA, optimized inference. Falls back to CUDA, then CPU. Implies inference. |
directml |
no | Windows GPUs (any vendor) via DirectX 12. Implies inference. |
rocm |
no | AMD GPUs (Linux). Requires ROCm SDK. Implies inference. |
coreml |
no | macOS / iOS via Core ML (Neural Engine + GPU + Metal). Implies inference. |
integration |
no | Enables the integration test (tests/integration.rs). Requires LFM_MODEL_PATH. |
GPU execution-provider features are off by default — none are required for CPU inference, and each requires its vendor SDK at build time.
Wasm / preprocessing-only build
The public surface under --no-default-features --features decoders is preproc::Preprocessor, preproc::TileGrid, preproc::PreprocessedImage, preproc::decode_bytes_with_orientation, options::*, and error::*.
Architecture
Per-image vision encoding → text+image embedding splice → hybrid KV/conv cache decoder loop → optional schema-constrained sampling.
| Graph | Role | Size |
|---|---|---|
vision_encoder.onnx |
SigLIP2 image encoder — single image per call | ~86M params |
embed_tokens.onnx |
Token embedding lookup table | — |
decoder_model_merged.onnx |
LFM2 hybrid LM: 10 conv-state + 6 KV-attn layers (sparse cache) | ~350M params |
The decoder manages a sparse hybrid cache: conv-state layers store recurrent state (not KV pairs), so cache indices are non-contiguous. Schema-constrained sampling is handled by llguidance masking the logits at each step to enforce the Grammar from the Task.
Multi-image note: the vision encoder accepts one image per call. Batched multi-image calls produce silently-wrong embeddings — Engine::generate/run iterate per-image and concatenate the flat image_features outputs in source order.
MSRV
Rust 1.95.
License
lfm is dual-licensed under the MIT license and the Apache License, Version 2.0.
The LFM2.5-VL model weights this crate runs are governed by the LFM Open License v1.0. Verify your use case complies with Liquid AI's terms separately from this crate's license.
Copyright (c) 2026 FinDIT Studio authors.