Qwen3-VL-2B structured-output inference engine — async, mistralrs-backed, JSON-Schema-constrained. Implements the engine-agnostic llmtask::Task contract so the same prompt + schema + parser runs on any llmtask-compatible backend (lfm, qwen3-vl, …) without translation.
Overview
qwen3-vl runs the Qwen3-VL-2B-Instruct vision-language model through mistralrs with JSON-Schema-constrained sampling. It implements the engine-agnostic llmtask::Task contract — so any Task written against llmtask runs through qwen3-vl unchanged, and the public API stays backend-pluggable.
Engine— async, mistralrs-backed Qwen3-VL inference.Engine::run<T: Task<Value = serde_json::Value>>accepts any JSON-schema task; the result is decoded by the task'sparseimpl.ImageAnalysisTask— built-in image-analysis preset (single-image VLM scene description). Owns the prompt, the JSON schema, and the resilient parser ported from the legacyfindit-qwenservice. Produces the canonicalllmtask::ImageAnalysisoutput type.- CPU by default, opt-in GPU —
qwen3-vldoes not re-export mistralrs's hardware-backend features. Consumers depend onmistralrsdirectly with the desired backend (metal/cuda/cudnn/ …); Cargo unifies feature sets andqwen3-vlpicks up the selection.
Why an llmtask-driven engine?
A bespoke qwen3_vl::Task would force every prompt + schema + parser to be rewritten against the next inference engine. Implementing llmtask::Task instead means the same Task code targets qwen3-vl (mistralrs), lfm (llguidance), or any future llmtask-compatible backend without modification — only the hardware backend selection differs.
┌──────────────────────────┐
YourTask: impl Task ──▶ │ llmtask::Task contract │ ──▶ qwen3-vl / lfm / …
│ prompt + Grammar │
│ parse → Output │
└──────────────────────────┘
Features
- Async, single-engine inference —
Engine::run(&task, images).await. No built-in cancellation token; wrap withtokio::time::timeoutortokio::select!. - Bounded inference timeout — every
Engine::runis wrapped intokio::time::timeout(EngineOptions::inference_timeout)(default 300 s). A stuck model (Metal JIT stall, GPU memory exhaustion) surfaces asError::InferenceTimeoutinstead of blocking the caller indefinitely. finish_reasondiscipline — mistralrs'sChoice::finish_reason != "stop"(e.g."length","model_length") is surfaced asError::TruncatedBEFORE the parser runs, so partial JSON can never silently land in a downstream search index.- Sampler-options validation —
RequestOptions::validaterejects out-of-range values (negative temperature,top_p > 1.0,top_k = 0) at the engine boundary instead of hitting undefined behavior inside mistralrs's sampler. - Resilient JSON parser (
ImageAnalysisTask) —TagList/DetectionLabelsaccept list-or-string forms;#[serde(deny_unknown_fields)]on the schema struct; required arrays set tonullare rejected (not coerced to empty); an indexable-content gate surfaces decoder/model regressions asJsonParseError::NoUsableFieldsby default. - Indexing-safe greedy default —
EngineOptions::newembedsRequestOptions::deterministic()(greedy,temperature = 0.0) so retries / timeouts / backfills produce bit-stableImageAnalysisacross runs. Swap to the model-card stochastic sampler with.with_request(RequestOptions::new())orEngine::run_with.
Example
use ;
async
Engine::run consumes Vec<DynamicImage> because mistralrs 0.8's MultimodalMessages::add_image_message takes the vec by value — borrowing would force a silent .to_vec() clone of decoded image data.
Per-call sampler override
# use ;
# async #
Installation
[]
= "0.1"
use ;
Hardware backend selection
Default features are CPU-only — qwen3-vl builds out of the box on every host mistralrs supports. To enable a hardware backend (Metal, CUDA, etc.), depend on mistralrs directly and select its feature; Cargo unifies feature sets across all references to the same crate, so qwen3-vl automatically picks up your selection:
[]
= "0.1"
# Pick at most one primary GPU backend; accelerated BLAS / cuDNN /
# NCCL / flash-attn options layer on top.
= { = "0.8", = ["metal"] } # Apple Metal
# mistralrs = { version = "0.8", features = ["cuda"] } # NVIDIA CUDA
# mistralrs = { version = "0.8", features = ["accelerate"] } # Apple Accelerate BLAS (CPU)
The full backend matrix mistralrs supports: metal, cuda, cudnn, flash-attn, accelerate, mkl, nccl, ring. Each may require an external toolchain (Xcode Command Line Tools for metal / accelerate, the CUDA toolkit for cuda, etc.) — see the mistralrs README for prerequisites.
Cargo features
| Feature | Default | What it adds |
|---|---|---|
integration |
no | Enables tests/integration_scene.rs (needs QWEN_MODEL_PATH and ~4 GB of weights) |
trace-output |
no | Logs raw model output at tracing::trace level — heavyweight; debugging only |
MSRV
Rust 1.95.
License
qwen3-vl is dual-licensed under the MIT license and the Apache License, Version 2.0.
The Qwen3-VL model weights this crate runs are governed by their own license — see the model card for terms.
Copyright (c) 2026 FinDIT Studio authors.