rlx-locateanything
NVIDIA LocateAnything-3B on RLX — MoonViT-SO-400M + MLP projector + Qwen2.5-3B-Instruct with MTP box decoding.
Setup
Hugging Face cache (recommended) — weights live under ~/.cache/huggingface/hub/ (or $HF_HOME):
Speed
This is a 3B VLM (299 vision tokens at 640px side + hybrid MTP decode). Expect minutes on CPU.
| Cause | Mitigation |
|---|---|
Default build is CPU-only (just locateanything) |
just locateanything-demo uses smaller image + fewer tokens |
| First run compiles MoonViT + LM graphs | Skip --warmup for one-shot; reuse session for many images |
| Large images → many patches | --max-image-side 384 (or 320) |
| Long decode | --max-tokens 32, --generation-mode fast |
| OOM on WGPU / multi-backend bench | Mmap LM + prompt-only embed rows; WGPU uses one KV bucket (1024 cap) and GPU-resident KV (no per-step host mirror). Bench: just bench-locateanything-backends (one subprocess per backend). Single backend: --device wgpu --no-isolate. |
GPU backends: build with --features apple-silicon (Metal + MLX + wgpu), cuda, or all-backends. just locateanything-demo picks apple-silicon on macOS and nvidia-gpu on Linux when available. MoonViT uses decomposed 2D RoPE on MLX/wgpu/Vulkan/Metal. Until per-backend GPU parity lands, vision + LM graphs run on the CPU host for every non-CPU --device (Metal/MLX/CUDA/WGPU/Vulkan) so grounding output matches CPU/HF; the CLI still reports your chosen device. GPU-resident KV is disabled while on the CPU host path.
Setup
# RLX (writes into HF cache via hf-hub)
# Or Hugging Face CLI
# Or rlx-locateanything binary
No env var required for Rust/CLI: [default_model_dir()] / [LocateAnythingSession::open_default()] resolve the cached snapshot.
Optional override:
# explicit dir
# cache root
Legacy layout (still supported): .cache/locateanything/LocateAnything-3B from an older fetch. Run just fetch-locateanything-tokenizer if tokenizer.json is missing (processor prompts).
Sample image
One JPEG ships with this crate:
| Path | API |
|---|---|
fixtures/sample.jpg |
rlx_locateanything::fixtures::sample_image_path() |
640×360 RGB (from yolov5 zidane.jpg). Used by the CLI (when --image is omitted), examples, and HF *_real parity probes.
Override:
Fastest path (CLI)
No image path needed — uses the bundled sample:
Equivalent:
GPU build:
CLI flags
| Flag | Default | Notes |
|---|---|---|
--model-dir |
(required) | Directory with config.json + safetensors |
--image |
fixtures/sample.jpg |
Any JPEG/PNG |
--device |
auto |
cpu, metal, cuda, … or RLX_DEVICE |
--task |
ground-single |
See prompts module |
--phrase |
object |
Target text for ground-* tasks |
--processor-prompt |
on | HF processor layout (boxes) |
--rlx-prompt |
off | RLX Qwen chat layout |
--generation-mode |
hybrid |
fast / slow / hybrid |
--max-tokens |
64 |
New tokens cap |
--temperature |
0 |
Greedy grounding |
--max-image-side |
none | Resize longest edge before patchify |
--warmup |
off | Compile vision + LM prefill first |
--preload-lm |
off | Load LM at session open |
--dry |
Validate checkpoint only |
Environment:
| Variable | Purpose |
|---|---|
RLX_LOCATEANYTHING_DIR |
Checkpoint directory (overrides HF cache) |
HF_HOME / HUGGINGFACE_HUB_CACHE |
Hugging Face cache root |
RLX_LOCATEANYTHING_IMAGE |
Replace bundled sample |
RLX_DEVICE |
Default device when --device omitted |
Rust API
High-level session (recommended):
use ;
// HF Hub cache (after `huggingface-cli download nvidia/LocateAnything-3B`)
let mut session = open_default?;
let out = session.ground_path?;
for b in &out.boxes
Hub id or filesystem path also work:
let mut session = open?;
// let mut session = LocateAnythingSession::open("/path/to/snapshot")?;
With options:
use ;
let opts = for_grounding
.device
.max_image_side
.prompt_style;
let mut session = open_with_options?;
Lower-level runner: LocateAnythingRunner::builder() → generate(), encode_vision_cached(), warmup_compile().
Modules
| Module | Role |
|---|---|
infer |
LocateAnythingSession, InferenceOptions, PromptStyle |
hub |
default_model_dir(), hf_snapshot_dir(), default_hf_cache_dir() |
fixtures |
sample_image_path(), probe_image_path() |
device |
resolve_device, pick_auto_device |
runner |
Vision + LM generation, compile caches |
preprocess |
Native-resolution patchify (in_token_limit from JSON) |
processor_prompt |
HF LocateAnythingProcessor token layout |
parse |
<box>, <ref> output parsing |
prompts |
Task strings (ground_single, detect, …) |
generation / mtp |
slow / hybrid / fast decode |
Example binary
Uses fixtures/sample.jpg unless --image is passed.
Validation
Crate tests on the sample (need RLX_LOCATEANYTHING_DIR):
RLX_LOCATEANYTHING_DIR= \
Backends
| Feature set | Devices |
|---|---|
metal |
Apple GPU (Metal) |
mlx |
Apple MLX |
cuda / rocm |
NVIDIA / AMD |
gpu / vulkan |
wgpu (Vulkan) |
apple-silicon |
Metal + MLX + wgpu |
all-backends |
All of the above |
--device auto (or RLX_DEVICE) picks the first compiled backend: CUDA → Metal → MLX → ROCm → wgpu → Vulkan → CPU.
MoonViT, projector, and Qwen2.5 LM graphs compile per device; caches live on MoonVitCache and LmSessionCaches. See compile_support for per-backend compile options.
HF reference
The HF checkpoint uses custom transformers code (trust_remote_code=True). Task strings and generation modes match the model card (rlx_locateanything::prompts, GenerationMode).
Parity harness: crates/rlx-models/tests/locateanything_hf_parity.rs + locateanything_parity_helpers/hf_reference.py. Set RLX_LOCATEANYTHING_PYTHON to a venv with transformers, torch, safetensors.
Architecture notes
- Preprocess — rescale when patch count exceeds
preprocessor_config.jsonin_token_limit(25600), then pad to merge kernel grid. - MTP — incremental prefill on trailing
block_sizewindow; hybrid falls back to AR on decode errors. - Coordinates — model emits integers in
[0, 1000];parse_groundingmaps to pixel space usingPreprocessedImage.pixel_w/pixel_h.