rlx-models

Concrete model graph builders + weight loaders for RLX — the "what actually runs" layer.

Standalone repo: github.com/MIT-RLX/rlx-models. Clone next to rlx:

rlx-workspace/
  rlx/          # github.com/MIT-RLX/rlx
  rlx-models/   # github.com/MIT-RLX/rlx-models
  candle/       # optional, for parity-candle

git clone https://github.com/MIT-RLX/rlx.git
git clone https://github.com/MIT-RLX/rlx-models.git
cd rlx-models && cargo test -p rlx-models

The RLX monorepo lists ../rlx-models/crates/rlx-models as a workspace member; you can also run cd rlx && cargo test -p rlx-models there.

Agent-oriented quick reference: AGENTS.md.

Architecture
Running models
What's here
Install
Quickstart — embeddings
High-level runner API
Adding a new model
Compile profiles
Qwen3
MiniCPM5
Qwen3-TTS
Voxtral TTS
VAD (Earshot + Silero)
Build and test
Status
Gotchas
Per-crate READMEs
License

Architecture

This repo is a Cargo workspace: one library crate per model family under crates/, plus shared infrastructure. The rlx-models package is a thin facade that re-exports historical paths (rlx_models::qwen3, rlx_models::sam, …).

rlx-models/
├── Cargo.toml              # workspace members + [workspace.dependencies]
├── justfile                # shortcuts (optional)
├── crates/
│   ├── rlx-models-core/    # config, weight_map, flow_bridge (package `rlx-core`)
│   ├── rlx-ssm/            # SSM flow stages + custom ops (Mamba, LFM, …)
│   ├── rlx-cli/            # shared CLI + rlx-inspect
│   ├── rlx-<model>/        # one crate per family
│   └── rlx-models/         # facade + optional rlx-run multiplexer
└── crates/rlx-models/examples/   # integration templates

Crates

Crate	Model / role
`rlx-models-core` (`rlx-core`)	config, `weight_map`, `weight_loader`, `flow_bridge`, `flow_util`
`rlx-ssm`	SSM flow stages (`MambaScanStage`, decode-step custom ops)
`rlx-mamba`	Mamba1 block + multi-backend driver
`rlx-bert`	BERT
`rlx-nomic`	NomicBERT
`rlx-vision`	NomicVision
`rlx-dinov2`	DINOv2
`rlx-embed`	embedding runtime
`rlx-sam` / `sam2` / `sam3`	SAM family
`rlx-sam-ir`	shared mask-decoder IR
`rlx-qwen3`	Qwen3 LM
`rlx-qwen35`	Qwen3.5 / 3.6
`rlx-llama32`	LLaMA 3.2
`rlx-minicpm5`	MiniCPM5 (Llama-shaped; openbmb/MiniCPM5-1B)
`rlx-gemma`	Gemma / Gemma 2
`rlx-llada2`	LLaDA2 + TIDE offload
`rlx-flux2`	FLUX.2
`rlx-vjepa2`	V-JEPA 2
`rlx-wav2vec2-bert`	Wav2Vec2-BERT
`rlx-whisper`	OpenAI Whisper ASR
`rlx-vad`	Earshot + Silero VAD (embedded weights, 16 kHz)
`rlx-voxtral`	Mistral Voxtral speech LM
`rlx-voxtral-tts`	Voxtral-4B-TTS inference (codec + Ministral LM)
`rlx-voxtral-tts-train`	Native RLX voice-clone training (encoder + LoRA)
`rlx-qwen3-tts`	Qwen3-TTS — voice clone + CustomVoice TTS, progressive streaming, duplex voice chat (Whisper + Qwen3 LM). JFK samples + roundtrip audio ship in the crate.
`rlx-locateanything`	NVIDIA LocateAnything-3B VLM (grounding)
`rlx-cli`	shared CLI helpers + `rlx-inspect`
`rlx-models`	facade (re-exports) + optional `rlx-run` multiplexer

How to depend

Goal	Depend on
One model only (fast builds)	`rlx-qwen3`, `rlx-sam3`, …
Stable `rlx_models::qwen3` paths	`rlx-models` facade
CLI / inspect only	`rlx-cli`

New code that only needs Qwen3 should depend on rlx-qwen3 directly.

Per-crate binaries

Each model crate with a CLI has src/cli.rs (pub fn run) and src/bin/rlx-<name>.rs. Shared flag parsing lives in rlx-cli.

rlx-run (in rlx-models) is an optional multiplexer over all built-in CLIs. Prefer per-crate binaries when you only need one family — they link less and compile faster.

SAM unified runner: SamRunner (SAM1/2/3) stays on the facade (rlx-models/src/sam_runner.rs) because rlx-sam2 depends on rlx-sam. Per-arch CLIs are on rlx-sam, rlx-sam2, rlx-sam3.

Published rlx* crates (rlx-runtime, rlx-flow, …) are pinned at 0.2.4 in root [workspace.dependencies]; every crate uses { workspace = true }. Local dev with a sibling ../rlx checkout: cp .cargo/config.toml.example .cargo/config.toml (gitignored patches). Publish / CI uses crates.io only — no .cargo/config.toml, no [patch.crates-io] in committed Cargo.toml.

Running models

just (shortcuts)

Install just (brew install just). From the repo root:

just                          # list recipes
just qwen3 -- --weights model.gguf --prompt-ids 1,2,3
just inspect weights/model.gguf
just qwen3-metal -- --weights model.gguf --device metal --prompt-ids 1,2,3
just fetch-minicpm5
just minicpm5 -- --weights /tmp/rlx-weights/MiniCPM5-1B/model-00000-of-00001.safetensors --device cpu --prompt-ids 1,42 --max-tokens 16
just minicpm5-chat "Hello from MiniCPM5"

Pass model CLI flags after --. MiniCPM5 details: crates/rlx-minicpm5/README.md and MiniCPM5. GPU backends: just features=all-backends qwen3 -- --device metal, just qwen35-all-backends -- …, or per-crate qwen3-all-backends / qwen35-all-backends.

Per-crate binaries (recommended)

Binary	Crate	Example
`rlx-qwen3`	`rlx-qwen3`	`cargo run -p rlx-qwen3 --bin rlx-qwen3 --release -- --weights model.gguf --prompt-ids 1,2,3`
`rlx-qwen35`	`rlx-qwen35`	`cargo run -p rlx-qwen35 --bin rlx-qwen35 --release -- …`
`rlx-llama32`	`rlx-llama32`	`cargo run -p rlx-llama32 --bin rlx-llama32 --release -- …`
`rlx-minicpm5`	`rlx-minicpm5`	`cargo run -p rlx-minicpm5 --features tokenizer --release -- --weights …/model.safetensors --prompt-ids 1,42`
`rlx-gemma`	`rlx-gemma`	`cargo run -p rlx-gemma --bin rlx-gemma --release -- --weights model.gguf --prompt-ids 1,2,3`
`rlx-dinov2`	`rlx-dinov2`	`cargo run -p rlx-dinov2 --bin rlx-dinov2 --release -- …`
`rlx-vjepa2`	`rlx-vjepa2`	`cargo run -p rlx-vjepa2 --bin rlx-vjepa2 --release -- …`
`rlx-wav2vec2-bert`	`rlx-wav2vec2-bert`	`cargo run -p rlx-wav2vec2-bert --bin rlx-wav2vec2-bert --release -- …`
`rlx-whisper`	`rlx-whisper`	`cargo run -p rlx-whisper --bin rlx-whisper --release -- --weights model.safetensors --wav audio16k.wav`
`rlx-vad`	`rlx-vad`	`cargo run -p rlx-vad --release -- --backend silero --wav audio16k.wav` (docs)
`rlx-voxtral`	`rlx-voxtral`	`cargo run -p rlx-voxtral --bin rlx-voxtral --release -- --weights model_dir --wav audio16k.wav --transcribe`
`rlx-voxtral-tts`	`rlx-voxtral-tts`	`just voxtral-tts -- --model-dir DIR --text "Hello" --voice neutral_female -o out.wav`
`rlx-voxtral-tts-train`	`rlx-voxtral-tts-train`	`just voxtral-tts-train-production -- --model-dir DIR --wav-dir WAVS --device auto`
`rlx-locateanything`	`rlx-locateanything`	`cargo run -p rlx-locateanything --bin rlx-locateanything --release -- --model-dir DIR --dry`
`rlx-sam1`	`rlx-sam`	`cargo run -p rlx-sam --bin rlx-sam1 --release -- …`
`rlx-sam2`	`rlx-sam2`	`cargo run -p rlx-sam2 --bin rlx-sam2 --release -- …`
`rlx-sam3`	`rlx-sam3`	`cargo run -p rlx-sam3 --bin rlx-sam3 --release -- …`
`rlx-flux2`	`rlx-flux2`	`cargo run -p rlx-flux2 --bin rlx-flux2 --release -- …`
`rlx-flux2-serve`	`rlx-flux2`	JSON-lines server on stdin
`rlx-inspect`	`rlx-cli`	`cargo run -p rlx-cli --bin rlx-inspect -- model.gguf`

Flags match the corresponding rlx-run subcommand (without the subcommand name).

Multiplexer (`rlx-run`)

cargo run -p rlx-models --bin rlx-run --release --features metal -- \
    qwen3 --weights Qwen3-0.6B-Q4_K_M.gguf --device metal --prompt-ids 1,17,42

cargo run -p rlx-models --bin rlx-run -- inspect Qwen3-0.6B-Q4_K_M.gguf

rlx-inspect dumps format, tensor count, dtype histogram, GGUF metadata, MTP heads, and multi-.gguf dir hints (--prefer Q4_K_M).

Custom CLI

Downstream tools can register runners without forking rlx-models:

use rlx_cli::{dispatch, register_cli};

register_cli("my-model", "…", |args| { /* … */ });
dispatch(&argv)?;

See crates/rlx-models/examples/register_custom_runner.rs.

Examples (facade)

Integration templates on the rlx-models package:

cargo run -p rlx-models --example run_qwen3_gguf --release -- [args]
just example-qwen3-gguf -- /path/to/model.gguf

File	What it does
`run_qwen3_safetensors.rs`	Qwen3 from HF safetensors, builder API, streaming greedy decode
`run_qwen3_gguf.rs`	Same from `.gguf` (Q4_K_M / Q5_K_M / Q6_K), MTP head detection
`run_sam1.rs`	SAM 1 — encode image, prompt encoder + mask decoder
`run_sam2.rs`	SAM 2 — FPN + memory attention
`run_sam3.rs`	SAM 3 — text-conditioned detection + masks
`qwen3_gguf_inference.rs`	Detailed Qwen3 GGUF walk-through
`gguf_qwen3_probe.rs`	Validate `hf_to_gguf_name` against a real GGUF
`qwen3_matrix.rs`	(B, L, mode) × (CPU, Metal, MLX, wgpu) parity + perf vs candle
`minicpm5_download.rs`	Fetch openbmb/MiniCPM5-1B safetensors (`hf-download`)
`minicpm5_gguf_download.rs`	Fetch GGUF quants (Q4_K_M / Q8_0 / F16)
`run_minicpm5.rs`	`MiniCpm5Runner` prefill + greedy decode from safetensors
`minicpm5_forward_bench.rs`	Wall-clock prefill/decode across backends (real 1B weights)
`minicpm5_chat.py`	HF chat template → `rlx-minicpm5` (`just minicpm5-chat`)

Qwen3-TTS samples

Audio and charts live in crates/rlx-qwen3-tts. Duplex voice chat (bundled question → JFK-clone reply):

Voice-chat roundtrip latency

Three JFK voice-clone clips (ask_not, moon, rlx_intro) — ECAPA cosine 0.95+, WER 0–3.8 %. Full metrics, streaming API, and just voice-chat-demo: crate README.

SAM examples synthesize a 1024×1024 RGB gradient — swap in image::open(path) for real images.

just fetch-minicpm5
just example run_minicpm5 --release

Weight fetch (optional)

docker/qwen3-fetch/ — container pulls HF checkpoints into ./weights; host runs cargo test / benches natively.

just fetch-qwen3
# or: docker build -t rlx-qwen3-fetch docker/qwen3-fetch && …
just fetch-minicpm5
just fetch-minicpm5-gguf Q4_K_M

What's here

qwen3 — Qwen3 decoder LM (GQA, QK-norm, RoPE, SwiGLU, tied embeddings). Safetensors + GGUF; optional qwen3.rlx.toml. See Qwen3.
qwen35 — Qwen3.5 / 3.6 hybrid (gated DeltaNet + periodic attention + optional MTP). GGUF via Qwen35Runner; optional qwen35.rlx.toml. Parity: examples/qwen35_compare.rs with the llama.cpp reference script in examples/.
gemma — Gemma / Gemma 2 / 3 / 4 (GQA, RoPE, GeGLU, tied weights, Gemma2 softcap). Safetensors + GGUF; optional gemma.rlx.toml. See crates/rlx-gemma/README.md. CLI: rlx-gemma / rlx-run gemma. Parity: just test-gemma-parity gemma2_synthetic; backends: just features=all-backends test-gemma-backends.
bert — BERT graph builder (MiniLM, BGE, all-MiniLM-L6-v2).
nomic — NomicBERT (RoPE + SwiGLU).
vision — NomicVision-style encoders.
dinov2 — DINOv2 ViT (B/14, L/14, g/14).
sam, sam2, sam3 — Segment Anything encoders + mask decoders. Optional sam.rlx.toml next to weights (reference: crates/rlx-sam/src/sam.rlx.toml).
flux2 — FLUX.2 rectified-flow denoiser. rlx-flux2 CLI; presets flux2_dev(), flux2_klein_4b(), flux2_klein_9b(). VAE, CFG, img2img, LoRA, hf-download, rlx-flux2-serve. GPU backends via rlx-models features (metal, cuda, …).
embed — RlxEmbed, registry, tokenizers, pooling. from_pretrained with hf-download.
config, weight_loader — HF config parsing; WeightMap + GgufLoader (K-quants, MTP isolation).
mamba — Mamba1 SSM block (rlx-mamba); SSM via rlx-ssm + SelectiveScan. See crates/rlx-mamba/README.md.
lfm, minimax, nemotron — hybrid runners using rlx-ssm decode-step stages.
minicpm5 — MiniCPM5 edge LMs (Llama-shaped 1B). Wraps Llama32Runner; safetensors + GGUF. See MiniCPM5 and crates/rlx-minicpm5/README.md.
qwen3-tts — Qwen3-TTS Base (voice clone) + CustomVoice. ECAPA x-vector, 28-layer talker, 16-group code predictor, 12 Hz Mimi decode. VoiceClone API, progressive streaming, and bidirectional_voice_chat (Whisper → Qwen3-0.6B → TTS). See Qwen3-TTS.
voxtral-tts — Voxtral-4B-TTS native inference (Tekken tokenizer, codec decode, compiled LM). voxtral-tts-train — RLX autodiff training for reference-audio cloning (codec encoder + full attention LoRA). See Voxtral TTS.
run — Qwen3Runner, SamRunner, … builders for one-call inference.

Install

[dependencies]
rlx-models = "0.2"

HF-hub download:

rlx-models = { version = "0.2", features = ["hf-download"] }

Quickstart — embeddings

use rlx_models::embed::{Pooling, RlxEmbed};

let mut model = RlxEmbed::from_pretrained("sentence-transformers/all-MiniLM-L6-v2")?;
let hidden = model.forward(&[("input_ids", &ids), ("attention_mask", &mask)], 1, 16)?;

High-level runner API

rlx_models::run exposes builder-style entry points (also rlx::run in the monorepo):

use rlx_models::run::{Qwen3Runner, Precision};
use rlx_runtime::Device;

let mut runner = Qwen3Runner::builder()
    .weights("Qwen3-0.6B-Q4_K_M.gguf")
    .device(Device::Metal)
    .max_seq(128)
    .precision(Precision::F32)
    .max_memory_gb(16.0)
    .stream(true)
    .use_mtp(false)
    .packed_weights(false)
    .build()?;

runner.generate(&prompt_ids, 32, |tok| print!("{tok} "))?;

Packed weights (large GGUF on limited RAM — CPU-only, memory-frugal, slower):

let mut runner = Qwen3Runner::builder()
    .weights("Qwen3-14B-Q4_K_M.gguf")
    .packed_weights(true)
    .max_seq(128)
    .build()?;
runner.generate(&prompt_ids, 16, |tok| print!(" {tok}"))?;
let logits = runner.predict_logits(&prompt_ids)?;

Format (safetensors vs gguf) is auto-detected. SAM uses SamRunner::builder(SamArch::Sam2).

CLI equivalent:

just qwen3 -- --weights Qwen3-14B-Q4_K_M.gguf --packed --max-seq 128 --max-tokens 16 --prompt-ids 1,17,42
# or: cargo run -p rlx-qwen3 --bin rlx-qwen3 --release -- …

Adding a new model

Borrowed from Max's four-file layout; each architecture is a workspace crate crates/rlx-<name>/.

1. Create the crate

Root Cargo.toml:

# [workspace.members]
"crates/rlx-myarch",

# [workspace.dependencies]
rlx-myarch = { path = "crates/rlx-myarch" }

Depend on rlx-core, rlx-ir, rlx-flow, rlx-runtime as needed.

2. Source layout

crates/rlx-myarch/src/
├── lib.rs
├── arch.rs       # ArchSpec registration (optional)
├── config.rs     # HF config.json
├── weights.rs    # HF → RLX name map
├── builder.rs    # graph construction
├── flow.rs       # compile helpers (optional split)
└── cli.rs        # pub fn run(args: &[String])

arch.rs registers with rlx_core::arch_registry. weights.rs holds rename rules; builder.rs emits IR. Reference: crates/rlx-qwen3.

3. Facade re-export

In crates/rlx-models/src/lib.rs:

pub mod myarch {
    pub use rlx_myarch::*;
}

4. CLI (optional)

cli.rs + [[bin]] name = "rlx-myarch"
Register in crates/rlx-models/src/bin/rlx_run.rs: register_cli("myarch", "…", rlx_myarch::cli::run)
Add a just recipe in justfile (optional)

5. High-level runner (optional)

Put MyArchRunner in the model crate; re-export from crates/rlx-models/src/run.rs.

Legacy flat modules (rlx-bert, rlx-nomic) stay as-is until they grow — use this layout for new architectures.

Compile profiles (tier-1)

Compile through tier-1 profiles, not bare Session::compile(graph):

Model	Profile helper	Optional file next to weights
Qwen3	`flow_util::compile_graph_qwen3_prefill_with_params`	`qwen3.rlx.toml`
Qwen3.5	`compile_support::compile_qwen35_prefill` / `compile_qwen35_decode`	`qwen35.rlx.toml`
SAM / SAM3	`flow_util::compile_graph_sam_with_params`	`sam.rlx.toml`
Encoders	`flow_util::compile_graph_encoder_with_params`	—

Synthetic Qwen3.5 weights for CPU checks: rlx_models::qwen35::synth (tiny_cfg, medium_cfg, bench_cfg, …).

just test-quick
# cargo test -p rlx-models --test qwen35_forward_check --test compile_profile_quick_check

Real-GGUF / backend checks: set QWEN35_GGUF_PATH (LMs) or vision env vars (SAM3_GGUF_PATH, DINOV2_GGUF_PATH, FLUX_GGUF_PATH, W2V_BERT_GGUF_PATH). Drain: cargo test -p rlx-models --test vision_gguf_load --release. Compile quick check: cargo test -p rlx-models --test vision_gguf_compile --release (SAM3 also needs VISION_GGUF_COMPILE=1; W2V-BERT needs RLX_W2V_BERT_DIR with config.json). FLUX: cargo test -p rlx-models --test flux2_gguf_runner_quick_check --release (FLUX_GGUF_PATH / FLUX_MODEL_ROOT; optional FLUX_VAE_DIR for VAE encode). Q4_0 fused matmul: cargo test -p rlx-models --test gguf_legacy_quant_matmul --release; Metal parity: GGUF_LEGACY_METAL_PARITY=1 with --features metal. Enable metal / mlx / cuda / parity-llama per test file where noted.

Qwen3

Prefill + decode on all seven standard backends (CPU, Metal, MLX, CUDA, ROCm, WGPU, Vulkan). Enable matching features at build time (cargo build -p rlx-qwen3 --features all-backends). Synthetic checks: just features=all-backends test-qwen3-backends. Parity: 100% top-1 vs HF (tests/qwen3_parity.rs).

Safetensors

use rlx_models::qwen3::{Qwen3Config, build_qwen3_graph_sized_last_logits};
use rlx_models::weight_map::WeightMap;
use rlx_runtime::Device;

let cfg = Qwen3Config::from_file("weights/Qwen3-0.6B/config.json".as_ref())?;
let mut wm = WeightMap::from_file("weights/Qwen3-0.6B/model.safetensors")?;
let (graph, params) = build_qwen3_graph_sized_last_logits(&cfg, &mut wm, 1, 128, false)?;
let mut compiled = rlx_models::flow_util::compile_graph_qwen3_prefill_with_params(
    Device::Metal, graph, params,
)?;

GGUF

use rlx_models::weight_loader::GgufLoader;
let mut wm = GgufLoader::from_file("Qwen3-0.6B-Q4_K_M.gguf")?;
// same compile + run as safetensors

Demo: just example-qwen3-gguf -- path/to/model.gguf. Verified vs unsloth/Qwen3-0.6B-GGUF (cosine ≈ 0.976 vs F32 safetensors on Q4_K_M).

Directories with several .gguf files: pass ResolveWeightsOptions { prefer_gguf_substring: Some("Q4_K_M"), .. } or gguf_index: Some(0) (see rlx_core::gguf_support). Multi-part split GGUF (split.count > 1) auto-merges when all shards sit in the same directory; otherwise rlx-inspect lists missing parts.

Weights API (model-agnostic loader)

rlx_core::weights only handles paths, file formats, and drain policy. It does not know about Qwen, FLUX, BERT, etc.

use rlx_core::weights::{self, LoadOpts};

let (path, map) = weights::open_map("weights/")?;
let (path, map) = weights::open_map_with(LoadOpts::map().prefer_q4_k_m(), "weights/")?;
let loaded = weights::open_with(LoadOpts::loader(), "model.gguf")?; // packed take / MTP

Model-specific policy belongs in each runner:

use rlx_core::{load_weight_map, gguf_validate_arch, EMBED_GGUF_ARCHES, DINOV2_GGUF_ARCHES};

// One call: resolve path, validate arch on .gguf, drain to F32 map
let map = load_weight_map(path, DINOV2_GGUF_ARCHES)?;

// Or split validate + open (embed / custom drain policy)
gguf_validate_arch(&path, EMBED_GGUF_ARCHES)?;
let (_path, map) = weights::open_map(path)?;

Layer	Responsibility
`weights` / `weight_registry`	`.gguf` / `.safetensors`, resolve dir, custom extensions
`gguf_validate_arch`, `assert_gguf_family`	Optional arch guard in your crate
`register_gguf_tensor_resolver`	HF ↔ `blk.*` / prefix strip per checkpoint layout
`BertConfig::from_gguf`, `Flux2Config::from_gguf`	Hyperparameters from metadata

Inspect: rlx-inspect path [--prefer Q4_K_M] [--json] — directory listing, split-part hints, runner suggestions.

CLI: LM / FLUX binaries accept --prefer-quant and --gguf-index (via rlx_cli::resolve_weights_cli); default quant preference is Q4_K_M in multi-file dirs.

Splits: Multi-part GGUF (split.count > 1) auto-merges when all parts are in the same directory; otherwise rlx-inspect lists missing shards.

Legacy quants: Q4_0 / Q8_0 support packed DequantMatMul on CPU and Metal (fused MSL dequant+matmul, 32-element blocks). Set RLX_DISABLE_METAL_DEQUANT_GPU=1 to force host dequant on Apple GPUs.

Example: cargo run -p rlx-models --example custom_weight_format

Apple Silicon

Metal lowers to MPSGraph (per shape). Env toggles:

env var	effect
`RLX_DISABLE_MPSGRAPH=1`	per-op Metal thunks
`RLX_DISABLE_MPSGRAPH_EXECUTABLE=1`	JIT MPSGraph
`RLX_MPSGRAPH_PARAM_CONST=1`	bake weights into executable
`RLX_QWEN3_F16_LM_HEAD=1`	F16 final matmul
`RLX_MPSGRAPH_TRACE=1`	print lowering blockers

Harness: examples/qwen3_matrix.rs.

MiniCPM5

openbmb/MiniCPM5-1B — 1B Llama decoder (GQA, RoPE, SwiGLU). Implemented in rlx-minicpm5 on top of rlx-llama32 with HF config.json / GGUF arch checks. Full runbook: crates/rlx-minicpm5/README.md.

Download

just fetch-minicpm5                              # safetensors → /tmp/rlx-weights/MiniCPM5-1B
just fetch-minicpm5-gguf Q4_K_M                  # GGUF → …/MiniCPM5-1B-GGUF

CLI

Uses the same flags as rlx-llama32 (--weights, --device, --prompt-ids, --tokenizer, --packed, --max-seq, --max-tokens, …). Build with tokenizer for decode:

W=/tmp/rlx-weights/MiniCPM5-1B/model-00000-of-00001.safetensors

just minicpm5 -- --weights "$W" --device cpu --prompt-ids 1,42,314 --max-tokens 16

# GGUF packed prefill (CPU + Metal native; MLX/wgpu/CUDA use CPU execution path today):
just minicpm5 -- --weights /tmp/rlx-weights/MiniCPM5-1B-GGUF/MiniCPM5-1B-Q4_K_M.gguf \
  --packed --device metal --prompt-ids 1,42 --max-tokens 8

Chat (HF template)

pip install transformers
just fetch-minicpm5
just minicpm5-chat "What is 2+2? Answer in one sentence."

minicpm5_chat.py tokenizes with the official template, then runs rlx-minicpm5 (defaults to CPU for reliable KV decode on Apple Silicon).

Library

use rlx_minicpm5::MiniCpm5Runner;
use rlx_runtime::Device;

let mut runner = MiniCpm5Runner::builder()
    .weights("/tmp/rlx-weights/MiniCPM5-1B/model-00000-of-00001.safetensors")
    .device(Device::Cpu)
    .max_seq(512)
    .build()?;
let logits = runner.predict_logits(&[1, 42, 314])?;

Example: just example run_minicpm5 --release (or cargo run -p rlx-models --example run_minicpm5 --release).

Tests

Command	What
`just test-minicpm5-parity-full`	RLX vs PyTorch (safetensors, needs weights)
`just test-minicpm5-backends-all`	Synthetic 1B-shaped graph, all backends
`just test-minicpm5-gguf-backends`	Real Q4_K_M GGUF packed prefill
`../rlx/rig.sh test-minicpm5`	Remote rig: CPU + CUDA + WGPU on Windows/WSL (after `sync` + `sync-minicpm5-gguf`)
`just bench-minicpm5-real --device cpu`	Wall-clock prefill/decode on 1B weights

Multiplexer: cargo run -p rlx-models --bin rlx-run --features tokenizer -- minicpm5 --weights ….

LocateAnything

NVIDIA LocateAnything-3B — MoonViT vision + mlp1 projector + Qwen2.5-3B with MTP box decoding. Crate: rlx-locateanything; runbook: crates/rlx-locateanything/README.md.

just fetch-locateanything
export RLX_LOCATEANYTHING_DIR=.cache/locateanything/LocateAnything-3B

just test-locateanything-checkpoint
just locateanything-demo   # bundled sample in rlx-locateanything/fixtures/sample.jpg
just locateanything -- --model-dir $RLX_LOCATEANYTHING_DIR \
  --image page.png --task ground-single --phrase "red backpack" \
  --generation-mode hybrid --device cpu

Command	What
`just test-locateanything-backends`	Synthetic projector + LM on all RLX backends
`just test-locateanything-moonvit-backends`	Compiled MoonViT on GPU backends
`just test-locateanything-parity`	Full tensor + MTP decode + RLX/HF processor prompts + tasks + slow/fast/hybrid `generate()` vs HF (28 tests; real JPEG fixture)
`just test-locateanything-parity-real`	Real-photo subset (`fixtures/sample.jpg`; `RLX_LOCATEANYTHING_IMAGE` optional)
`just locateanything-demo`	Quick ground on bundled sample (no `--image`)
`just bench-locateanything-backends`	E2E timing per backend; one subprocess per backend by default (avoids OOM). Single backend: `--device wgpu --no-isolate`

Weights are HF safetensors only (770 tensors: vision / projector / language_model.*).

Qwen3-TTS

Qwen3-TTS-12Hz-0.6B — native Rust voice clone and CustomVoice synthesis in rlx-qwen3-tts. Full runbook: crates/rlx-qwen3-tts/README.md.

Download and clone

just fetch-qwen3-tts-base
export RLX_QWEN3_TTS_DIR=.cache/qwen3-tts/Qwen3-TTS-12Hz-0.6B-Base

cargo build -p rlx-qwen3-tts --release --features apple-silicon --bin jfk_voice_clone
./target/release/jfk_voice_clone \
  --model-dir $RLX_QWEN3_TTS_DIR \
  --ref-wav assets/jfk/jfk_voice_clone.wav \
  --target-text "Hello from native Rust TTS." \
  --out-wav /tmp/hello.wav --device metal

Duplex voice chat

Mic WAV → Whisper → Qwen3-0.6B → progressive TTS (JFK clone). Bundled roundtrip audio under crates/rlx-qwen3-tts/examples/audio/.

just fetch-qwen3 && just fetch-whisper-base   # LM + ASR weights
just voice-chat-demo                          # → /tmp/voice_chat_roundtrip/

--turbo preloads all models, streams LM tokens, and uses batched TTS by default (--streaming-tts for progressive partial-decode). Measured stop-speaking → first audio ≈ 5.1 s on Apple Silicon (see voice_chat_latency.svg).

Streaming API

VoiceClone::generate_stream supports StreamMode::Batched (lossless chunking of full generate()) and StreamMode::Progressive (codec frames decoded during AR). Progressive speech decode uses CPU on Metal/MLX (GPU prefix-length mismatch); CUDA and other backends use GPU speech decode when available.

Tests

Command	What
`just test-qwen3-tts-parity`	Codec frames + speech decode vs reference (`RLX_QWEN3_TTS_DIR`)
`just features=all-backends test-qwen3-tts-backends`	Talker prefill/decode per backend
`just features=all-backends test-qwen3-tts-streaming`	Streaming PCM parity (batched + progressive)
`just qwen3-tts-vivian-demo`	CustomVoice preset speaker → `/tmp/vivian-demo.wav`

Env: RLX_QWEN3_TTS_CP_EAGER=1 / RLX_QWEN3_TTS_SPEECH_EAGER=1 force CPU paths; RLX_QWEN3_TTS_TIMING=1 prints stage breakdown.

Voxtral TTS

Mistral Voxtral-4B-TTS-2603 — native Rust inference in rlx-voxtral-tts, voice-clone training in rlx-voxtral-tts-train. Full runbook: docker/voxtral-tts/README.md.

Download and synthesize

just fetch-voxtral-tts
export RLX_VOXTRAL_TTS_DIR=.cache/voxtral/Voxtral-4B-TTS-2603

just voxtral-tts-prepare-voices
just voxtral-tts -- --model-dir $RLX_VOXTRAL_TTS_DIR \
  --text "Hello world" --voice neutral_female -o out.wav

Voice cloning (native RLX training)

Public checkpoints omit the codec encoder. Train it in RLX, inject into consolidated.safetensors, then synthesize from a reference WAV:

# Optional manifest (transcript field improves ASR auxiliary loss):
just voxtral-tts-train-manifest -- --wav-dir ./wavs --out ./wavs/manifest.json

PRODUCTION=1 just features=all-backends voxtral-tts-train-production -- \
  --model-dir $RLX_VOXTRAL_TTS_DIR --wav-dir ./wavs \
  --manifest ./wavs/manifest.json --out-dir ./out/train --device auto

just voxtral-tts -- --model-dir $RLX_VOXTRAL_TTS_DIR \
  --reference-wav ./ref.wav --text "Hello from my voice" -o cloned.wav

Periodic checkpoints during long runs: CHECKPOINT_EVERY=500. Resume: --resume-weights ./out/train/encoder/encoder_step_5000.safetensors --resume-step 5000. Rig validation: RLX_VOXTRAL_TTS_TRAIN_RIG=1 RLX_VOXTRAL_TTS_REF_WAV=./ref.wav just test-voxtral-tts-train-synthesize-rig (reports mel similarity).

Tests

Command	What
`just test-voxtral-tts-train`	Train crate unit + integration tests
`just test-voxtral-tts-train-backends`	Encoder/LoRA backward compile on all GPU backends
`just test-voxtral-tts-codec`	Codec round-trip
`just test-voxtral-tts-native-parity`	Native vs Docker reference export

VAD (Earshot + Silero)

rlx-vad — 16 kHz voice activity detection with embedded weights (no ONNX Runtime):

Earshot — weights/earshot_weights.bin (~75 KiB)
Silero — weights/silero_vad_16k.safetensors (~920 KiB), exported from official silero_vad.onnx 16 kHz branch

cargo run -p rlx-vad --release -- --backend silero --wav audio16k.wav
cargo run -p rlx-vad --example jfk_bench --release
cargo test -p rlx-vad

Regenerate Silero embed: python3 scripts/export_silero_onnx_weights.py … (see crate README). The Hugging Face file named silero_vad_16k.safetensors is a different (8 kHz) graph — do not substitute it.

Shared loader: rlx_core::embedded_safetensors::EmbeddedSafetensors.

Build and test

just check
just test
just build

cargo build -p rlx-models
cargo test  -p rlx-models
cargo test  -p rlx-models --features parity-candle

burnembed (/Users/Shared/burnembed) re-exports rlx_models::embed with --features rlx.

Real-weight integration tests

just fetch-real-weights              # downloads ~1.5 GB of small Q4_K_M GGUFs (idempotent)
just test-real-weights               # config + compat + chat-template across 4 families (~2 s/suite)
just test-real-weights-inference     # adds end-to-end forward inference (slow on CPU)
just test-net-hf                     # live HuggingFace Hub compat check (RLX_NET_TESTS=1)

Covers SmolLM2 135M (llama), Qwen 2.5 0.5B (qwen2), Gemma 3 270M (gemma3 — currently KnownUnimplemented(M2)), and Llama 3.2 1B (llama + Llama-3 RoPE scaling). The inference path verifies the full Llama32Runner/Qwen3Runner packed-decode pipeline against real downloaded GGUFs.

Auto-dispatch + compatibility check

rlx-run check <path-or-hf-repo>      # `SUPPORTED`, `KnownUnimplemented(<milestone>)`, `MissingMetadata`, or `Unknown`
rlx-run check <path> --json          # machine-readable verdict
rlx-run auto <weights> [args...]     # sniffs arch, dispatches to the right runner

Programmatic: rlx_models::run::check_path, check_hf_repo (requires compat-net feature), auto_dispatch, ChatTemplate::from_gguf. Implements the same load-time-field predicate llama.cpp uses (general.architecture + <arch>.context_length + <arch>.embedding_length + <arch>.block_count + tokenizer.ggml.{model,tokens}).

Status

Weights and parity

rlx GGUF = this repo can load .gguf through GgufLoader and the family runner. GGUF on HF = models on the Hub tagged library:gguf (counts are approximate; use the search link to browse).

family	safetensors	rlx GGUF	GGUF on Hugging Face	parity
`bert`, `nomic`, `vision` (`embed`)	yes	yes (`bert`, `nomic-bert`, …)	yes — minilm (~128), bge (~247), nomic (~60); e.g. nomic-embed-text-v1.5-GGUF (`nomic-bert`), bge-small-en-v1.5-gguf. Vision embed: no GGUF sibling.	production (safetensors)
`dinov2`	yes	yes (`dinov2`; F32 drain or K-quant/Q4_0/Q8_0 packed `DequantMatMul` when quant tensors present)	no for `facebook/dinov2-*` — dinov2 (0). Community converters (dinov2.cpp) use `dinov2` arch; tensor names must match HF/candle keys.	production
`sam`, `sam2`, `sam3`	yes	yes (`sam` / `mobile-sam` / `sam2` F32 drain). SAM3: F32 drain or K-quant via fused CPU `gguf_matmul` (ViT, text, detector host+IR, seg cross-attn/mask/scoring, 1×1 inst/sem `DequantMatMul` IR); 3×3 pixel conv stays packed at load (one-time dequant cache on host, materialize for tier-1 IR compile)	SAM1 ViT-H / SAM2: no official Hub GGUF — segment+anything (0), sam2.1 (0). MobileSAM: mobilesam (2), e.g. Acly/MobileSAM-GGUF (`mobile-sam`). SAM3: sam3 (1) — rob-laz/sam3-gguf (`sam3`). Beware TheBloke/SAM-GGUF — 7B chat LM (`llama`), not Segment Anything.	production (encoder + mask path)
`qwen3`	yes	yes (Q4_K_M / Q5_K_M / Q6_K)	yes — qwen3 (many); e.g. `unsloth/Qwen3-*-GGUF`	top-1 vs HF (`parity-candle` + weights)
`qwen35`	—	yes	yes — same hub space; e.g. `unsloth/Qwen3.5-*-GGUF`	vs llama.cpp when `QWEN35_GGUF_PATH` / `parity-llama`
`llama32`	yes	yes	yes — llama-3.2 (~5k)	vs llama.cpp when `LLAMA32_GGUF_PATH`
`minicpm5`	yes	yes (`llama`)	yes — MiniCPM5-1B-GGUF (Q4_K_M / Q8_0 / F16)	vs PyTorch (`minicpm5_parity`); `rlx-minicpm5` 0.2.1 on `rlx-llama32` 0.2.1; GGUF packed CPU/Metal
`llada2`	yes	—	preview — llada2 (1): LLaDA2.0-mini-preview-GGUF (`llada2`)	vs PyTorch when `LLADA2_MODEL_DIR`
`flux2`	yes (BFL / NVFP4 safetensors)	yes (denoiser `.gguf`, `architecture: flux`; K-quant GGUF uses packed `DequantMatMul`; `Flux2Runner` + VAE/TE safetensors)	yes — flux2 (~53); e.g. unsloth/FLUX.2-klein-9B-GGUF, city96/FLUX.2-dev-gguf	GGUF = denoiser only; VAE + Qwen3 TE still safetensors dirs
`vjepa2`	yes	yes (`vjepa2` / `vjepa`, F32 drain)	no Hub GGUF yet — vjepa (0)	synthetic + optional weight checks
`wav2vec2-bert`	yes	yes (`w2v-bert` / `wav2vec2`, F32 drain)	no for Seamless W2V-BERT — w2v-bert (0). Classic ASR: wav2vec2 (~7), e.g. `cstr/wav2vec2-*-GGUF` (`wav2vec2` arch; keys may not match W2V-BERT)	vs HF when `RLX_W2V_BERT_DIR` + python reference

To discover GGUF on the Hub: open Models → library GGUF and add a search term matching the family (qwen3, bge, flux2, …). Check the model card Architecture field — many repos share a name but are unrelated LMs.

Backends

Every model family targets the same standard backends: CPU, Metal, MLX, CUDA, ROCm, WGPU (gpu), Vulkan. SAM also accepts tpu. Policy lives in rlx_core::device_capabilities; runners call validate_standard_device (or validate_sam_device) at build time.

Enable GPU at compile time with matching features on rlx-models or any model crate, e.g. cargo build -p rlx-qwen3 --features all-backends or cargo run -p rlx-models --features metal --bin rlx-run -- qwen3 …. Per-crate binaries (rlx-qwen3, rlx-sam3, …) expose the same feature names. CLI: cpu, metal/mps, mlx, cuda, rocm/hip, gpu/wgpu, vulkan.

Legend: ✅ supported · ⚠️ partial (host fallback or open runtime gap) · ❌ not supported

family	cpu	metal	mlx	cuda	rocm	wgpu	vulkan	notes
`embed` (`bert`, `nomic`, `vision`)	✅	✅	✅	✅	✅	✅	✅	`RlxEmbed::from_dir_on`; `from_dir` defaults to CPU
`dinov2`	✅	✅	✅	✅	✅	✅	✅	`DinoV2Runner` `--device`
`sam`, `sam2`, `sam3`	✅	✅	✅	✅	✅	✅	✅	SAM v1 also accepts `tpu`; CPU/Metal/MLX most exercised in CI
`qwen3`	✅	✅	✅	✅	✅	✅	✅	packed GGUF: CPU + Metal native; MLX/wgpu/CUDA prefill via CPU path (`rlx_core::packed_gguf_*`); MTP decode not wired
`qwen35`	✅	✅	✅	✅	✅	✅	✅	`--device` on all backends; some ops use host GDN/dequant on GPU; MoE offload may keep experts on host
`llama32`	✅	✅	✅	✅	✅	✅	✅	`rlx-llama32` 0.2.1: Metal decode guard + packed GGUF helpers; same packed rules as Qwen3
`minicpm5`	✅	✅	⚠️	⚠️	⚠️	⚠️	⚠️	Wraps `rlx-llama32`; safetensors decode on CPU/Metal; GGUF `--packed` parity on CPU/Metal (MLX/wgpu tests use CPU prefill path)
`llada2`	✅	✅	✅	✅	✅	✅	✅	MoE predictive expert offload on all standard backends (GPU uses resident experts + host fallback)
`flux2`	✅	✅	✅	✅	✅	✅	✅	Full pipeline; text encoder compiled on Metal/MLX by default, host once on CUDA/ROCm/WGPU/Vulkan
`vjepa2`	✅	✅	✅	✅	✅	✅	✅	Runner `--device`
`wav2vec2-bert`	✅	✅	✅	✅	✅	✅	✅	`Wav2Vec2BertRunner` `--device`

Multi-tenant serving (paged KV, continuous batching) lives in rlx_runtime::paged_kv; qwen3::generator is single-stream.

Gotchas

Safetensors names ≠ IR Param names — weight_map.rs renames; GGUF uses GgufLoader.
GGUF LMs (qwen3, qwen35, llama32, minicpm5): pass a .gguf file or a directory with one .gguf / model.safetensors. Wrong-family files get a redirect (rlx_core::assert_gguf_family). Shared helpers: resolve_weights_file, WeightFormat::resolve, open_loader_resolved. MiniCPM5 expects general.architecture = llama and HF model_type = llama.
Packed GGUF prefill (--packed, K-quant): use rlx_core::{packed_gguf_compile_guard, compile_options_for_packed_gguf_prefill_with_profile, packed_gguf_execution_device} in rlx-llama32, rlx-qwen3, rlx-gemma, and rlx-minicpm5. Metal sets RLX_DISABLE_MPSGRAPH=1 during compile; MLX uses RLX_MLX_MODE=lazy (host GGUF dequant); wgpu/CUDA/ROCm disable fusion and may run prefill on CPU until upstream GPU parity.
GGUF elsewhere on HF (embed, FLUX, SAM3, …) does not imply rlx support — see Weights and parity column GGUF on Hugging Face.
GGUF shapes are innermost-first labels; byte layout matches safetensors row-major — do not transpose in take.
Unsupported GGUF quants (Q1_0, Q2_K, IQ*, …) error cleanly.
27B GGUF on Mac: F32 dequant ≈ 108 GB; needs Metal Op::DequantMatMul to stay packed (~13.5 GB).
Pooling in embed::pooling.
New arch: new crate under crates/, facade hook, optional parity test.

Per-crate READMEs

Model-specific runbooks live next to each crate. Agent quick reference: AGENTS.md.

Crate	README
`rlx-qwen3-tts`	crates/rlx-qwen3-tts/README.md
`rlx-gemma`	crates/rlx-gemma/README.md
`rlx-minicpm5`	crates/rlx-minicpm5/README.md
`rlx-llama32`	crates/rlx-llama32/README.md
`rlx-locateanything`	crates/rlx-locateanything/README.md
`rlx-kittentts`	crates/rlx-kittentts/README.md
`rlx-vad`	crates/rlx-vad/README.md
`rlx-mamba`	crates/rlx-mamba/README.md
`rlx-ssm`	crates/rlx-ssm/README.md
`rlx-models-core` (`rlx-core`)	crates/rlx-models-core/README.md
`rlx-clinicalbert`	crates/rlx-clinicalbert/README.md
`rlx-onnx-import`	crates/rlx-onnx-import/README.md
`rlx-onnx-decompose`	crates/rlx-onnx-decompose/README.md
`kitten_tts_mini_rlx`	crates/kitten_tts_mini_rlx/README.md
Voxtral TTS training	docker/voxtral-tts/README.md

Crates without a dedicated README are documented in What's here and the facade examples under crates/rlx-models/examples/.

License

GPL-3.0-only.

rlx-models 0.2.4

rlx-models

Contents

Architecture

Crates

How to depend

Per-crate binaries

Running models

just (shortcuts)

Per-crate binaries (recommended)

Multiplexer (rlx-run)

Custom CLI

Examples (facade)

Qwen3-TTS samples

Weight fetch (optional)

What's here

Install

Quickstart — embeddings

High-level runner API

Adding a new model

1. Create the crate

2. Source layout

3. Facade re-export

4. CLI (optional)

5. High-level runner (optional)

Compile profiles (tier-1)

Qwen3

Safetensors

GGUF

Weights API (model-agnostic loader)

Apple Silicon

MiniCPM5

Download

CLI

Chat (HF template)

Library

Tests

LocateAnything

Qwen3-TTS

Download and clone

Duplex voice chat

Streaming API

Tests

Voxtral TTS

Download and synthesize

Voice cloning (native RLX training)

Tests

VAD (Earshot + Silero)

Build and test

Real-weight integration tests

Auto-dispatch + compatibility check

Status

Weights and parity

Backends

Gotchas

Per-crate READMEs

License

Multiplexer (`rlx-run`)