rlx-models 0.2.4

Model loading for RLX — config parsing, safetensors weights, graph builders
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
# rlx-models

Concrete model graph builders + weight loaders for RLX — the "what actually runs" layer.

Standalone repo: [github.com/MIT-RLX/rlx-models](https://github.com/MIT-RLX/rlx-models). Clone next to [`rlx`](https://github.com/MIT-RLX/rlx):

```text
rlx-workspace/
  rlx/          # github.com/MIT-RLX/rlx
  rlx-models/   # github.com/MIT-RLX/rlx-models
  candle/       # optional, for parity-candle
```

```bash
git clone https://github.com/MIT-RLX/rlx.git
git clone https://github.com/MIT-RLX/rlx-models.git
cd rlx-models && cargo test -p rlx-models
```

The RLX monorepo lists `../rlx-models/crates/rlx-models` as a workspace member; you can also run `cd rlx && cargo test -p rlx-models` there.

Agent-oriented quick reference: [AGENTS.md](AGENTS.md).

## Contents

- [Architecture](#architecture)
- [Running models](#running-models)
- [What's here](#whats-here)
- [Install](#install)
- [Quickstart — embeddings](#quickstart--embeddings)
- [High-level runner API](#high-level-runner-api)
- [Adding a new model](#adding-a-new-model)
- [Compile profiles](#compile-profiles-tier-1)
- [Qwen3](#qwen3)
- [MiniCPM5](#minicpm5)
- [Qwen3-TTS](#qwen3-tts)
- [Voxtral TTS](#voxtral-tts)
- [VAD (Earshot + Silero)](#vad-earshot--silero)
- [Build and test](#build-and-test)
- [Status](#status)
- [Gotchas](#gotchas)
- [Per-crate READMEs](#per-crate-readmes)
- [License](#license)

## Architecture

This repo is a **Cargo workspace**: one library crate per model family under `crates/`, plus shared infrastructure. The `rlx-models` package is a thin **facade** that re-exports historical paths (`rlx_models::qwen3`, `rlx_models::sam`, …).

```text
rlx-models/
├── Cargo.toml              # workspace members + [workspace.dependencies]
├── justfile                # shortcuts (optional)
├── crates/
│   ├── rlx-models-core/    # config, weight_map, flow_bridge (package `rlx-core`)
│   ├── rlx-ssm/            # SSM flow stages + custom ops (Mamba, LFM, …)
│   ├── rlx-cli/            # shared CLI + rlx-inspect
│   ├── rlx-<model>/        # one crate per family
│   └── rlx-models/         # facade + optional rlx-run multiplexer
└── crates/rlx-models/examples/   # integration templates
```

### Crates

| Crate | Model / role |
|---|---|
| `rlx-models-core` (`rlx-core`) | config, `weight_map`, `weight_loader`, `flow_bridge`, `flow_util` |
| `rlx-ssm` | SSM flow stages (`MambaScanStage`, decode-step custom ops) |
| `rlx-mamba` | Mamba1 block + multi-backend driver |
| `rlx-bert` | BERT |
| `rlx-nomic` | NomicBERT |
| `rlx-vision` | NomicVision |
| `rlx-dinov2` | DINOv2 |
| `rlx-embed` | embedding runtime |
| `rlx-sam` / `sam2` / `sam3` | SAM family |
| `rlx-sam-ir` | shared mask-decoder IR |
| `rlx-qwen3` | Qwen3 LM |
| `rlx-qwen35` | Qwen3.5 / 3.6 |
| `rlx-llama32` | LLaMA 3.2 |
| `rlx-minicpm5` | MiniCPM5 (Llama-shaped; [openbmb/MiniCPM5-1B](https://huggingface.co/openbmb/MiniCPM5-1B)) |
| `rlx-gemma` | Gemma / Gemma 2 |
| `rlx-llada2` | LLaDA2 + TIDE offload |
| `rlx-flux2` | FLUX.2 |
| `rlx-vjepa2` | V-JEPA 2 |
| `rlx-wav2vec2-bert` | Wav2Vec2-BERT |
| `rlx-whisper` | OpenAI Whisper ASR |
| [`rlx-vad`](crates/rlx-vad/README.md) | Earshot + Silero VAD (embedded weights, 16 kHz) |
| `rlx-voxtral` | Mistral Voxtral speech LM |
| `rlx-voxtral-tts` | Voxtral-4B-TTS inference (codec + Ministral LM) |
| `rlx-voxtral-tts-train` | Native RLX voice-clone training (encoder + LoRA) |
| [`rlx-qwen3-tts`](crates/rlx-qwen3-tts/README.md) | Qwen3-TTS — voice clone + CustomVoice TTS, progressive streaming, duplex voice chat (Whisper + Qwen3 LM). [JFK samples + roundtrip audio](crates/rlx-qwen3-tts/examples/audio/) ship in the crate. |
| `rlx-locateanything` | NVIDIA LocateAnything-3B VLM (grounding) |
| `rlx-cli` | shared CLI helpers + `rlx-inspect` |
| `rlx-models` | facade (re-exports) + optional `rlx-run` multiplexer |

### How to depend

| Goal | Depend on |
|------|-----------|
| One model only (fast builds) | `rlx-qwen3`, `rlx-sam3`, … |
| Stable `rlx_models::qwen3` paths | `rlx-models` facade |
| CLI / inspect only | `rlx-cli` |

New code that only needs Qwen3 should depend on `rlx-qwen3` directly.

### Per-crate binaries

Each model crate with a CLI has `src/cli.rs` (`pub fn run`) and `src/bin/rlx-<name>.rs`. Shared flag parsing lives in `rlx-cli`.

**`rlx-run`** (in `rlx-models`) is an optional multiplexer over all built-in CLIs. Prefer per-crate binaries when you only need one family — they link less and compile faster.

**SAM unified runner:** `SamRunner` (SAM1/2/3) stays on the facade (`rlx-models/src/sam_runner.rs`) because `rlx-sam2` depends on `rlx-sam`. Per-arch CLIs are on `rlx-sam`, `rlx-sam2`, `rlx-sam3`.

Published `rlx*` crates (`rlx-runtime`, `rlx-flow`, …) are pinned at **0.2.4** in root `[workspace.dependencies]`; every crate uses `{ workspace = true }`. **Local dev** with a sibling `../rlx` checkout: `cp .cargo/config.toml.example .cargo/config.toml` (gitignored patches). **Publish / CI** uses crates.io only — no `.cargo/config.toml`, no `[patch.crates-io]` in committed `Cargo.toml`.

## Running models

### just (shortcuts)

Install [just](https://github.com/casey/just) (`brew install just`). From the repo root:

```sh
just                          # list recipes
just qwen3 -- --weights model.gguf --prompt-ids 1,2,3
just inspect weights/model.gguf
just qwen3-metal -- --weights model.gguf --device metal --prompt-ids 1,2,3
just fetch-minicpm5
just minicpm5 -- --weights /tmp/rlx-weights/MiniCPM5-1B/model-00000-of-00001.safetensors --device cpu --prompt-ids 1,42 --max-tokens 16
just minicpm5-chat "Hello from MiniCPM5"
```

Pass model CLI flags after `--`. MiniCPM5 details: [crates/rlx-minicpm5/README.md](crates/rlx-minicpm5/README.md) and [MiniCPM5](#minicpm5). GPU backends: `just features=all-backends qwen3 -- --device metal`, `just qwen35-all-backends -- …`, or per-crate `qwen3-all-backends` / `qwen35-all-backends`.

### Per-crate binaries (recommended)

| Binary | Crate | Example |
|--------|-------|---------|
| `rlx-qwen3` | `rlx-qwen3` | `cargo run -p rlx-qwen3 --bin rlx-qwen3 --release -- --weights model.gguf --prompt-ids 1,2,3` |
| `rlx-qwen35` | `rlx-qwen35` | `cargo run -p rlx-qwen35 --bin rlx-qwen35 --release -- …` |
| `rlx-llama32` | `rlx-llama32` | `cargo run -p rlx-llama32 --bin rlx-llama32 --release -- …` |
| `rlx-minicpm5` | `rlx-minicpm5` | `cargo run -p rlx-minicpm5 --features tokenizer --release -- --weights …/model.safetensors --prompt-ids 1,42` |
| `rlx-gemma` | `rlx-gemma` | `cargo run -p rlx-gemma --bin rlx-gemma --release -- --weights model.gguf --prompt-ids 1,2,3` |
| `rlx-dinov2` | `rlx-dinov2` | `cargo run -p rlx-dinov2 --bin rlx-dinov2 --release -- …` |
| `rlx-vjepa2` | `rlx-vjepa2` | `cargo run -p rlx-vjepa2 --bin rlx-vjepa2 --release -- …` |
| `rlx-wav2vec2-bert` | `rlx-wav2vec2-bert` | `cargo run -p rlx-wav2vec2-bert --bin rlx-wav2vec2-bert --release -- …` |
| `rlx-whisper` | `rlx-whisper` | `cargo run -p rlx-whisper --bin rlx-whisper --release -- --weights model.safetensors --wav audio16k.wav` |
| `rlx-vad` | `rlx-vad` | `cargo run -p rlx-vad --release -- --backend silero --wav audio16k.wav` ([docs](crates/rlx-vad/README.md)) |
| `rlx-voxtral` | `rlx-voxtral` | `cargo run -p rlx-voxtral --bin rlx-voxtral --release -- --weights model_dir --wav audio16k.wav --transcribe` |
| `rlx-voxtral-tts` | `rlx-voxtral-tts` | `just voxtral-tts -- --model-dir DIR --text "Hello" --voice neutral_female -o out.wav` |
| `rlx-voxtral-tts-train` | `rlx-voxtral-tts-train` | `just voxtral-tts-train-production -- --model-dir DIR --wav-dir WAVS --device auto` |
| `rlx-locateanything` | `rlx-locateanything` | `cargo run -p rlx-locateanything --bin rlx-locateanything --release -- --model-dir DIR --dry` |
| `rlx-sam1` | `rlx-sam` | `cargo run -p rlx-sam --bin rlx-sam1 --release -- …` |
| `rlx-sam2` | `rlx-sam2` | `cargo run -p rlx-sam2 --bin rlx-sam2 --release -- …` |
| `rlx-sam3` | `rlx-sam3` | `cargo run -p rlx-sam3 --bin rlx-sam3 --release -- …` |
| `rlx-flux2` | `rlx-flux2` | `cargo run -p rlx-flux2 --bin rlx-flux2 --release -- …` |
| `rlx-flux2-serve` | `rlx-flux2` | JSON-lines server on stdin |
| `rlx-inspect` | `rlx-cli` | `cargo run -p rlx-cli --bin rlx-inspect -- model.gguf` |

Flags match the corresponding `rlx-run` subcommand (without the subcommand name).

### Multiplexer (`rlx-run`)

```sh
cargo run -p rlx-models --bin rlx-run --release --features metal -- \
    qwen3 --weights Qwen3-0.6B-Q4_K_M.gguf --device metal --prompt-ids 1,17,42

cargo run -p rlx-models --bin rlx-run -- inspect Qwen3-0.6B-Q4_K_M.gguf
```

`rlx-inspect` dumps format, tensor count, dtype histogram, GGUF metadata, MTP heads, and multi-`.gguf` dir hints (`--prefer Q4_K_M`).

### Custom CLI

Downstream tools can register runners without forking `rlx-models`:

```rust
use rlx_cli::{dispatch, register_cli};

register_cli("my-model", "…", |args| { /* … */ });
dispatch(&argv)?;
```

See `crates/rlx-models/examples/register_custom_runner.rs`.

### Examples (facade)

Integration templates on the `rlx-models` package:

```sh
cargo run -p rlx-models --example run_qwen3_gguf --release -- [args]
just example-qwen3-gguf -- /path/to/model.gguf
```

| File | What it does |
|---|---|
| `run_qwen3_safetensors.rs` | Qwen3 from HF safetensors, builder API, streaming greedy decode |
| `run_qwen3_gguf.rs` | Same from `.gguf` (Q4_K_M / Q5_K_M / Q6_K), MTP head detection |
| `run_sam1.rs` | SAM 1 — encode image, prompt encoder + mask decoder |
| `run_sam2.rs` | SAM 2 — FPN + memory attention |
| `run_sam3.rs` | SAM 3 — text-conditioned detection + masks |
| `qwen3_gguf_inference.rs` | Detailed Qwen3 GGUF walk-through |
| `gguf_qwen3_probe.rs` | Validate `hf_to_gguf_name` against a real GGUF |
| `qwen3_matrix.rs` | (B, L, mode) × (CPU, Metal, MLX, wgpu) parity + perf vs candle |
| `minicpm5_download.rs` | Fetch [openbmb/MiniCPM5-1B](https://huggingface.co/openbmb/MiniCPM5-1B) safetensors (`hf-download`) |
| `minicpm5_gguf_download.rs` | Fetch GGUF quants (Q4_K_M / Q8_0 / F16) |
| `run_minicpm5.rs` | `MiniCpm5Runner` prefill + greedy decode from safetensors |
| `minicpm5_forward_bench.rs` | Wall-clock prefill/decode across backends (real 1B weights) |
| `minicpm5_chat.py` | HF chat template → `rlx-minicpm5` (`just minicpm5-chat`) |

#### Qwen3-TTS samples

Audio and charts live in [`crates/rlx-qwen3-tts`](crates/rlx-qwen3-tts/README.md). **Duplex voice chat** (bundled question → JFK-clone reply):

<video controls preload="metadata" src="crates/rlx-qwen3-tts/examples/audio/voice_chat_question.mp4"></video>

<video controls preload="metadata" src="crates/rlx-qwen3-tts/examples/audio/voice_chat_reply.mp4"></video>

![Voice-chat roundtrip latency](crates/rlx-qwen3-tts/examples/charts/voice_chat_latency.svg)

Three **JFK voice-clone** clips (`ask_not`, `moon`, `rlx_intro`) — ECAPA cosine 0.95+, WER 0–3.8 %. Full metrics, streaming API, and `just voice-chat-demo`: [crate README](crates/rlx-qwen3-tts/README.md).

SAM examples synthesize a 1024×1024 RGB gradient — swap in `image::open(path)` for real images.

```sh
just fetch-minicpm5
just example run_minicpm5 --release
```

### Weight fetch (optional)

`docker/qwen3-fetch/` — container pulls HF checkpoints into `./weights`; host runs `cargo test` / benches natively.

```sh
just fetch-qwen3
# or: docker build -t rlx-qwen3-fetch docker/qwen3-fetch && …
just fetch-minicpm5
just fetch-minicpm5-gguf Q4_K_M
```

## What's here

- **`qwen3`** — Qwen3 decoder LM (GQA, QK-norm, RoPE, SwiGLU, tied embeddings). Safetensors + GGUF; optional `qwen3.rlx.toml`. See [Qwen3](#qwen3).
- **`qwen35`** — Qwen3.5 / 3.6 hybrid (gated DeltaNet + periodic attention + optional MTP). GGUF via `Qwen35Runner`; optional `qwen35.rlx.toml`. Parity: `examples/qwen35_compare.rs` with the llama.cpp reference script in `examples/`.
- **`gemma`** — Gemma / Gemma 2 / 3 / 4 (GQA, RoPE, GeGLU, tied weights, Gemma2 softcap). Safetensors + GGUF; optional `gemma.rlx.toml`. See [crates/rlx-gemma/README.md](crates/rlx-gemma/README.md). CLI: `rlx-gemma` / `rlx-run gemma`. Parity: `just test-gemma-parity gemma2_synthetic`; backends: `just features=all-backends test-gemma-backends`.
- **`bert`** — BERT graph builder (MiniLM, BGE, all-MiniLM-L6-v2).
- **`nomic`** — NomicBERT (RoPE + SwiGLU).
- **`vision`** — NomicVision-style encoders.
- **`dinov2`** — DINOv2 ViT (B/14, L/14, g/14).
- **`sam`**, **`sam2`**, **`sam3`** — Segment Anything encoders + mask decoders. Optional `sam.rlx.toml` next to weights (reference: `crates/rlx-sam/src/sam.rlx.toml`).
- **`flux2`** — FLUX.2 rectified-flow denoiser. `rlx-flux2` CLI; presets `flux2_dev()`, `flux2_klein_4b()`, `flux2_klein_9b()`. VAE, CFG, img2img, LoRA, `hf-download`, `rlx-flux2-serve`. GPU backends via `rlx-models` features (`metal`, `cuda`, …).
- **`embed`** — `RlxEmbed`, registry, tokenizers, pooling. `from_pretrained` with `hf-download`.
- **`config`**, **`weight_loader`** — HF config parsing; `WeightMap` + `GgufLoader` (K-quants, MTP isolation).
- **`mamba`** — Mamba1 SSM block (`rlx-mamba`); SSM via `rlx-ssm` + `SelectiveScan`. See [crates/rlx-mamba/README.md](crates/rlx-mamba/README.md).
- **`lfm`**, **`minimax`**, **`nemotron`** — hybrid runners using `rlx-ssm` decode-step stages.
- **`minicpm5`** — MiniCPM5 edge LMs (Llama-shaped 1B). Wraps `Llama32Runner`; safetensors + GGUF. See [MiniCPM5](#minicpm5) and [crates/rlx-minicpm5/README.md](crates/rlx-minicpm5/README.md).
- **`qwen3-tts`** — Qwen3-TTS Base (voice clone) + CustomVoice. ECAPA x-vector, 28-layer talker, 16-group code predictor, 12 Hz Mimi decode. [`VoiceClone`](crates/rlx-qwen3-tts/README.md#library-api) API, progressive streaming, and `bidirectional_voice_chat` (Whisper → Qwen3-0.6B → TTS). See [Qwen3-TTS](#qwen3-tts).
- **`voxtral-tts`** — Voxtral-4B-TTS native inference (Tekken tokenizer, codec decode, compiled LM). **`voxtral-tts-train`** — RLX autodiff training for reference-audio cloning (codec encoder + full attention LoRA). See [Voxtral TTS](#voxtral-tts).
- **`run`** — `Qwen3Runner`, `SamRunner`, … builders for one-call inference.

## Install

```toml
[dependencies]
rlx-models = "0.2"
```

HF-hub download:

```toml
rlx-models = { version = "0.2", features = ["hf-download"] }
```

## Quickstart — embeddings

```rust
use rlx_models::embed::{Pooling, RlxEmbed};

let mut model = RlxEmbed::from_pretrained("sentence-transformers/all-MiniLM-L6-v2")?;
let hidden = model.forward(&[("input_ids", &ids), ("attention_mask", &mask)], 1, 16)?;
```

## High-level runner API

`rlx_models::run` exposes builder-style entry points (also `rlx::run` in the monorepo):

```rust
use rlx_models::run::{Qwen3Runner, Precision};
use rlx_runtime::Device;

let mut runner = Qwen3Runner::builder()
    .weights("Qwen3-0.6B-Q4_K_M.gguf")
    .device(Device::Metal)
    .max_seq(128)
    .precision(Precision::F32)
    .max_memory_gb(16.0)
    .stream(true)
    .use_mtp(false)
    .packed_weights(false)
    .build()?;

runner.generate(&prompt_ids, 32, |tok| print!("{tok} "))?;
```

**Packed weights** (large GGUF on limited RAM — CPU-only, memory-frugal, slower):

```rust,ignore
let mut runner = Qwen3Runner::builder()
    .weights("Qwen3-14B-Q4_K_M.gguf")
    .packed_weights(true)
    .max_seq(128)
    .build()?;
runner.generate(&prompt_ids, 16, |tok| print!(" {tok}"))?;
let logits = runner.predict_logits(&prompt_ids)?;
```

Format (`safetensors` vs `gguf`) is auto-detected. SAM uses `SamRunner::builder(SamArch::Sam2)`.

CLI equivalent:

```sh
just qwen3 -- --weights Qwen3-14B-Q4_K_M.gguf --packed --max-seq 128 --max-tokens 16 --prompt-ids 1,17,42
# or: cargo run -p rlx-qwen3 --bin rlx-qwen3 --release -- …
```

## Adding a new model

Borrowed from Max's four-file layout; each architecture is a workspace crate `crates/rlx-<name>/`.

### 1. Create the crate

Root `Cargo.toml`:

```toml
# [workspace.members]
"crates/rlx-myarch",

# [workspace.dependencies]
rlx-myarch = { path = "crates/rlx-myarch" }
```

Depend on `rlx-core`, `rlx-ir`, `rlx-flow`, `rlx-runtime` as needed.

### 2. Source layout

```text
crates/rlx-myarch/src/
├── lib.rs
├── arch.rs       # ArchSpec registration (optional)
├── config.rs     # HF config.json
├── weights.rs    # HF → RLX name map
├── builder.rs    # graph construction
├── flow.rs       # compile helpers (optional split)
└── cli.rs        # pub fn run(args: &[String])
```

`arch.rs` registers with `rlx_core::arch_registry`. `weights.rs` holds rename rules; `builder.rs` emits IR. Reference: `crates/rlx-qwen3`.

### 3. Facade re-export

In `crates/rlx-models/src/lib.rs`:

```rust
pub mod myarch {
    pub use rlx_myarch::*;
}
```

### 4. CLI (optional)

- `cli.rs` + `[[bin]] name = "rlx-myarch"`
- Register in `crates/rlx-models/src/bin/rlx_run.rs`: `register_cli("myarch", "…", rlx_myarch::cli::run)`
- Add a `just` recipe in `justfile` (optional)

### 5. High-level runner (optional)

Put `MyArchRunner` in the model crate; re-export from `crates/rlx-models/src/run.rs`.

Legacy flat modules (`rlx-bert`, `rlx-nomic`) stay as-is until they grow — use this layout for **new** architectures.

## Compile profiles (tier-1)

Compile through tier-1 profiles, not bare `Session::compile(graph)`:

| Model | Profile helper | Optional file next to weights |
|---|---|---|
| Qwen3 | `flow_util::compile_graph_qwen3_prefill_with_params` | `qwen3.rlx.toml` |
| Qwen3.5 | `compile_support::compile_qwen35_prefill` / `compile_qwen35_decode` | `qwen35.rlx.toml` |
| SAM / SAM3 | `flow_util::compile_graph_sam_with_params` | `sam.rlx.toml` |
| Encoders | `flow_util::compile_graph_encoder_with_params` | — |

Synthetic Qwen3.5 weights for CPU checks: `rlx_models::qwen35::synth` (`tiny_cfg`, `medium_cfg`, `bench_cfg`, …).

```sh
just test-quick
# cargo test -p rlx-models --test qwen35_forward_check --test compile_profile_quick_check
```

Real-GGUF / backend checks: set `QWEN35_GGUF_PATH` (LMs) or vision env vars (`SAM3_GGUF_PATH`, `DINOV2_GGUF_PATH`, `FLUX_GGUF_PATH`, `W2V_BERT_GGUF_PATH`). Drain: `cargo test -p rlx-models --test vision_gguf_load --release`. Compile quick check: `cargo test -p rlx-models --test vision_gguf_compile --release` (SAM3 also needs `VISION_GGUF_COMPILE=1`; W2V-BERT needs `RLX_W2V_BERT_DIR` with `config.json`). FLUX: `cargo test -p rlx-models --test flux2_gguf_runner_quick_check --release` (`FLUX_GGUF_PATH` / `FLUX_MODEL_ROOT`; optional `FLUX_VAE_DIR` for VAE encode). Q4_0 fused matmul: `cargo test -p rlx-models --test gguf_legacy_quant_matmul --release`; Metal parity: `GGUF_LEGACY_METAL_PARITY=1` with `--features metal`. Enable `metal` / `mlx` / `cuda` / `parity-llama` per test file where noted.

## Qwen3

Prefill + decode on all seven standard backends (CPU, Metal, MLX, CUDA, ROCm, WGPU, Vulkan). Enable matching features at build time (`cargo build -p rlx-qwen3 --features all-backends`). Synthetic checks: `just features=all-backends test-qwen3-backends`. Parity: 100% top-1 vs HF (`tests/qwen3_parity.rs`).

### Safetensors

```rust
use rlx_models::qwen3::{Qwen3Config, build_qwen3_graph_sized_last_logits};
use rlx_models::weight_map::WeightMap;
use rlx_runtime::Device;

let cfg = Qwen3Config::from_file("weights/Qwen3-0.6B/config.json".as_ref())?;
let mut wm = WeightMap::from_file("weights/Qwen3-0.6B/model.safetensors")?;
let (graph, params) = build_qwen3_graph_sized_last_logits(&cfg, &mut wm, 1, 128, false)?;
let mut compiled = rlx_models::flow_util::compile_graph_qwen3_prefill_with_params(
    Device::Metal, graph, params,
)?;
```

### GGUF

```rust
use rlx_models::weight_loader::GgufLoader;
let mut wm = GgufLoader::from_file("Qwen3-0.6B-Q4_K_M.gguf")?;
// same compile + run as safetensors
```

Demo: `just example-qwen3-gguf -- path/to/model.gguf`. Verified vs `unsloth/Qwen3-0.6B-GGUF` (cosine ≈ 0.976 vs F32 safetensors on Q4_K_M).

**Directories with several `.gguf` files:** pass `ResolveWeightsOptions { prefer_gguf_substring: Some("Q4_K_M"), .. }` or `gguf_index: Some(0)` (see `rlx_core::gguf_support`). Multi-part split GGUF (`split.count` > 1) auto-merges when all shards sit in the same directory; otherwise `rlx-inspect` lists missing parts.

### Weights API (model-agnostic loader)

**`rlx_core::weights`** only handles paths, file formats, and drain policy. It does **not** know about Qwen, FLUX, BERT, etc.

```rust
use rlx_core::weights::{self, LoadOpts};

let (path, map) = weights::open_map("weights/")?;
let (path, map) = weights::open_map_with(LoadOpts::map().prefer_q4_k_m(), "weights/")?;
let loaded = weights::open_with(LoadOpts::loader(), "model.gguf")?; // packed take / MTP
```

**Model-specific policy** belongs in each runner:

```rust
use rlx_core::{load_weight_map, gguf_validate_arch, EMBED_GGUF_ARCHES, DINOV2_GGUF_ARCHES};

// One call: resolve path, validate arch on .gguf, drain to F32 map
let map = load_weight_map(path, DINOV2_GGUF_ARCHES)?;

// Or split validate + open (embed / custom drain policy)
gguf_validate_arch(&path, EMBED_GGUF_ARCHES)?;
let (_path, map) = weights::open_map(path)?;
```

| Layer | Responsibility |
|-------|----------------|
| `weights` / `weight_registry` | `.gguf` / `.safetensors`, resolve dir, custom extensions |
| `gguf_validate_arch`, `assert_gguf_family` | Optional arch guard in **your** crate |
| `register_gguf_tensor_resolver` | HF ↔ `blk.*` / prefix strip per checkpoint layout |
| `BertConfig::from_gguf`, `Flux2Config::from_gguf` | Hyperparameters from metadata |

**Inspect:** `rlx-inspect path [--prefer Q4_K_M] [--json]` — directory listing, split-part hints, runner suggestions.

**CLI:** LM / FLUX binaries accept `--prefer-quant` and `--gguf-index` (via `rlx_cli::resolve_weights_cli`); default quant preference is `Q4_K_M` in multi-file dirs.

**Splits:** Multi-part GGUF (`split.count` > 1) auto-merges when all parts are in the same directory; otherwise `rlx-inspect` lists missing shards.

**Legacy quants:** `Q4_0` / `Q8_0` support packed `DequantMatMul` on **CPU** and **Metal** (fused MSL dequant+matmul, 32-element blocks). Set `RLX_DISABLE_METAL_DEQUANT_GPU=1` to force host dequant on Apple GPUs.

**Example:** `cargo run -p rlx-models --example custom_weight_format`

### Apple Silicon

Metal lowers to MPSGraph (per shape). Env toggles:

| env var | effect |
|---|---|
| `RLX_DISABLE_MPSGRAPH=1` | per-op Metal thunks |
| `RLX_DISABLE_MPSGRAPH_EXECUTABLE=1` | JIT MPSGraph |
| `RLX_MPSGRAPH_PARAM_CONST=1` | bake weights into executable |
| `RLX_QWEN3_F16_LM_HEAD=1` | F16 final matmul |
| `RLX_MPSGRAPH_TRACE=1` | print lowering blockers |

Harness: `examples/qwen3_matrix.rs`.

## MiniCPM5

[openbmb/MiniCPM5-1B](https://huggingface.co/openbmb/MiniCPM5-1B) — 1B Llama decoder (GQA, RoPE, SwiGLU). Implemented in `rlx-minicpm5` on top of `rlx-llama32` with HF `config.json` / GGUF arch checks. **Full runbook:** [crates/rlx-minicpm5/README.md](crates/rlx-minicpm5/README.md).

### Download

```sh
just fetch-minicpm5                              # safetensors → /tmp/rlx-weights/MiniCPM5-1B
just fetch-minicpm5-gguf Q4_K_M                  # GGUF → …/MiniCPM5-1B-GGUF
```

### CLI

Uses the same flags as `rlx-llama32` (`--weights`, `--device`, `--prompt-ids`, `--tokenizer`, `--packed`, `--max-seq`, `--max-tokens`, …). Build with `tokenizer` for decode:

```sh
W=/tmp/rlx-weights/MiniCPM5-1B/model-00000-of-00001.safetensors

just minicpm5 -- --weights "$W" --device cpu --prompt-ids 1,42,314 --max-tokens 16

# GGUF packed prefill (CPU + Metal native; MLX/wgpu/CUDA use CPU execution path today):
just minicpm5 -- --weights /tmp/rlx-weights/MiniCPM5-1B-GGUF/MiniCPM5-1B-Q4_K_M.gguf \
  --packed --device metal --prompt-ids 1,42 --max-tokens 8
```

### Chat (HF template)

```sh
pip install transformers
just fetch-minicpm5
just minicpm5-chat "What is 2+2? Answer in one sentence."
```

`minicpm5_chat.py` tokenizes with the official template, then runs `rlx-minicpm5` (defaults to **CPU** for reliable KV decode on Apple Silicon).

### Library

```rust
use rlx_minicpm5::MiniCpm5Runner;
use rlx_runtime::Device;

let mut runner = MiniCpm5Runner::builder()
    .weights("/tmp/rlx-weights/MiniCPM5-1B/model-00000-of-00001.safetensors")
    .device(Device::Cpu)
    .max_seq(512)
    .build()?;
let logits = runner.predict_logits(&[1, 42, 314])?;
```

Example: `just example run_minicpm5 --release` (or `cargo run -p rlx-models --example run_minicpm5 --release`).

### Tests

| Command | What |
|---------|------|
| `just test-minicpm5-parity-full` | RLX vs PyTorch (safetensors, needs weights) |
| `just test-minicpm5-backends-all` | Synthetic 1B-shaped graph, all backends |
| `just test-minicpm5-gguf-backends` | Real Q4_K_M GGUF packed prefill |
| `../rlx/rig.sh test-minicpm5` | Remote rig: CPU + CUDA + WGPU on Windows/WSL (after `sync` + `sync-minicpm5-gguf`) |
| `just bench-minicpm5-real --device cpu` | Wall-clock prefill/decode on 1B weights |

Multiplexer: `cargo run -p rlx-models --bin rlx-run --features tokenizer -- minicpm5 --weights …`.

## LocateAnything

[NVIDIA LocateAnything-3B](https://huggingface.co/nvidia/LocateAnything-3B) — MoonViT vision + `mlp1` projector + Qwen2.5-3B with MTP box decoding. Crate: `rlx-locateanything`; runbook: [crates/rlx-locateanything/README.md](crates/rlx-locateanything/README.md).

```bash
just fetch-locateanything
export RLX_LOCATEANYTHING_DIR=.cache/locateanything/LocateAnything-3B

just test-locateanything-checkpoint
just locateanything-demo   # bundled sample in rlx-locateanything/fixtures/sample.jpg
just locateanything -- --model-dir $RLX_LOCATEANYTHING_DIR \
  --image page.png --task ground-single --phrase "red backpack" \
  --generation-mode hybrid --device cpu
```

| Command | What |
|---------|------|
| `just test-locateanything-backends` | Synthetic projector + LM on all RLX backends |
| `just test-locateanything-moonvit-backends` | Compiled MoonViT on GPU backends |
| `just test-locateanything-parity` | Full tensor + MTP decode + RLX/HF processor prompts + tasks + slow/fast/hybrid `generate()` vs HF (28 tests; real JPEG fixture) |
| `just test-locateanything-parity-real` | Real-photo subset (`fixtures/sample.jpg`; `RLX_LOCATEANYTHING_IMAGE` optional) |
| `just locateanything-demo` | Quick ground on bundled sample (no `--image`) |
| `just bench-locateanything-backends` | E2E timing per backend; **one subprocess per backend** by default (avoids OOM). Single backend: `--device wgpu --no-isolate` |

Weights are HF safetensors only (770 tensors: vision / projector / `language_model.*`).

## Qwen3-TTS

[Qwen3-TTS-12Hz-0.6B](https://huggingface.co/Qwen/Qwen3-TTS-12Hz-0.6B-Base) — native Rust voice clone and CustomVoice synthesis in `rlx-qwen3-tts`. Full runbook: [crates/rlx-qwen3-tts/README.md](crates/rlx-qwen3-tts/README.md).

### Download and clone

```bash
just fetch-qwen3-tts-base
export RLX_QWEN3_TTS_DIR=.cache/qwen3-tts/Qwen3-TTS-12Hz-0.6B-Base

cargo build -p rlx-qwen3-tts --release --features apple-silicon --bin jfk_voice_clone
./target/release/jfk_voice_clone \
  --model-dir $RLX_QWEN3_TTS_DIR \
  --ref-wav assets/jfk/jfk_voice_clone.wav \
  --target-text "Hello from native Rust TTS." \
  --out-wav /tmp/hello.wav --device metal
```

### Duplex voice chat

Mic WAV → Whisper → Qwen3-0.6B → progressive TTS (JFK clone). Bundled roundtrip audio under `crates/rlx-qwen3-tts/examples/audio/`.

```bash
just fetch-qwen3 && just fetch-whisper-base   # LM + ASR weights
just voice-chat-demo                          # → /tmp/voice_chat_roundtrip/
```

`--turbo` preloads all models, streams LM tokens, and uses batched TTS by default (`--streaming-tts` for progressive partial-decode). Measured stop-speaking → first audio ≈ **5.1 s** on Apple Silicon (see [voice_chat_latency.svg](crates/rlx-qwen3-tts/examples/charts/voice_chat_latency.svg)).

### Streaming API

[`VoiceClone::generate_stream`](crates/rlx-qwen3-tts/README.md#live-streaming-api) supports `StreamMode::Batched` (lossless chunking of full `generate()`) and `StreamMode::Progressive` (codec frames decoded during AR). Progressive speech decode uses CPU on Metal/MLX (GPU prefix-length mismatch); CUDA and other backends use GPU speech decode when available.

### Tests

| Command | What |
|---------|------|
| `just test-qwen3-tts-parity` | Codec frames + speech decode vs reference (`RLX_QWEN3_TTS_DIR`) |
| `just features=all-backends test-qwen3-tts-backends` | Talker prefill/decode per backend |
| `just features=all-backends test-qwen3-tts-streaming` | Streaming PCM parity (batched + progressive) |
| `just qwen3-tts-vivian-demo` | CustomVoice preset speaker → `/tmp/vivian-demo.wav` |

Env: `RLX_QWEN3_TTS_CP_EAGER=1` / `RLX_QWEN3_TTS_SPEECH_EAGER=1` force CPU paths; `RLX_QWEN3_TTS_TIMING=1` prints stage breakdown.

## Voxtral TTS

[Mistral Voxtral-4B-TTS-2603](https://huggingface.co/mistralai/Voxtral-4B-TTS-2603) — native Rust inference in `rlx-voxtral-tts`, voice-clone training in `rlx-voxtral-tts-train`. Full runbook: [docker/voxtral-tts/README.md](docker/voxtral-tts/README.md).

### Download and synthesize

```bash
just fetch-voxtral-tts
export RLX_VOXTRAL_TTS_DIR=.cache/voxtral/Voxtral-4B-TTS-2603

just voxtral-tts-prepare-voices
just voxtral-tts -- --model-dir $RLX_VOXTRAL_TTS_DIR \
  --text "Hello world" --voice neutral_female -o out.wav
```

### Voice cloning (native RLX training)

Public checkpoints omit the codec **encoder**. Train it in RLX, inject into `consolidated.safetensors`, then synthesize from a reference WAV:

```bash
# Optional manifest (transcript field improves ASR auxiliary loss):
just voxtral-tts-train-manifest -- --wav-dir ./wavs --out ./wavs/manifest.json

PRODUCTION=1 just features=all-backends voxtral-tts-train-production -- \
  --model-dir $RLX_VOXTRAL_TTS_DIR --wav-dir ./wavs \
  --manifest ./wavs/manifest.json --out-dir ./out/train --device auto

just voxtral-tts -- --model-dir $RLX_VOXTRAL_TTS_DIR \
  --reference-wav ./ref.wav --text "Hello from my voice" -o cloned.wav
```

Periodic checkpoints during long runs: `CHECKPOINT_EVERY=500`. Resume: `--resume-weights ./out/train/encoder/encoder_step_5000.safetensors --resume-step 5000`. Rig validation: `RLX_VOXTRAL_TTS_TRAIN_RIG=1 RLX_VOXTRAL_TTS_REF_WAV=./ref.wav just test-voxtral-tts-train-synthesize-rig` (reports mel similarity).

### Tests

| Command | What |
|---------|------|
| `just test-voxtral-tts-train` | Train crate unit + integration tests |
| `just test-voxtral-tts-train-backends` | Encoder/LoRA backward compile on all GPU backends |
| `just test-voxtral-tts-codec` | Codec round-trip |
| `just test-voxtral-tts-native-parity` | Native vs Docker reference export |

## VAD (Earshot + Silero)

[`rlx-vad`](crates/rlx-vad/README.md) — 16 kHz voice activity detection with **embedded weights** (no ONNX Runtime):

- **Earshot** — `weights/earshot_weights.bin` (~75 KiB)
- **Silero** — `weights/silero_vad_16k.safetensors` (~920 KiB), exported from official `silero_vad.onnx` 16 kHz branch

```sh
cargo run -p rlx-vad --release -- --backend silero --wav audio16k.wav
cargo run -p rlx-vad --example jfk_bench --release
cargo test -p rlx-vad
```

Regenerate Silero embed: `python3 scripts/export_silero_onnx_weights.py …` (see crate README). The Hugging Face file named `silero_vad_16k.safetensors` is a different (8 kHz) graph — do not substitute it.

Shared loader: `rlx_core::embedded_safetensors::EmbeddedSafetensors`.

## Build and test

```sh
just check
just test
just build

cargo build -p rlx-models
cargo test  -p rlx-models
cargo test  -p rlx-models --features parity-candle
```

burnembed (`/Users/Shared/burnembed`) re-exports `rlx_models::embed` with `--features rlx`.

### Real-weight integration tests

```sh
just fetch-real-weights              # downloads ~1.5 GB of small Q4_K_M GGUFs (idempotent)
just test-real-weights               # config + compat + chat-template across 4 families (~2 s/suite)
just test-real-weights-inference     # adds end-to-end forward inference (slow on CPU)
just test-net-hf                     # live HuggingFace Hub compat check (RLX_NET_TESTS=1)
```

Covers SmolLM2 135M (`llama`), Qwen 2.5 0.5B (`qwen2`), Gemma 3 270M (`gemma3` — currently `KnownUnimplemented(M2)`), and Llama 3.2 1B (`llama` + Llama-3 RoPE scaling). The inference path verifies the full `Llama32Runner`/`Qwen3Runner` packed-decode pipeline against real downloaded GGUFs.

### Auto-dispatch + compatibility check

```sh
rlx-run check <path-or-hf-repo>      # `SUPPORTED`, `KnownUnimplemented(<milestone>)`, `MissingMetadata`, or `Unknown`
rlx-run check <path> --json          # machine-readable verdict
rlx-run auto <weights> [args...]     # sniffs arch, dispatches to the right runner
```

Programmatic: [`rlx_models::run::check_path`](crates/rlx-cli/src/compat.rs), [`check_hf_repo`](crates/rlx-cli/src/compat.rs) (requires `compat-net` feature), [`auto_dispatch`](crates/rlx-cli/src/auto_dispatch.rs), [`ChatTemplate::from_gguf`](crates/rlx-cli/src/chat.rs). Implements the same load-time-field predicate llama.cpp uses (`general.architecture` + `<arch>.context_length` + `<arch>.embedding_length` + `<arch>.block_count` + `tokenizer.ggml.{model,tokens}`).

## Status

### Weights and parity

**rlx GGUF** = this repo can load `.gguf` through `GgufLoader` and the family runner. **GGUF on HF** = models on the Hub tagged `library:gguf` (counts are approximate; use the search link to browse).

| family | safetensors | rlx GGUF | GGUF on Hugging Face | parity |
|---|---|---|---|---|
| `bert`, `nomic`, `vision` (`embed`) | yes | yes (`bert`, `nomic-bert`, …) | **yes** — [minilm](https://huggingface.co/models?library=gguf&search=minilm) (~128), [bge](https://huggingface.co/models?library=gguf&search=bge) (~247), [nomic](https://huggingface.co/models?library=gguf&search=nomic) (~60); e.g. [nomic-embed-text-v1.5-GGUF](https://huggingface.co/nomic-ai/nomic-embed-text-v1.5-GGUF) (`nomic-bert`), [bge-small-en-v1.5-gguf](https://huggingface.co/CompendiumLabs/bge-small-en-v1.5-gguf). Vision embed: no GGUF sibling. | production (safetensors) |
| `dinov2` | yes | yes (`dinov2`; F32 drain or K-quant/Q4_0/Q8_0 packed `DequantMatMul` when quant tensors present) | **no** for `facebook/dinov2-*` — [dinov2](https://huggingface.co/models?library=gguf&search=dinov2) (0). Community converters (dinov2.cpp) use `dinov2` arch; tensor names must match HF/candle keys. | production |
| `sam`, `sam2`, `sam3` | yes | yes (`sam` / `mobile-sam` / `sam2` F32 drain). **SAM3**: F32 drain or K-quant via fused CPU `gguf_matmul` (ViT, text, detector host+IR, seg cross-attn/mask/scoring, 1×1 inst/sem `DequantMatMul` IR); 3×3 pixel conv stays packed at load (one-time dequant cache on host, materialize for tier-1 IR compile) | **SAM1 ViT-H / SAM2**: no official Hub GGUF — [segment+anything](https://huggingface.co/models?library=gguf&search=segment+anything) (0), [sam2.1](https://huggingface.co/models?library=gguf&search=sam2.1) (0). **MobileSAM**: [mobilesam](https://huggingface.co/models?library=gguf&search=mobilesam) (2), e.g. [Acly/MobileSAM-GGUF](https://huggingface.co/Acly/MobileSAM-GGUF) (`mobile-sam`). **SAM3**: [sam3](https://huggingface.co/models?library=gguf&search=sam3) (1) — [rob-laz/sam3-gguf](https://huggingface.co/rob-laz/sam3-gguf) (`sam3`). Beware [TheBloke/SAM-GGUF](https://huggingface.co/TheBloke/SAM-GGUF) — 7B **chat LM** (`llama`), not Segment Anything. | production (encoder + mask path) |
| `qwen3` | yes | yes (Q4_K_M / Q5_K_M / Q6_K) | **yes** — [qwen3](https://huggingface.co/models?library=gguf&search=qwen3) (many); e.g. `unsloth/Qwen3-*-GGUF` | top-1 vs HF (`parity-candle` + weights) |
| `qwen35` | — | yes | **yes** — same hub space; e.g. `unsloth/Qwen3.5-*-GGUF` | vs llama.cpp when `QWEN35_GGUF_PATH` / `parity-llama` |
| `llama32` | yes | yes | **yes** — [llama-3.2](https://huggingface.co/models?library=gguf&search=llama-3.2) (~5k) | vs llama.cpp when `LLAMA32_GGUF_PATH` |
| `minicpm5` | yes | yes (`llama`) | **yes** — [MiniCPM5-1B-GGUF](https://huggingface.co/openbmb/MiniCPM5-1B-GGUF) (Q4_K_M / Q8_0 / F16) | vs PyTorch (`minicpm5_parity`); `rlx-minicpm5` 0.2.1 on `rlx-llama32` 0.2.1; GGUF packed CPU/Metal |
| `llada2` | yes | — | **preview** — [llada2](https://huggingface.co/models?library=gguf&search=llada2) (1): [LLaDA2.0-mini-preview-GGUF](https://huggingface.co/wsbagnsv1/LLaDA2.0-mini-preview-GGUF) (`llada2`) | vs PyTorch when `LLADA2_MODEL_DIR` |
| `flux2` | yes (BFL / NVFP4 safetensors) | yes (denoiser `.gguf`, `architecture: flux`; K-quant GGUF uses packed `DequantMatMul`; `Flux2Runner` + VAE/TE safetensors) | **yes** — [flux2](https://huggingface.co/models?library=gguf&search=flux2) (~53); e.g. [unsloth/FLUX.2-klein-9B-GGUF](https://huggingface.co/unsloth/FLUX.2-klein-9B-GGUF), [city96/FLUX.2-dev-gguf](https://huggingface.co/city96/FLUX.2-dev-gguf) | GGUF = denoiser only; VAE + Qwen3 TE still safetensors dirs |
| `vjepa2` | yes | yes (`vjepa2` / `vjepa`, F32 drain) | **no** Hub GGUF yet — [vjepa](https://huggingface.co/models?library=gguf&search=vjepa) (0) | synthetic + optional weight checks |
| `wav2vec2-bert` | yes | yes (`w2v-bert` / `wav2vec2`, F32 drain) | **no** for Seamless W2V-BERT — [w2v-bert](https://huggingface.co/models?library=gguf&search=w2v-bert) (0). Classic ASR: [wav2vec2](https://huggingface.co/models?library=gguf&search=wav2vec2) (~7), e.g. `cstr/wav2vec2-*-GGUF` (`wav2vec2` arch; keys may not match W2V-BERT) | vs HF when `RLX_W2V_BERT_DIR` + python reference |

To discover GGUF on the Hub: open [Models → library GGUF](https://huggingface.co/models?library=gguf) and add a **search term** matching the family (`qwen3`, `bge`, `flux2`, …). Check the model card **Architecture** field — many repos share a name but are unrelated LMs.

### Backends

Every model family targets the same standard backends: **CPU, Metal, MLX, CUDA, ROCm, WGPU (`gpu`), Vulkan**. SAM also accepts **`tpu`**. Policy lives in [`rlx_core::device_capabilities`](crates/rlx-core/src/device_capabilities.rs); runners call `validate_standard_device` (or `validate_sam_device`) at build time.

Enable GPU at compile time with matching features on `rlx-models` or any model crate, e.g. `cargo build -p rlx-qwen3 --features all-backends` or `cargo run -p rlx-models --features metal --bin rlx-run -- qwen3 …`. Per-crate binaries (`rlx-qwen3`, `rlx-sam3`, …) expose the same feature names. CLI: `cpu`, `metal`/`mps`, `mlx`, `cuda`, `rocm`/`hip`, `gpu`/`wgpu`, `vulkan`.

Legend: ✅ supported · ⚠️ partial (host fallback or open runtime gap) · ❌ not supported

| family | cpu | metal | mlx | cuda | rocm | wgpu | vulkan | notes |
|---|---|---|---|---|---|---|---|---|
| `embed` (`bert`, `nomic`, `vision`) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | [`RlxEmbed::from_dir_on`](crates/rlx-embed/src/runtime.rs); `from_dir` defaults to CPU |
| `dinov2` | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | [`DinoV2Runner`](crates/rlx-dinov2/src/runner.rs) `--device` |
| `sam`, `sam2`, `sam3` | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | SAM v1 also accepts `tpu`; CPU/Metal/MLX most exercised in CI |
| `qwen3` | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | packed GGUF: CPU + Metal native; MLX/wgpu/CUDA prefill via CPU path (`rlx_core::packed_gguf_*`); MTP decode not wired |
| `qwen35` | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | `--device` on all backends; some ops use host GDN/dequant on GPU; MoE offload may keep experts on host |
| `llama32` | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | `rlx-llama32` 0.2.1: Metal decode guard + packed GGUF helpers; same packed rules as Qwen3 |
| `minicpm5` | ✅ | ✅ | ⚠️ | ⚠️ | ⚠️ | ⚠️ | ⚠️ | Wraps `rlx-llama32`; safetensors decode on CPU/Metal; GGUF `--packed` parity on CPU/Metal (MLX/wgpu tests use CPU prefill path) |
| `llada2` | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | MoE predictive expert offload on all standard backends (GPU uses resident experts + host fallback) |
| `flux2` | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | Full pipeline; text encoder compiled on Metal/MLX by default, host once on CUDA/ROCm/WGPU/Vulkan |
| `vjepa2` | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | Runner `--device` |
| `wav2vec2-bert` | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | [`Wav2Vec2BertRunner`](crates/rlx-wav2vec2-bert/src/runner.rs) `--device` |

Multi-tenant serving (paged KV, continuous batching) lives in `rlx_runtime::paged_kv`; `qwen3::generator` is single-stream.

## Gotchas

- Safetensors names ≠ IR `Param` names — `weight_map.rs` renames; GGUF uses `GgufLoader`.
- **GGUF LMs** (`qwen3`, `qwen35`, `llama32`, `minicpm5`): pass a `.gguf` file or a directory with one `.gguf` / `model.safetensors`. Wrong-family files get a redirect (`rlx_core::assert_gguf_family`). Shared helpers: `resolve_weights_file`, `WeightFormat::resolve`, `open_loader_resolved`. MiniCPM5 expects `general.architecture = llama` and HF `model_type = llama`.
- **Packed GGUF prefill** (`--packed`, K-quant): use `rlx_core::{packed_gguf_compile_guard, compile_options_for_packed_gguf_prefill_with_profile, packed_gguf_execution_device}` in `rlx-llama32`, `rlx-qwen3`, `rlx-gemma`, and `rlx-minicpm5`. Metal sets `RLX_DISABLE_MPSGRAPH=1` during compile; MLX uses `RLX_MLX_MODE=lazy` (host GGUF dequant); wgpu/CUDA/ROCm disable fusion and may run prefill on CPU until upstream GPU parity.
- **GGUF elsewhere on HF** (embed, FLUX, SAM3, …) does not imply rlx support — see [Weights and parity](#weights-and-parity) column *GGUF on Hugging Face*.
- **GGUF shapes** are innermost-first labels; byte layout matches safetensors row-major — do not transpose in `take`.
- Unsupported GGUF quants (Q1_0, Q2_K, IQ*, …) error cleanly.
- **27B GGUF on Mac**: F32 dequant ≈ 108 GB; needs Metal `Op::DequantMatMul` to stay packed (~13.5 GB).
- Pooling in `embed::pooling`.
- New arch: new crate under `crates/`, facade hook, optional parity test.

## Per-crate READMEs

Model-specific runbooks live next to each crate. Agent quick reference: [AGENTS.md](AGENTS.md).

| Crate | README |
|-------|--------|
| `rlx-qwen3-tts` | [crates/rlx-qwen3-tts/README.md](crates/rlx-qwen3-tts/README.md) |
| `rlx-gemma` | [crates/rlx-gemma/README.md](crates/rlx-gemma/README.md) |
| `rlx-minicpm5` | [crates/rlx-minicpm5/README.md](crates/rlx-minicpm5/README.md) |
| `rlx-llama32` | [crates/rlx-llama32/README.md](crates/rlx-llama32/README.md) |
| `rlx-locateanything` | [crates/rlx-locateanything/README.md](crates/rlx-locateanything/README.md) |
| `rlx-kittentts` | [crates/rlx-kittentts/README.md](crates/rlx-kittentts/README.md) |
| `rlx-vad` | [crates/rlx-vad/README.md](crates/rlx-vad/README.md) |
| `rlx-mamba` | [crates/rlx-mamba/README.md](crates/rlx-mamba/README.md) |
| `rlx-ssm` | [crates/rlx-ssm/README.md](crates/rlx-ssm/README.md) |
| `rlx-models-core` (`rlx-core`) | [crates/rlx-models-core/README.md](crates/rlx-models-core/README.md) |
| `rlx-clinicalbert` | [crates/rlx-clinicalbert/README.md](crates/rlx-clinicalbert/README.md) |
| `rlx-onnx-import` | [crates/rlx-onnx-import/README.md](crates/rlx-onnx-import/README.md) |
| `rlx-onnx-decompose` | [crates/rlx-onnx-decompose/README.md](crates/rlx-onnx-decompose/README.md) |
| `kitten_tts_mini_rlx` | [crates/kitten_tts_mini_rlx/README.md](crates/kitten_tts_mini_rlx/README.md) |
| Voxtral TTS training | [docker/voxtral-tts/README.md](docker/voxtral-tts/README.md) |

Crates without a dedicated README are documented in [What's here](#whats-here) and the facade examples under `crates/rlx-models/examples/`.

## License

GPL-3.0-only.