rlx-models 0.2.0

Model loading for RLX — config parsing, safetensors weights, graph builders
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
# rlx-models

Concrete model graph builders + weight loaders for RLX — the "what actually runs" layer.

Standalone repo: [github.com/MIT-RLX/rlx-models](https://github.com/MIT-RLX/rlx-models). Clone next to [`rlx`](https://github.com/MIT-RLX/rlx):

```text
rlx-workspace/
  rlx/          # github.com/MIT-RLX/rlx
  rlx-models/   # github.com/MIT-RLX/rlx-models
  candle/       # optional, for parity-candle
```

```bash
git clone https://github.com/MIT-RLX/rlx.git
git clone https://github.com/MIT-RLX/rlx-models.git
cd rlx-models && cargo test -p rlx-models
```

The RLX monorepo lists `../rlx-models/crates/rlx-models` as a workspace member; you can also run `cd rlx && cargo test -p rlx-models` there.

Agent-oriented quick reference: [AGENTS.md](AGENTS.md).

## Contents

- [Architecture](#architecture)
- [Running models](#running-models)
- [What's here](#whats-here)
- [Install](#install)
- [Quickstart — embeddings](#quickstart--embeddings)
- [High-level runner API](#high-level-runner-api)
- [Adding a new model](#adding-a-new-model)
- [Compile profiles](#compile-profiles-tier-1)
- [Qwen3](#qwen3)
- [Build and test](#build-and-test)
- [Status](#status)
- [Gotchas](#gotchas)
- [License](#license)

## Architecture

This repo is a **Cargo workspace**: one library crate per model family under `crates/`, plus shared infrastructure. The `rlx-models` package is a thin **facade** that re-exports historical paths (`rlx_models::qwen3`, `rlx_models::sam`, …).

```text
rlx-models/
├── Cargo.toml              # workspace members + [workspace.dependencies]
├── justfile                # shortcuts (optional)
├── crates/
│   ├── rlx-models-core/    # config, weight_map, flow_bridge (package `rlx-core`)
│   ├── rlx-ssm/            # SSM flow stages + custom ops (Mamba, LFM, …)
│   ├── rlx-cli/            # shared CLI + rlx-inspect
│   ├── rlx-<model>/        # one crate per family
│   └── rlx-models/         # facade + optional rlx-run multiplexer
└── crates/rlx-models/examples/   # integration templates
```

### Crates

| Crate | Model / role |
|---|---|
| `rlx-models-core` (`rlx-core`) | config, `weight_map`, `weight_loader`, `flow_bridge`, `flow_util` |
| `rlx-ssm` | SSM flow stages (`MambaScanStage`, decode-step custom ops) |
| `rlx-mamba` | Mamba1 block + multi-backend driver |
| `rlx-bert` | BERT |
| `rlx-nomic` | NomicBERT |
| `rlx-vision` | NomicVision |
| `rlx-dinov2` | DINOv2 |
| `rlx-embed` | embedding runtime |
| `rlx-sam` / `sam2` / `sam3` | SAM family |
| `rlx-sam-ir` | shared mask-decoder IR |
| `rlx-qwen3` | Qwen3 LM |
| `rlx-qwen35` | Qwen3.5 / 3.6 |
| `rlx-llama32` | LLaMA 3.2 |
| `rlx-gemma` | Gemma / Gemma 2 |
| `rlx-llada2` | LLaDA2 + TIDE offload |
| `rlx-flux2` | FLUX.2 |
| `rlx-vjepa2` | V-JEPA 2 |
| `rlx-wav2vec2-bert` | Wav2Vec2-BERT |
| `rlx-whisper` | OpenAI Whisper ASR |
| `rlx-cli` | shared CLI helpers + `rlx-inspect` |
| `rlx-models` | facade (re-exports) + optional `rlx-run` multiplexer |

### How to depend

| Goal | Depend on |
|------|-----------|
| One model only (fast builds) | `rlx-qwen3`, `rlx-sam3`, … |
| Stable `rlx_models::qwen3` paths | `rlx-models` facade |
| CLI / inspect only | `rlx-cli` |

New code that only needs Qwen3 should depend on `rlx-qwen3` directly.

### Per-crate binaries

Each model crate with a CLI has `src/cli.rs` (`pub fn run`) and `src/bin/rlx-<name>.rs`. Shared flag parsing lives in `rlx-cli`.

**`rlx-run`** (in `rlx-models`) is an optional multiplexer over all built-in CLIs. Prefer per-crate binaries when you only need one family — they link less and compile faster.

**SAM unified runner:** `SamRunner` (SAM1/2/3) stays on the facade (`rlx-models/src/sam_runner.rs`) because `rlx-sam2` depends on `rlx-sam`. Per-arch CLIs are on `rlx-sam`, `rlx-sam2`, `rlx-sam3`.

Published `rlx*` crates (`rlx-runtime`, `rlx-flow`, …) are pulled from **crates.io** (`0.2.1` in `[workspace.dependencies]`). Optional local overrides: see `.cargo/config.toml.example`.

## Running models

### just (shortcuts)

Install [just](https://github.com/casey/just) (`brew install just`). From the repo root:

```sh
just                          # list recipes
just qwen3 -- --weights model.gguf --prompt-ids 1,2,3
just inspect weights/model.gguf
just qwen3-metal -- --weights model.gguf --device metal --prompt-ids 1,2,3
```

Pass model CLI flags after `--`. GPU backends: `just features=all-backends qwen3 -- --device metal`, `just qwen35-all-backends -- …`, or per-crate `qwen3-all-backends` / `qwen35-all-backends`.

### Per-crate binaries (recommended)

| Binary | Crate | Example |
|--------|-------|---------|
| `rlx-qwen3` | `rlx-qwen3` | `cargo run -p rlx-qwen3 --bin rlx-qwen3 --release -- --weights model.gguf --prompt-ids 1,2,3` |
| `rlx-qwen35` | `rlx-qwen35` | `cargo run -p rlx-qwen35 --bin rlx-qwen35 --release -- …` |
| `rlx-llama32` | `rlx-llama32` | `cargo run -p rlx-llama32 --bin rlx-llama32 --release -- …` |
| `rlx-gemma` | `rlx-gemma` | `cargo run -p rlx-gemma --bin rlx-gemma --release -- --weights model.gguf --prompt-ids 1,2,3` |
| `rlx-dinov2` | `rlx-dinov2` | `cargo run -p rlx-dinov2 --bin rlx-dinov2 --release -- …` |
| `rlx-vjepa2` | `rlx-vjepa2` | `cargo run -p rlx-vjepa2 --bin rlx-vjepa2 --release -- …` |
| `rlx-wav2vec2-bert` | `rlx-wav2vec2-bert` | `cargo run -p rlx-wav2vec2-bert --bin rlx-wav2vec2-bert --release -- …` |
| `rlx-whisper` | `rlx-whisper` | `cargo run -p rlx-whisper --bin rlx-whisper --release -- --weights model.safetensors --wav audio16k.wav` |
| `rlx-sam1` | `rlx-sam` | `cargo run -p rlx-sam --bin rlx-sam1 --release -- …` |
| `rlx-sam2` | `rlx-sam2` | `cargo run -p rlx-sam2 --bin rlx-sam2 --release -- …` |
| `rlx-sam3` | `rlx-sam3` | `cargo run -p rlx-sam3 --bin rlx-sam3 --release -- …` |
| `rlx-flux2` | `rlx-flux2` | `cargo run -p rlx-flux2 --bin rlx-flux2 --release -- …` |
| `rlx-flux2-serve` | `rlx-flux2` | JSON-lines server on stdin |
| `rlx-inspect` | `rlx-cli` | `cargo run -p rlx-cli --bin rlx-inspect -- model.gguf` |

Flags match the corresponding `rlx-run` subcommand (without the subcommand name).

### Multiplexer (`rlx-run`)

```sh
cargo run -p rlx-models --bin rlx-run --release --features metal -- \
    qwen3 --weights Qwen3-0.6B-Q4_K_M.gguf --device metal --prompt-ids 1,17,42

cargo run -p rlx-models --bin rlx-run -- inspect Qwen3-0.6B-Q4_K_M.gguf
```

`rlx-inspect` dumps format, tensor count, dtype histogram, GGUF metadata, MTP heads, and multi-`.gguf` dir hints (`--prefer Q4_K_M`).

### Custom CLI

Downstream tools can register runners without forking `rlx-models`:

```rust
use rlx_cli::{dispatch, register_cli};

register_cli("my-model", "…", |args| { /* … */ });
dispatch(&argv)?;
```

See `crates/rlx-models/examples/register_custom_runner.rs`.

### Examples (facade)

Integration templates on the `rlx-models` package:

```sh
cargo run -p rlx-models --example run_qwen3_gguf --release -- [args]
just example-qwen3-gguf -- /path/to/model.gguf
```

| File | What it does |
|---|---|
| `run_qwen3_safetensors.rs` | Qwen3 from HF safetensors, builder API, streaming greedy decode |
| `run_qwen3_gguf.rs` | Same from `.gguf` (Q4_K_M / Q5_K_M / Q6_K), MTP head detection |
| `run_sam1.rs` | SAM 1 — encode image, prompt encoder + mask decoder |
| `run_sam2.rs` | SAM 2 — FPN + memory attention |
| `run_sam3.rs` | SAM 3 — text-conditioned detection + masks |
| `qwen3_gguf_inference.rs` | Detailed Qwen3 GGUF walk-through |
| `gguf_qwen3_probe.rs` | Validate `hf_to_gguf_name` against a real GGUF |
| `qwen3_matrix.rs` | (B, L, mode) × (CPU, Metal, MLX, wgpu) parity + perf vs candle |

SAM examples synthesize a 1024×1024 RGB gradient — swap in `image::open(path)` for real images.

### Weight fetch (optional)

`docker/qwen3-fetch/` — container pulls HF checkpoints into `./weights`; host runs `cargo test` / benches natively.

```sh
just fetch-qwen3
# or: docker build -t rlx-qwen3-fetch docker/qwen3-fetch && …
```

## What's here

- **`qwen3`** — Qwen3 decoder LM (GQA, QK-norm, RoPE, SwiGLU, tied embeddings). Safetensors + GGUF; optional `qwen3.rlx.toml`. See [Qwen3](#qwen3).
- **`qwen35`** — Qwen3.5 / 3.6 hybrid (gated DeltaNet + periodic attention + optional MTP). GGUF via `Qwen35Runner`; optional `qwen35.rlx.toml`. Parity: `examples/qwen35_compare.rs` with the llama.cpp reference script in `examples/`.
- **`gemma`** — Gemma / Gemma 2 (GQA, RoPE, GeGLU, embed `sqrt(d)` scaling, tied weights, Gemma2 logit softcap + V2 norms). Safetensors + GGUF (`gemma` / `gemma2` arch); optional `gemma.rlx.toml` (`dce = false` for V2). CLI: `rlx-gemma` / `rlx-run gemma`. Candle parity (CPU): `cargo test -p rlx-models --test gemma_parity --features parity-candle gemma2_synthetic --release`. Multi-backend: `just features=all-backends test-gemma-backends`.
- **`bert`** — BERT graph builder (MiniLM, BGE, all-MiniLM-L6-v2).
- **`nomic`** — NomicBERT (RoPE + SwiGLU).
- **`vision`** — NomicVision-style encoders.
- **`dinov2`** — DINOv2 ViT (B/14, L/14, g/14).
- **`sam`**, **`sam2`**, **`sam3`** — Segment Anything encoders + mask decoders. Optional `sam.rlx.toml` next to weights (reference: `crates/rlx-sam/src/sam.rlx.toml`).
- **`flux2`** — FLUX.2 rectified-flow denoiser. `rlx-flux2` CLI; presets `flux2_dev()`, `flux2_klein_4b()`, `flux2_klein_9b()`. VAE, CFG, img2img, LoRA, `hf-download`, `rlx-flux2-serve`. GPU backends via `rlx-models` features (`metal`, `cuda`, …).
- **`embed`** — `RlxEmbed`, registry, tokenizers, pooling. `from_pretrained` with `hf-download`.
- **`config`**, **`weight_loader`** — HF config parsing; `WeightMap` + `GgufLoader` (K-quants, MTP isolation).
- **`mamba`** — Mamba1 SSM block (`rlx-mamba`); SSM via `rlx-ssm` + `SelectiveScan`. See [crates/rlx-mamba/README.md](crates/rlx-mamba/README.md).
- **`lfm`**, **`minimax`**, **`nemotron`** — hybrid runners using `rlx-ssm` decode-step stages.
- **`run`** — `Qwen3Runner`, `SamRunner`, … builders for one-call inference.

## Install

```toml
[dependencies]
rlx-models = "0.2"
```

HF-hub download:

```toml
rlx-models = { version = "0.2", features = ["hf-download"] }
```

## Quickstart — embeddings

```rust
use rlx_models::embed::{Pooling, RlxEmbed};

let mut model = RlxEmbed::from_pretrained("sentence-transformers/all-MiniLM-L6-v2")?;
let hidden = model.forward(&[("input_ids", &ids), ("attention_mask", &mask)], 1, 16)?;
```

## High-level runner API

`rlx_models::run` exposes builder-style entry points (also `rlx::run` in the monorepo):

```rust
use rlx_models::run::{Qwen3Runner, Precision};
use rlx_runtime::Device;

let mut runner = Qwen3Runner::builder()
    .weights("Qwen3-0.6B-Q4_K_M.gguf")
    .device(Device::Metal)
    .max_seq(128)
    .precision(Precision::F32)
    .max_memory_gb(16.0)
    .stream(true)
    .use_mtp(false)
    .packed_weights(false)
    .build()?;

runner.generate(&prompt_ids, 32, |tok| print!("{tok} "))?;
```

**Packed weights** (large GGUF on limited RAM — CPU-only, memory-frugal, slower):

```rust,ignore
let mut runner = Qwen3Runner::builder()
    .weights("Qwen3-14B-Q4_K_M.gguf")
    .packed_weights(true)
    .max_seq(128)
    .build()?;
runner.generate(&prompt_ids, 16, |tok| print!(" {tok}"))?;
let logits = runner.predict_logits(&prompt_ids)?;
```

Format (`safetensors` vs `gguf`) is auto-detected. SAM uses `SamRunner::builder(SamArch::Sam2)`.

CLI equivalent:

```sh
just qwen3 -- --weights Qwen3-14B-Q4_K_M.gguf --packed --max-seq 128 --max-tokens 16 --prompt-ids 1,17,42
# or: cargo run -p rlx-qwen3 --bin rlx-qwen3 --release -- …
```

## Adding a new model

Borrowed from Max's four-file layout; each architecture is a workspace crate `crates/rlx-<name>/`.

### 1. Create the crate

Root `Cargo.toml`:

```toml
# [workspace.members]
"crates/rlx-myarch",

# [workspace.dependencies]
rlx-myarch = { path = "crates/rlx-myarch" }
```

Depend on `rlx-core`, `rlx-ir`, `rlx-flow`, `rlx-runtime` as needed.

### 2. Source layout

```text
crates/rlx-myarch/src/
├── lib.rs
├── arch.rs       # ArchSpec registration (optional)
├── config.rs     # HF config.json
├── weights.rs    # HF → RLX name map
├── builder.rs    # graph construction
├── flow.rs       # compile helpers (optional split)
└── cli.rs        # pub fn run(args: &[String])
```

`arch.rs` registers with `rlx_core::arch_registry`. `weights.rs` holds rename rules; `builder.rs` emits IR. Reference: `crates/rlx-qwen3`.

### 3. Facade re-export

In `crates/rlx-models/src/lib.rs`:

```rust
pub mod myarch {
    pub use rlx_myarch::*;
}
```

### 4. CLI (optional)

- `cli.rs` + `[[bin]] name = "rlx-myarch"`
- Register in `crates/rlx-models/src/bin/rlx_run.rs`: `register_cli("myarch", "…", rlx_myarch::cli::run)`
- Add a `just` recipe in `justfile` (optional)

### 5. High-level runner (optional)

Put `MyArchRunner` in the model crate; re-export from `crates/rlx-models/src/run.rs`.

Legacy flat modules (`rlx-bert`, `rlx-nomic`) stay as-is until they grow — use this layout for **new** architectures.

## Compile profiles (tier-1)

Compile through tier-1 profiles, not bare `Session::compile(graph)`:

| Model | Profile helper | Optional file next to weights |
|---|---|---|
| Qwen3 | `flow_util::compile_graph_qwen3_prefill_with_params` | `qwen3.rlx.toml` |
| Qwen3.5 | `compile_support::compile_qwen35_prefill` / `compile_qwen35_decode` | `qwen35.rlx.toml` |
| SAM / SAM3 | `flow_util::compile_graph_sam_with_params` | `sam.rlx.toml` |
| Encoders | `flow_util::compile_graph_encoder_with_params` | — |

Synthetic Qwen3.5 weights for CPU checks: `rlx_models::qwen35::synth` (`tiny_cfg`, `medium_cfg`, `bench_cfg`, …).

```sh
just test-quick
# cargo test -p rlx-models --test qwen35_forward_check --test compile_profile_quick_check
```

Real-GGUF / backend checks: set `QWEN35_GGUF_PATH` (LMs) or vision env vars (`SAM3_GGUF_PATH`, `DINOV2_GGUF_PATH`, `FLUX_GGUF_PATH`, `W2V_BERT_GGUF_PATH`). Drain: `cargo test -p rlx-models --test vision_gguf_load --release`. Compile quick check: `cargo test -p rlx-models --test vision_gguf_compile --release` (SAM3 also needs `VISION_GGUF_COMPILE=1`; W2V-BERT needs `RLX_W2V_BERT_DIR` with `config.json`). FLUX: `cargo test -p rlx-models --test flux2_gguf_runner_quick_check --release` (`FLUX_GGUF_PATH` / `FLUX_MODEL_ROOT`; optional `FLUX_VAE_DIR` for VAE encode). Q4_0 fused matmul: `cargo test -p rlx-models --test gguf_legacy_quant_matmul --release`; Metal parity: `GGUF_LEGACY_METAL_PARITY=1` with `--features metal`. Enable `metal` / `mlx` / `cuda` / `parity-llama` per test file where noted.

## Qwen3

Prefill + decode on all seven standard backends (CPU, Metal, MLX, CUDA, ROCm, WGPU, Vulkan). Enable matching features at build time (`cargo build -p rlx-qwen3 --features all-backends`). Synthetic checks: `just features=all-backends test-qwen3-backends`. Parity: 100% top-1 vs HF (`tests/qwen3_parity.rs`).

### Safetensors

```rust
use rlx_models::qwen3::{Qwen3Config, build_qwen3_graph_sized_last_logits};
use rlx_models::weight_map::WeightMap;
use rlx_runtime::Device;

let cfg = Qwen3Config::from_file("weights/Qwen3-0.6B/config.json".as_ref())?;
let mut wm = WeightMap::from_file("weights/Qwen3-0.6B/model.safetensors")?;
let (graph, params) = build_qwen3_graph_sized_last_logits(&cfg, &mut wm, 1, 128, false)?;
let mut compiled = rlx_models::flow_util::compile_graph_qwen3_prefill_with_params(
    Device::Metal, graph, params,
)?;
```

### GGUF

```rust
use rlx_models::weight_loader::GgufLoader;
let mut wm = GgufLoader::from_file("Qwen3-0.6B-Q4_K_M.gguf")?;
// same compile + run as safetensors
```

Demo: `just example-qwen3-gguf -- path/to/model.gguf`. Verified vs `unsloth/Qwen3-0.6B-GGUF` (cosine ≈ 0.976 vs F32 safetensors on Q4_K_M).

**Directories with several `.gguf` files:** pass `ResolveWeightsOptions { prefer_gguf_substring: Some("Q4_K_M"), .. }` or `gguf_index: Some(0)` (see `rlx_core::gguf_support`). Multi-part split GGUF (`split.count` > 1) auto-merges when all shards sit in the same directory; otherwise `rlx-inspect` lists missing parts.

### Weights API (model-agnostic loader)

**`rlx_core::weights`** only handles paths, file formats, and drain policy. It does **not** know about Qwen, FLUX, BERT, etc.

```rust
use rlx_core::weights::{self, LoadOpts};

let (path, map) = weights::open_map("weights/")?;
let (path, map) = weights::open_map_with(LoadOpts::map().prefer_q4_k_m(), "weights/")?;
let loaded = weights::open_with(LoadOpts::loader(), "model.gguf")?; // packed take / MTP
```

**Model-specific policy** belongs in each runner:

```rust
use rlx_core::{load_weight_map, gguf_validate_arch, EMBED_GGUF_ARCHES, DINOV2_GGUF_ARCHES};

// One call: resolve path, validate arch on .gguf, drain to F32 map
let map = load_weight_map(path, DINOV2_GGUF_ARCHES)?;

// Or split validate + open (embed / custom drain policy)
gguf_validate_arch(&path, EMBED_GGUF_ARCHES)?;
let (_path, map) = weights::open_map(path)?;
```

| Layer | Responsibility |
|-------|----------------|
| `weights` / `weight_registry` | `.gguf` / `.safetensors`, resolve dir, custom extensions |
| `gguf_validate_arch`, `assert_gguf_family` | Optional arch guard in **your** crate |
| `register_gguf_tensor_resolver` | HF ↔ `blk.*` / prefix strip per checkpoint layout |
| `BertConfig::from_gguf`, `Flux2Config::from_gguf` | Hyperparameters from metadata |

**Inspect:** `rlx-inspect path [--prefer Q4_K_M] [--json]` — directory listing, split-part hints, runner suggestions.

**CLI:** LM / FLUX binaries accept `--prefer-quant` and `--gguf-index` (via `rlx_cli::resolve_weights_cli`); default quant preference is `Q4_K_M` in multi-file dirs.

**Splits:** Multi-part GGUF (`split.count` > 1) auto-merges when all parts are in the same directory; otherwise `rlx-inspect` lists missing shards.

**Legacy quants:** `Q4_0` / `Q8_0` support packed `DequantMatMul` on **CPU** and **Metal** (fused MSL dequant+matmul, 32-element blocks). Set `RLX_DISABLE_METAL_DEQUANT_GPU=1` to force host dequant on Apple GPUs.

**Example:** `cargo run -p rlx-models --example custom_weight_format`

### Apple Silicon

Metal lowers to MPSGraph (per shape). Env toggles:

| env var | effect |
|---|---|
| `RLX_DISABLE_MPSGRAPH=1` | per-op Metal thunks |
| `RLX_DISABLE_MPSGRAPH_EXECUTABLE=1` | JIT MPSGraph |
| `RLX_MPSGRAPH_PARAM_CONST=1` | bake weights into executable |
| `RLX_QWEN3_F16_LM_HEAD=1` | F16 final matmul |
| `RLX_MPSGRAPH_TRACE=1` | print lowering blockers |

Harness: `examples/qwen3_matrix.rs`.

## Build and test

```sh
just check
just test
just build

cargo build -p rlx-models
cargo test  -p rlx-models
cargo test  -p rlx-models --features parity-candle
```

burnembed (`/Users/Shared/burnembed`) re-exports `rlx_models::embed` with `--features rlx`.

### Real-weight integration tests

```sh
just fetch-real-weights              # downloads ~1.5 GB of small Q4_K_M GGUFs (idempotent)
just test-real-weights               # config + compat + chat-template across 4 families (~2 s/suite)
just test-real-weights-inference     # adds end-to-end forward inference (slow on CPU)
just test-net-hf                     # live HuggingFace Hub compat check (RLX_NET_TESTS=1)
```

Covers SmolLM2 135M (`llama`), Qwen 2.5 0.5B (`qwen2`), Gemma 3 270M (`gemma3` — currently `KnownUnimplemented(M2)`), and Llama 3.2 1B (`llama` + Llama-3 RoPE scaling). The inference path verifies the full `Llama32Runner`/`Qwen3Runner` packed-decode pipeline against real downloaded GGUFs.

### Auto-dispatch + compatibility check

```sh
rlx-run check <path-or-hf-repo>      # `SUPPORTED`, `KnownUnimplemented(<milestone>)`, `MissingMetadata`, or `Unknown`
rlx-run check <path> --json          # machine-readable verdict
rlx-run auto <weights> [args...]     # sniffs arch, dispatches to the right runner
```

Programmatic: [`rlx_models::run::check_path`](crates/rlx-cli/src/compat.rs), [`check_hf_repo`](crates/rlx-cli/src/compat.rs) (requires `compat-net` feature), [`auto_dispatch`](crates/rlx-cli/src/auto_dispatch.rs), [`ChatTemplate::from_gguf`](crates/rlx-cli/src/chat.rs). Implements the same load-time-field predicate llama.cpp uses (`general.architecture` + `<arch>.context_length` + `<arch>.embedding_length` + `<arch>.block_count` + `tokenizer.ggml.{model,tokens}`).

## Status

### Weights and parity

**rlx GGUF** = this repo can load `.gguf` through `GgufLoader` and the family runner. **GGUF on HF** = models on the Hub tagged `library:gguf` (counts are approximate; use the search link to browse).

| family | safetensors | rlx GGUF | GGUF on Hugging Face | parity |
|---|---|---|---|---|
| `bert`, `nomic`, `vision` (`embed`) | yes | yes (`bert`, `nomic-bert`, …) | **yes** — [minilm](https://huggingface.co/models?library=gguf&search=minilm) (~128), [bge](https://huggingface.co/models?library=gguf&search=bge) (~247), [nomic](https://huggingface.co/models?library=gguf&search=nomic) (~60); e.g. [nomic-embed-text-v1.5-GGUF](https://huggingface.co/nomic-ai/nomic-embed-text-v1.5-GGUF) (`nomic-bert`), [bge-small-en-v1.5-gguf](https://huggingface.co/CompendiumLabs/bge-small-en-v1.5-gguf). Vision embed: no GGUF sibling. | production (safetensors) |
| `dinov2` | yes | yes (`dinov2`; F32 drain or K-quant/Q4_0/Q8_0 packed `DequantMatMul` when quant tensors present) | **no** for `facebook/dinov2-*` — [dinov2](https://huggingface.co/models?library=gguf&search=dinov2) (0). Community converters (dinov2.cpp) use `dinov2` arch; tensor names must match HF/candle keys. | production |
| `sam`, `sam2`, `sam3` | yes | yes (`sam` / `mobile-sam` / `sam2` F32 drain). **SAM3**: F32 drain or K-quant via fused CPU `gguf_matmul` (ViT, text, detector host+IR, seg cross-attn/mask/scoring, 1×1 inst/sem `DequantMatMul` IR); 3×3 pixel conv stays packed at load (one-time dequant cache on host, materialize for tier-1 IR compile) | **SAM1 ViT-H / SAM2**: no official Hub GGUF — [segment+anything](https://huggingface.co/models?library=gguf&search=segment+anything) (0), [sam2.1](https://huggingface.co/models?library=gguf&search=sam2.1) (0). **MobileSAM**: [mobilesam](https://huggingface.co/models?library=gguf&search=mobilesam) (2), e.g. [Acly/MobileSAM-GGUF](https://huggingface.co/Acly/MobileSAM-GGUF) (`mobile-sam`). **SAM3**: [sam3](https://huggingface.co/models?library=gguf&search=sam3) (1) — [rob-laz/sam3-gguf](https://huggingface.co/rob-laz/sam3-gguf) (`sam3`). Beware [TheBloke/SAM-GGUF](https://huggingface.co/TheBloke/SAM-GGUF) — 7B **chat LM** (`llama`), not Segment Anything. | production (encoder + mask path) |
| `qwen3` | yes | yes (Q4_K_M / Q5_K_M / Q6_K) | **yes** — [qwen3](https://huggingface.co/models?library=gguf&search=qwen3) (many); e.g. `unsloth/Qwen3-*-GGUF` | top-1 vs HF (`parity-candle` + weights) |
| `qwen35` | — | yes | **yes** — same hub space; e.g. `unsloth/Qwen3.5-*-GGUF` | vs llama.cpp when `QWEN35_GGUF_PATH` / `parity-llama` |
| `llama32` | yes | yes | **yes** — [llama-3.2](https://huggingface.co/models?library=gguf&search=llama-3.2) (~5k) | vs llama.cpp when `LLAMA32_GGUF_PATH` |
| `llada2` | yes | — | **preview** — [llada2](https://huggingface.co/models?library=gguf&search=llada2) (1): [LLaDA2.0-mini-preview-GGUF](https://huggingface.co/wsbagnsv1/LLaDA2.0-mini-preview-GGUF) (`llada2`) | vs PyTorch when `LLADA2_MODEL_DIR` |
| `flux2` | yes (BFL / NVFP4 safetensors) | yes (denoiser `.gguf`, `architecture: flux`; K-quant GGUF uses packed `DequantMatMul`; `Flux2Runner` + VAE/TE safetensors) | **yes** — [flux2](https://huggingface.co/models?library=gguf&search=flux2) (~53); e.g. [unsloth/FLUX.2-klein-9B-GGUF](https://huggingface.co/unsloth/FLUX.2-klein-9B-GGUF), [city96/FLUX.2-dev-gguf](https://huggingface.co/city96/FLUX.2-dev-gguf) | GGUF = denoiser only; VAE + Qwen3 TE still safetensors dirs |
| `vjepa2` | yes | yes (`vjepa2` / `vjepa`, F32 drain) | **no** Hub GGUF yet — [vjepa](https://huggingface.co/models?library=gguf&search=vjepa) (0) | synthetic + optional weight checks |
| `wav2vec2-bert` | yes | yes (`w2v-bert` / `wav2vec2`, F32 drain) | **no** for Seamless W2V-BERT — [w2v-bert](https://huggingface.co/models?library=gguf&search=w2v-bert) (0). Classic ASR: [wav2vec2](https://huggingface.co/models?library=gguf&search=wav2vec2) (~7), e.g. `cstr/wav2vec2-*-GGUF` (`wav2vec2` arch; keys may not match W2V-BERT) | vs HF when `RLX_W2V_BERT_DIR` + python reference |

To discover GGUF on the Hub: open [Models → library GGUF](https://huggingface.co/models?library=gguf) and add a **search term** matching the family (`qwen3`, `bge`, `flux2`, …). Check the model card **Architecture** field — many repos share a name but are unrelated LMs.

### Backends

Every model family targets the same standard backends: **CPU, Metal, MLX, CUDA, ROCm, WGPU (`gpu`), Vulkan**. SAM also accepts **`tpu`**. Policy lives in [`rlx_core::device_capabilities`](crates/rlx-core/src/device_capabilities.rs); runners call `validate_standard_device` (or `validate_sam_device`) at build time.

Enable GPU at compile time with matching features on `rlx-models` or any model crate, e.g. `cargo build -p rlx-qwen3 --features all-backends` or `cargo run -p rlx-models --features metal --bin rlx-run -- qwen3 …`. Per-crate binaries (`rlx-qwen3`, `rlx-sam3`, …) expose the same feature names. CLI: `cpu`, `metal`/`mps`, `mlx`, `cuda`, `rocm`/`hip`, `gpu`/`wgpu`, `vulkan`.

Legend: ✅ supported · ⚠️ partial (host fallback or open runtime gap) · ❌ not supported

| family | cpu | metal | mlx | cuda | rocm | wgpu | vulkan | notes |
|---|---|---|---|---|---|---|---|---|
| `embed` (`bert`, `nomic`, `vision`) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | [`RlxEmbed::from_dir_on`](crates/rlx-embed/src/runtime.rs); `from_dir` defaults to CPU |
| `dinov2` | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | [`DinoV2Runner`](crates/rlx-dinov2/src/runner.rs) `--device` |
| `sam`, `sam2`, `sam3` | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | SAM v1 also accepts `tpu`; CPU/Metal/MLX most exercised in CI |
| `qwen3` | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | packed GGUF on chosen device; MTP speculative decode not wired yet |
| `qwen35` | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | `--device` on all backends; some ops use host GDN/dequant on GPU; MoE offload may keep experts on host |
| `llama32` | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | Same standard set as Qwen3-class LMs |
| `llada2` | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | MoE predictive expert offload on all standard backends (GPU uses resident experts + host fallback) |
| `flux2` | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | Full pipeline; text encoder compiled on Metal/MLX by default, host once on CUDA/ROCm/WGPU/Vulkan |
| `vjepa2` | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | Runner `--device` |
| `wav2vec2-bert` | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | [`Wav2Vec2BertRunner`](crates/rlx-wav2vec2-bert/src/runner.rs) `--device` |

Multi-tenant serving (paged KV, continuous batching) lives in `rlx_runtime::paged_kv`; `qwen3::generator` is single-stream.

## Gotchas

- Safetensors names ≠ IR `Param` names — `weight_map.rs` renames; GGUF uses `GgufLoader`.
- **GGUF LMs** (`qwen3`, `qwen35`, `llama32`): pass a `.gguf` file or a directory with one `.gguf` / `model.safetensors`. Wrong-family files get a redirect (`rlx_core::assert_gguf_family`). Shared helpers: `resolve_weights_file`, `WeightFormat::resolve`, `open_loader_resolved`.
- **GGUF elsewhere on HF** (embed, FLUX, SAM3, …) does not imply rlx support — see [Weights and parity](#weights-and-parity) column *GGUF on Hugging Face*.
- **GGUF shapes** are innermost-first labels; byte layout matches safetensors row-major — do not transpose in `take`.
- Unsupported GGUF quants (Q1_0, Q2_K, IQ*, …) error cleanly.
- **27B GGUF on Mac**: F32 dequant ≈ 108 GB; needs Metal `Op::DequantMatMul` to stay packed (~13.5 GB).
- Pooling in `embed::pooling`.
- New arch: new crate under `crates/`, facade hook, optional parity test.

## License

GPL-3.0-only.