rlx-models
Concrete model graph builders + weight loaders for RLX — the "what actually runs" layer.
Standalone repo: github.com/MIT-RLX/rlx-models. Clone next to rlx:
rlx-workspace/
rlx/ # github.com/MIT-RLX/rlx
rlx-models/ # github.com/MIT-RLX/rlx-models
candle/ # optional, for parity-candle
&&
The RLX monorepo lists ../rlx-models/crates/rlx-models as a workspace member; you can also run cd rlx && cargo test -p rlx-models there.
Agent-oriented quick reference: AGENTS.md.
Contents
- Architecture
- Running models
- What's here
- Install
- Quickstart — embeddings
- High-level runner API
- Adding a new model
- Compile profiles
- Qwen3
- Build and test
- Status
- Gotchas
- License
Architecture
This repo is a Cargo workspace: one library crate per model family under crates/, plus shared infrastructure. The rlx-models package is a thin facade that re-exports historical paths (rlx_models::qwen3, rlx_models::sam, …).
rlx-models/
├── Cargo.toml # workspace members + [workspace.dependencies]
├── justfile # shortcuts (optional)
├── crates/
│ ├── rlx-models-core/ # config, weight_map, flow_bridge (package `rlx-core`)
│ ├── rlx-ssm/ # SSM flow stages + custom ops (Mamba, LFM, …)
│ ├── rlx-cli/ # shared CLI + rlx-inspect
│ ├── rlx-<model>/ # one crate per family
│ └── rlx-models/ # facade + optional rlx-run multiplexer
└── crates/rlx-models/examples/ # integration templates
Crates
| Crate | Model / role |
|---|---|
rlx-models-core (rlx-core) |
config, weight_map, weight_loader, flow_bridge, flow_util |
rlx-ssm |
SSM flow stages (MambaScanStage, decode-step custom ops) |
rlx-mamba |
Mamba1 block + multi-backend driver |
rlx-bert |
BERT |
rlx-nomic |
NomicBERT |
rlx-vision |
NomicVision |
rlx-dinov2 |
DINOv2 |
rlx-embed |
embedding runtime |
rlx-sam / sam2 / sam3 |
SAM family |
rlx-sam-ir |
shared mask-decoder IR |
rlx-qwen3 |
Qwen3 LM |
rlx-qwen35 |
Qwen3.5 / 3.6 |
rlx-llama32 |
LLaMA 3.2 |
rlx-gemma |
Gemma / Gemma 2 |
rlx-llada2 |
LLaDA2 + TIDE offload |
rlx-flux2 |
FLUX.2 |
rlx-vjepa2 |
V-JEPA 2 |
rlx-wav2vec2-bert |
Wav2Vec2-BERT |
rlx-whisper |
OpenAI Whisper ASR |
rlx-cli |
shared CLI helpers + rlx-inspect |
rlx-models |
facade (re-exports) + optional rlx-run multiplexer |
How to depend
| Goal | Depend on |
|---|---|
| One model only (fast builds) | rlx-qwen3, rlx-sam3, … |
Stable rlx_models::qwen3 paths |
rlx-models facade |
| CLI / inspect only | rlx-cli |
New code that only needs Qwen3 should depend on rlx-qwen3 directly.
Per-crate binaries
Each model crate with a CLI has src/cli.rs (pub fn run) and src/bin/rlx-<name>.rs. Shared flag parsing lives in rlx-cli.
rlx-run (in rlx-models) is an optional multiplexer over all built-in CLIs. Prefer per-crate binaries when you only need one family — they link less and compile faster.
SAM unified runner: SamRunner (SAM1/2/3) stays on the facade (rlx-models/src/sam_runner.rs) because rlx-sam2 depends on rlx-sam. Per-arch CLIs are on rlx-sam, rlx-sam2, rlx-sam3.
Published rlx* crates (rlx-runtime, rlx-flow, …) are pulled from crates.io (0.2.1 in [workspace.dependencies]). Optional local overrides: see .cargo/config.toml.example.
Running models
just (shortcuts)
Install just (brew install just). From the repo root:
Pass model CLI flags after --. GPU backends: just features=all-backends qwen3 -- --device metal, just qwen35-all-backends -- …, or per-crate qwen3-all-backends / qwen35-all-backends.
Per-crate binaries (recommended)
| Binary | Crate | Example |
|---|---|---|
rlx-qwen3 |
rlx-qwen3 |
cargo run -p rlx-qwen3 --bin rlx-qwen3 --release -- --weights model.gguf --prompt-ids 1,2,3 |
rlx-qwen35 |
rlx-qwen35 |
cargo run -p rlx-qwen35 --bin rlx-qwen35 --release -- … |
rlx-llama32 |
rlx-llama32 |
cargo run -p rlx-llama32 --bin rlx-llama32 --release -- … |
rlx-gemma |
rlx-gemma |
cargo run -p rlx-gemma --bin rlx-gemma --release -- --weights model.gguf --prompt-ids 1,2,3 |
rlx-dinov2 |
rlx-dinov2 |
cargo run -p rlx-dinov2 --bin rlx-dinov2 --release -- … |
rlx-vjepa2 |
rlx-vjepa2 |
cargo run -p rlx-vjepa2 --bin rlx-vjepa2 --release -- … |
rlx-wav2vec2-bert |
rlx-wav2vec2-bert |
cargo run -p rlx-wav2vec2-bert --bin rlx-wav2vec2-bert --release -- … |
rlx-whisper |
rlx-whisper |
cargo run -p rlx-whisper --bin rlx-whisper --release -- --weights model.safetensors --wav audio16k.wav |
rlx-sam1 |
rlx-sam |
cargo run -p rlx-sam --bin rlx-sam1 --release -- … |
rlx-sam2 |
rlx-sam2 |
cargo run -p rlx-sam2 --bin rlx-sam2 --release -- … |
rlx-sam3 |
rlx-sam3 |
cargo run -p rlx-sam3 --bin rlx-sam3 --release -- … |
rlx-flux2 |
rlx-flux2 |
cargo run -p rlx-flux2 --bin rlx-flux2 --release -- … |
rlx-flux2-serve |
rlx-flux2 |
JSON-lines server on stdin |
rlx-inspect |
rlx-cli |
cargo run -p rlx-cli --bin rlx-inspect -- model.gguf |
Flags match the corresponding rlx-run subcommand (without the subcommand name).
Multiplexer (rlx-run)
rlx-inspect dumps format, tensor count, dtype histogram, GGUF metadata, MTP heads, and multi-.gguf dir hints (--prefer Q4_K_M).
Custom CLI
Downstream tools can register runners without forking rlx-models:
use ;
register_cli;
dispatch?;
See crates/rlx-models/examples/register_custom_runner.rs.
Examples (facade)
Integration templates on the rlx-models package:
| File | What it does |
|---|---|
run_qwen3_safetensors.rs |
Qwen3 from HF safetensors, builder API, streaming greedy decode |
run_qwen3_gguf.rs |
Same from .gguf (Q4_K_M / Q5_K_M / Q6_K), MTP head detection |
run_sam1.rs |
SAM 1 — encode image, prompt encoder + mask decoder |
run_sam2.rs |
SAM 2 — FPN + memory attention |
run_sam3.rs |
SAM 3 — text-conditioned detection + masks |
qwen3_gguf_inference.rs |
Detailed Qwen3 GGUF walk-through |
gguf_qwen3_probe.rs |
Validate hf_to_gguf_name against a real GGUF |
qwen3_matrix.rs |
(B, L, mode) × (CPU, Metal, MLX, wgpu) parity + perf vs candle |
SAM examples synthesize a 1024×1024 RGB gradient — swap in image::open(path) for real images.
Weight fetch (optional)
docker/qwen3-fetch/ — container pulls HF checkpoints into ./weights; host runs cargo test / benches natively.
# or: docker build -t rlx-qwen3-fetch docker/qwen3-fetch && …
What's here
qwen3— Qwen3 decoder LM (GQA, QK-norm, RoPE, SwiGLU, tied embeddings). Safetensors + GGUF; optionalqwen3.rlx.toml. See Qwen3.qwen35— Qwen3.5 / 3.6 hybrid (gated DeltaNet + periodic attention + optional MTP). GGUF viaQwen35Runner; optionalqwen35.rlx.toml. Parity:examples/qwen35_compare.rswith the llama.cpp reference script inexamples/.gemma— Gemma / Gemma 2 (GQA, RoPE, GeGLU, embedsqrt(d)scaling, tied weights, Gemma2 logit softcap + V2 norms). Safetensors + GGUF (gemma/gemma2arch); optionalgemma.rlx.toml(dce = falsefor V2). CLI:rlx-gemma/rlx-run gemma. Candle parity (CPU):cargo test -p rlx-models --test gemma_parity --features parity-candle gemma2_synthetic --release. Multi-backend:just features=all-backends test-gemma-backends.bert— BERT graph builder (MiniLM, BGE, all-MiniLM-L6-v2).nomic— NomicBERT (RoPE + SwiGLU).vision— NomicVision-style encoders.dinov2— DINOv2 ViT (B/14, L/14, g/14).sam,sam2,sam3— Segment Anything encoders + mask decoders. Optionalsam.rlx.tomlnext to weights (reference:crates/rlx-sam/src/sam.rlx.toml).flux2— FLUX.2 rectified-flow denoiser.rlx-flux2CLI; presetsflux2_dev(),flux2_klein_4b(),flux2_klein_9b(). VAE, CFG, img2img, LoRA,hf-download,rlx-flux2-serve. GPU backends viarlx-modelsfeatures (metal,cuda, …).embed—RlxEmbed, registry, tokenizers, pooling.from_pretrainedwithhf-download.config,weight_loader— HF config parsing;WeightMap+GgufLoader(K-quants, MTP isolation).mamba— Mamba1 SSM block (rlx-mamba); SSM viarlx-ssm+SelectiveScan. See crates/rlx-mamba/README.md.lfm,minimax,nemotron— hybrid runners usingrlx-ssmdecode-step stages.run—Qwen3Runner,SamRunner, … builders for one-call inference.
Install
[]
= "0.2"
HF-hub download:
= { = "0.2", = ["hf-download"] }
Quickstart — embeddings
use ;
let mut model = from_pretrained?;
let hidden = model.forward?;
High-level runner API
rlx_models::run exposes builder-style entry points (also rlx::run in the monorepo):
use ;
use Device;
let mut runner = builder
.weights
.device
.max_seq
.precision
.max_memory_gb
.stream
.use_mtp
.packed_weights
.build?;
runner.generate?;
Packed weights (large GGUF on limited RAM — CPU-only, memory-frugal, slower):
let mut runner = builder
.weights
.packed_weights
.max_seq
.build?;
runner.generate?;
let logits = runner.predict_logits?;
Format (safetensors vs gguf) is auto-detected. SAM uses SamRunner::builder(SamArch::Sam2).
CLI equivalent:
# or: cargo run -p rlx-qwen3 --bin rlx-qwen3 --release -- …
Adding a new model
Borrowed from Max's four-file layout; each architecture is a workspace crate crates/rlx-<name>/.
1. Create the crate
Root Cargo.toml:
# [workspace.members]
# [workspace.dependencies]
= { = "crates/rlx-myarch" }
Depend on rlx-core, rlx-ir, rlx-flow, rlx-runtime as needed.
2. Source layout
crates/rlx-myarch/src/
├── lib.rs
├── arch.rs # ArchSpec registration (optional)
├── config.rs # HF config.json
├── weights.rs # HF → RLX name map
├── builder.rs # graph construction
├── flow.rs # compile helpers (optional split)
└── cli.rs # pub fn run(args: &[String])
arch.rs registers with rlx_core::arch_registry. weights.rs holds rename rules; builder.rs emits IR. Reference: crates/rlx-qwen3.
3. Facade re-export
In crates/rlx-models/src/lib.rs:
4. CLI (optional)
cli.rs+[[bin]] name = "rlx-myarch"- Register in
crates/rlx-models/src/bin/rlx_run.rs:register_cli("myarch", "…", rlx_myarch::cli::run) - Add a
justrecipe injustfile(optional)
5. High-level runner (optional)
Put MyArchRunner in the model crate; re-export from crates/rlx-models/src/run.rs.
Legacy flat modules (rlx-bert, rlx-nomic) stay as-is until they grow — use this layout for new architectures.
Compile profiles (tier-1)
Compile through tier-1 profiles, not bare Session::compile(graph):
| Model | Profile helper | Optional file next to weights |
|---|---|---|
| Qwen3 | flow_util::compile_graph_qwen3_prefill_with_params |
qwen3.rlx.toml |
| Qwen3.5 | compile_support::compile_qwen35_prefill / compile_qwen35_decode |
qwen35.rlx.toml |
| SAM / SAM3 | flow_util::compile_graph_sam_with_params |
sam.rlx.toml |
| Encoders | flow_util::compile_graph_encoder_with_params |
— |
Synthetic Qwen3.5 weights for CPU checks: rlx_models::qwen35::synth (tiny_cfg, medium_cfg, bench_cfg, …).
# cargo test -p rlx-models --test qwen35_forward_check --test compile_profile_quick_check
Real-GGUF / backend checks: set QWEN35_GGUF_PATH (LMs) or vision env vars (SAM3_GGUF_PATH, DINOV2_GGUF_PATH, FLUX_GGUF_PATH, W2V_BERT_GGUF_PATH). Drain: cargo test -p rlx-models --test vision_gguf_load --release. Compile quick check: cargo test -p rlx-models --test vision_gguf_compile --release (SAM3 also needs VISION_GGUF_COMPILE=1; W2V-BERT needs RLX_W2V_BERT_DIR with config.json). FLUX: cargo test -p rlx-models --test flux2_gguf_runner_quick_check --release (FLUX_GGUF_PATH / FLUX_MODEL_ROOT; optional FLUX_VAE_DIR for VAE encode). Q4_0 fused matmul: cargo test -p rlx-models --test gguf_legacy_quant_matmul --release; Metal parity: GGUF_LEGACY_METAL_PARITY=1 with --features metal. Enable metal / mlx / cuda / parity-llama per test file where noted.
Qwen3
Prefill + decode on all seven standard backends (CPU, Metal, MLX, CUDA, ROCm, WGPU, Vulkan). Enable matching features at build time (cargo build -p rlx-qwen3 --features all-backends). Synthetic checks: just features=all-backends test-qwen3-backends. Parity: 100% top-1 vs HF (tests/qwen3_parity.rs).
Safetensors
use ;
use WeightMap;
use Device;
let cfg = from_file?;
let mut wm = from_file?;
let = build_qwen3_graph_sized_last_logits?;
let mut compiled = compile_graph_qwen3_prefill_with_params?;
GGUF
use GgufLoader;
let mut wm = from_file?;
// same compile + run as safetensors
Demo: just example-qwen3-gguf -- path/to/model.gguf. Verified vs unsloth/Qwen3-0.6B-GGUF (cosine ≈ 0.976 vs F32 safetensors on Q4_K_M).
Directories with several .gguf files: pass ResolveWeightsOptions { prefer_gguf_substring: Some("Q4_K_M"), .. } or gguf_index: Some(0) (see rlx_core::gguf_support). Multi-part split GGUF (split.count > 1) auto-merges when all shards sit in the same directory; otherwise rlx-inspect lists missing parts.
Weights API (model-agnostic loader)
rlx_core::weights only handles paths, file formats, and drain policy. It does not know about Qwen, FLUX, BERT, etc.
use ;
let = open_map?;
let = open_map_with?;
let loaded = open_with?; // packed take / MTP
Model-specific policy belongs in each runner:
use ;
// One call: resolve path, validate arch on .gguf, drain to F32 map
let map = load_weight_map?;
// Or split validate + open (embed / custom drain policy)
gguf_validate_arch?;
let = open_map?;
| Layer | Responsibility |
|---|---|
weights / weight_registry |
.gguf / .safetensors, resolve dir, custom extensions |
gguf_validate_arch, assert_gguf_family |
Optional arch guard in your crate |
register_gguf_tensor_resolver |
HF ↔ blk.* / prefix strip per checkpoint layout |
BertConfig::from_gguf, Flux2Config::from_gguf |
Hyperparameters from metadata |
Inspect: rlx-inspect path [--prefer Q4_K_M] [--json] — directory listing, split-part hints, runner suggestions.
CLI: LM / FLUX binaries accept --prefer-quant and --gguf-index (via rlx_cli::resolve_weights_cli); default quant preference is Q4_K_M in multi-file dirs.
Splits: Multi-part GGUF (split.count > 1) auto-merges when all parts are in the same directory; otherwise rlx-inspect lists missing shards.
Legacy quants: Q4_0 / Q8_0 support packed DequantMatMul on CPU and Metal (fused MSL dequant+matmul, 32-element blocks). Set RLX_DISABLE_METAL_DEQUANT_GPU=1 to force host dequant on Apple GPUs.
Example: cargo run -p rlx-models --example custom_weight_format
Apple Silicon
Metal lowers to MPSGraph (per shape). Env toggles:
| env var | effect |
|---|---|
RLX_DISABLE_MPSGRAPH=1 |
per-op Metal thunks |
RLX_DISABLE_MPSGRAPH_EXECUTABLE=1 |
JIT MPSGraph |
RLX_MPSGRAPH_PARAM_CONST=1 |
bake weights into executable |
RLX_QWEN3_F16_LM_HEAD=1 |
F16 final matmul |
RLX_MPSGRAPH_TRACE=1 |
print lowering blockers |
Harness: examples/qwen3_matrix.rs.
Build and test
burnembed (/Users/Shared/burnembed) re-exports rlx_models::embed with --features rlx.
Real-weight integration tests
Covers SmolLM2 135M (llama), Qwen 2.5 0.5B (qwen2), Gemma 3 270M (gemma3 — currently KnownUnimplemented(M2)), and Llama 3.2 1B (llama + Llama-3 RoPE scaling). The inference path verifies the full Llama32Runner/Qwen3Runner packed-decode pipeline against real downloaded GGUFs.
Auto-dispatch + compatibility check
Programmatic: rlx_models::run::check_path, check_hf_repo (requires compat-net feature), auto_dispatch, ChatTemplate::from_gguf. Implements the same load-time-field predicate llama.cpp uses (general.architecture + <arch>.context_length + <arch>.embedding_length + <arch>.block_count + tokenizer.ggml.{model,tokens}).
Status
Weights and parity
rlx GGUF = this repo can load .gguf through GgufLoader and the family runner. GGUF on HF = models on the Hub tagged library:gguf (counts are approximate; use the search link to browse).
| family | safetensors | rlx GGUF | GGUF on Hugging Face | parity |
|---|---|---|---|---|
bert, nomic, vision (embed) |
yes | yes (bert, nomic-bert, …) |
yes — minilm (~128), bge (~247), nomic (~60); e.g. nomic-embed-text-v1.5-GGUF (nomic-bert), bge-small-en-v1.5-gguf. Vision embed: no GGUF sibling. |
production (safetensors) |
dinov2 |
yes | yes (dinov2; F32 drain or K-quant/Q4_0/Q8_0 packed DequantMatMul when quant tensors present) |
no for facebook/dinov2-* — dinov2 (0). Community converters (dinov2.cpp) use dinov2 arch; tensor names must match HF/candle keys. |
production |
sam, sam2, sam3 |
yes | yes (sam / mobile-sam / sam2 F32 drain). SAM3: F32 drain or K-quant via fused CPU gguf_matmul (ViT, text, detector host+IR, seg cross-attn/mask/scoring, 1×1 inst/sem DequantMatMul IR); 3×3 pixel conv stays packed at load (one-time dequant cache on host, materialize for tier-1 IR compile) |
SAM1 ViT-H / SAM2: no official Hub GGUF — segment+anything (0), sam2.1 (0). MobileSAM: mobilesam (2), e.g. Acly/MobileSAM-GGUF (mobile-sam). SAM3: sam3 (1) — rob-laz/sam3-gguf (sam3). Beware TheBloke/SAM-GGUF — 7B chat LM (llama), not Segment Anything. |
production (encoder + mask path) |
qwen3 |
yes | yes (Q4_K_M / Q5_K_M / Q6_K) | yes — qwen3 (many); e.g. unsloth/Qwen3-*-GGUF |
top-1 vs HF (parity-candle + weights) |
qwen35 |
— | yes | yes — same hub space; e.g. unsloth/Qwen3.5-*-GGUF |
vs llama.cpp when QWEN35_GGUF_PATH / parity-llama |
llama32 |
yes | yes | yes — llama-3.2 (~5k) | vs llama.cpp when LLAMA32_GGUF_PATH |
llada2 |
yes | — | preview — llada2 (1): LLaDA2.0-mini-preview-GGUF (llada2) |
vs PyTorch when LLADA2_MODEL_DIR |
flux2 |
yes (BFL / NVFP4 safetensors) | yes (denoiser .gguf, architecture: flux; K-quant GGUF uses packed DequantMatMul; Flux2Runner + VAE/TE safetensors) |
yes — flux2 (~53); e.g. unsloth/FLUX.2-klein-9B-GGUF, city96/FLUX.2-dev-gguf | GGUF = denoiser only; VAE + Qwen3 TE still safetensors dirs |
vjepa2 |
yes | yes (vjepa2 / vjepa, F32 drain) |
no Hub GGUF yet — vjepa (0) | synthetic + optional weight checks |
wav2vec2-bert |
yes | yes (w2v-bert / wav2vec2, F32 drain) |
no for Seamless W2V-BERT — w2v-bert (0). Classic ASR: wav2vec2 (~7), e.g. cstr/wav2vec2-*-GGUF (wav2vec2 arch; keys may not match W2V-BERT) |
vs HF when RLX_W2V_BERT_DIR + python reference |
To discover GGUF on the Hub: open Models → library GGUF and add a search term matching the family (qwen3, bge, flux2, …). Check the model card Architecture field — many repos share a name but are unrelated LMs.
Backends
Every model family targets the same standard backends: CPU, Metal, MLX, CUDA, ROCm, WGPU (gpu), Vulkan. SAM also accepts tpu. Policy lives in rlx_core::device_capabilities; runners call validate_standard_device (or validate_sam_device) at build time.
Enable GPU at compile time with matching features on rlx-models or any model crate, e.g. cargo build -p rlx-qwen3 --features all-backends or cargo run -p rlx-models --features metal --bin rlx-run -- qwen3 …. Per-crate binaries (rlx-qwen3, rlx-sam3, …) expose the same feature names. CLI: cpu, metal/mps, mlx, cuda, rocm/hip, gpu/wgpu, vulkan.
Legend: ✅ supported · ⚠️ partial (host fallback or open runtime gap) · ❌ not supported
| family | cpu | metal | mlx | cuda | rocm | wgpu | vulkan | notes |
|---|---|---|---|---|---|---|---|---|
embed (bert, nomic, vision) |
✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | RlxEmbed::from_dir_on; from_dir defaults to CPU |
dinov2 |
✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | DinoV2Runner --device |
sam, sam2, sam3 |
✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | SAM v1 also accepts tpu; CPU/Metal/MLX most exercised in CI |
qwen3 |
✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | packed GGUF on chosen device; MTP speculative decode not wired yet |
qwen35 |
✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | --device on all backends; some ops use host GDN/dequant on GPU; MoE offload may keep experts on host |
llama32 |
✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | Same standard set as Qwen3-class LMs |
llada2 |
✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | MoE predictive expert offload on all standard backends (GPU uses resident experts + host fallback) |
flux2 |
✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | Full pipeline; text encoder compiled on Metal/MLX by default, host once on CUDA/ROCm/WGPU/Vulkan |
vjepa2 |
✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | Runner --device |
wav2vec2-bert |
✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | Wav2Vec2BertRunner --device |
Multi-tenant serving (paged KV, continuous batching) lives in rlx_runtime::paged_kv; qwen3::generator is single-stream.
Gotchas
- Safetensors names ≠ IR
Paramnames —weight_map.rsrenames; GGUF usesGgufLoader. - GGUF LMs (
qwen3,qwen35,llama32): pass a.gguffile or a directory with one.gguf/model.safetensors. Wrong-family files get a redirect (rlx_core::assert_gguf_family). Shared helpers:resolve_weights_file,WeightFormat::resolve,open_loader_resolved. - GGUF elsewhere on HF (embed, FLUX, SAM3, …) does not imply rlx support — see Weights and parity column GGUF on Hugging Face.
- GGUF shapes are innermost-first labels; byte layout matches safetensors row-major — do not transpose in
take. - Unsupported GGUF quants (Q1_0, Q2_K, IQ*, …) error cleanly.
- 27B GGUF on Mac: F32 dequant ≈ 108 GB; needs Metal
Op::DequantMatMulto stay packed (~13.5 GB). - Pooling in
embed::pooling. - New arch: new crate under
crates/, facade hook, optional parity test.
License
GPL-3.0-only.