rlx-llama32 0.2.1

LLaMA 3.2 for RLX
Documentation

rlx-llama32

LLaMA 3.2–shaped causal LMs in RLX (runner, CLI, GGUF packed prefill).

Version 0.2.1 — Metal KV-decode compile guard (RLX_DISABLE_MPSGRAPH on decode); packed GGUF helpers re-exported via rlx_core::flow_bridge (used by rlx-minicpm5, rlx-qwen3, rlx-gemma).

CLI

cargo run -p rlx-llama32 --features tokenizer --release -- \
  --weights /path/to/model.gguf \
  --packed --device metal \
  --prompt-ids 1,42 --max-tokens 16

Packed GGUF

When building a packed prefill graph (Op::DequantMatMul), use the shared helpers from rlx_core:

  • compile_options_for_packed_gguf_prefill(device) — Llama 3.2 prefill profile
  • packed_gguf_compile_guard(device, || compile…) — Metal / MLX env overrides
  • packed_gguf_execution_device(device) — CPU fallback for MLX/wgpu/CUDA until upstream GPU parity

See README.md gotchas and crates/rlx-minicpm5/README.md.

See also