rlx-minicpm5
MiniCPM5 edge LMs in RLX. The 1B checkpoint is Llama-shaped (LlamaForCausalLM, general.architecture = llama in GGUF). This crate validates HF/GGUF metadata and delegates inference to rlx-llama32.
Crate version: 0.2.1 (depends on rlx-llama32 0.2.1). Publish via scripts/publish.sh tier 4 before facade rlx-models 0.2.1.
Prerequisites
From the repo root (rlx-models/):
# Optional but recommended
# Build the CLI (tokenizer feature required for decode)
GPU backends: add feature flags (metal, mlx, cuda, all-backends, …) on rlx-minicpm5 or use just features=all-backends minicpm5 -- ….
Download weights
Safetensors (~2.1 GB, reference parity path):
# → /tmp/rlx-weights/MiniCPM5-1B/ (override with MINICPM5_MODEL_DIR=…)
GGUF (openbmb/MiniCPM5-1B-GGUF):
# → /tmp/rlx-weights/MiniCPM5-1B-GGUF/
Requires the hf-download feature (just fetch-* enables it via the facade example).
CLI (rlx-minicpm5)
Same flags as rlx-llama32 (this binary wraps rlx_llama32::cli::run after weight-kind checks).
Prefill / logits (prompt token ids)
WEIGHTS=/tmp/rlx-weights/MiniCPM5-1B/model-00000-of-00001.safetensors
# Or per-crate binary:
Greedy decode with HF tokenizer file
Decoded text is printed after response: when --tokenizer is set.
GGUF packed prefill (Q4_K_M / Q8_0 / F16)
GGUF=/tmp/rlx-weights/MiniCPM5-1B-GGUF/MiniCPM5-1B-Q4_K_M.gguf
Packed graphs use Op::DequantMatMul. CPU and Metal run natively on the requested device (Metal uses an MPSGraph workaround). MLX / wgpu / CUDA / ROCm packed prefill currently executes on CPU for logits parity until upstream rlx 0.2.2 fixes land (see rlx_core::flow_bridge::packed_gguf_execution_device). Decode still uses the F32 generator on your --device.
Useful flags
| Flag | Purpose |
|---|---|
--device cpu|metal|mlx|cuda|… |
Execution device |
--packed |
GGUF K-quant / F16 packed matmul path |
--max-seq N |
Compile / KV cap (default 512; lower for faster compile on long contexts) |
--max-tokens N |
Greedy decode steps after prefill |
--no-bucketed-decode |
One-shot decode graphs (try on MLX if output diverges) |
--no-stream |
Print full response: once at end |
--temperature, --top-p |
Sampling (default greedy) |
Chat (HF chat template)
Applies the official MiniCPM5 chat template in Python, then calls rlx-minicpm5:
Override paths / device:
MINICPM5_MODEL_DIR=/path/to/MiniCPM5-1B \
RLX_MINICPM5_DEVICE=cpu \
Script: crates/rlx-models/examples/minicpm5_chat.py.
Library API
use MiniCpm5Runner;
use Device;
let weights = "/tmp/rlx-weights/MiniCPM5-1B/model-00000-of-00001.safetensors";
let mut runner = builder
.weights
.device
.max_seq
.build?;
let prompt = ;
let logits = runner.predict_logits?;
let generated = runner.generate?;
Facade path: rlx_models::minicpm5::….
Full programmatic example: cargo run -p rlx-models --example run_minicpm5 --release (set RLX_MINICPM5_WEIGHTS).
Examples and tests (facade rlx-models)
| Command / file | What it does |
|---|---|
just fetch-minicpm5 |
examples/minicpm5_download.rs |
just fetch-minicpm5-gguf Q4_K_M |
examples/minicpm5_gguf_download.rs |
just bench-minicpm5-real --device cpu |
examples/minicpm5_forward_bench.rs |
just test-minicpm5-parity-full |
RLX vs PyTorch (minicpm5_parity) |
just test-minicpm5-backends-all |
Synthetic graph, all backends |
just test-minicpm5-gguf-backends |
Real Q4_K_M packed prefill |
cargo run -p rlx-models --example run_minicpm5 --release |
Builder API walk-through |
Weights on Hugging Face
| Artifact | Hub |
|---|---|
| Safetensors 1B | openbmb/MiniCPM5-1B |
| GGUF Q4_K_M / Q8_0 / F16 | openbmb/MiniCPM5-1B-GGUF |
Remote CUDA rig (Windows + WSL)
From a Mac/Linux dev machine with SSH to the rig (rig.sh in a sibling rlx/ checkout):
Set MINICPM5_GGUF_SRC or run just fetch-minicpm5-gguf Q4_K_M locally before sync-minicpm5-gguf. WSL builds use ~/rlx-workspace-mirror/rlx-models (ext4); GGUF lands under models/MiniCPM5-1B-Q4_K_M.gguf.