rlx-minicpm5

MiniCPM5 edge LMs in RLX. The 1B checkpoint is Llama-shaped (LlamaForCausalLM, general.architecture = llama in GGUF). This crate validates HF/GGUF metadata and delegates inference to rlx-llama32.

Crate version: 0.2.1 (depends on rlx-llama32 0.2.1). Publish via scripts/publish.sh tier 4 before facade rlx-models 0.2.1.

Prerequisites

From the repo root (rlx-models/):

# Optional but recommended
brew install just

# Build the CLI (tokenizer feature required for decode)
cargo build -p rlx-minicpm5 --features tokenizer --release

GPU backends: add feature flags (metal, mlx, cuda, all-backends, …) on rlx-minicpm5 or use just features=all-backends minicpm5 -- ….

Download weights

Safetensors (~2.1 GB, reference parity path):

just fetch-minicpm5
# → /tmp/rlx-weights/MiniCPM5-1B/  (override with MINICPM5_MODEL_DIR=…)

GGUF (openbmb/MiniCPM5-1B-GGUF):

just fetch-minicpm5-gguf Q4_K_M    # or Q8_0, F16, or `all`
# → /tmp/rlx-weights/MiniCPM5-1B-GGUF/

Requires the hf-download feature (just fetch-* enables it via the facade example).

CLI (`rlx-minicpm5`)

Same flags as rlx-llama32 (this binary wraps rlx_llama32::cli::run after weight-kind checks).

Prefill / logits (prompt token ids)

WEIGHTS=/tmp/rlx-weights/MiniCPM5-1B/model-00000-of-00001.safetensors

just minicpm5 -- \
  --weights "$WEIGHTS" \
  --device cpu \
  --prompt-ids 1,42,314 \
  --max-tokens 0

# Or per-crate binary:
cargo run -p rlx-minicpm5 --features tokenizer --release -- \
  --weights "$WEIGHTS" \
  --device cpu \
  --prompt-ids 1,42,314 \
  --max-tokens 16 \
  --max-seq 512

Greedy decode with HF tokenizer file

just minicpm5 -- \
  --weights "$WEIGHTS" \
  --tokenizer /tmp/rlx-weights/MiniCPM5-1B/tokenizer.json \
  --device cpu \
  --prompt "What is 2+2? Answer briefly." \
  --max-tokens 32 \
  --no-stream

Decoded text is printed after response: when --tokenizer is set.

GGUF packed prefill (Q4_K_M / Q8_0 / F16)

GGUF=/tmp/rlx-weights/MiniCPM5-1B-GGUF/MiniCPM5-1B-Q4_K_M.gguf

just minicpm5 -- \
  --weights "$GGUF" \
  --packed \
  --device cpu \
  --prompt-ids 1,42,314 \
  --max-tokens 8

Packed graphs use Op::DequantMatMul. CPU and Metal run natively on the requested device (Metal uses an MPSGraph workaround). MLX / wgpu / CUDA / ROCm packed prefill currently executes on CPU for logits parity until upstream rlx 0.2.2 fixes land (see rlx_core::flow_bridge::packed_gguf_execution_device). Decode still uses the F32 generator on your --device.

Useful flags

Flag	Purpose
`--device cpu\|metal\|mlx\|cuda\|…`	Execution device
`--packed`	GGUF K-quant / F16 packed matmul path
`--max-seq N`	Compile / KV cap (default 512; lower for faster compile on long contexts)
`--max-tokens N`	Greedy decode steps after prefill
`--no-bucketed-decode`	One-shot decode graphs (try on MLX if output diverges)
`--no-stream`	Print full `response:` once at end
`--temperature`, `--top-p`	Sampling (default greedy)

Chat (HF chat template)

Applies the official MiniCPM5 chat template in Python, then calls rlx-minicpm5:

pip install transformers
just fetch-minicpm5
just minicpm5-chat "What is the capital of France? Reply in one short sentence."

Override paths / device:

MINICPM5_MODEL_DIR=/path/to/MiniCPM5-1B \
RLX_MINICPM5_DEVICE=cpu \
  just minicpm5-chat "Hello" --max-tokens 64

Script: crates/rlx-models/examples/minicpm5_chat.py.

Library API

use rlx_minicpm5::MiniCpm5Runner;
use rlx_runtime::Device;

let weights = "/tmp/rlx-weights/MiniCPM5-1B/model-00000-of-00001.safetensors";
let mut runner = MiniCpm5Runner::builder()
    .weights(weights)
    .device(Device::Cpu)
    .max_seq(512)
    .build()?;

let prompt = [1u32, 42, 314];
let logits = runner.predict_logits(&prompt)?;
let generated = runner.generate(&prompt, 16, |tok| eprint!(" {tok}"))?;

Facade path: rlx_models::minicpm5::….

Full programmatic example: cargo run -p rlx-models --example run_minicpm5 --release (set RLX_MINICPM5_WEIGHTS).

Examples and tests (facade `rlx-models`)

Command / file	What it does
`just fetch-minicpm5`	`examples/minicpm5_download.rs`
`just fetch-minicpm5-gguf Q4_K_M`	`examples/minicpm5_gguf_download.rs`
`just bench-minicpm5-real --device cpu`	`examples/minicpm5_forward_bench.rs`
`just test-minicpm5-parity-full`	RLX vs PyTorch (`minicpm5_parity`)
`just test-minicpm5-backends-all`	Synthetic graph, all backends
`just test-minicpm5-gguf-backends`	Real Q4_K_M packed prefill
`cargo run -p rlx-models --example run_minicpm5 --release`	Builder API walk-through

Weights on Hugging Face

Artifact	Hub
Safetensors 1B	openbmb/MiniCPM5-1B
GGUF Q4_K_M / Q8_0 / F16	openbmb/MiniCPM5-1B-GGUF

Remote CUDA rig (Windows + WSL)

From a Mac/Linux dev machine with SSH to the rig (rig.sh in a sibling rlx/ checkout):

cd ../rlx
./rig.sh sync
./rig.sh sync-minicpm5-gguf    # Q4_K_M from local HF cache or MINICPM5_GGUF_SRC
./rig.sh test-minicpm5         # CPU + CUDA + WGPU (synthetic + real GGUF)
./rig.sh --both test-minicpm5  # Windows MSVC, then WSL Ubuntu
./rig.sh --wsl test-minicpm5   # WSL only (CUDA needs libcublas in WSL)

Set MINICPM5_GGUF_SRC or run just fetch-minicpm5-gguf Q4_K_M locally before sync-minicpm5-gguf. WSL builds use ~/rlx-workspace-mirror/rlx-models (ext4); GGUF lands under models/MiniCPM5-1B-Q4_K_M.gguf.

rlx-minicpm5 0.2.1

rlx-minicpm5

Prerequisites

Download weights

CLI (`rlx-minicpm5`)

Prefill / logits (prompt token ids)

Greedy decode with HF tokenizer file

GGUF packed prefill (Q4_K_M / Q8_0 / F16)

Useful flags

Chat (HF chat template)

Library API

Examples and tests (facade `rlx-models`)

Weights on Hugging Face

Remote CUDA rig (Windows + WSL)

See also

rlx-minicpm5 0.2.1

rlx-minicpm5

Prerequisites

Download weights

CLI (rlx-minicpm5)

Prefill / logits (prompt token ids)

Greedy decode with HF tokenizer file

GGUF packed prefill (Q4_K_M / Q8_0 / F16)

Useful flags

Chat (HF chat template)

Library API

Examples and tests (facade rlx-models)

Weights on Hugging Face

Remote CUDA rig (Windows + WSL)

See also

CLI (`rlx-minicpm5`)

Examples and tests (facade `rlx-models`)