rlx-cpu
CPU backend for RLX — SIMD kernels, BLAS dispatch, persistent thread pool, arena executor.
Two execution paths share the same Thunk types:
- Direct execution (single
matchover&Thunkat line ~1280 onward inthunk.rs) — hot path, zero closure overhead. Performance-critical ops (Attention, FusedAttnBlock, matmul) live here. - Closure-based (
Box<dyn Fn(*mut u8)>per thunk; line ~780 onward inthunk.rs) — older, used for ops where dispatch overhead doesn't matter, and for some unit tests.
Keep both paths in sync when changing a Thunk variant.
Features
- NEON / AVX2 + FMA SIMD kernels for softmax, layer norm, RMS norm, GELU / SiLU / RoPE, fused matmul-bias-act.
- BLAS dispatch via Apple Accelerate (default on macOS) or OpenBLAS / MKL via Cargo features.
- LAPACK bindings (
dgesv,dpotrf,dgeqrf,dgesvd,dsyevd,dtrsm) forOp::DenseSolveand the downstreamrlx-linalgcrate. - Work-stealing thread pool with
par_for(total, grain, &|off, cnt| …)primitive. - Reverse-mode AD support: thunks for every backward op
rlx_opt::autodiffemits. - FFT —
Op::Fftfor F32 / F64 / C64 (2N real-block or native C64). Stockham radix-2 for pow-2; naive DFT for small composite N (≤16); Bluestein for other non-pow-2. Host entry shared with GPU fallbacks.
What's here
thunk.rs(2.5k LOC, the bulk) — Thunk enum + lowering fromOp+ both execution paths.executor.rs— alternate non-thunk executor used by old paths and some unit tests.kernels.rs— NEON intrinsics: softmax, layer norm, RMSNorm, matmul inner loops.blas.rs— Accelerate / MKL dispatch. SGEMM variants for different alignment regimes.naive.rs— reference scalar implementations. Used by tests for parity and as a fallback.pool.rs— work-stealing thread pool.arena.rs— buffer planning interface (the actual byte buffer comes from rlx-runtime).autotune.rs—Tick-based search overRuntimeConfig. Userlx_ir::Tickfor sub-ms timing.cost.rs/config.rs— model selection + runtime knobs (par_threshold, sdpa_seq_threshold, attn_mask_neg_inf, ...).
Cargo features
| feature | what it links |
|---|---|
blas (default) |
platform CBLAS via FFI |
blas-accelerate |
Apple Accelerate |
blas-mkl |
Intel MKL |
blas-openblas |
OpenBLAS |
With --no-default-features, a portable scalar gemm is linked instead
— slow, but useful on hosts without a system BLAS.
Install
[]
= "0.1"
Or via rlx's cpu feature.
Build / test
Gotchas
Thunk::Attentioncarriesmask_kind: MaskKind(plan #20). Custom readsmaskslice, others synthesize viaapply_synthetic_mask. Both execution paths handle this — keep them in sync.RuntimeConfig::global()is read once per thunk closure. If you need per-call config, pass it through the thunk fields, not via global.cfg.sdpa_seq_thresholdcontrols the NEON-vs-BLAS attention crossover. The NEON path skips dispatch for batch=1 / short seq.- Thunk-level fusion runs after compile_thunks (line ~990) — it rewrites Q/K/V → Narrow×3 → [Rope×2] → Attention → out_proj sequences into a single FusedAttnBlock. Fragile pattern matching.
License
GPL-3.0-only.