Skip to main content

Crate rlx_cpu

Crate rlx_cpu 

Source
Expand description

RLX CPU backend — executes optimized IR graphs on CPU.

Takes a fused + memory-planned IR graph and executes it using:

  • BLAS (Accelerate/MKL/OpenBLAS) for matmul
  • NEON/AVX SIMD kernels for element-wise ops
  • Persistent Rayon thread pool for parallelism
  • Arena allocator for zero per-call allocation

Modules§

arena
Arena allocator — ONE allocation, zero per-call overhead.
asm_check
FileCheck-style disassembly regression tests (plan #10).
attention_bwd
Scaled dot-product attention backward (recomputes scores + softmax).
autotune
Auto-tuner — finds the optimal RuntimeConfig for a model on current hardware.
blas
Direct BLAS FFI — zero abstraction overhead.
calibrate
Activation-scale calibration for post-training INT8 quantization.
config
Runtime configuration — compile-time platform defaults + runtime hardware detection.
cost
Cost model — estimates execution time for kernel dispatch decisions.
dequant_cache
Cache dequantized GGUF weight bytes for static params.
dispatch
Dispatch table — calibration-aware kernel selection (plan #2).
executor
Graph executor — runs a fused IR graph on CPU using the arena + kernels.
gdn
Gated-DeltaNet BLAS micro-kernels (Tier C.10).
gguf_matmul
Fused GGUF K-quant dequant + matmul without materializing full F32 weights (Tier C.11).
intrinsics
ISA-split intrinsics layer (plan #85).
kernel_config
Compile-time kernel-config tables (plan #14).
kernels
SIMD kernels for fused operations.
llada2_gate
lm_head
Greedy tied-LM-head argmax without materializing full vocab logits.
moe_residency
Per-forward MoE expert residency mask (TIDE placement) for CPU dispatch.
moe_topk_capture
Capture MoE router [Op::TopK] outputs during CPU forward (TIDE refresh input).
naive
Naive reference implementations — for correctness testing and benchmarking.
op_registry
Per-backend (CPU) kernel registry for Op::Custom.
pool
Rayon-backed parallel for: par_for(total, grain, |off, cnt| …).
splat
CPU dispatch hooks for rlx_ir::Op::GaussianSplatRender — bodies registered from rlx-splat.
thunk
Thunks — pre-compiled kernel dispatch with zero per-call overhead.
tile
CPU TileIO impls (plans #23 + #27).
training_bwd