Crate rlx_cpu

Source

Expand description

RLX CPU backend — executes optimized IR graphs on CPU.

Takes a fused + memory-planned IR graph and executes it using:

BLAS (Accelerate/MKL/OpenBLAS) for matmul
NEON/AVX SIMD kernels for element-wise ops
Persistent Rayon thread pool for parallelism
Arena allocator for zero per-call allocation

Modules§

arena: Arena allocator — ONE allocation, zero per-call overhead.
asm_check: FileCheck-style disassembly regression tests (plan #10).
attention_bwd: Scaled dot-product attention backward (recomputes scores + softmax).
autotune: Auto-tuner — finds the optimal RuntimeConfig for a model on current hardware.
blas: Direct BLAS FFI — zero abstraction overhead.
calibrate: Activation-scale calibration for post-training INT8 quantization.
config: Runtime configuration — compile-time platform defaults + runtime hardware detection.
cost: Cost model — estimates execution time for kernel dispatch decisions.
dequant_cache: Cache dequantized GGUF weight bytes for static params.
dispatch: Dispatch table — calibration-aware kernel selection (plan #2).
executor: Graph executor — runs a fused IR graph on CPU using the arena + kernels.
gdn: Gated-DeltaNet BLAS micro-kernels (Tier C.10).
gguf_matmul: Fused GGUF K-quant dequant + matmul without materializing full F32 weights (Tier C.11).
intrinsics: ISA-split intrinsics layer (plan #85).
kernel_config: Compile-time kernel-config tables (plan #14).
kernels: SIMD kernels for fused operations.
llada2_gate
lm_head: Greedy tied-LM-head argmax without materializing full vocab logits.
moe_residency: Per-forward MoE expert residency mask (TIDE placement) for CPU dispatch.
moe_topk_capture: Capture MoE router [Op::TopK] outputs during CPU forward (TIDE refresh input).
naive: Naive reference implementations — for correctness testing and benchmarking.
op_registry: Per-backend (CPU) kernel registry for Op::Custom.
pool: Rayon-backed parallel for: par_for(total, grain, |off, cnt| …).
splat: CPU dispatch hooks for rlx_ir::Op::GaussianSplatRender — bodies registered from rlx-splat.
thunk: Thunks — pre-compiled kernel dispatch with zero per-call overhead.
tile: CPU TileIO impls (plans #23 + #27).
training_bwd
umap_knn: Reference k-NN from a row-major [n, n] pairwise distance matrix.

Crate rlx_cpu

Crate rlx_cpu Copy item path

Modules§

Crate rlx_cpu