Expand description
Cross-backend op-diff harness — PLAYBOOK § 3 L1.
Runs the same op on CPU (reference) and on each available accelerator
(Metal / CUDA), then reports the NMSE (normalized mean-squared
error) of each accelerator’s output relative to CPU’s. Modelled on
llama.cpp’s tests/test-backend-ops.cpp NMSE = mse(a,b) / mse(a,0)
comparison rather than naive max-abs-diff: NMSE is invariant to the
magnitude of the reference output, so a well-tuned kernel will sit
at the same NMSE regardless of input scaling.
§Usage
ⓘ
use ferrum_testkit::op_diff::{compare_backends, NMSE_FP16_TOL, rms_norm::RmsNormOp};
let report = compare_backends(&RmsNormOp { tokens: 4, dim: 4096, eps: 1e-6 }, 42);
if let Some(nmse) = report.metal_nmse {
assert!(nmse < NMSE_FP16_TOL, "metal rms_norm NMSE {nmse} exceeds fp16 tol");
}§Tolerance buckets (PLAYBOOK § 3.1)
NMSE_FP32_TOL = 1e-7— fp32 kernels must agree with CPU below this.NMSE_FP16_TOL = 1e-6— fp16 kernels (Metal default storage).
Tighter bucketing per op is welcome — define op-specific constants once empirical baselines are stable.
Modules§
- activation_
bridge activation_bridgeop-diff harness — seecrate::op_diff.- argmax_
rows argmax_rows_f16op-diff harness — seecrate::op_diff.- embedding_
lookup embedding_lookupop-diff harness — seecrate::op_diff.- flash_
attention flash_attentionop-diff harness — seecrate::op_diff.- fused_
add_ rms_ norm fused_add_rms_normop-diff harness — seecrate::op_diff.- gemm
gemmop-diff harness — covers the basic fp16 matmul that backsqkv_proj,o_proj,gate_up_proj,down_proj, and the lm_head projection. Per nsys profile on Vast 4090 / M3,Marlin<256,...>Marlin matmul accounts for ~55% of GPU time at c=16; this op-diff validates the non-quantized fallback path against CPU.- kv_
cache_ append kv_cache_append_head_majorop-diff harness — seecrate::op_diff.- marlin_
matmul marlin_matmul(GPTQ INT4) op-diff harness — PARTIAL: planning stub.- paged_
varlen_ attn paged_varlen_attentionop-diff harness — PARTIAL: planning stub.- qk_
norm_ rope qk_norm_ropeop-diff harness — covers the fused rms_norm + rotary-position-embedding + head-major transpose used in all transformer attention layers (split_qkv_norm_rope_into_paged_cache_f16in nsys traces — top-10 kernel on M3).- residual_
add residual_add(add_inplace) op-diff harness — seecrate::op_diff.- rms_
norm rms_normop-diff harness — seecrate::op_difffor the framework.- silu_
mul fused_silu_mul_splitop-diff harness.- split_
qkv split_qkvop-diff harness — seecrate::op_diff.- transpose_
head_ to_ token transpose_head_to_tokenop-diff harness — seecrate::op_diff.
Structs§
- Nmse
Report - Cross-backend comparison result.
Constants§
- NMSE_
FP16_ TOL - fp16 storage / Metal accumulation — slightly larger tol.
- NMSE_
FP32_ TOL - fp32 kernels — should agree with CPU below this.
Traits§
- OpUnder
Test - A single op-under-test: knows how to run itself on each backend.
Functions§
- compare_
backends - Run
opon every backend the current build supports and assemble the comparison report. - nmse
- Normalized mean-squared error.
- random_
vec - Convenience: deterministic uniform-random
Vec<f32>in[lo, hi).
Type Aliases§
- Output
- Output of a single op invocation. Each backend produces its own
Vec<f32>afterto_vec()-ing its buffer to host.