Skip to main content

Module op_diff

Module op_diff 

Source
Expand description

Cross-backend op-diff harness — PLAYBOOK § 3 L1.

Runs the same op on CPU (reference) and on each available accelerator (Metal / CUDA), then reports the NMSE (normalized mean-squared error) of each accelerator’s output relative to CPU’s. Modelled on llama.cpp’s tests/test-backend-ops.cpp NMSE = mse(a,b) / mse(a,0) comparison rather than naive max-abs-diff: NMSE is invariant to the magnitude of the reference output, so a well-tuned kernel will sit at the same NMSE regardless of input scaling.

§Usage

use ferrum_testkit::op_diff::{compare_backends, NMSE_FP16_TOL, rms_norm::RmsNormOp};

let report = compare_backends(&RmsNormOp { tokens: 4, dim: 4096, eps: 1e-6 }, 42);
if let Some(nmse) = report.metal_nmse {
    assert!(nmse < NMSE_FP16_TOL, "metal rms_norm NMSE {nmse} exceeds fp16 tol");
}

§Tolerance buckets (PLAYBOOK § 3.1)

  • NMSE_FP32_TOL = 1e-7 — fp32 kernels must agree with CPU below this.
  • NMSE_FP16_TOL = 1e-6 — fp16 kernels (Metal default storage).

Tighter bucketing per op is welcome — define op-specific constants once empirical baselines are stable.

Modules§

activation_bridge
activation_bridge op-diff harness — see crate::op_diff.
argmax_rows
argmax_rows_f16 op-diff harness — see crate::op_diff.
embedding_lookup
embedding_lookup op-diff harness — see crate::op_diff.
flash_attention
flash_attention op-diff harness — see crate::op_diff.
fused_add_rms_norm
fused_add_rms_norm op-diff harness — see crate::op_diff.
gemm
gemm op-diff harness — covers the basic fp16 matmul that backs qkv_proj, o_proj, gate_up_proj, down_proj, and the lm_head projection. Per nsys profile on Vast 4090 / M3, Marlin<256,...> Marlin matmul accounts for ~55% of GPU time at c=16; this op-diff validates the non-quantized fallback path against CPU.
kv_cache_append
kv_cache_append_head_major op-diff harness — see crate::op_diff.
marlin_matmul
marlin_matmul (GPTQ INT4) op-diff harness — PARTIAL: planning stub.
paged_varlen_attn
paged_varlen_attention op-diff harness — PARTIAL: planning stub.
qk_norm_rope
qk_norm_rope op-diff harness — covers the fused rms_norm + rotary-position-embedding + head-major transpose used in all transformer attention layers (split_qkv_norm_rope_into_paged_cache_f16 in nsys traces — top-10 kernel on M3).
residual_add
residual_add (add_inplace) op-diff harness — see crate::op_diff.
rms_norm
rms_norm op-diff harness — see crate::op_diff for the framework.
silu_mul
fused_silu_mul_split op-diff harness.
split_qkv
split_qkv op-diff harness — see crate::op_diff.
transpose_head_to_token
transpose_head_to_token op-diff harness — see crate::op_diff.

Structs§

NmseReport
Cross-backend comparison result.

Constants§

NMSE_FP16_TOL
fp16 storage / Metal accumulation — slightly larger tol.
NMSE_FP32_TOL
fp32 kernels — should agree with CPU below this.

Traits§

OpUnderTest
A single op-under-test: knows how to run itself on each backend.

Functions§

compare_backends
Run op on every backend the current build supports and assemble the comparison report.
nmse
Normalized mean-squared error.
random_vec
Convenience: deterministic uniform-random Vec<f32> in [lo, hi).

Type Aliases§

Output
Output of a single op invocation. Each backend produces its own Vec<f32> after to_vec()-ing its buffer to host.