Module op_diff

Expand description

Cross-backend op-diff harness — PLAYBOOK § 3 L1.

Runs the same op on CPU (reference) and on each available accelerator (Metal / CUDA), then reports the NMSE (normalized mean-squared error) of each accelerator’s output relative to CPU’s. Modelled on llama.cpp’s tests/test-backend-ops.cpp NMSE = mse(a,b) / mse(a,0) comparison rather than naive max-abs-diff: NMSE is invariant to the magnitude of the reference output, so a well-tuned kernel will sit at the same NMSE regardless of input scaling.

§Usage

use ferrum_testkit::op_diff::{compare_backends, NMSE_FP16_TOL, rms_norm::RmsNormOp};

let report = compare_backends(&RmsNormOp { tokens: 4, dim: 4096, eps: 1e-6 }, 42);
if let Some(nmse) = report.metal_nmse {
    assert!(nmse < NMSE_FP16_TOL, "metal rms_norm NMSE {nmse} exceeds fp16 tol");
}

§Tolerance buckets (PLAYBOOK § 3.1)

NMSE_FP32_TOL = 1e-7 — fp32 kernels must agree with CPU below this.
NMSE_FP16_TOL = 1e-6 — fp16 kernels (Metal default storage).

Tighter bucketing per op is welcome — define op-specific constants once empirical baselines are stable.

Modules§

activation_bridge: activation_bridge op-diff harness — see crate::op_diff.
argmax_rows: argmax_rows_f16 op-diff harness — see crate::op_diff.
embedding_lookup: embedding_lookup op-diff harness — see crate::op_diff.
flash_attention: flash_attention op-diff harness — see crate::op_diff.
fused_add_rms_norm: fused_add_rms_norm op-diff harness — see crate::op_diff.
gemm: gemm op-diff harness — covers the basic fp16 matmul that backs qkv_proj, o_proj, gate_up_proj, down_proj, and the lm_head projection. Per nsys profile on Vast 4090 / M3, Marlin<256,...> Marlin matmul accounts for ~55% of GPU time at c=16; this op-diff validates the non-quantized fallback path against CPU.
kv_cache_append: kv_cache_append_head_major op-diff harness — see crate::op_diff.
marlin_matmul: marlin_matmul (GPTQ INT4) op-diff harness — PARTIAL: planning stub.
paged_varlen_attn: paged_varlen_attention op-diff harness — PARTIAL: planning stub.
qk_norm_rope: qk_norm_rope op-diff harness — covers the fused rms_norm + rotary-position-embedding + head-major transpose used in all transformer attention layers (split_qkv_norm_rope_into_paged_cache_f16 in nsys traces — top-10 kernel on M3).
residual_add: residual_add (add_inplace) op-diff harness — see crate::op_diff.
rms_norm: rms_norm op-diff harness — see crate::op_diff for the framework.
silu_mul: fused_silu_mul_split op-diff harness.
split_qkv: split_qkv op-diff harness — see crate::op_diff.
transpose_head_to_token: transpose_head_to_token op-diff harness — see crate::op_diff.

Structs§

NmseReport: Cross-backend comparison result.

Constants§

NMSE_FP16_TOL: fp16 storage / Metal accumulation — slightly larger tol.
NMSE_FP32_TOL: fp32 kernels — should agree with CPU below this.

Traits§

OpUnderTest: A single op-under-test: knows how to run itself on each backend.

Functions§

compare_backends: Run op on every backend the current build supports and assemble the comparison report.
nmse: Normalized mean-squared error.
random_vec: Convenience: deterministic uniform-random Vec<f32> in [lo, hi).

Type Aliases§

Output: Output of a single op invocation. Each backend produces its own Vec<f32> after to_vec()-ing its buffer to host.

Module op_diff

Module op_diff Copy item path

§Usage

§Tolerance buckets (PLAYBOOK § 3.1)

Modules§

Structs§

Constants§

Traits§

Functions§

Type Aliases§

Module op_diff