pub fn batchnorm_ptx() -> &'static str
PTX assembly for BatchNorm kernel (training mode).
One block per channel. Each block reduces across the batch dimension to compute per-channel mean and variance, then normalizes.