pub fn gpu_mul_channel(a: u64, b: u64, m: u64) -> u64
Reference implementation of one GPU thread’s multiply: (a * b) % m.
(a * b) % m