pub fn mean_dim_shared_memory<E: WgpuElement, const D: usize>(
    input: WgpuTensor<E, D>,
    output: WgpuTensor<E, D>,
    dim: usize
) -> WgpuTensor<E, D>
Expand description

Execute the mean dim kernel leveraging shared memory Probably more efficient on tensors where the dimension to reduced is much larger than the others