pub fn calculate_tensor_stats_cache_optimized( data: &[f32], cache_params: &CacheAwareParams, ) -> Result<(f32, f32, f32, f32)>
Cache-optimized tensor statistics calculation with blocking