pub fn prefetch_vector<T>(vec: &[T])
Prefetch the given vector in chunks of 64 bytes, which is a cache line size NOTE: good efficiency when total_vec_size is integral multiple of 64