Struct hwlocality::cpu::cache::CpuCacheStats

source ·

pub struct CpuCacheStats { /* private fields */ }

Expand description

CPU cache statistics

These statistics can be used to perform simple cache locality optimizations when your performance requirements do not call for full locality-aware scheduling with manual task and memory pinning.

Implementations§

source §

impl CpuCacheStats

source

pub fn new(topology: &Topology) -> Option<Self>

Compute CPU cache statistics, if cache sizes are known

Returns None if cache size information is unavailable for at least some of the CPU caches on the system.

source

pub fn smallest_data_cache_sizes(&self) -> &[u64]

Smallest CPU data cache capacity at each cache level

This tells you how many cache levels there are in the deepest cache hierarchy on this system, and what is the minimal cache capacity at each level.

You should tune sequential algorithms such that they fit this effective cache hierarchy (first layer of loop blocking has a working set that can stay in the first reported cache capacity, second layer of loop blocking has a working set that can fit in the second reported capacity, etc.)

source

pub fn smallest_data_cache_sizes_per_thread(&self) -> &[u64]

Smallest CPU data cache capacity at each cache level, per thread

This tells you how many cache levels there are in the deepest cache hierarchy on this system, and what is the minimal cache capacity per thread sharing a cache at each level.

In parallel algorithms where all CPU threads are potentially used, and threads effectively share no common data, you should tune the private working set of each thread such that it fits this effective cache hierarchy (first layer of loop blocking has a working set that can stay in the first reported cache capacity, second layer of loop blocking has a working set that can fit in the second reported capacity, etc.).

source

pub fn total_data_cache_sizes(&self) -> &[u64]

Total CPU data cache capacity at each cache level

This tells you how many cache levels there are in the deepest cache hierarchy on this system, and what is the total cache capacity at each level.

You should tune parallel algorithms such that the total working set (summed across all threads without double-counting shared resources) fits in the reported aggregated cache capacities.

Beware that this is only a minimal requirement for cache locality, and programs honoring this criterion might still not achieve good cache performance due to CPU core heterogeneity or Non-Uniform Cache Access (NUCA) effects. To correctly handle these, you need to move to a fully locality-aware design with threads pinned to CPU cores and tree-like synchronization following the shape of the topology tree.

That being said, you may manage to reduce NUCA effects at the cost of using a smaller fraction of your CPU cache capacity by making your parallel algorithm collectively fit into the smallest last-level cache.