Crate many_cpus

Crate many_cpus 

Source
Expand description

Working on many-processor systems with 100+ logical processors can require you to pay extra attention to the specifics of the hardware to make optimal use of available compute capacity and extract the most performance out of the system.

This is part of the Folo project that provides mechanisms for high-performance hardware-aware programming in Rust.

§Why should one care?

Modern operating systems try to distribute work fairly between all processors. Typical Rust sync and async task runtimes like Rayon and Tokio likewise try to be efficient in occupying all processors with work, even moving work between processors if one risks becoming idle. This is fine but we can do better.

Taking direct control over the placement of work on specific processors can yield superior performance by taking advantage of factors under the service author’s control, which are not known to general-purpose tasking runtimes:

  1. A key insight we can use is that most service apps exist to process requests or execute jobs - each unit of work being done is related to a specific data set. We can ensure we only process the data associated with a specific HTTP/gRPC request on a single processor to ensure optimal data locality. This means the data related to the request is likely to be in the caches of that processor, speeding up all operations related to that request by avoiding expensive memory accesses.
  2. Even when data is intentionally shared across processors (e.g. because one processor is not capable enough to do the work and parallelization is required), performance differences exist between different pairs of processors because different processors can be connected to different physical memory modules. Access to non-cached data is optimal when that data is in the same memory region as the current processor (i.e. on the physical memory modules directly wired to the current processor).

§How does this package help?

The many_cpus package provides mechanisms to schedule threads on specific processors and in specific memory regions, ensuring that work assigned to those threads remains on the same hardware and that data shared between threads is local to the same memory region, enabling you to achieve high data locality and processor cache efficiency.

In addition to thread spawning, this package enables app logic to observe what processor the current thread is executing on and in which memory region this processor is located, even if the thread is not bound to a specific processor. This can be a building block for efficiency improvements even outside directly controlled work scheduling.

Other packages from the Folo project build upon this hardware- awareness functionality to provide higher-level primitives such as thread pools, work schedulers, region-local cells and more.

§Quick start

The simplest scenario is when you want to start a thread on every processor in the default processor set:

// examples/spawn_on_all_processors.rs
let threads = SystemHardware::current()
    .processors()
    .spawn_threads(|processor| {
        println!("Spawned thread on processor {}", processor.id());

        // In a real service, you would start some work handler here, e.g. to read
        // and process messages from a channel or to spawn a web handler.
    });

If there are no operating system enforced constraints active, the default processor set includes all processors.

§Selection criteria

Depending on the specific circumstances, you may want to filter the set of processors. For example, you may want to use only two processors but ensure that they are high-performance processors that are connected to the same physical memory modules so they can cooperatively perform some processing on a shared data set:

// examples/spawn_on_selected_processors.rs
let hardware = SystemHardware::current();

let selected_processors = hardware
    .processors()
    .to_builder()
    .same_memory_region()
    .performance_processors_only()
    .take(nz!(2))
    // If we do not have what we want, we fall back to the default set.
    .unwrap_or_else(|| hardware.processors());

let threads = selected_processors.spawn_threads(|processor| {
    println!("Spawned thread on processor {}", processor.id());

    // In a real service, you would start some work handler here, e.g. to read
    // and process messages from a channel or to spawn a web handler.
});

§Inspecting the hardware environment

Functions are provided to easily inspect the current hardware environment:

// examples/observe_processor.rs
let hardware = SystemHardware::current();

let max_processors = hardware.max_processor_count();
let max_memory_regions = hardware.max_memory_region_count();
println!(
    "This system can support up to {max_processors} processors in {max_memory_regions} memory regions"
);

loop {
    let current_processor_id = hardware.current_processor_id();
    let current_memory_region_id = hardware.current_memory_region_id();

    println!(
        "Thread executing on processor {current_processor_id} in memory region {current_memory_region_id}"
    );

    thread::sleep(Duration::from_secs(1));
}

Note that the current processor may change at any time if you are not using threads pinned to specific processors (such as those spawned via ProcessorSet::spawn_threads()). Example output:

This system can support up to 32 processors in 1 memory regions
Thread executing on processor 4 in memory region 0
Thread executing on processor 4 in memory region 0
Thread executing on processor 12 in memory region 0
Thread executing on processor 2 in memory region 0
Thread executing on processor 12 in memory region 0
Thread executing on processor 0 in memory region 0
Thread executing on processor 4 in memory region 0
Thread executing on processor 4 in memory region 0

§External constraints

The operating system may define constraints that prohibit the application from using all the available processors (e.g. when the app is containerized and provided limited hardware resources).

This package treats platform constraints as follows:

  • Hard limits on which processors are allowed are respected - forbidden processors are mostly ignored by this package and cannot be used to spawn threads, though such processors are still accounted for when inspecting hardware information such as “max processor ID”. The mechanisms for defining such limits are cgroups on Linux and job objects on Windows. See examples/obey_job_affinity_limits_windows.rs for a Windows-specific example.
  • Soft limits on which processors are allowed are ignored by default - specifying a processor affinity via taskset on Linux, start.exe /affinity 0xff on Windows or similar mechanisms does not affect the set of processors this package will use by default, though you can opt in to this via .where_available_for_current_thread().
  • Any operating system enforced processor time quota is taken as the upper bound for the processor count of the processor set returned by SystemHardware::processors().
  • Any other processor set can be opt-in quota-limited when building the processor set. For example, by calling SystemHardware::current().all_processors().to_builder().enforce_resource_quota().take_all().

See examples/obey_job_resource_quota_limits_windows.rs for a Windows-specific example of processor time quota enforcement.

§Avoiding operating system quota penalties

If a process exceeds the processor time limit, the operating system will delay executing the process further until the “debt is paid off”. This is undesirable for most workloads because:

  1. There will be random latency spikes from when the operating system decides to apply a delay.
  2. The delay may not be evenly applied across all threads of the process, leading to unbalanced load between worker threads.

For predictable behavior that does not suffer from delay side-effects, it is important that the process does not exceed the processor time limit. To keep out of trouble, follow these guidelines:

  • Ensure that all your concurrently executing thread pools are derived from the same processor set, so there is a single set of processors (up to the resource quota) that all work of the process will be executed on. Any new processor sets you create should be subsets of this set, thereby ensuring that all worker threads combined do not exceed the quota.
  • Ensure that the original processor set is constructed while obeying the resource quota (which is enabled by default),

If your resource constraints are already applied on process startup, you can use SystemHardware::current().processors() as the master set from which all other processor sets are derived using either .take() or .to_builder(). This will ensure the processor time quota is obeyed because processors() is size-limited to the quota.

let hw = SystemHardware::current();

// By taking both senders and receivers from the same original processor set, we
// guarantee that all worker threads combined cannot exceed the processor time quota.
let mail_senders = hw.processors()
    .take(nz!(2))
    .expect("need at least 2 processors for mail workers")
    .spawn_threads(|_| send_mail());

let mail_receivers = hw.processors()
    .take(nz!(2))
    .expect("need at least 2 processors for mail workers")
    .spawn_threads(|_| receive_mail());

§Changes at runtime

It is possible that a system will have processors added or removed at runtime, or for constraints enforced by the operating system to change over time. Such changes will not be represented in an existing processor set - once created, a processor set is static.

Changes to resource quotas can be applied by creating a new processor set (e.g. if the processor time quota is lowered, building a new set will by default use the new quota).

This package will not detect more fundamental changes such as added/removed processors. Operations attempted on removed processors may fail with an error or panic or silently misbehave (e.g. threads never starting). Added processors will not be considered a member of any set.

§Inheriting soft limits on allowed processors

While the package does not by default obey soft limits, you can opt in to these limits by inheriting the allowed processor set in the main() entrypoint thread:

// examples/spawn_on_inherited_processors.rs
let hardware = SystemHardware::current();

// The set of processors used here can be adjusted via OS mechanisms.
//
// For example, to select only processors 0 and 1:
// Linux: `taskset 0x3 target/debug/examples/spawn_on_inherited_processors`
// Windows: `start /affinity 0x3 target/debug/examples/spawn_on_inherited_processors.exe`
let inherited_processors = hardware
    .processors()
    .to_builder()
    // This causes soft limits on processor affinity to be respected.
    .where_available_for_current_thread()
    .take_all()
    .expect(
        "found no processors usable by the current thread; \
        this is impossible because the thread is currently running on one",
    );

println!(
    "After applying soft limits, we are allowed to use {} processors.",
    inherited_processors.len()
);

let threads = inherited_processors.spawn_threads(|processor| {
    println!("Spawned thread on processor {}", processor.id());

    // In a real service, you would start some work handler here, e.g. to read
    // and process messages from a channel or to spawn a web handler.
});

§Testing with fake hardware

The many_cpus package provides a fake hardware capability for testing code that depends on hardware configuration. This is available when the test-util Cargo feature is enabled.

To make your code testable with fake hardware, accept SystemHardware as a value (typically as a function parameter or struct field) instead of always calling SystemHardware::current(). This allows tests to substitute fake hardware while production code uses real hardware.

See the [fake] module for detailed examples and API documentation.

§Operating system compatibility

This package is tested on the following operating systems:

  • Windows 11 and newer
  • Windows Server 2022 and newer
  • Ubuntu 24.04 and newer

The functionality may also work on other operating systems if they offer compatible platform APIs but this is not actively tested.

§Unsupported platforms

On operating systems without native support (such as macOS, BSD variants, etc.), this package provides a fallback implementation that allows code to compile and run with graceful degradation:

  • Processor count is determined via std::thread::available_parallelism()
  • All processors are simulated as being in a single memory region (region 0)
  • All processors are marked as Performance class
  • Thread pinning operations succeed but do not actually pin threads to processors
  • Current processor tracking uses stable thread-local IDs derived from thread IDs

While this fallback behavior maintains API compatibility and allows applications to function, it does not provide the performance benefits of actual processor pinning and topology awareness. Applications running on unsupported platforms will not see performance improvements from using this package but will still function correctly.

Structs§

Processor
A processor present on the system and available to the current process.
ProcessorSet
One or more processors present on the system and available for use.
ProcessorSetBuilder
Builds a ProcessorSet based on specified criteria.
ResourceQuota
The resource quota that the operating system enforces for the current process.
SystemHardware
Handle to system hardware, providing access to hardware information and tracking.

Enums§

EfficiencyClass
Differentiates processors on the performance-efficiency axis.

Type Aliases§

MemoryRegionId
Identifies a specific memory region.
ProcessorId
Identifies a specific processor.