Expand description
Working on many-processor systems with 100+ logical processors can require you to pay extra attention to the specifics of the hardware to make optimal use of available compute capacity and extract the most performance out of the system.
This is part of the Folo project that provides mechanisms for high-performance hardware-aware programming in Rust.
§Why should one care?
Modern operating systems try to distribute work fairly between all processors. Typical Rust sync and async task runtimes like Rayon and Tokio likewise try to be efficient in occupying all processors with work, even moving work between processors if one risks becoming idle. This is fine but we can do better.
Taking direct control over the placement of work on specific processors can yield superior performance by taking advantage of factors under the service author’s control, which are not known to general-purpose tasking runtimes:
- A key insight we can use is that most service apps exist to process requests or execute jobs - each unit of work being done is related to a specific data set. We can ensure we only process the data associated with a specific HTTP/gRPC request on a single processor to ensure optimal data locality. This means the data related to the request is likely to be in the caches of that processor, speeding up all operations related to that request by avoiding expensive memory accesses.
- Even when data is intentionally shared across processors (e.g. because one processor is not capable enough to do the work and parallelization is required), performance differences exist between different pairs of processors because different processors can be connected to different physical memory modules. Access to non-cached data is optimal when that data is in the same memory region as the current processor (i.e. on the physical memory modules directly wired to the current processor).
§How does this crate help?
The many_cpus crate provides mechanisms to schedule threads on specific processors and in specific
memory regions, ensuring that work assigned to those threads remains on the same hardware and that
data shared between threads is local to the same memory region, enabling you to achieve high data
locality and processor cache efficiency.
In addition to thread spawning, this crate enables app logic to observe what processor the current thread is executing on and in which memory region this processor is located, even if the thread is not bound to a specific processor. This can be a building block for efficiency improvements even outside directly controlled work scheduling.
Other packages from the Folo project build upon this hardware- awareness functionality to provide higher-level primitives such as thread pools, work schedulers, region-local cells and more.
§Operating system compatibility
This crate is tested on the following operating systems:
- Windows 11 and newer
- Windows Server 2022 and newer
- Ubuntu 24.04 and newer
The functionality may also work on other operating systems if they offer compatible platform APIs but this is not actively tested.
§Quick start
The simplest scenario is when you want to start a thread on every processor in the default processor set:
// examples/spawn_on_all_processors.rs
let threads = ProcessorSet::default().spawn_threads(|processor| {
println!("Spawned thread on processor {}", processor.id());
// In a real service, you would start some work handler here, e.g. to read
// and process messages from a channel or to spawn a web handler.
});If there are no operating system enforced constraints active, the default processor set includes all processors.
§Selection criteria
Depending on the specific circumstances, you may want to filter the set of processors. For example, you may want to use only two processors but ensure that they are high-performance processors that are connected to the same physical memory modules so they can cooperatively perform some processing on a shared data set:
// examples/spawn_on_selected_processors.rs
const PROCESSOR_COUNT: NonZero<usize> = NonZero::new(2).unwrap();
let Some(selected_processors) = ProcessorSet::builder()
.same_memory_region()
.performance_processors_only()
.take(PROCESSOR_COUNT)
else {
println!("Not enough processors available for this example");
return;
};
let threads = selected_processors.spawn_threads(|processor| {
println!("Spawned thread on processor {}", processor.id());
// In a real service, you would start some work handler here, e.g. to read
// and process messages from a channel or to spawn a web handler.
});§Inspecting the hardware environment
Functions are provided to easily inspect the current hardware environment:
// examples/observe_processor.rs
let max_processors = HardwareInfo::max_processor_count();
let max_memory_regions = HardwareInfo::max_memory_region_count();
println!(
"This system can support up to {max_processors} processors in {max_memory_regions} memory regions"
);
loop {
let current_processor_id = HardwareTracker::current_processor_id();
let current_memory_region_id = HardwareTracker::current_memory_region_id();
println!(
"Thread executing on processor {current_processor_id} in memory region {current_memory_region_id}"
);
thread::sleep(Duration::from_secs(1));
}Note that the current processor may change at any time if you are not using threads pinned to
specific processors (such as those spawned via ProcessorSet::spawn_threads()). Example output:
This system can support up to 32 processors in 1 memory regions
Thread executing on processor 4 in memory region 0
Thread executing on processor 4 in memory region 0
Thread executing on processor 12 in memory region 0
Thread executing on processor 2 in memory region 0
Thread executing on processor 12 in memory region 0
Thread executing on processor 0 in memory region 0
Thread executing on processor 4 in memory region 0
Thread executing on processor 4 in memory region 0§External constraints
The operating system may define constraints that prohibit the application from using all the available processors (e.g. when the app is containerized and provided limited hardware resources).
This crate treats platform constraints as follows:
- Hard limits on which processors are allowed are respected - forbidden processors are mostly
ignored by this crate and cannot be used to spawn threads, though such processors are still
accounted for when inspecting hardware information such as “max processor ID”.
The mechanisms for defining such limits are cgroups on Linux and job objects on Windows.
See
examples/obey_job_affinity_limits_windows.rsfor a Windows-specific example. - Soft limits on which processors are allowed are ignored by default - specifying a processor
affinity via
taskseton Linux,start.exe /affinity 0xffon Windows or similar mechanisms does not affect the set of processors this crate will use by default, though you can opt in to this via.where_available_for_current_thread(). - Limits on processor time are considered an upper bound on the number of processors that can be
included in a processor set. For example, if you configure a processor time limit of
10 seconds per second of real time on a 20-processor system, then the builder may return up
to 10 of the processors in the resulting processor set (though it may be a different 10 every
time you create a new processor set from scratch). This limit is optional and may be disabled
by using
.ignoring_resource_quota(). Seeexamples/obey_job_resource_quota_limits_windows.rsfor a Windows-specific example.
§Working with processor time constraints
If a process exceeds the processor time limit, the operating system will delay executing the process further until the “debt is paid off”. This is undesirable for most workloads because:
- There will be random latency spikes from when the operating system decides to apply a delay.
- The delay may not be evenly applied across all threads of the process, leading to unbalanced load between worker threads.
For predictable behavior that does not suffer from delay side-effects, it is important that the process does not exceed the processor time limit. To keep out of trouble, follow these guidelines:
- Ensure that all your concurrently executing thread pools are derived from the same processor set, so there is a single set of processors (up to the resource quota) that all work of the process will be executed on. Any new processor sets you create should be subsets of this set, thereby ensuring that all worker threads combined do not exceed the quota.
- Ensure that the original processor set is constructed while obeying the resource quota (which is enabled by default),
If your resource constraints are already applied on process startup, you can use
ProcessorSet::default() as the master set from which all other processor sets are derived using
ProcessorSet::default().to_builder(). This will ensure the processor time quota is always obeyed
because ProcessorSet::default() is guaranteed to obey the resource quota.
let mail_senders = ProcessorSet::default().to_builder().take(MAIL_WORKER_COUNT).unwrap();§Changes at runtime
It is possible that a system will have processors added or removed at runtime, or for constraints enforced by the operating system to change over time. Such changes will not be represented in an existing processor set - once created, a processor set is static.
Changes to resource quotas can be applied by creating a new processor set (e.g. if the processor time quota is lowered, building a new set will by default use the new quota).
This crate will not detect more fundamental changes such as added/removed processors. Operations attempted on removed processors may fail with an error or panic or silently misbehave (e.g. threads never starting). Added processors will not be considered a member of any set.
§Inheriting soft limits on allowed processors
While the crate does not by default obey soft limits, you can opt in to these limits by
inheriting the allowed processor set in the main() entrypoint thread:
// examples/spawn_on_inherited_processors.rs
// The set of processors used here can be adjusted via OS mechanisms.
//
// For example, to select only processors 0 and 1:
// Linux: `taskset 0x3 target/debug/examples/spawn_on_inherited_processors`
// Windows: `start /affinity 0x3 target/debug/examples/spawn_on_inherited_processors.exe`
let inherited_processors = ProcessorSet::builder()
// This causes soft limits on processor affinity to be respected.
.where_available_for_current_thread()
.take_all()
.expect("found no processors usable by the current thread - impossible because the thread is currently running on one");
println!(
"After applying soft limits, we are allowed to use {} processors.",
inherited_processors.len()
);
let threads = inherited_processors.spawn_threads(|processor| {
println!("Spawned thread on processor {}", processor.id());
// In a real service, you would start some work handler here, e.g. to read
// and process messages from a channel or to spawn a web handler.
});Structs§
- Hardware
Info - Reports non-changing information about the system hardware.
- Hardware
Tracker - Tracks and provides access to changing hardware information over time.
- Processor
- A processor present on the system and available to the current process.
- Processor
Set - One or more processors present on the system and available for use.
- Processor
SetBuilder - Builds a
ProcessorSetbased on specified criteria. The default criteria include all available processors, with the maximum count determined by the process resource quota. - Resource
Quota - Information about the resource quota that the operating system enforces for the current process.
Enums§
- Efficiency
Class - Differentiates processors by their efficiency class, allowing work requiring high performance to be placed on the most performant processors at the expense of energy usage.
Type Aliases§
- Memory
Region Id - A memory region identifier, used to differentiate memory regions in the system. This will match the numeric identifier used by standard tooling of the operating system.
- Processor
Id - A processor identifier, used to differentiate processors in the system. This will match the numeric identifier used by standard tooling of the operating system.