loom-rs
Weaving multiple threads together
A Rust crate providing a bespoke thread pool runtime combining tokio and rayon with CPU pinning capabilities.
Features
- Hybrid Runtime: Combines tokio for async I/O with rayon for CPU-bound parallel work
- CPU Pinning: Automatically pins threads to specific CPUs for consistent performance
- Flexible Configuration: Configure via files (TOML/YAML/JSON), environment variables, or code
- CLI Integration: Built-in clap support for command-line overrides
- CUDA NUMA Awareness: Optional feature for selecting CPUs local to a GPU (Linux only)
Platform Support
| Platform | Status | Notes |
|---|---|---|
| Linux | Full support | All features including CPU pinning and CUDA |
| macOS | Partial | Compiles and runs, but CPU pinning may silently fail |
| Windows | Partial | Compiles and runs, but CPU pinning may silently fail |
Note: CPU affinity (thread pinning) is a Linux-focused feature. On macOS and Windows, pinning calls may return failure or have no effect. The library remains functional for development and testing, but production deployments targeting performance should use Linux.
Installation
Add to your Cargo.toml:
[]
= "0.1"
For CUDA support (Linux only):
[]
= { = "0.1", = ["cuda"] }
Quick Start
use LoomBuilder;
Configuration
Configuration sources are merged in order (later sources override earlier):
- Default values
- Config files (via
.file()) - Environment variables (via
.env_prefix()) - Programmatic overrides
- CLI arguments (via
.with_cli_args())
Config File (TOML)
= "myapp"
= "0-7,16-23"
= 2
= 14
Environment Variables
With .env_prefix("LOOM"):
CLI Arguments
use Parser;
use ;
Available CLI arguments:
--loom-prefix: Thread name prefix--loom-cpuset: CPU set (e.g., "0-7,16-23")--loom-tokio-threads: Number of tokio threads--loom-rayon-threads: Number of rayon threads--loom-cuda-device: CUDA device ID or UUID (requirescudafeature)
CPU Set Format
The cpuset option accepts a string in Linux taskset/numactl format:
- Single CPUs:
"0","5" - Ranges:
"0-7","16-23" - Mixed:
"0-3,8-11","0,2,4,6-8"
CUDA Support
With the cuda feature enabled (Linux only), configure the runtime to use CPUs local to a specific GPU:
let runtime = new
.cuda_device_id // Use CPUs near GPU 0
.build?;
// Or by UUID
let runtime = new
.cuda_device_uuid
.build?;
This is useful for GPU-accelerated workloads where data needs to be transferred between CPU and GPU memory, as it minimizes NUMA-related latency.
Thread Naming
Threads are named with the configured prefix:
- Tokio threads:
{prefix}-tokio-0000,{prefix}-tokio-0001, ... - Rayon threads:
{prefix}-rayon-0000,{prefix}-rayon-0001, ...
API Reference
Task Spawning
| Method | Use Case | Overhead | Tracked |
|---|---|---|---|
spawn_async() |
I/O-bound async tasks | ~10ns | Yes |
spawn_compute() |
CPU-bound work (await from async) | ~100-500ns | Yes |
install() |
Zero-overhead parallel iterators | ~0ns | No |
Shutdown
// Option 1: Simple shutdown from main thread
runtime.block_until_idle;
// Option 2: Manual control from async context
runtime.block_on;
// Option 3: Check status without blocking
if runtime.is_idle
Direct Access (Untracked)
For advanced use cases requiring untracked access:
// Direct tokio handle
let handle = runtime.tokio_handle;
handle.spawn;
// Direct rayon pool
let pool = runtime.rayon_pool;
pool.spawn;
Ergonomic Access
Use current_runtime() or spawn_compute() from anywhere in the runtime:
use LoomBuilder;
let runtime = new.build?;
runtime.block_on;
Performance
loom-rs is designed for zero unnecessary overhead:
- Thread pinning: One-time cost at runtime creation only
- Zero allocation after warmup:
spawn_compute()uses per-type object pools - Custom async-rayon bridge: Uses atomic wakers (~32 bytes) instead of channels (~80 bytes)
- Main thread is separate: Not part of worker pools
spawn_compute Performance
| State | Allocations | Overhead |
|---|---|---|
| Pool hit | 0 bytes | ~100-500ns |
| Pool miss | ~32 bytes | ~100-500ns |
| First call per type | Pool + state | ~1µs |
Configure pool size for high-concurrency workloads:
let runtime = new
.compute_pool_size // Default is 64
.build?;
Patterns to Avoid
1. Nested spawn_compute (Deadlock Risk)
// BAD: Can deadlock if all rayon threads are waiting
runtime.spawn_compute.await;
// GOOD: Use install() for nested parallelism
runtime.spawn_compute.await;
2. Blocking I/O in spawn_compute
// BAD: Blocks rayon thread
runtime.spawn_compute.await;
// GOOD: I/O in async, compute in rayon
let data = read_to_string.await?;
runtime.spawn_compute.await;
3. spawn_compute in Tight Loops
// OK (auto-pooling): Each call reuses pooled state
for item in items
// STILL BETTER for batch: Single cross-thread trip
let results = runtime.install;
4. Holding Locks Across spawn_compute
// BAD: Lock held during async gap
let guard = mutex.lock;
runtime.spawn_compute.await;
// GOOD: Clone data, release lock
let data = mutex.lock.clone;
runtime.spawn_compute.await;
5. install() Blocks the Thread
// CAUTION in async context: blocks tokio worker
runtime.spawn_async.await;
// BETTER: spawn_compute for async-safe bridge
runtime.spawn_async.await;
License
MIT