Expand description
Binary element-wise operations for the CPU execution backend.
These operations are memory-bandwidth-bound (just read + write, minimal math). Parallelism via Rayon is NOT used because the memory bus is the bottleneck, not the CPU. Adding thread wake-up overhead (~20µs) hurts performance at all practical tensor sizes. The single-threaded vectorized loop already saturates the memory bus on modern CPUs.
Structs§
- CpuBackend
- The execution driver for standard host CPU memory.