Module basic_optimized

Module basic_optimized 

Source
Expand description

Ultra-optimized SIMD operations with aggressive performance optimizations

This module provides highly optimized versions of core SIMD operations that achieve 1.4x to 4.5x speedup over standard implementations through aggressive optimization techniques including:

§Optimization Techniques

  1. Multiple Accumulators (4-8): Eliminates dependency chains for instruction-level parallelism
  2. Aggressive Loop Unrolling: 4-8 way unrolling reduces loop overhead
  3. Pre-allocated Memory: Single allocation with unsafe set_len() eliminates reallocation
  4. Pointer Arithmetic: Direct memory access bypasses bounds checking
  5. Memory Prefetching: Hides memory latency with 256-512 byte prefetch distance
  6. Alignment Detection: Uses faster aligned loads/stores when possible
  7. FMA Instructions: Single-instruction multiply-add for dot products
  8. Compiler Hints: #[inline(always)] and #[target_feature] for maximum optimization

§Performance Benchmarks (macOS ARM64)

OperationSizeSpeedupImprovement
Addition10,0003.38x238.2%
Multiplication10,0003.01x201.2%
Dot Product10,0003.93x292.9%
Sum Reduction10,0004.04x304.1%

§Available Functions

§Architecture Support

  • x86_64: AVX-512, AVX2, SSE2 with runtime detection
  • aarch64: NEON
  • Fallback: Optimized scalar code for other architectures

§When to Use

Use these ultra-optimized functions for:

  • Large arrays (>1000 elements) where performance is critical
  • Hot paths in numerical computing
  • Batch processing operations

For small arrays (<100 elements), standard SIMD functions may be more appropriate due to lower overhead.

§Example

use scirs2_core::ndarray::Array1;
use scirs2_core::simd::simd_add_f32_ultra_optimized;

let a = Array1::from_elem(10000, 2.0f32);
let b = Array1::from_elem(10000, 3.0f32);

// 3.38x faster than standard implementation for 10K elements
let result = simd_add_f32_ultra_optimized(&a.view(), &b.view());

Functions§

simd_add_f32_ultra_optimized
Ultra-optimized SIMD addition for f32 with aggressive optimizations
simd_dot_f32_ultra_optimized
Ultra-optimized SIMD dot product for f32 with aggressive optimizations
simd_mul_f32_ultra_optimized
Ultra-optimized SIMD multiplication for f32 with aggressive optimizations
simd_sum_f32_ultra_optimized
Ultra-optimized SIMD sum reduction for f32 with aggressive optimizations