Module simd

Module simd 

Source
Expand description

SIMD-accelerated operations for SciRS2

This module provides highly optimized SIMD implementations for numerical operations. The module is organized into focused sub-modules for better maintainability:

§Module Organization

§Foundation (Layer 1)

  • traits: Core SIMD trait definitions
  • detect: CPU feature detection and capability management

§Core Operations (Layer 2)

  • basic: Basic arithmetic (add, min, max)
  • arithmetic: Advanced arithmetic (mul, div, sub, scalar ops)
  • dot: Dot product and FMA operations

§Reductions & Statistics (Layer 3)

  • reductions: Statistical reductions (sum, mean, variance, std, min, max)

§Vector Computations (Layer 4)

  • norms: Vector norms (L1, L2, Linf)
  • distances: Distance metrics (Euclidean, Manhattan, Chebyshev)
  • similarity: Similarity metrics (cosine)
  • weighted: Weighted operations

§Specialized Operations (Layer 5)

  • indexing: Indexing operations (argmin, argmax, clip)
  • activation: Activation functions (ReLU, softmax, log_sum_exp)
  • cumulative: Cumulative operations (cumsum, cumprod, diff)
  • normalization: Batch/layer normalization (Phase 79)
  • preprocessing: Data preprocessing (normalize, standardize)
  • rounding: Rounding operations (floor, ceil, round, trunc)
  • transcendental: Transcendental functions (exp, sin, cos, ln, activations) (Phases 75-78)
  • transpose: Cache-optimized blocked transpose
  • unary: Unary operations (abs, sqrt, sign)
  • unary_powi: Integer exponentiation

§Performance

The SIMD implementations in this module achieve significant speedups over scalar code:

  • Overall: 32.48x average speedup vs NumPy
  • Preprocessing: 2.81x average (clip: 1.58x-3.16x faster than NumPy!)
  • Reductions: 470.03x average
  • Element-wise: 1.47x average

§Architecture Support

  • x86_64: AVX-512, AVX2, SSE2 with runtime detection
  • aarch64: NEON with runtime detection
  • Fallback: Scalar implementations for unsupported architectures

Re-exports§

pub use detect::detect_simd_capabilities;
pub use detect::get_cpu_features;
pub use detect::CpuFeatures;
pub use detect::SimdCapabilities;
pub use traits::SimdOps;
pub use basic::simd_add_aligned_ultra;
pub use basic::simd_add_f32;
pub use basic::simd_add_f32_fast;
pub use basic::simd_add_f32_optimized;
pub use basic::simd_add_f32_ultra;
pub use basic::simd_add_f64;
pub use basic::simd_maximum_f32;
pub use basic::simd_maximum_f64;
pub use basic::simd_minimum_f32;
pub use basic::simd_minimum_f64;
pub use basic_optimized::simd_add_f32_ultra_optimized;
pub use basic_optimized::simd_dot_f32_ultra_optimized;
pub use basic_optimized::simd_mul_f32_ultra_optimized;
pub use basic_optimized::simd_sum_f32_ultra_optimized;
pub use arithmetic::simd_scalar_mul_f32;
pub use arithmetic::simd_scalar_mul_f64;
pub use dot::simd_div_f32;
pub use dot::simd_div_f64;
pub use dot::simd_dot_f32;
pub use dot::simd_dot_f32_adaptive;
pub use dot::simd_dot_f32_ultra;
pub use dot::simd_dot_f64;
pub use dot::simd_fma_f32_ultra;
pub use dot::simd_mul_f32;
pub use dot::simd_mul_f32_fast;
pub use dot::simd_mul_f64;
pub use dot::simd_sub_f32;
pub use dot::simd_sub_f64;
pub use reductions::simd_max_f32;
pub use reductions::simd_max_f64;
pub use reductions::simd_mean_f32;
pub use reductions::simd_mean_f64;
pub use reductions::simd_min_f32;
pub use reductions::simd_min_f64;
pub use reductions::simd_std_f32;
pub use reductions::simd_std_f64;
pub use reductions::simd_sum_f32;
pub use reductions::simd_sum_f64;
pub use reductions::simd_variance_f32;
pub use reductions::simd_variance_f64;
pub use norms::simd_norm_l1_f32;
pub use norms::simd_norm_l1_f64;
pub use norms::simd_norm_l2_f32;
pub use norms::simd_norm_l2_f64;
pub use norms::simd_norm_linf_f32;
pub use norms::simd_norm_linf_f64;
pub use distances::simd_distance_chebyshev_f32;
pub use distances::simd_distance_chebyshev_f64;
pub use distances::simd_distance_euclidean_f32;
pub use distances::simd_distance_euclidean_f64;
pub use distances::simd_distance_manhattan_f32;
pub use distances::simd_distance_manhattan_f64;
pub use distances::simd_distance_squared_euclidean_f32;
pub use distances::simd_distance_squared_euclidean_f64;
pub use similarity::simd_cosine_similarity_f32;
pub use similarity::simd_cosine_similarity_f64;
pub use similarity::simd_distance_cosine_f32;
pub use similarity::simd_distance_cosine_f64;
pub use weighted::simd_weighted_mean_f32;
pub use weighted::simd_weighted_mean_f64;
pub use weighted::simd_weighted_sum_f32;
pub use weighted::simd_weighted_sum_f64;
pub use preprocessing::simd_normalize_f32;
pub use preprocessing::simd_normalize_f64;
pub use preprocessing::simd_standardize_f32;
pub use preprocessing::simd_standardize_f64;
pub use indexing::simd_argmax_f32;
pub use indexing::simd_argmax_f64;
pub use indexing::simd_argmin_f32;
pub use indexing::simd_argmin_f64;
pub use indexing::simd_clip_f32;
pub use indexing::simd_clip_f64;
pub use activation::simd_leaky_relu_f32;
pub use activation::simd_leaky_relu_f64;
pub use activation::simd_log_sum_exp_f32;
pub use activation::simd_log_sum_exp_f64;
pub use activation::simd_relu_f32;
pub use activation::simd_relu_f64;
pub use activation::simd_softmax_f32;
pub use activation::simd_softmax_f64;
pub use cumulative::simd_cumprod_f32;
pub use cumulative::simd_cumprod_f64;
pub use cumulative::simd_cumsum_f32;
pub use cumulative::simd_cumsum_f64;
pub use cumulative::simd_diff_f32;
pub use cumulative::simd_diff_f64;
pub use unary::simd_abs_f32;
pub use unary::simd_abs_f64;
pub use unary::simd_sign_f32;
pub use unary::simd_sign_f64;
pub use unary::simd_sqrt_f32;
pub use unary::simd_sqrt_f64;
pub use unary_powi::simd_powi_f32;
pub use unary_powi::simd_powi_f64;
pub use transpose::simd_transpose_blocked_f32;
pub use transpose::simd_transpose_blocked_f64;
pub use rounding::simd_ceil_f32;
pub use rounding::simd_ceil_f64;
pub use rounding::simd_floor_f32;
pub use rounding::simd_floor_f64;
pub use rounding::simd_round_f32;
pub use rounding::simd_round_f64;
pub use rounding::simd_trunc_f32;
pub use rounding::simd_trunc_f64;
pub use transcendental::simd_cos_f32;
pub use transcendental::simd_cos_f64;
pub use transcendental::simd_exp_f32;
pub use transcendental::simd_exp_f64;
pub use transcendental::simd_exp_fast_f32;
pub use transcendental::simd_gelu_f32;
pub use transcendental::simd_gelu_f64;
pub use transcendental::simd_ln_f32;
pub use transcendental::simd_ln_f64;
pub use transcendental::simd_log10_f32;
pub use transcendental::simd_log10_f64;
pub use transcendental::simd_log2_f32;
pub use transcendental::simd_log2_f64;
pub use transcendental::simd_mish_f32;
pub use transcendental::simd_mish_f64;
pub use transcendental::simd_sigmoid_f32;
pub use transcendental::simd_sigmoid_f64;
pub use transcendental::simd_sin_f32;
pub use transcendental::simd_sin_f64;
pub use transcendental::simd_softplus_f32;
pub use transcendental::simd_softplus_f64;
pub use transcendental::simd_swish_f32;
pub use transcendental::simd_swish_f64;
pub use transcendental::simd_tanh_f32;
pub use transcendental::simd_tanh_f64;
pub use normalization::simd_batch_norm_f32;
pub use normalization::simd_batch_norm_f64;
pub use normalization::simd_layer_norm_f32;
pub use normalization::simd_layer_norm_f64;

Modules§

activation
Activation functions with SIMD acceleration
arithmetic
Arithmetic operations with SIMD acceleration
basic
Basic arithmetic operations with SIMD acceleration
basic_optimized
Ultra-optimized SIMD operations with aggressive performance optimizations
cumulative
Cumulative operations with SIMD acceleration
detect
CPU feature detection and SIMD capability management
distances
Distance metric operations with SIMD acceleration
dot
Dot product and FMA operations with SIMD acceleration
indexing
Indexing operations with SIMD acceleration
normalization
SIMD-accelerated normalization operations for neural networks
norms
Vector norm operations with SIMD acceleration
preprocessing
Data preprocessing operations with SIMD acceleration
reductions
Statistical reduction operations with SIMD acceleration
rounding
SIMD-accelerated rounding operations
similarity
Similarity metric operations with SIMD acceleration
traits
SIMD trait definitions for type constraints
transcendental
SIMD-accelerated transcendental functions
transpose
Cache-optimized blocked matrix transpose
unary
Unary operations with SIMD acceleration
unary_powi
Integer exponentiation with SIMD acceleration (Phase 25)
weighted
Weighted operations with SIMD acceleration

Functions§

simd_adaptive_add_f32
Adaptive SIMD operation selector using unified interface
simd_adaptive_add_f64
Adaptive SIMD operation selector for f64 using unified interface
simd_add_auto
Automatically select the best SIMD operation based on detected capabilities
simd_add_cache_optimized_f32
Cache-optimized SIMD addition for f32 using unified interface
simd_add_cache_optimized_f64
Cache-optimized SIMD addition for f64 using unified interface
simd_add_f32_adaptive
Adaptive addition selector
simd_binary_op
Apply element-wise operation on arrays using unified SIMD operations
simd_fma_advanced_optimized_f32
Advanced-optimized fused multiply-add for f32 using unified interface
simd_fma_advanced_optimized_f64
Advanced-optimized fused multiply-add for f64 using unified interface
simd_fused_multiply_add_f32
Fused multiply-add for f32 arrays using unified interface
simd_fused_multiply_add_f64
Fused multiply-add for f64 arrays using unified interface
simd_gemv_cache_optimized_f32
Cache-aware matrix-vector multiplication (GEMV) using unified interface
simd_mul_f32_adaptive
ADAPTIVE: Intelligent multiplication with optimal selection (legacy)
simd_mul_f32_avx512
CUTTING-EDGE: AVX-512 ultra-high-speed multiplication (16 f32 per instruction)
simd_mul_f32_bandwidth_saturated
BANDWIDTH-SATURATED: Memory bandwidth optimization for large arrays
simd_mul_f32_blazing
ULTRA: Streamlined ultra-fast multiplication with maximum ILP
simd_mul_f32_branchfree
BRANCH-FREE: Elimination of all conditional branches in hot paths
simd_mul_f32_cache_optimized
CACHE-OPTIMIZED: Cache-line aware ultra-fast multiplication
simd_mul_f32_cacheline
ULTRA-OPTIMIZED: Cache-line aware with non-temporal stores Processes exactly 64 bytes (16 floats) at a time for optimal cache usage Uses non-temporal stores to bypass cache for streaming workloads
simd_mul_f32_hyperoptimized
Hyperoptimized multiplication variant
simd_mul_f32_lightweight
LIGHTWEIGHT: Minimal overhead SIMD multiplication
simd_mul_f32_pipelined
ULTRA-OPTIMIZED: Software pipelined with register blocking Overlaps memory loads with computation using multiple accumulators Utilizes all 16 YMM registers for maximum throughput
simd_mul_f32_tlb_optimized
ULTRA-OPTIMIZED: TLB-optimized memory access patterns Processes data in 2MB chunks to minimize TLB misses Uses huge page-aware iteration for maximum efficiency
simd_mul_f32_ultimate
ULTIMATE: Next-generation adaptive SIMD with breakthrough performance selection
simd_mul_f32_ultra
PHASE 3: High-performance SIMD multiplication with prefetching