Expand description
SIMD-accelerated operations for SciRS2
This module provides highly optimized SIMD implementations for numerical operations. The module is organized into focused sub-modules for better maintainability:
§Module Organization
§Foundation (Layer 1)
§Core Operations (Layer 2)
basic: Basic arithmetic (add, min, max)arithmetic: Advanced arithmetic (mul, div, sub, scalar ops)dot: Dot product and FMA operations
§Reductions & Statistics (Layer 3)
reductions: Statistical reductions (sum, mean, variance, std, min, max)
§Vector Computations (Layer 4)
norms: Vector norms (L1, L2, Linf)distances: Distance metrics (Euclidean, Manhattan, Chebyshev)similarity: Similarity metrics (cosine)weighted: Weighted operations
§Specialized Operations (Layer 5)
indexing: Indexing operations (argmin, argmax, clip)activation: Activation functions (ReLU, softmax, log_sum_exp)cumulative: Cumulative operations (cumsum, cumprod, diff)normalization: Batch/layer normalization (Phase 79)preprocessing: Data preprocessing (normalize, standardize)rounding: Rounding operations (floor, ceil, round, trunc)transcendental: Transcendental functions (exp, sin, cos, ln, activations) (Phases 75-78)transpose: Cache-optimized blocked transposeunary: Unary operations (abs, sqrt, sign)unary_powi: Integer exponentiation
§Performance
The SIMD implementations in this module achieve significant speedups over scalar code:
- Overall: 32.48x average speedup vs NumPy
- Preprocessing: 2.81x average (clip: 1.58x-3.16x faster than NumPy!)
- Reductions: 470.03x average
- Element-wise: 1.47x average
§Architecture Support
- x86_64: AVX-512, AVX2, SSE2 with runtime detection
- aarch64: NEON with runtime detection
- Fallback: Scalar implementations for unsupported architectures
Re-exports§
pub use detect::detect_simd_capabilities;pub use detect::get_cpu_features;pub use detect::CpuFeatures;pub use detect::SimdCapabilities;pub use traits::SimdOps;pub use basic::simd_add_aligned_ultra;pub use basic::simd_add_f32;pub use basic::simd_add_f32_fast;pub use basic::simd_add_f32_optimized;pub use basic::simd_add_f32_ultra;pub use basic::simd_add_f64;pub use basic::simd_maximum_f32;pub use basic::simd_maximum_f64;pub use basic::simd_minimum_f32;pub use basic::simd_minimum_f64;pub use basic_optimized::simd_add_f32_ultra_optimized;pub use basic_optimized::simd_dot_f32_ultra_optimized;pub use basic_optimized::simd_mul_f32_ultra_optimized;pub use basic_optimized::simd_sum_f32_ultra_optimized;pub use arithmetic::simd_scalar_mul_f32;pub use arithmetic::simd_scalar_mul_f64;pub use dot::simd_div_f32;pub use dot::simd_div_f64;pub use dot::simd_dot_f32;pub use dot::simd_dot_f32_adaptive;pub use dot::simd_dot_f32_ultra;pub use dot::simd_dot_f64;pub use dot::simd_fma_f32_ultra;pub use dot::simd_mul_f32;pub use dot::simd_mul_f32_fast;pub use dot::simd_mul_f64;pub use dot::simd_sub_f32;pub use dot::simd_sub_f64;pub use reductions::simd_max_f32;pub use reductions::simd_max_f64;pub use reductions::simd_mean_f32;pub use reductions::simd_mean_f64;pub use reductions::simd_min_f32;pub use reductions::simd_min_f64;pub use reductions::simd_std_f32;pub use reductions::simd_std_f64;pub use reductions::simd_sum_f32;pub use reductions::simd_sum_f64;pub use reductions::simd_variance_f32;pub use reductions::simd_variance_f64;pub use norms::simd_norm_l1_f32;pub use norms::simd_norm_l1_f64;pub use norms::simd_norm_l2_f32;pub use norms::simd_norm_l2_f64;pub use norms::simd_norm_linf_f32;pub use norms::simd_norm_linf_f64;pub use distances::simd_distance_chebyshev_f32;pub use distances::simd_distance_chebyshev_f64;pub use distances::simd_distance_euclidean_f32;pub use distances::simd_distance_euclidean_f64;pub use distances::simd_distance_manhattan_f32;pub use distances::simd_distance_manhattan_f64;pub use distances::simd_distance_squared_euclidean_f32;pub use distances::simd_distance_squared_euclidean_f64;pub use similarity::simd_cosine_similarity_f32;pub use similarity::simd_cosine_similarity_f64;pub use similarity::simd_distance_cosine_f32;pub use similarity::simd_distance_cosine_f64;pub use weighted::simd_weighted_mean_f32;pub use weighted::simd_weighted_mean_f64;pub use weighted::simd_weighted_sum_f32;pub use weighted::simd_weighted_sum_f64;pub use preprocessing::simd_normalize_f32;pub use preprocessing::simd_normalize_f64;pub use preprocessing::simd_standardize_f32;pub use preprocessing::simd_standardize_f64;pub use indexing::simd_argmax_f32;pub use indexing::simd_argmax_f64;pub use indexing::simd_argmin_f32;pub use indexing::simd_argmin_f64;pub use indexing::simd_clip_f32;pub use indexing::simd_clip_f64;pub use activation::simd_leaky_relu_f32;pub use activation::simd_leaky_relu_f64;pub use activation::simd_log_sum_exp_f32;pub use activation::simd_log_sum_exp_f64;pub use activation::simd_relu_f32;pub use activation::simd_relu_f64;pub use activation::simd_softmax_f32;pub use activation::simd_softmax_f64;pub use cumulative::simd_cumprod_f32;pub use cumulative::simd_cumprod_f64;pub use cumulative::simd_cumsum_f32;pub use cumulative::simd_cumsum_f64;pub use cumulative::simd_diff_f32;pub use cumulative::simd_diff_f64;pub use unary::simd_abs_f32;pub use unary::simd_abs_f64;pub use unary::simd_sign_f32;pub use unary::simd_sign_f64;pub use unary::simd_sqrt_f32;pub use unary::simd_sqrt_f64;pub use unary_powi::simd_powi_f32;pub use unary_powi::simd_powi_f64;pub use transpose::simd_transpose_blocked_f32;pub use transpose::simd_transpose_blocked_f64;pub use rounding::simd_ceil_f32;pub use rounding::simd_ceil_f64;pub use rounding::simd_floor_f32;pub use rounding::simd_floor_f64;pub use rounding::simd_round_f32;pub use rounding::simd_round_f64;pub use rounding::simd_trunc_f32;pub use rounding::simd_trunc_f64;pub use transcendental::simd_cos_f32;pub use transcendental::simd_cos_f64;pub use transcendental::simd_exp_f32;pub use transcendental::simd_exp_f64;pub use transcendental::simd_exp_fast_f32;pub use transcendental::simd_gelu_f32;pub use transcendental::simd_gelu_f64;pub use transcendental::simd_ln_f32;pub use transcendental::simd_ln_f64;pub use transcendental::simd_log10_f32;pub use transcendental::simd_log10_f64;pub use transcendental::simd_log2_f32;pub use transcendental::simd_log2_f64;pub use transcendental::simd_mish_f32;pub use transcendental::simd_mish_f64;pub use transcendental::simd_sigmoid_f32;pub use transcendental::simd_sigmoid_f64;pub use transcendental::simd_sin_f32;pub use transcendental::simd_sin_f64;pub use transcendental::simd_softplus_f32;pub use transcendental::simd_softplus_f64;pub use transcendental::simd_swish_f32;pub use transcendental::simd_swish_f64;pub use transcendental::simd_tanh_f32;pub use transcendental::simd_tanh_f64;pub use normalization::simd_batch_norm_f32;pub use normalization::simd_batch_norm_f64;pub use normalization::simd_layer_norm_f32;pub use normalization::simd_layer_norm_f64;
Modules§
- activation
- Activation functions with SIMD acceleration
- arithmetic
- Arithmetic operations with SIMD acceleration
- basic
- Basic arithmetic operations with SIMD acceleration
- basic_
optimized - Ultra-optimized SIMD operations with aggressive performance optimizations
- cumulative
- Cumulative operations with SIMD acceleration
- detect
- CPU feature detection and SIMD capability management
- distances
- Distance metric operations with SIMD acceleration
- dot
- Dot product and FMA operations with SIMD acceleration
- indexing
- Indexing operations with SIMD acceleration
- normalization
- SIMD-accelerated normalization operations for neural networks
- norms
- Vector norm operations with SIMD acceleration
- preprocessing
- Data preprocessing operations with SIMD acceleration
- reductions
- Statistical reduction operations with SIMD acceleration
- rounding
- SIMD-accelerated rounding operations
- similarity
- Similarity metric operations with SIMD acceleration
- traits
- SIMD trait definitions for type constraints
- transcendental
- SIMD-accelerated transcendental functions
- transpose
- Cache-optimized blocked matrix transpose
- unary
- Unary operations with SIMD acceleration
- unary_
powi - Integer exponentiation with SIMD acceleration (Phase 25)
- weighted
- Weighted operations with SIMD acceleration
Functions§
- simd_
adaptive_ add_ f32 - Adaptive SIMD operation selector using unified interface
- simd_
adaptive_ add_ f64 - Adaptive SIMD operation selector for f64 using unified interface
- simd_
add_ auto - Automatically select the best SIMD operation based on detected capabilities
- simd_
add_ cache_ optimized_ f32 - Cache-optimized SIMD addition for f32 using unified interface
- simd_
add_ cache_ optimized_ f64 - Cache-optimized SIMD addition for f64 using unified interface
- simd_
add_ f32_ adaptive - Adaptive addition selector
- simd_
binary_ op - Apply element-wise operation on arrays using unified SIMD operations
- simd_
fma_ advanced_ optimized_ f32 - Advanced-optimized fused multiply-add for f32 using unified interface
- simd_
fma_ advanced_ optimized_ f64 - Advanced-optimized fused multiply-add for f64 using unified interface
- simd_
fused_ multiply_ add_ f32 - Fused multiply-add for f32 arrays using unified interface
- simd_
fused_ multiply_ add_ f64 - Fused multiply-add for f64 arrays using unified interface
- simd_
gemv_ cache_ optimized_ f32 - Cache-aware matrix-vector multiplication (GEMV) using unified interface
- simd_
mul_ f32_ adaptive - ADAPTIVE: Intelligent multiplication with optimal selection (legacy)
- simd_
mul_ f32_ avx512 - CUTTING-EDGE: AVX-512 ultra-high-speed multiplication (16 f32 per instruction)
- simd_
mul_ f32_ bandwidth_ saturated - BANDWIDTH-SATURATED: Memory bandwidth optimization for large arrays
- simd_
mul_ f32_ blazing - ULTRA: Streamlined ultra-fast multiplication with maximum ILP
- simd_
mul_ f32_ branchfree - BRANCH-FREE: Elimination of all conditional branches in hot paths
- simd_
mul_ f32_ cache_ optimized - CACHE-OPTIMIZED: Cache-line aware ultra-fast multiplication
- simd_
mul_ f32_ cacheline - ULTRA-OPTIMIZED: Cache-line aware with non-temporal stores Processes exactly 64 bytes (16 floats) at a time for optimal cache usage Uses non-temporal stores to bypass cache for streaming workloads
- simd_
mul_ f32_ hyperoptimized - Hyperoptimized multiplication variant
- simd_
mul_ f32_ lightweight - LIGHTWEIGHT: Minimal overhead SIMD multiplication
- simd_
mul_ f32_ pipelined - ULTRA-OPTIMIZED: Software pipelined with register blocking Overlaps memory loads with computation using multiple accumulators Utilizes all 16 YMM registers for maximum throughput
- simd_
mul_ f32_ tlb_ optimized - ULTRA-OPTIMIZED: TLB-optimized memory access patterns Processes data in 2MB chunks to minimize TLB misses Uses huge page-aware iteration for maximum efficiency
- simd_
mul_ f32_ ultimate - ULTIMATE: Next-generation adaptive SIMD with breakthrough performance selection
- simd_
mul_ f32_ ultra - PHASE 3: High-performance SIMD multiplication with prefetching