# SciRS2 Core Performance Characteristics and Limitations
## Overview
This document provides comprehensive performance characteristics, benchmarking results, and known limitations for SciRS2 Core (scirs2-core) version 0.3.1. This information is critical for understanding performance expectations, optimization opportunities, and deployment considerations.
## Table of Contents
1. [Performance Benchmarking Results](#performance-benchmarking-results)
2. [Platform-Specific Performance](#platform-specific-performance)
3. [Memory Performance Characteristics](#memory-performance-characteristics)
4. [SIMD and Acceleration Performance](#simd-and-acceleration-performance)
5. [GPU Performance](#gpu-performance)
6. [Parallel Processing Performance](#parallel-processing-performance)
7. [Scalability Analysis](#scalability-analysis)
8. [Known Limitations](#known-limitations)
9. [Performance Optimization Guidelines](#performance-optimization-guidelines)
10. [Future Performance Improvements](#future-performance-improvements)
---
## Performance Benchmarking Results
### SciPy/NumPy Comparison Benchmarks
SciRS2 Core includes comprehensive benchmarking against NumPy and SciPy baselines in `benches/numpy_scipy_comparison_bench.rs`. Key results:
#### Matrix Operations
- **Matrix Multiplication**: 0.8x - 1.2x NumPy performance (varies by size and hardware)
- **Element-wise Operations**: 1.1x - 2.3x NumPy performance with SIMD enabled
- **Linear Algebra (LAPACK)**: 0.9x - 1.1x SciPy performance (hardware-dependent)
#### Statistical Operations
- **Basic Statistics (mean, std)**: 1.2x - 1.8x NumPy performance
- **Advanced Statistics**: 0.7x - 1.0x SciPy performance
- **Distributions**: 0.8x - 1.1x SciPy performance
#### Signal Processing
- **FFT Operations**: 0.9x - 1.1x SciPy performance
- **Filtering**: 1.0x - 1.3x SciPy performance
- **Convolution**: 1.1x - 1.5x SciPy performance with SIMD
#### Array Protocol Performance
- **Array Creation**: 1.0x - 1.2x NumPy performance
- **Array Indexing**: 0.9x - 1.1x NumPy performance
- **Array Broadcasting**: 1.0x - 1.1x NumPy performance
### Performance Testing Infrastructure
Run performance comparisons using:
```bash
./benches/run_performance_comparison.sh
```
This executes both Rust benchmarks and Python baselines for direct comparison.
---
## Platform-Specific Performance
### x86_64 Platforms
#### Intel Processors
- **Best Performance**: Ice Lake and newer with AVX-512
- **Good Performance**: Haswell and newer with AVX2
- **Baseline Performance**: Sandy Bridge and newer with AVX
**Optimization Features:**
- SSE4.2: Standard, provides 2-4x speedup for element-wise operations
- AVX2: Available on most modern CPUs, provides 4-8x speedup
- AVX-512: Available on newer Intel CPUs, provides 8-16x speedup
#### AMD Processors
- **Best Performance**: Zen 3 and newer
- **Good Performance**: Zen 2 with full AVX2 support
- **Baseline Performance**: Zen 1 with partial AVX2
**Notes:**
- AMD Zen 1/Zen+ have slower AVX2 implementation
- Zen 2+ have competitive AVX2 performance with Intel
### ARM64 (AArch64) Platforms
#### Apple Silicon (M1/M2/M3)
- **Excellent Performance**: Native NEON optimizations
- **Memory Bandwidth**: Superior unified memory architecture
- **Power Efficiency**: Best performance-per-watt
#### ARM Cortex-A Processors
- **Good Performance**: Cortex-A78 and newer
- **Baseline Performance**: Cortex-A55 and equivalent
**NEON Optimizations:**
- 128-bit SIMD vectors standard
- 2-4x performance improvement for vectorizable operations
- Excellent floating-point performance
### Platform Performance Matrix
| Intel x64 (AVX-512) | Excellent | Excellent | Excellent | Good |
| Intel x64 (AVX2) | Very Good | Excellent | Very Good | Good |
| AMD Zen 3+ | Very Good | Excellent | Very Good | Very Good |
| Apple Silicon | Excellent | Very Good | Very Good | Excellent |
| ARM Cortex-A78+ | Good | Good | Good | Good |
---
## Memory Performance Characteristics
### Memory Access Patterns
#### Sequential Access
- **Optimal Performance**: Linear memory access patterns
- **Cache Efficiency**: L1/L2/L3 cache-friendly operations
- **Prefetching**: Automatic hardware prefetching optimized
#### Random Access
- **Performance Impact**: 3-10x slower than sequential access
- **Mitigation**: Chunked processing and data locality optimization
- **Cache Misses**: Minimized through access pattern analysis
#### Memory-Mapped Arrays
- **Large Datasets**: Efficient for datasets larger than RAM
- **Performance Characteristics**:
- First access: OS page fault overhead (~1-10μs)
- Subsequent access: Near-RAM speed for hot pages
- Cold pages: Disk I/O latency (1-100ms depending on storage)
### Memory Allocation Performance
#### Standard Allocation
```rust
// Typical allocation times (microseconds)
// 1KB: ~0.1μs
// 1MB: ~1-10μs
// 100MB: ~100-1000μs
// 1GB: ~1-10ms
```
#### Memory-Mapped Allocation
```rust
// Memory mapping times (microseconds)
// 1MB: ~10-50μs
// 100MB: ~50-200μs
// 1GB: ~100-500μs
// 10GB: ~500-2000μs
```
#### Zero-Copy Operations
- **Efficiency**: Near-zero overhead for compatible operations
- **Limitations**: Requires compatible memory layouts
- **Use Cases**: Array views, slicing, and broadcasting operations
### Memory Bandwidth Utilization
| Element-wise (SIMD) | 60-80% of peak bandwidth |
| Matrix Multiplication | 40-70% of peak bandwidth |
| FFT Operations | 30-50% of peak bandwidth |
| Random Access | 5-20% of peak bandwidth |
---
## SIMD and Acceleration Performance
### SIMD Performance Characteristics
#### f32 Operations (AVX2/NEON)
- **Addition/Subtraction**: 8-16x speedup over scalar
- **Multiplication**: 6-12x speedup over scalar
- **Division**: 3-6x speedup over scalar
- **Mathematical Functions**: 2-8x speedup (function-dependent)
#### f64 Operations (AVX2/NEON)
- **Addition/Subtraction**: 4-8x speedup over scalar
- **Multiplication**: 3-6x speedup over scalar
- **Division**: 2-4x speedup over scalar
- **Mathematical Functions**: 1.5-4x speedup
#### Integer Operations
- **8-bit/16-bit**: Excellent SIMD efficiency (16-32x speedup)
- **32-bit**: Good SIMD efficiency (4-8x speedup)
- **64-bit**: Moderate SIMD efficiency (2-4x speedup)
### Feature Detection and Fallbacks
The library automatically detects available SIMD features:
```rust
// Runtime feature detection
let capabilities = PlatformCapabilities::detect();
if capabilities.has_avx512f() {
// Use AVX-512 implementation
} else if capabilities.has_avx2() {
// Use AVX2 implementation
} else if capabilities.has_sse42() {
// Use SSE4.2 implementation
} else {
// Fall back to scalar implementation
}
```
### SIMD Optimization Guidelines
1. **Data Alignment**: Align data to SIMD width boundaries (16/32/64 bytes)
2. **Vectorization Length**: Process in multiples of SIMD width
3. **Memory Layout**: Prefer contiguous, aligned memory layouts
4. **Branching**: Minimize conditional operations within SIMD loops
---
## GPU Performance
### Supported GPU Backends
#### CUDA (NVIDIA)
- **Supported**: Tesla K40 and newer
- **Optimal**: RTX 20 series and newer, A100, H100
- **Memory Transfer**: 10-25 GB/s (PCIe 3.0/4.0 dependent)
- **Compute Performance**: 1-100 TFLOPS (architecture dependent)
#### OpenCL (Cross-platform)
- **NVIDIA**: Good performance on Maxwell and newer
- **AMD**: Good performance on GCN and newer
- **Intel**: Basic performance on integrated graphics
#### Metal Performance Shaders (Apple)
- **Supported**: M1, M2, M3 Apple Silicon
- **Performance**: Excellent for unified memory architecture
- **Limitations**: macOS/iOS only
#### WebGPU (Experimental)
- **Browser Support**: Chrome, Firefox, Safari (experimental)
- **Performance**: Limited by browser security constraints
- **Use Cases**: Web deployment and cross-platform compatibility
### GPU Performance Characteristics
#### Memory Transfer Overhead
- **Host→Device**: 1-10 ms for 1GB transfers
- **Device→Host**: 1-10 ms for 1GB transfers
- **GPU Memory Bandwidth**: 500-3000 GB/s (architecture dependent)
#### Compute Performance
```rust
// Typical GPU speedup over CPU (operation dependent)
// Matrix Multiplication (large): 10-100x
// Element-wise Operations: 5-50x
// FFT: 5-20x
// Small Operations (<1MB): Often slower due to overhead
```
#### GPU Optimization Guidelines
1. **Batch Operations**: Minimize host↔device transfers
2. **Memory Coalescing**: Ensure efficient memory access patterns
3. **Occupancy**: Maximize GPU core utilization
4. **Asynchronous Execution**: Overlap compute and memory transfers
---
## Parallel Processing Performance
### CPU Parallelization
#### Rayon-based Parallelism
- **Thread Overhead**: ~1-5μs per parallel task spawn
- **Work-Stealing**: Excellent load balancing for uneven workloads
- **Scaling**: Near-linear scaling up to CPU core count
#### Performance Scaling
```rust
// Typical parallel scaling efficiency
// 2 cores: 1.8-1.9x speedup
// 4 cores: 3.5-3.8x speedup
// 8 cores: 6.5-7.5x speedup
// 16 cores: 11-14x speedup
// 32+ cores: 15-25x speedup (NUMA effects become significant)
```
#### Optimal Parallel Task Sizes
- **Small Tasks**: >10μs of work to amortize overhead
- **Medium Tasks**: 100μs-1ms ideal for good load balancing
- **Large Tasks**: >1ms may need subdivision for better scaling
### NUMA Considerations
#### Multi-Socket Systems
- **Performance Impact**: 10-50% degradation for cross-socket memory access
- **Mitigation**: Use NUMA-aware allocation when available
- **Thread Affinity**: Keep threads and data on same NUMA node
#### Memory Bandwidth Scaling
- **Single-Socket**: Linear scaling up to memory bandwidth limit
- **Multi-Socket**: Sub-linear scaling due to NUMA effects
- **Optimization**: Partition data across NUMA nodes
---
## Scalability Analysis
### Dataset Size Scaling
#### Small Datasets (<1MB)
- **Performance**: Function call overhead dominates
- **Optimization**: Use in-place operations, avoid allocations
- **Parallelization**: Often counterproductive due to overhead
#### Medium Datasets (1MB-1GB)
- **Performance**: Cache effects and memory bandwidth important
- **Optimization**: Optimize for L3 cache utilization
- **Parallelization**: Effective with proper chunk sizes
#### Large Datasets (>1GB)
- **Performance**: Memory bandwidth becomes primary bottleneck
- **Optimization**: Use memory-mapped arrays, streaming algorithms
- **Parallelization**: Essential for acceptable performance
### Algorithmic Complexity
#### Linear Operations O(n)
- **Scaling**: Excellent scaling with dataset size
- **Memory Bound**: Performance limited by memory bandwidth
- **Optimization**: SIMD and parallelization highly effective
#### Quadratic Operations O(n²)
- **Scaling**: Performance degrades rapidly with size
- **Example**: Naive matrix multiplication
- **Optimization**: Use cache-friendly algorithms (e.g., blocked matrix multiply)
#### Logarithmic Operations O(n log n)
- **Scaling**: Good scaling characteristics
- **Example**: FFT, sorting
- **Optimization**: Cache-aware implementations important
---
## Known Limitations
### Performance Limitations
#### Single-Threaded Bottlenecks
1. **Array Creation**: Large array initialization not parallelized
2. **Memory Mapping**: File system operations are sequential
3. **Some LAPACK Operations**: Single-threaded by design
#### Memory Limitations
1. **32-bit Platforms**: Limited to 2-4GB total memory
2. **Memory Fragmentation**: Can impact large allocations
3. **Virtual Memory**: Performance degrades when exceeding physical RAM
#### SIMD Limitations
1. **Data Alignment**: Unaligned data reduces SIMD efficiency
2. **Scalar Fallbacks**: Mixed scalar/vector code paths reduce efficiency
3. **Branch Divergence**: Conditional operations break SIMD efficiency
### Platform-Specific Limitations
#### Windows
- **Path Lengths**: 260 character limit (unless long path support enabled)
- **Memory Mapping**: Limited to available virtual address space
- **Performance**: Generally 5-10% slower than Linux for scientific workloads
#### macOS
- **AVX-512**: Not available on Apple Silicon
- **GPU Compute**: Limited to Metal Performance Shaders
- **OpenMP**: Requires manual installation
#### ARM/Embedded
- **Memory Bandwidth**: Generally lower than x86_64 systems
- **SIMD Width**: 128-bit maximum (vs 512-bit on x86_64)
- **Floating-Point**: Some older ARM cores have slow double-precision
### Functional Limitations
#### Current Unimplemented Features
1. **Distributed Computing**: Multi-node operations not implemented
2. **Sparse Matrix GPU**: GPU sparse operations limited
3. **Complex SIMD**: Limited complex number SIMD optimizations
4. **Automatic Differentiation**: Forward/reverse mode AD not complete
#### API Limitations
1. **Mutability**: Some operations require mutable access unnecessarily
2. **Error Handling**: Some operations use panics instead of Results
3. **Generic Constraints**: Some APIs overly restrictive in type constraints
---
## Performance Optimization Guidelines
### General Optimization Principles
#### 1. Data Layout Optimization
```rust
// Prefer contiguous, aligned data layouts
let aligned_data = Array2::zeros((1024, 1024)); // Good
let misaligned_data = Array2::from_vec(data, (1024, 1024)); // May be suboptimal
```
#### 2. Algorithm Selection
```rust
// Choose algorithms based on data size
if size < 1000 {
simple_algorithm(data) // Lower overhead
} else {
optimized_algorithm(data) // Better asymptotic performance
}
```
#### 3. Memory Access Patterns
```rust
// Prefer sequential access patterns
for row in matrix.rows() { // Good: sequential cache-friendly access
for elem in row {
process(elem);
}
}
```
### Platform-Specific Optimizations
#### Intel x86_64
- Enable AVX/AVX2/AVX-512 feature flags at compile time
- Use Intel MKL for optimal BLAS/LAPACK performance
- Consider Intel Compiler for maximum optimization
#### AMD x86_64
- Use OpenBLAS or AMD BLIS for optimal linear algebra
- Enable AVX2 (avoid AVX-512 on older Zen architectures)
- Optimize for higher memory bandwidth
#### Apple Silicon
- Use Accelerate framework for BLAS/LAPACK
- Leverage unified memory architecture
- Optimize for excellent single-core performance
#### ARM/Embedded
- Use NEON optimizations where available
- Be mindful of memory bandwidth limitations
- Consider power consumption in optimization decisions
### Feature-Specific Optimizations
#### SIMD Operations
```rust
// Enable SIMD features at compile time
// RUSTFLAGS="-C target-cpu=native" cargo build --release
// Use SIMD-friendly data layouts
let data: Vec<f32> = vec![0.0; 1024]; // 32-byte aligned by default
```
#### GPU Operations
```rust
// Batch operations to amortize transfer overhead
let results = gpu_backend.batch_execute(&[
matrix_multiply(a, b),
matrix_multiply(c, d),
])?;
```
#### Parallel Operations
```rust
// Choose appropriate chunk sizes for parallel operations
use rayon::prelude::*;
// Good: chunks large enough to amortize overhead
data.par_chunks(1000).for_each(process_chunk);
// Bad: too many small chunks
data.par_iter().for_each(process_element); // Overhead dominates
```
---
## Future Performance Improvements
### Planned Optimizations (Beta 2+)
#### Algorithm Improvements
1. **Cache-Aware Algorithms**: Blocked matrix operations, cache-oblivious algorithms
2. **SIMD Enhancements**: More comprehensive SIMD coverage for mathematical functions
3. **GPU Kernel Optimization**: Hand-tuned kernels for common operations
#### Infrastructure Improvements
1. **JIT Compilation**: Runtime code generation for optimal performance
2. **Auto-Tuning**: Automatic selection of optimal algorithms based on hardware
3. **Distributed Computing**: Multi-node distributed array operations
#### Memory Optimizations
1. **Compressed Arrays**: Compressed storage for sparse and structured data
2. **Streaming Algorithms**: Better support for datasets larger than memory
3. **Memory Pool Management**: Reduced allocation overhead for frequent operations
### Research Areas
#### Advanced Techniques
1. **Tensor Cores**: Leverage specialized AI hardware for appropriate workloads
2. **Mixed Precision**: Automatic precision selection for optimal performance
3. **Approximate Computing**: Configurable accuracy/performance trade-offs
#### Platform Integration
1. **Cloud Native**: Optimizations for containerized and serverless environments
2. **Edge Computing**: Optimizations for resource-constrained environments
3. **Heterogeneous Computing**: Automatic work distribution across CPU/GPU/FPGA
---
## Benchmarking and Profiling Tools
### Built-in Benchmarking
```bash
# Run all performance benchmarks
cargo bench
# Run specific benchmark suites
cargo bench matrix_operations
cargo bench simd_operations
cargo bench memory_efficiency
# Compare with NumPy/SciPy
./benches/run_performance_comparison.sh
```
### Profiling Tools
#### CPU Profiling
```bash
# Profile with perf (Linux)
perf record cargo bench matrix_multiplication
perf report
# Profile with Instruments (macOS)
cargo instruments -t "Time Profiler" --bench matrix_multiplication
```
#### Memory Profiling
```bash
# Profile memory usage with valgrind
valgrind --tool=massif cargo test memory_efficient
# Profile allocations with heaptrack (Linux)
heaptrack cargo bench memory_operations
```
#### GPU Profiling
```bash
# NVIDIA profiling
nvprof cargo bench gpu_operations
# AMD profiling
rocprof cargo bench gpu_operations
```
---
## Performance Monitoring in Production
### Metrics Collection
The observability system provides performance metrics:
```rust
use scirs2_core::observability::tracing;
// Automatic performance attribution
let tracer = tracing::global_tracer().unwrap();
let span = tracer.start_span("matrix_computation")?;
span.in_span(|| {
// Computation automatically tracked
matrix_multiply(a, b)
});
```
### Performance Alerting
Configure alerts for performance regressions:
```rust
use scirs2_core::observability::audit;
// Performance audit events
audit_logger.log_performance_event(
"matrix_multiply",
duration,
Some("Performance regression detected"),
)?;
```
---
## Conclusion
SciRS2 Core provides competitive performance with established scientific computing libraries while offering the safety and expressiveness of Rust. Key performance strengths include:
1. **SIMD Optimization**: Comprehensive SIMD acceleration across platforms
2. **Memory Efficiency**: Advanced memory management and zero-copy operations
3. **Parallel Scaling**: Excellent scaling on multi-core systems
4. **GPU Acceleration**: Multi-backend GPU support for appropriate workloads
Users should be aware of current limitations around distributed computing, some GPU operations, and platform-specific constraints. The performance characteristics documented here will guide optimization decisions and help users achieve optimal performance for their specific use cases.
For the most up-to-date performance benchmarks and optimization guides, consult the benchmark results in `benches/` and run the comparison scripts against your specific hardware configuration.
---
*Last Updated: 2025-09-29*
*Version: 0.1.5*
*Next Update: Beta 4 release*