scirs2-core 0.4.2

Core utilities and common functionality for SciRS2 (scirs2-core)
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
# SciRS2 Core Performance Characteristics and Limitations

## Overview

This document provides comprehensive performance characteristics, benchmarking results, and known limitations for SciRS2 Core (scirs2-core) version 0.3.1. This information is critical for understanding performance expectations, optimization opportunities, and deployment considerations.

## Table of Contents

1. [Performance Benchmarking Results]#performance-benchmarking-results
2. [Platform-Specific Performance]#platform-specific-performance
3. [Memory Performance Characteristics]#memory-performance-characteristics
4. [SIMD and Acceleration Performance]#simd-and-acceleration-performance
5. [GPU Performance]#gpu-performance
6. [Parallel Processing Performance]#parallel-processing-performance
7. [Scalability Analysis]#scalability-analysis
8. [Known Limitations]#known-limitations
9. [Performance Optimization Guidelines]#performance-optimization-guidelines
10. [Future Performance Improvements]#future-performance-improvements

---

## Performance Benchmarking Results

### SciPy/NumPy Comparison Benchmarks

SciRS2 Core includes comprehensive benchmarking against NumPy and SciPy baselines in `benches/numpy_scipy_comparison_bench.rs`. Key results:

#### Matrix Operations
- **Matrix Multiplication**: 0.8x - 1.2x NumPy performance (varies by size and hardware)
- **Element-wise Operations**: 1.1x - 2.3x NumPy performance with SIMD enabled
- **Linear Algebra (LAPACK)**: 0.9x - 1.1x SciPy performance (hardware-dependent)

#### Statistical Operations
- **Basic Statistics (mean, std)**: 1.2x - 1.8x NumPy performance
- **Advanced Statistics**: 0.7x - 1.0x SciPy performance
- **Distributions**: 0.8x - 1.1x SciPy performance

#### Signal Processing
- **FFT Operations**: 0.9x - 1.1x SciPy performance
- **Filtering**: 1.0x - 1.3x SciPy performance
- **Convolution**: 1.1x - 1.5x SciPy performance with SIMD

#### Array Protocol Performance
- **Array Creation**: 1.0x - 1.2x NumPy performance
- **Array Indexing**: 0.9x - 1.1x NumPy performance
- **Array Broadcasting**: 1.0x - 1.1x NumPy performance

### Performance Testing Infrastructure

Run performance comparisons using:
```bash
./benches/run_performance_comparison.sh
```

This executes both Rust benchmarks and Python baselines for direct comparison.

---

## Platform-Specific Performance

### x86_64 Platforms

#### Intel Processors
- **Best Performance**: Ice Lake and newer with AVX-512
- **Good Performance**: Haswell and newer with AVX2
- **Baseline Performance**: Sandy Bridge and newer with AVX

**Optimization Features:**
- SSE4.2: Standard, provides 2-4x speedup for element-wise operations
- AVX2: Available on most modern CPUs, provides 4-8x speedup
- AVX-512: Available on newer Intel CPUs, provides 8-16x speedup

#### AMD Processors
- **Best Performance**: Zen 3 and newer
- **Good Performance**: Zen 2 with full AVX2 support
- **Baseline Performance**: Zen 1 with partial AVX2

**Notes:**
- AMD Zen 1/Zen+ have slower AVX2 implementation
- Zen 2+ have competitive AVX2 performance with Intel

### ARM64 (AArch64) Platforms

#### Apple Silicon (M1/M2/M3)
- **Excellent Performance**: Native NEON optimizations
- **Memory Bandwidth**: Superior unified memory architecture
- **Power Efficiency**: Best performance-per-watt

#### ARM Cortex-A Processors
- **Good Performance**: Cortex-A78 and newer
- **Baseline Performance**: Cortex-A55 and equivalent

**NEON Optimizations:**
- 128-bit SIMD vectors standard
- 2-4x performance improvement for vectorizable operations
- Excellent floating-point performance

### Platform Performance Matrix

| Platform | Single-Core | Multi-Core | SIMD Efficiency | Memory Bandwidth |
|----------|-------------|------------|-----------------|------------------|
| Intel x64 (AVX-512) | Excellent | Excellent | Excellent | Good |
| Intel x64 (AVX2) | Very Good | Excellent | Very Good | Good |
| AMD Zen 3+ | Very Good | Excellent | Very Good | Very Good |
| Apple Silicon | Excellent | Very Good | Very Good | Excellent |
| ARM Cortex-A78+ | Good | Good | Good | Good |

---

## Memory Performance Characteristics

### Memory Access Patterns

#### Sequential Access
- **Optimal Performance**: Linear memory access patterns
- **Cache Efficiency**: L1/L2/L3 cache-friendly operations
- **Prefetching**: Automatic hardware prefetching optimized

#### Random Access
- **Performance Impact**: 3-10x slower than sequential access
- **Mitigation**: Chunked processing and data locality optimization
- **Cache Misses**: Minimized through access pattern analysis

#### Memory-Mapped Arrays
- **Large Datasets**: Efficient for datasets larger than RAM
- **Performance Characteristics**:
  - First access: OS page fault overhead (~1-10μs)
  - Subsequent access: Near-RAM speed for hot pages
  - Cold pages: Disk I/O latency (1-100ms depending on storage)

### Memory Allocation Performance

#### Standard Allocation
```rust
// Typical allocation times (microseconds)
// 1KB:     ~0.1μs
// 1MB:     ~1-10μs  
// 100MB:   ~100-1000μs
// 1GB:     ~1-10ms
```

#### Memory-Mapped Allocation
```rust
// Memory mapping times (microseconds)
// 1MB:     ~10-50μs
// 100MB:   ~50-200μs
// 1GB:     ~100-500μs
// 10GB:    ~500-2000μs
```

#### Zero-Copy Operations
- **Efficiency**: Near-zero overhead for compatible operations
- **Limitations**: Requires compatible memory layouts
- **Use Cases**: Array views, slicing, and broadcasting operations

### Memory Bandwidth Utilization

| Operation Type | Memory Bandwidth Utilization |
|----------------|------------------------------|
| Element-wise (SIMD) | 60-80% of peak bandwidth |
| Matrix Multiplication | 40-70% of peak bandwidth |
| FFT Operations | 30-50% of peak bandwidth |
| Random Access | 5-20% of peak bandwidth |

---

## SIMD and Acceleration Performance

### SIMD Performance Characteristics

#### f32 Operations (AVX2/NEON)
- **Addition/Subtraction**: 8-16x speedup over scalar
- **Multiplication**: 6-12x speedup over scalar  
- **Division**: 3-6x speedup over scalar
- **Mathematical Functions**: 2-8x speedup (function-dependent)

#### f64 Operations (AVX2/NEON)
- **Addition/Subtraction**: 4-8x speedup over scalar
- **Multiplication**: 3-6x speedup over scalar
- **Division**: 2-4x speedup over scalar
- **Mathematical Functions**: 1.5-4x speedup

#### Integer Operations
- **8-bit/16-bit**: Excellent SIMD efficiency (16-32x speedup)
- **32-bit**: Good SIMD efficiency (4-8x speedup)
- **64-bit**: Moderate SIMD efficiency (2-4x speedup)

### Feature Detection and Fallbacks

The library automatically detects available SIMD features:

```rust
// Runtime feature detection
let capabilities = PlatformCapabilities::detect();
if capabilities.has_avx512f() {
    // Use AVX-512 implementation
} else if capabilities.has_avx2() {
    // Use AVX2 implementation  
} else if capabilities.has_sse42() {
    // Use SSE4.2 implementation
} else {
    // Fall back to scalar implementation
}
```

### SIMD Optimization Guidelines

1. **Data Alignment**: Align data to SIMD width boundaries (16/32/64 bytes)
2. **Vectorization Length**: Process in multiples of SIMD width
3. **Memory Layout**: Prefer contiguous, aligned memory layouts
4. **Branching**: Minimize conditional operations within SIMD loops

---

## GPU Performance

### Supported GPU Backends

#### CUDA (NVIDIA)
- **Supported**: Tesla K40 and newer
- **Optimal**: RTX 20 series and newer, A100, H100
- **Memory Transfer**: 10-25 GB/s (PCIe 3.0/4.0 dependent)
- **Compute Performance**: 1-100 TFLOPS (architecture dependent)

#### OpenCL (Cross-platform)
- **NVIDIA**: Good performance on Maxwell and newer
- **AMD**: Good performance on GCN and newer
- **Intel**: Basic performance on integrated graphics

#### Metal Performance Shaders (Apple)
- **Supported**: M1, M2, M3 Apple Silicon
- **Performance**: Excellent for unified memory architecture
- **Limitations**: macOS/iOS only

#### WebGPU (Experimental)
- **Browser Support**: Chrome, Firefox, Safari (experimental)
- **Performance**: Limited by browser security constraints
- **Use Cases**: Web deployment and cross-platform compatibility

### GPU Performance Characteristics

#### Memory Transfer Overhead
- **Host→Device**: 1-10 ms for 1GB transfers
- **Device→Host**: 1-10 ms for 1GB transfers
- **GPU Memory Bandwidth**: 500-3000 GB/s (architecture dependent)

#### Compute Performance
```rust
// Typical GPU speedup over CPU (operation dependent)
// Matrix Multiplication (large): 10-100x
// Element-wise Operations: 5-50x
// FFT: 5-20x
// Small Operations (<1MB): Often slower due to overhead
```

#### GPU Optimization Guidelines
1. **Batch Operations**: Minimize host↔device transfers
2. **Memory Coalescing**: Ensure efficient memory access patterns
3. **Occupancy**: Maximize GPU core utilization
4. **Asynchronous Execution**: Overlap compute and memory transfers

---

## Parallel Processing Performance

### CPU Parallelization

#### Rayon-based Parallelism
- **Thread Overhead**: ~1-5μs per parallel task spawn
- **Work-Stealing**: Excellent load balancing for uneven workloads
- **Scaling**: Near-linear scaling up to CPU core count

#### Performance Scaling
```rust
// Typical parallel scaling efficiency
// 2 cores:  1.8-1.9x speedup
// 4 cores:  3.5-3.8x speedup  
// 8 cores:  6.5-7.5x speedup
// 16 cores: 11-14x speedup
// 32+ cores: 15-25x speedup (NUMA effects become significant)
```

#### Optimal Parallel Task Sizes
- **Small Tasks**: >10μs of work to amortize overhead
- **Medium Tasks**: 100μs-1ms ideal for good load balancing
- **Large Tasks**: >1ms may need subdivision for better scaling

### NUMA Considerations

#### Multi-Socket Systems
- **Performance Impact**: 10-50% degradation for cross-socket memory access
- **Mitigation**: Use NUMA-aware allocation when available
- **Thread Affinity**: Keep threads and data on same NUMA node

#### Memory Bandwidth Scaling
- **Single-Socket**: Linear scaling up to memory bandwidth limit
- **Multi-Socket**: Sub-linear scaling due to NUMA effects
- **Optimization**: Partition data across NUMA nodes

---

## Scalability Analysis

### Dataset Size Scaling

#### Small Datasets (<1MB)
- **Performance**: Function call overhead dominates
- **Optimization**: Use in-place operations, avoid allocations
- **Parallelization**: Often counterproductive due to overhead

#### Medium Datasets (1MB-1GB)
- **Performance**: Cache effects and memory bandwidth important
- **Optimization**: Optimize for L3 cache utilization
- **Parallelization**: Effective with proper chunk sizes

#### Large Datasets (>1GB)
- **Performance**: Memory bandwidth becomes primary bottleneck
- **Optimization**: Use memory-mapped arrays, streaming algorithms
- **Parallelization**: Essential for acceptable performance

### Algorithmic Complexity

#### Linear Operations O(n)
- **Scaling**: Excellent scaling with dataset size
- **Memory Bound**: Performance limited by memory bandwidth
- **Optimization**: SIMD and parallelization highly effective

#### Quadratic Operations O(n²)
- **Scaling**: Performance degrades rapidly with size
- **Example**: Naive matrix multiplication
- **Optimization**: Use cache-friendly algorithms (e.g., blocked matrix multiply)

#### Logarithmic Operations O(n log n)
- **Scaling**: Good scaling characteristics
- **Example**: FFT, sorting
- **Optimization**: Cache-aware implementations important

---

## Known Limitations

### Performance Limitations

#### Single-Threaded Bottlenecks
1. **Array Creation**: Large array initialization not parallelized
2. **Memory Mapping**: File system operations are sequential
3. **Some LAPACK Operations**: Single-threaded by design

#### Memory Limitations
1. **32-bit Platforms**: Limited to 2-4GB total memory
2. **Memory Fragmentation**: Can impact large allocations
3. **Virtual Memory**: Performance degrades when exceeding physical RAM

#### SIMD Limitations
1. **Data Alignment**: Unaligned data reduces SIMD efficiency
2. **Scalar Fallbacks**: Mixed scalar/vector code paths reduce efficiency
3. **Branch Divergence**: Conditional operations break SIMD efficiency

### Platform-Specific Limitations

#### Windows
- **Path Lengths**: 260 character limit (unless long path support enabled)
- **Memory Mapping**: Limited to available virtual address space
- **Performance**: Generally 5-10% slower than Linux for scientific workloads

#### macOS
- **AVX-512**: Not available on Apple Silicon
- **GPU Compute**: Limited to Metal Performance Shaders
- **OpenMP**: Requires manual installation

#### ARM/Embedded
- **Memory Bandwidth**: Generally lower than x86_64 systems
- **SIMD Width**: 128-bit maximum (vs 512-bit on x86_64)
- **Floating-Point**: Some older ARM cores have slow double-precision

### Functional Limitations

#### Current Unimplemented Features
1. **Distributed Computing**: Multi-node operations not implemented
2. **Sparse Matrix GPU**: GPU sparse operations limited
3. **Complex SIMD**: Limited complex number SIMD optimizations
4. **Automatic Differentiation**: Forward/reverse mode AD not complete

#### API Limitations
1. **Mutability**: Some operations require mutable access unnecessarily
2. **Error Handling**: Some operations use panics instead of Results
3. **Generic Constraints**: Some APIs overly restrictive in type constraints

---

## Performance Optimization Guidelines

### General Optimization Principles

#### 1. Data Layout Optimization
```rust
// Prefer contiguous, aligned data layouts
let aligned_data = Array2::zeros((1024, 1024));  // Good
let misaligned_data = Array2::from_vec(data, (1024, 1024));  // May be suboptimal
```

#### 2. Algorithm Selection
```rust
// Choose algorithms based on data size
if size < 1000 {
    simple_algorithm(data)  // Lower overhead
} else {
    optimized_algorithm(data)  // Better asymptotic performance
}
```

#### 3. Memory Access Patterns
```rust
// Prefer sequential access patterns
for row in matrix.rows() {  // Good: sequential cache-friendly access
    for elem in row {
        process(elem);
    }
}
```

### Platform-Specific Optimizations

#### Intel x86_64
- Enable AVX/AVX2/AVX-512 feature flags at compile time
- Use Intel MKL for optimal BLAS/LAPACK performance
- Consider Intel Compiler for maximum optimization

#### AMD x86_64
- Use OpenBLAS or AMD BLIS for optimal linear algebra
- Enable AVX2 (avoid AVX-512 on older Zen architectures)
- Optimize for higher memory bandwidth

#### Apple Silicon
- Use Accelerate framework for BLAS/LAPACK
- Leverage unified memory architecture
- Optimize for excellent single-core performance

#### ARM/Embedded
- Use NEON optimizations where available
- Be mindful of memory bandwidth limitations
- Consider power consumption in optimization decisions

### Feature-Specific Optimizations

#### SIMD Operations
```rust
// Enable SIMD features at compile time
// RUSTFLAGS="-C target-cpu=native" cargo build --release

// Use SIMD-friendly data layouts
let data: Vec<f32> = vec![0.0; 1024];  // 32-byte aligned by default
```

#### GPU Operations
```rust
// Batch operations to amortize transfer overhead
let results = gpu_backend.batch_execute(&[
    matrix_multiply(a, b),
    matrix_multiply(c, d),
])?;
```

#### Parallel Operations
```rust
// Choose appropriate chunk sizes for parallel operations
use rayon::prelude::*;

// Good: chunks large enough to amortize overhead
data.par_chunks(1000).for_each(process_chunk);

// Bad: too many small chunks
data.par_iter().for_each(process_element);  // Overhead dominates
```

---

## Future Performance Improvements

### Planned Optimizations (Beta 2+)

#### Algorithm Improvements
1. **Cache-Aware Algorithms**: Blocked matrix operations, cache-oblivious algorithms
2. **SIMD Enhancements**: More comprehensive SIMD coverage for mathematical functions
3. **GPU Kernel Optimization**: Hand-tuned kernels for common operations

#### Infrastructure Improvements
1. **JIT Compilation**: Runtime code generation for optimal performance
2. **Auto-Tuning**: Automatic selection of optimal algorithms based on hardware
3. **Distributed Computing**: Multi-node distributed array operations

#### Memory Optimizations
1. **Compressed Arrays**: Compressed storage for sparse and structured data
2. **Streaming Algorithms**: Better support for datasets larger than memory
3. **Memory Pool Management**: Reduced allocation overhead for frequent operations

### Research Areas

#### Advanced Techniques
1. **Tensor Cores**: Leverage specialized AI hardware for appropriate workloads
2. **Mixed Precision**: Automatic precision selection for optimal performance
3. **Approximate Computing**: Configurable accuracy/performance trade-offs

#### Platform Integration
1. **Cloud Native**: Optimizations for containerized and serverless environments
2. **Edge Computing**: Optimizations for resource-constrained environments
3. **Heterogeneous Computing**: Automatic work distribution across CPU/GPU/FPGA

---

## Benchmarking and Profiling Tools

### Built-in Benchmarking
```bash
# Run all performance benchmarks
cargo bench

# Run specific benchmark suites
cargo bench matrix_operations
cargo bench simd_operations
cargo bench memory_efficiency

# Compare with NumPy/SciPy
./benches/run_performance_comparison.sh
```

### Profiling Tools

#### CPU Profiling
```bash
# Profile with perf (Linux)
perf record cargo bench matrix_multiplication
perf report

# Profile with Instruments (macOS)
cargo instruments -t "Time Profiler" --bench matrix_multiplication
```

#### Memory Profiling
```bash
# Profile memory usage with valgrind
valgrind --tool=massif cargo test memory_efficient

# Profile allocations with heaptrack (Linux)
heaptrack cargo bench memory_operations
```

#### GPU Profiling
```bash
# NVIDIA profiling
nvprof cargo bench gpu_operations

# AMD profiling  
rocprof cargo bench gpu_operations
```

---

## Performance Monitoring in Production

### Metrics Collection
The observability system provides performance metrics:

```rust
use scirs2_core::observability::tracing;

// Automatic performance attribution
let tracer = tracing::global_tracer().unwrap();
let span = tracer.start_span("matrix_computation")?;
span.in_span(|| {
    // Computation automatically tracked
    matrix_multiply(a, b)
});
```

### Performance Alerting
Configure alerts for performance regressions:

```rust
use scirs2_core::observability::audit;

// Performance audit events
audit_logger.log_performance_event(
    "matrix_multiply",
    duration,
    Some("Performance regression detected"),
)?;
```

---

## Conclusion

SciRS2 Core provides competitive performance with established scientific computing libraries while offering the safety and expressiveness of Rust. Key performance strengths include:

1. **SIMD Optimization**: Comprehensive SIMD acceleration across platforms
2. **Memory Efficiency**: Advanced memory management and zero-copy operations  
3. **Parallel Scaling**: Excellent scaling on multi-core systems
4. **GPU Acceleration**: Multi-backend GPU support for appropriate workloads

Users should be aware of current limitations around distributed computing, some GPU operations, and platform-specific constraints. The performance characteristics documented here will guide optimization decisions and help users achieve optimal performance for their specific use cases.

For the most up-to-date performance benchmarks and optimization guides, consult the benchmark results in `benches/` and run the comparison scripts against your specific hardware configuration.

---

*Last Updated: 2025-09-29*  
*Version: 0.1.5*  
*Next Update: Beta 4 release*