numrs2 0.3.3

A Rust implementation inspired by NumPy for numerical computing (NumRS2)
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
# NumRS2 Benchmarking Guide

Comprehensive guide for running, interpreting, and using benchmarks in NumRS2 v1.0.0 (0.2.0 release).

## Table of Contents

1. [Overview]#overview
2. [Benchmark Suite]#benchmark-suite
3. [Running Benchmarks]#running-benchmarks
4. [Interpreting Results]#interpreting-results
5. [Performance Optimization Tips]#performance-optimization-tips
6. [Hardware Requirements]#hardware-requirements
7. [Comparison with NumPy]#comparison-with-numpy
8. [Troubleshooting]#troubleshooting

## Overview

NumRS2 includes a comprehensive benchmark suite built with the [Criterion.rs](https://github.com/bheisler/criterion.rs) library. The benchmarks cover all major operations and are designed to:

- Track performance across releases
- Identify performance regressions
- Compare SIMD vs scalar performance
- Evaluate parallel processing efficiency
- Measure memory bandwidth utilization
- Compare with NumPy where applicable

## Benchmark Suite

### 1. Linear Algebra Benchmarks (`linalg_benchmarks`)

**File:** `bench/linalg_benchmarks.rs`

**Operations tested:**
- Matrix multiplication (10x10 to 1000x1000)
- Matrix-vector multiplication
- Matrix transpose (square and rectangular)
- Matrix inverse (10x10 to 200x200)
- Determinant calculation
- Matrix norms (Frobenius, infinity, 1-norm)
- QR decomposition
- Cholesky decomposition
- SVD (Singular Value Decomposition)
- LU decomposition
- Eigenvalue decomposition
- Linear system solving
- Matrix rank and condition number
- Matrix trace
- Outer and inner products
- Cross product (3D)
- Kronecker product

**Run:**
```bash
cargo bench --bench linalg_benchmarks
```

### 2. Statistics Benchmarks (`stats_benchmarks`)

**File:** `bench/stats_benchmarks.rs`

**Operations tested:**
- Basic statistics (mean, variance, std, median)
- Quantiles and percentiles
- Correlation and covariance (vectors and matrices)
- Histogram computation (10, 50, 100 bins)
- Distribution sampling:
  - Normal (standard and custom)
  - Uniform
  - Exponential
  - Gamma
  - Beta
  - Chi-squared
  - Student's t
  - F distribution
  - Poisson
  - Binomial
- Cumulative statistics (cumsum, cumprod)
- Statistical moments (skewness, kurtosis)
- Random sampling and shuffling

**Run:**
```bash
cargo bench --bench stats_benchmarks
```

### 3. FFT Benchmarks (`fft_benchmarks`)

**File:** `bench/fft_benchmarks.rs`

**Operations tested:**
- 1D FFT/IFFT (64 to 16384 points)
- Real FFT/IRFFT
- 2D FFT/IFFT (8x8 to 256x256)
- 2D Real FFT/IRFFT
- Window functions (rectangular, Hann, Hamming, Blackman)
- FFT shift operations
- Frequency axis generation
- Power spectrum calculation
- FFT on different signal types (random, sine, square, impulse)
- End-to-end FFT workflow
- Implementation comparison
- Data type comparison (f32 vs f64)

**Run:**
```bash
cargo bench --bench fft_benchmarks
```

### 4. Array Operations Benchmarks (`array_ops_benchmarks`)

**File:** `bench/array_ops_benchmarks.rs`

**Operations tested:**
- Element-wise operations (add, sub, mul, div, pow)
- Broadcasting (scalar to array, vector to matrix)
- Reduction operations (sum, prod, min, max, argmin, argmax)
- Array indexing (element access)
- Array slicing (1D and 2D)
- Array reshaping (1D to 2D, flattening)
- Array transposition (square and rectangular)
- Array concatenation (1D and 2D, different axes)
- Array stacking (vstack, hstack)
- Array splitting
- Array tiling and repetition

**Run:**
```bash
cargo bench --bench array_ops_benchmarks
```

### 5. Optimization Benchmarks (`optimization_benchmarks`)

**File:** `bench/optimization_benchmarks.rs`

**Operations tested:**
- BFGS optimization (2D to 10D)
- L-BFGS optimization (2D to 20D)
- Conjugate gradient methods:
  - Fletcher-Reeves
  - Polak-Ribiere
  - Hestenes-Stiefel
- Trust region methods
- Genetic algorithms
- Particle swarm optimization
- Simulated annealing
- Differential evolution
- Algorithm comparison

**Test functions:**
- Rosenbrock
- Sphere
- Rastrigin
- Ackley

**Run:**
```bash
cargo bench --bench optimization_benchmarks
```

### 6. SIMD Comparison Benchmarks (`simd_comparison_benchmark`)

**File:** `bench/simd_comparison_benchmark.rs`

**Operations tested:**
- SIMD vs scalar addition
- SIMD vs scalar multiplication
- SIMD vs scalar dot product
- SIMD vs scalar sum reduction
- Threshold analysis (2 to 256 elements)
- Data type comparison (f32 vs f64)
- Alignment effects
- Complex operations (FMA, norm)
- Strided data access
- Memory bandwidth with SIMD

**Run:**
```bash
cargo bench --bench simd_comparison_benchmark
```

### 7. Parallel Benchmarks (`parallel_benchmarks`)

**File:** `bench/parallel_benchmarks.rs`

**Operations tested:**
- Parallel vs sequential sum
- Parallel vs sequential matrix multiplication
- Parallel reduction operations
- Parallel map operations
- Thread scaling analysis
- Parallel overhead for different array sizes
- Load balancing efficiency
- Parallel matrix operations
- Parallel statistics
- Parallel FFT

**Run:**
```bash
cargo bench --bench parallel_benchmarks
```

### 8. Memory Benchmarks (`memory_benchmarks`)

**File:** `bench/memory_benchmarks.rs`

**Operations tested:**
- Memory allocation patterns (1D, 2D, zeros, ones)
- Cache efficiency (row-major vs column-major)
- Memory bandwidth utilization (read, write, copy, triad)
- Copy vs view operations
- In-place vs allocating operations
- Memory access patterns (sequential, strided, random)
- Cache line effects
- Allocation size effects (small, medium, large)
- Contiguous vs non-contiguous memory
- Prefetching effects

**Run:**
```bash
cargo bench --bench memory_benchmarks
```

## Running Benchmarks

### Run All Benchmarks

```bash
cargo bench
```

### Run Specific Benchmark Suite

```bash
cargo bench --bench linalg_benchmarks
cargo bench --bench stats_benchmarks
cargo bench --bench fft_benchmarks
cargo bench --bench array_ops_benchmarks
cargo bench --bench optimization_benchmarks
cargo bench --bench simd_comparison_benchmark
cargo bench --bench parallel_benchmarks
cargo bench --bench memory_benchmarks
```

### Run Specific Benchmark Function

```bash
# Run only matrix multiplication benchmarks
cargo bench --bench linalg_benchmarks -- matrix_multiplication

# Run only FFT 1D benchmarks
cargo bench --bench fft_benchmarks -- fft_1d

# Run only SIMD threshold analysis
cargo bench --bench simd_comparison_benchmark -- threshold_analysis
```

### Save Results for Comparison

```bash
# Save baseline
cargo bench -- --save-baseline main

# Make changes...

# Compare with baseline
cargo bench -- --baseline main
```

### Generate HTML Reports

Criterion automatically generates HTML reports in `target/criterion/`. Open them with:

```bash
# macOS
open target/criterion/report/index.html

# Linux
xdg-open target/criterion/report/index.html

# Windows
start target/criterion/report/index.html
```

## Interpreting Results

### Understanding Criterion Output

```
matrix_multiplication/square_matmul/100
                        time:   [1.2345 ms 1.2567 ms 1.2789 ms]
                        change: [-5.2341% -3.1234% -1.0123%] (p = 0.02 < 0.05)
                        Performance has improved.
```

**Components:**
- **time**: Median measurement with 95% confidence interval
  - Lower bound: 1.2345 ms
  - Median: 1.2567 ms
  - Upper bound: 1.2789 ms
- **change**: Relative change from previous run
  - Negative = faster (improvement)
  - Positive = slower (regression)
- **p-value**: Statistical significance (< 0.05 = significant)

### Performance Metrics

1. **Throughput**: Operations per second
   - Higher is better
   - Compare with theoretical peak performance

2. **Latency**: Time per operation
   - Lower is better
   - Important for real-time applications

3. **Scaling**: Performance vs problem size
   - O(n), O(n²), O(n³) complexity
   - SIMD speedup: 2-4x for f64, 4-8x for f32
   - Parallel speedup: Near-linear with thread count

4. **Efficiency**: Actual vs theoretical performance
   - Memory bandwidth utilization
   - Cache hit rates
   - SIMD utilization

## Performance Optimization Tips

### 1. Choose Appropriate Data Types

- **f32 vs f64**: Use f32 when precision allows (2x SIMD lanes)
- **Integer types**: Use smallest type that fits your range

### 2. Memory Layout

- **Contiguous arrays**: Fastest access pattern
- **Row-major order**: Default in NumRS2 (same as NumPy)
- **Avoid unnecessary transposes**: Cache-unfriendly

### 3. SIMD Optimization

- **Minimum size**: SIMD benefits start at ~64 elements
- **Alignment**: Aligned data is faster (handled automatically)
- **Contiguous data**: SIMD requires contiguous memory

### 4. Parallel Processing

- **Minimum size**: Parallel benefits start at ~10,000 elements
- **Thread count**: Optimal = number of physical cores
- **Overhead**: Consider serial for small arrays

### 5. Cache Optimization

- **Locality**: Access nearby elements together
- **Blocking**: Use cache-sized blocks for large matrices
- **Prefetching**: Sequential access enables hardware prefetch

### 6. Algorithm Selection

- **Matrix multiplication**: O(n³) - consider size limits
- **SVD**: Expensive - use only when needed
- **Iterative solvers**: Better for large sparse systems

## Hardware Requirements

### Minimum Requirements

- **CPU**: x86_64 or ARM64 with SIMD support
- **RAM**: 4 GB (8 GB recommended)
- **Disk**: 1 GB for build artifacts

### Recommended Hardware

- **CPU**: Modern multi-core processor (4+ cores)
  - x86_64: AVX2 or AVX-512 support
  - ARM64: NEON support
- **RAM**: 16 GB or more
- **Disk**: SSD for faster compilation

### Performance Expectations

**CPU-bound operations:**
- Matrix multiplication: ~100 GFLOPS on modern CPUs
- FFT: ~1-10 GB/s throughput
- Element-wise operations: Memory bandwidth limited

**Memory-bound operations:**
- Stream bandwidth: 10-100 GB/s (DDR4)
- L1 cache: ~1 TB/s
- L2 cache: ~200 GB/s
- L3 cache: ~100 GB/s

## Comparison with NumPy

### NumRS2 Advantages

1. **Zero-copy operations**: Views don't allocate
2. **Pure Rust**: No C/Fortran dependencies
3. **Type safety**: Compile-time error checking
4. **Memory safety**: No segfaults or undefined behavior

### Performance Comparison

**Expected relative performance:**

| Operation | NumRS2 vs NumPy |
|-----------|-----------------|
| Matrix multiplication (small) | 0.8-1.2x |
| Matrix multiplication (large) | 0.9-1.1x |
| Element-wise operations | 0.9-1.2x |
| FFT | 0.8-1.0x |
| Statistics | 1.0-1.5x |
| Memory allocation | 1.0-1.3x |

**Notes:**
- NumPy uses MKL/OpenBLAS (highly optimized C/Fortran)
- NumRS2 uses OxiBLAS (pure Rust, actively improving)
- Performance varies by operation and hardware

### Running Comparison Benchmarks

```bash
# NumRS2 benchmarks
cargo bench

# NumPy benchmarks (requires Python setup)
cd bench
python numpy_benchmark.py
```

## Troubleshooting

### Issue: Benchmarks Take Too Long

**Solution 1**: Run subset of benchmarks
```bash
cargo bench --bench linalg_benchmarks -- matrix_multiplication
```

**Solution 2**: Reduce sample size (in benchmark code)
```rust
group.sample_size(10);  // Default is 100
```

**Solution 3**: Use quick benchmark mode
```bash
cargo bench -- --quick
```

### Issue: Inconsistent Results

**Possible causes:**
- System load (close other applications)
- CPU frequency scaling (disable for benchmarking)
- Thermal throttling (ensure adequate cooling)
- Background processes (disable antivirus, etc.)

**Solutions:**
```bash
# Linux: Disable CPU frequency scaling
sudo cpupower frequency-set --governor performance

# Check system load
top
htop
```

### Issue: Out of Memory

**Solution 1**: Run smaller benchmarks
```bash
cargo bench --bench memory_benchmarks -- small
```

**Solution 2**: Increase swap space

**Solution 3**: Skip large problem sizes
- Edit benchmark files to reduce maximum sizes

### Issue: Compilation Errors

**Current known issue**: There are compilation errors in `src/optimize/simulated_annealing.rs` related to `NumRs2Error::Other` variant not existing. These need to be fixed before benchmarks can run.

**Solution**: Fix the error enum issues first:
```bash
# Check error definition
cat src/error/legacy.rs

# Fix simulated_annealing.rs to use correct error variant
```

### Issue: Performance Lower Than Expected

**Check:**
1. **Build mode**: Ensure using `--release` or `cargo bench`
2. **CPU frequency**: Check for thermal throttling
3. **SIMD support**: Verify CPU features
4. **Thread count**: Check `RAYON_NUM_THREADS`
5. **Memory**: Ensure no swapping

**Verify:**
```bash
# Check if release mode
cargo bench --verbose

# Check CPU features
lscpu | grep Flags  # Linux
sysctl -a | grep cpu  # macOS

# Check memory usage
free -h  # Linux
vm_stat  # macOS
```

## Best Practices

### Before Benchmarking

1. **Close unnecessary applications**
2. **Disable CPU frequency scaling**
3. **Ensure adequate cooling**
4. **Use AC power (laptops)**
5. **Wait for system to stabilize**

### During Benchmarking

1. **Don't use the system**
2. **Monitor temperature**
3. **Save baselines regularly**
4. **Document system configuration**

### After Benchmarking

1. **Archive results**
2. **Compare with previous runs**
3. **Generate reports**
4. **Document findings**

## Performance Regression Testing

### Automated Testing

```bash
# Save baseline before changes
cargo bench -- --save-baseline before

# Make changes...

# Compare with baseline
cargo bench -- --baseline before

# Check for regressions (exit code != 0 if regression)
cargo bench -- --baseline before || echo "Performance regression detected!"
```

### CI/CD Integration

Example GitHub Actions workflow:
```yaml
name: Benchmark

on: [push, pull_request]

jobs:
  benchmark:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      - uses: actions-rs/toolchain@v1
        with:
          toolchain: stable
      - name: Run benchmarks
        run: cargo bench --no-fail-fast
      - name: Archive results
        uses: actions/upload-artifact@v2
        with:
          name: benchmark-results
          path: target/criterion/
```

## Additional Resources

- [Criterion.rs Documentation]https://bheisler.github.io/criterion.rs/book/
- [NumRS2 Documentation]https://docs.rs/numrs2
- [SciRS2 Performance Guide]https://github.com/cool-japan/scirs2/docs/PERFORMANCE.md
- [Rust Performance Book]https://nnethercote.github.io/perf-book/

## Contributing

To add new benchmarks:

1. Create benchmark file in `bench/`
2. Add entry in `Cargo.toml`
3. Follow existing patterns
4. Document in this guide
5. Test thoroughly
6. Submit pull request

## License

NumRS2 is licensed under Apache-2.0.

Copyright © 2025 COOLJAPAN OU (Team KitaSan)