1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
use Location;
use crate::;
/// Inserts a dimension of size 1 before the specified index in a shape.
///
/// The `yield_one_before` function takes an existing shape (a slice of `i64` values) and inserts
/// a new dimension of size 1 before the specified index `idx`. This is useful in tensor operations
/// where you need to expand the dimensions of a tensor by adding singleton dimensions, which can
/// facilitate broadcasting or other dimension-specific operations.
///
/// # Parameters
///
/// - `shape`: A slice of `i64` representing the original shape of the tensor.
/// - `idx`: The index before which a new dimension of size 1 will be inserted.
///
/// # Returns
///
/// - A `Vec<i64>` representing the new shape with the inserted dimension of size 1.
///
/// # Examples
///
/// ```rust
/// // Example 1: Insert before the first dimension
/// let shape = vec![3, 4, 5];
/// let idx = 0;
/// let new_shape = yield_one_before(&shape, idx);
/// assert_eq!(new_shape, vec![1, 3, 4, 5]);
///
/// // Example 2: Insert before a middle dimension
/// let idx = 2;
/// let new_shape = yield_one_before(&shape, idx);
/// assert_eq!(new_shape, vec![3, 4, 1, 5]);
///
/// // Example 3: Insert before the last dimension
/// let idx = 2;
/// let new_shape = yield_one_before(&shape, idx);
/// assert_eq!(new_shape, vec![3, 4, 1, 5]);
///
/// // Example 4: Index out of bounds (appends 1 at the end)
/// let idx = 5;
/// let new_shape = yield_one_before(&shape, idx);
/// assert_eq!(new_shape, vec![3, 4, 5, 1]);
/// ```
///
/// # Notes
///
/// - **Index Bounds**: If `idx` is greater than the length of `shape`, the function will append a
/// dimension of size 1 at the end of the shape.
/// - **Use Cases**: Adding a singleton dimension is often used to adjust the shape of a tensor for
/// broadcasting in element-wise operations or to match required input dimensions for certain
/// functions.
/// - **Immutability**: The original `shape` slice is not modified; a new `Vec<i64>` is returned.
///
/// # Implementation Details
///
/// The function works by iterating over the original shape and copying each dimension into a new
/// vector. When the current index matches `idx`, it inserts a `1` before copying the next dimension.
///
/// # See Also
///
/// ```rust
/// fn yield_one_after(shape: &[i64], idx: usize) -> Vec<i64>
/// ```
/// Inserts a `1` into a shape vector immediately after a specified index.
///
/// The `yield_one_after` function takes a slice representing the shape of a tensor and an index,
/// and returns a new shape vector where the value `1` is inserted immediately after the specified index.
/// This is useful for reshaping tensors, especially when you need to add a singleton dimension
/// for broadcasting or other tensor operations.
///
/// # Parameters
///
/// - `shape`: A slice of `i64` representing the original shape of the tensor.
/// - `idx`: A `usize` index after which the value `1` will be inserted into the shape.
///
/// # Returns
///
/// - A `Vec<i64>` representing the new shape with the value `1` inserted after the specified index.
///
/// # Examples
///
/// ```rust
/// // Example 1: Inserting after the first dimension
/// let shape = vec![2, 3, 4];
/// let idx = 0;
/// let new_shape = yield_one_after(&shape, idx);
/// assert_eq!(new_shape, vec![2, 1, 3, 4]);
///
/// // Example 2: Inserting after the second dimension
/// let shape = vec![5, 6, 7];
/// let idx = 1;
/// let new_shape = yield_one_after(&shape, idx);
/// assert_eq!(new_shape, vec![5, 6, 1, 7]);
///
/// // Example 3: Inserting after the last dimension
/// let shape = vec![8, 9];
/// let idx = 1;
/// let new_shape = yield_one_after(&shape, idx);
/// assert_eq!(new_shape, vec![8, 9, 1]);
/// ```
///
/// # Notes
///
/// - **Index Bounds**: The `idx` parameter must be less than or equal to `shape.len() - 1`.
/// - If `idx` is equal to `shape.len() - 1`, the `1` will be appended at the end of the shape vector.
/// - If `idx` is greater than `shape.len() - 1`, the function will panic due to an out-of-bounds index.
/// - **Non-mutating**: The function does not modify the original `shape` slice; it returns a new `Vec<i64>`.
///
/// # Use Cases
///
/// - **Adding a Dimension**: Useful when you need to add a singleton dimension to a tensor for operations like broadcasting.
/// - **Reshaping Tensors**: Helps in reshaping tensors to match required dimensions for certain mathematical operations.
///
/// # Edge Cases
///
/// - **Empty Shape**: If the `shape` slice is empty, the function will panic if `idx` is not zero.
/// ```rust
/// let shape: Vec<i64> = vec![];
/// let idx = 0;
/// let new_shape = yield_one_after(&shape, idx);
/// assert_eq!(new_shape, vec![1]); // Inserts `1` at position 0
/// ```
///
/// # Panics
///
/// - The function will panic if `idx` is greater than `shape.len()`.
///
/// # See Also
///
/// ```rust
/// fn yield_one_before(shape: &[i64], idx: usize) -> Vec<i64>
/// ```
/// Pads a shape with ones on the left to reach a specified length.
///
/// The `try_pad_shape` function takes an existing shape (a slice of `i64` values) and pads it with
/// ones on the left side to ensure the shape has the desired length. If the existing shape's length
/// is already equal to or greater than the desired length, the function returns the shape as is.
///
/// This is particularly useful in tensor operations where broadcasting rules require shapes to have
/// the same number of dimensions.
///
/// # Parameters
///
/// - `shape`: A slice of `i64` representing the original shape of the tensor.
/// - `length`: The desired length of the shape after padding.
///
/// # Returns
///
/// - A `Vec<i64>` representing the new shape, padded with ones on the left if necessary.
///
/// # Examples
///
/// ```rust
/// // Example 1: Padding is needed
/// let shape = vec![3, 4];
/// let padded_shape = try_pad_shape(&shape, 4);
/// assert_eq!(padded_shape, vec![1, 1, 3, 4]);
///
/// // Example 2: No padding is needed
/// let shape = vec![2, 3, 4];
/// let padded_shape = try_pad_shape(&shape, 2);
/// assert_eq!(padded_shape, vec![2, 3, 4]); // Shape is returned as is
/// ```
///
/// # Notes
///
/// - **Left Padding**: The function pads the shape with ones on the left side (i.e., it adds new
/// dimensions to the beginning of the shape).
/// - **Use Case**: This is useful for aligning shapes in operations that require input tensors to have
/// the same number of dimensions, such as broadcasting in tensor computations.
///
/// # Implementation Details
///
/// - **Length Check**: The function first checks if the desired `length` is less than or equal to the
/// current length of `shape`. If so, it returns a copy of `shape` as is.
/// - **Padding Logic**: If padding is needed, it creates a new vector filled with ones of size `length`.
/// It then copies the original shape's elements into the rightmost positions of this new vector,
/// effectively padding the left side with ones.
///
/// # Edge Cases
///
/// - If `length` is zero, the function returns an empty vector.
/// - If `shape` is empty and `length` is greater than zero, the function returns a vector of ones
/// with the specified `length`.
///
/// # See Also
///
/// - Functions that handle shape manipulation and broadcasting in tensor operations.
///
/// # Example Usage in Context
///
/// ```rust
/// // Assume we have two tensors with shapes [3, 4] and [4].
/// // To perform element-wise operations, we need to align their shapes.
/// let a_shape = vec![3, 4];
/// let b_shape = vec![4];
///
/// // Pad the smaller shape to match the number of dimensions.
/// let padded_b_shape = try_pad_shape(&b_shape, a_shape.len());
/// assert_eq!(padded_b_shape, vec![1, 4]);
///
/// // Now both shapes have the same number of dimensions and can be broadcast together.
/// ```
/// pad shape to the shortter one, this is used for prepareing for matmul broadcast.
///
/// possibly we can make it works in more generic cases not only matmul
/// pad shape and strides to the shortter one, this is used for prepareing for matmul broadcast.
///
/// possibly we can make it works in more generic cases not only matmul
/// Predicts the broadcasted shape resulting from broadcasting two arrays.
///
/// The `predict_broadcast_shape` function computes the resulting shape when two arrays with shapes
/// `a_shape` and `b_shape` are broadcast together. Broadcasting is a technique that allows arrays of
/// different shapes to be used together in arithmetic operations by "stretching" one or both arrays
/// so that they have compatible shapes.
///
/// # Parameters
///
/// - `a_shape`: A slice of `i64` representing the shape of the first array.
/// - `b_shape`: A slice of `i64` representing the shape of the second array.
///
/// # Returns
///
/// - `Ok(Shape)`: The resulting broadcasted shape as a `Shape` object if broadcasting is possible.
/// - `Err(anyhow::Error)`: An error if the shapes cannot be broadcast together.
///
/// # Broadcasting Rules
///
/// The broadcasting rules determine how two arrays of different shapes can be broadcast together:
///
/// 1. **Alignment**: The shapes are right-aligned, meaning that the last dimensions are compared first.
/// If one shape has fewer dimensions, it is left-padded with ones to match the other shape's length.
///
/// 2. **Dimension Compatibility**: For each dimension from the last to the first:
/// - If the dimensions are equal, they are compatible.
/// - If one of the dimensions is 1, the array in that dimension can be broadcast to match the other dimension.
/// - If the dimensions are not equal and neither is 1, broadcasting is not possible.
///
/// # Example
///
/// ```rust
/// // Assuming Shape and the necessary imports are defined appropriately.
///
/// let a_shape = &[8, 1, 6, 1];
/// let b_shape = &[7, 1, 5];
///
/// match predict_broadcast_shape(a_shape, b_shape) {
/// Ok(result_shape) => {
/// assert_eq!(result_shape, Shape::from(vec![8, 7, 6, 5]));
/// println!("Broadcasted shape: {:?}", result_shape);
/// },
/// Err(e) => {
/// println!("Error: {}", e);
/// },
/// }
/// ```
///
/// In this example:
///
/// - `a_shape` has shape `[8, 1, 6, 1]`.
/// - `b_shape` has shape `[7, 1, 5]`.
/// - After padding `b_shape` to `[1, 7, 1, 5]`, the shapes are compared element-wise from the last dimension.
/// - The resulting broadcasted shape is `[8, 7, 6, 5]`.
///
/// # Notes
///
/// - The function assumes that shapes are represented as slices of `i64`.
/// - The function uses a helper function `try_pad_shape` to pad the shorter shape with ones on the left.
/// - If broadcasting is not possible, the function returns an error indicating the dimension at which the incompatibility occurs.
///
/// # Errors
///
/// - Returns an error if at any dimension the sizes differ and neither is 1, indicating that broadcasting cannot be performed.
///
/// # Implementation Details
///
/// - The function first determines which of the two shapes is longer and which is shorter.
/// - The shorter shape is padded on the left with ones to match the length of the longer shape.
/// - It then iterates over the dimensions, comparing corresponding dimensions from each shape:
/// - If the dimensions are equal or one of them is 1, the resulting dimension is set to the maximum of the two.
/// - If neither condition is met, an error is returned.
/// Determines the axes along which broadcasting is required to match a desired result shape.
///
/// The `get_broadcast_axes_from` function computes the indices of axes along which the input array `a`
/// needs to be broadcasted to match the target shape `res_shape`. Broadcasting is a method used in
/// tensor operations to allow arrays of different shapes to be used together in arithmetic operations.
///
/// **Note**: This function is adapted from NumPy's broadcasting rules and implementation.
///
/// # Parameters
///
/// - `a_shape`: A slice of `i64` representing the shape of the input array `a`.
/// - `res_shape`: A slice of `i64` representing the desired result shape after broadcasting.
/// - `location`: A `Location` object indicating the source code location for error reporting.
///
/// # Returns
///
/// - `Ok(Vec<usize>)`: A vector containing the indices of the axes along which broadcasting occurs.
/// - `Err(anyhow::Error)`: An error if broadcasting is not possible due to incompatible shapes.
///
/// # Broadcasting Rules
///
/// Broadcasting follows specific rules to align arrays of different shapes:
///
/// 1. **Left Padding**: If the input array `a_shape` has fewer dimensions than `res_shape`, it is left-padded
/// with ones to match the number of dimensions of `res_shape`.
///
/// 2. **Dimension Compatibility**: For each dimension from the most significant (leftmost) to the least significant
/// (rightmost):
/// - If the dimension sizes are equal, no broadcasting is needed for that axis.
/// - If the dimension size in `a_shape` is 1 and in `res_shape` is greater than 1, broadcasting occurs along that axis.
/// - If the dimension size in `res_shape` is 1 and in `a_shape` is greater than 1, broadcasting is not possible,
/// and an error is returned.
///
/// 3. **Collecting Broadcast Axes**: The axes where broadcasting occurs are collected and returned.
///
/// # Example
///
/// ```rust
/// use anyhow::Result;
/// // Assuming `get_broadcast_axes_from` and `Location` are defined appropriately
///
/// fn main() -> Result<()> {
/// let a_shape = &[3, 1];
/// let res_shape = &[3, 4];
/// let location = Location::new("module_name", "function_name");
///
/// let axes = get_broadcast_axes_from(a_shape, res_shape, location)?;
/// assert_eq!(axes, vec![1]);
///
/// println!("Broadcast axes: {:?}", axes);
/// Ok(())
/// }
/// ```
///
/// In this example:
///
/// - The input array has shape `[3, 1]`.
/// - The desired result shape is `[3, 4]`.
/// - Broadcasting occurs along axis `1`, so the function returns `vec![1]`.
///
/// # Notes
///
/// - **Padding Shapes**: If `a_shape` has fewer dimensions than `res_shape`, it is padded on the left with ones
/// to align the dimensions.
///
/// - **Axes Indices**: The axes indices are zero-based and correspond to the dimensions of the padded `a_shape`.
///
/// - **Error Handling**: If broadcasting is not possible due to incompatible dimensions, the function returns an error
/// using `ErrHandler::BroadcastError`, providing detailed information about the mismatch.
///
/// - **Implementation Details**:
/// - The function first calculates the difference in the number of dimensions and pads `a_shape` accordingly.
/// - It then iterates over the dimensions to identify axes where broadcasting is needed or not possible.
///
/// # Errors
///
/// - Returns an error if any dimension in `res_shape` is `1` while the corresponding dimension in `a_shape` is
/// greater than `1`, as broadcasting cannot be performed in this case.
// This file contains code translated from NumPy (https://github.com/numpy/numpy)
// Original work Copyright (c) 2005-2025, NumPy Developers
// Modified work Copyright (c) 2025 hpt Contributors
//
// Redistribution and use in source and binary forms, with or without
// modification, are permitted provided that the following conditions are
// met:
// * Redistributions of source code must retain the above copyright
// notice, this list of conditions and the following disclaimer.
// * Redistributions in binary form must reproduce the above
// copyright notice, this list of conditions and the following
// disclaimer in the documentation and/or other materials provided
// with the distribution.
// * Neither the name of the NumPy Developers nor the names of any
// contributors may be used to endorse or promote products derived
// from this software without specific prior written permission.
// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
// "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
// LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
// A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
// OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
// SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
// LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
// DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
// THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
// (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
// OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
//
// This Rust port is additionally licensed under Apache-2.0 OR MIT
// See repository root for details
/// Attempt to reshape an array without copying data.
/// Translated from NumPy's _attempt_nocopy_reshape function.
/// Generates intervals for multi-threaded processing by dividing the outer loop into chunks.
///
/// The `mt_intervals` function divides a large outer loop into multiple smaller intervals to be
/// processed by multiple threads. The function aims to distribute the workload as evenly as possible
/// among the available threads, handling cases where the total number of iterations is not perfectly
/// divisible by the number of threads.
///
/// # Parameters
///
/// - `outer_loop_size`: The total number of iterations in the outer loop.
/// - `num_threads`: The number of threads to divide the work among.
///
/// # Returns
///
/// A `Vec` of tuples `(usize, usize)`, where each tuple represents the start (inclusive) and end
/// (exclusive) indices of the interval assigned to each thread.
///
/// # Algorithm Overview
///
/// 1. **Calculate Base Workload**: Each thread is assigned at least `outer_loop_size / num_threads` iterations.
/// 2. **Distribute Remainder**: If `outer_loop_size` is not divisible by `num_threads`, the remaining iterations
/// (`outer_loop_size % num_threads`) are distributed one by one to the first few threads.
/// 3. **Calculate Start and End Indices**:
/// - The `start_index` for each thread `i` is calculated as:
/// ```
/// i * (outer_loop_size / num_threads) + min(i, outer_loop_size % num_threads)
/// ```
/// - The `end_index` is then calculated by adding the base workload and an extra iteration if the thread
/// received an extra iteration from the remainder.
///
/// # Examples
///
/// ```rust
/// fn main() {
/// let outer_loop_size = 10;
/// let num_threads = 3;
///
/// let intervals = mt_intervals(outer_loop_size, num_threads);
///
/// for (i, (start, end)) in intervals.iter().enumerate() {
/// println!("Thread {}: Processing indices [{}..{})", i, start, end);
/// }
/// }
/// ```
///
/// Output:
///
/// ```text
/// Thread 0: Processing indices [0..4)
/// Thread 1: Processing indices [4..7)
/// Thread 2: Processing indices [7..10)
/// ```
///
/// In this example:
/// - The total number of iterations is 10.
/// - The number of threads is 3.
/// - Each thread gets at least `10 / 3 = 3` iterations.
/// - The remainder is `10 % 3 = 1`. So, the first thread gets one extra iteration.
///
/// # Notes
///
/// - **Workload Balance**: The function ensures that the workload is distributed as evenly as possible.
/// - **Integer Division**: Since integer division truncates towards zero, the remainder is used to distribute
/// the extra iterations.
/// - **Index Calculation**: The calculation uses `std::cmp::min` to ensure that only the first `remainder` threads
/// receive the extra iteration.
///
/// # Function Definition
///
/// ```rust
/// pub fn mt_intervals(outer_loop_size: usize, num_threads: usize) -> Vec<(usize, usize)> {
/// let mut intervals = Vec::with_capacity(num_threads);
/// for i in 0..num_threads {
/// let start_index = i * (outer_loop_size / num_threads)
/// + std::cmp::min(i, outer_loop_size % num_threads);
/// let end_index = start_index
/// + outer_loop_size / num_threads
/// + ((i < outer_loop_size % num_threads) as usize);
/// intervals.push((start_index, end_index));
/// }
/// intervals
/// }
/// ```
///
/// # Unit Tests
///
/// Here are some unit tests to verify the correctness of the function:
///
/// ```rust
/// #[cfg(test)]
/// mod tests {
/// use super::*;
///
/// #[test]
/// fn test_even_division() {
/// let intervals = mt_intervals(100, 4);
/// assert_eq!(intervals.len(), 4);
/// assert_eq!(intervals[0], (0, 25));
/// assert_eq!(intervals[1], (25, 50));
/// assert_eq!(intervals[2], (50, 75));
/// assert_eq!(intervals[3], (75, 100));
/// }
///
/// #[test]
/// fn test_uneven_division() {
/// let intervals = mt_intervals(10, 3);
/// assert_eq!(intervals.len(), 3);
/// assert_eq!(intervals[0], (0, 4));
/// assert_eq!(intervals[1], (4, 7));
/// assert_eq!(intervals[2], (7, 10));
/// }
///
/// #[test]
/// fn test_more_threads_than_work() {
/// let intervals = mt_intervals(5, 10);
/// assert_eq!(intervals.len(), 10);
/// assert_eq!(intervals[0], (0, 1));
/// assert_eq!(intervals[1], (1, 2));
/// assert_eq!(intervals[2], (2, 3));
/// assert_eq!(intervals[3], (3, 4));
/// assert_eq!(intervals[4], (4, 5));
/// for i in 5..10 {
/// assert_eq!(intervals[i], (5, 5));
/// }
/// }
///
/// #[test]
/// fn test_zero_iterations() {
/// let intervals = mt_intervals(0, 4);
/// assert_eq!(intervals.len(), 4);
/// for &(start, end) in &intervals {
/// assert_eq!(start, 0);
/// assert_eq!(end, 0);
/// }
/// }
///
/// #[test]
/// fn test_zero_threads() {
/// let intervals = mt_intervals(10, 0);
/// assert_eq!(intervals.len(), 0);
/// }
/// }
/// ```
///
/// # Caveats
///
/// - If `num_threads` is zero, the function will return an empty vector.
/// - If `outer_loop_size` is zero, all intervals will have start and end indices of zero.
///
/// # Performance Considerations
///
/// - **Allocation**: The function pre-allocates the vector with capacity `num_threads`.
/// - **Integer Operations**: The function uses integer division and modulo operations, which are efficient.
///
/// # Conclusion
///
/// The `mt_intervals` function is useful for dividing work among multiple threads in a balanced way, ensuring that
/// each thread gets a fair share of the workload, even when the total number of iterations is not perfectly divisible
/// by the number of threads.
/// Generates intervals for multi-threaded SIMD processing by dividing the outer loop into chunks.
///
/// The `mt_intervals_simd` function divides a large outer loop into multiple smaller intervals
/// to be processed by multiple threads. Each interval is aligned with the SIMD vector size to
/// optimize performance. This ensures that each thread processes a chunk of data that is a
/// multiple of the SIMD vector size, which is beneficial for vectorized operations.
///
/// # Parameters
///
/// - `outer_loop_size`: The total size of the outer loop (number of iterations).
/// - `num_threads`: The desired number of threads to use for processing.
/// - `vec_size`: The size of the SIMD vector (number of elements processed in one SIMD operation).
///
/// # Returns
///
/// A `Vec` of tuples `(usize, usize)`, where each tuple represents the start (inclusive) and
/// end (exclusive) indices of the interval assigned to a thread.
///
/// # Algorithm Overview
///
/// 1. **Determine Maximum Threads**: Calculate `max_threads` as `outer_loop_size / vec_size` to
/// ensure each thread has at least one full SIMD vector's worth of work.
/// 2. **Adjust Thread Count**: Set `actual_threads` to the minimum of `num_threads` and
/// `max_threads` to avoid creating more threads than necessary.
/// 3. **Calculate Base Block Count and Remainder**:
/// - `base_block_count` is the number of full blocks each thread will process.
/// - `remainder` is the number of remaining blocks that couldn't be evenly divided.
/// 4. **Assign Intervals to Threads**:
/// - Distribute the extra blocks from the remainder among the first `remainder` threads.
/// - Calculate `start_index` and `end_index` for each thread accordingly.
///
/// # Examples
///
/// ```rust
/// fn main() {
/// let outer_loop_size = 1000;
/// let num_threads = 4;
/// let vec_size = 8;
///
/// let intervals = mt_intervals_simd(outer_loop_size, num_threads, vec_size);
///
/// for (i, (start, end)) in intervals.iter().enumerate() {
/// println!("Thread {}: Processing indices [{}..{})", i, start, end);
/// }
/// }
/// ```
///
/// Output might be:
///
/// ```text
/// Thread 0: Processing indices [0..200)
/// Thread 1: Processing indices [200..400)
/// Thread 2: Processing indices [400..600)
/// Thread 3: Processing indices [600..800)
/// ```
///
/// # Notes
///
/// - **Data Alignment**: The function ensures that each interval's size is a multiple of `vec_size`
/// to maintain data alignment for SIMD operations.
/// - **Load Balancing**: Extra iterations resulting from the remainder are distributed among the
/// first few threads to balance the workload.
///
/// # Panics
///
/// The function does not explicitly panic, but providing a `vec_size` of zero will result in a
/// division by zero error.
///
/// # See Also
///
/// - SIMD (Single Instruction, Multiple Data) processing.
/// - Multi-threading in Rust.
///
/// # Caveats
///
/// - Ensure that `vec_size` is not zero to avoid division by zero errors.
/// - The function assumes that `outer_loop_size`, `num_threads`, and `vec_size` are positive integers.
///
/// # Performance Considerations
///
/// - **Thread Overhead**: Creating too many threads may introduce overhead. The function limits the
/// number of threads to the maximum useful amount based on `outer_loop_size` and `vec_size`.
/// - **SIMD Efficiency**: Aligning intervals to `vec_size` improves SIMD efficiency by preventing
/// partial vector loads and stores.
///
/// # Conclusion
///
/// The `mt_intervals_simd` function is useful for parallelizing loops in applications that benefit
/// from both multi-threading and SIMD vectorization. By carefully dividing the work into appropriately
/// sized intervals, it helps maximize performance on modern CPUs.