# CubeCL Linear Algebra Library.
The crate contains common linear algebra algorithms.
## Algorithms
- [X] Tiling 2D Matrix Multiplication.
The kernel is very flexible and can be used on pretty much any hardware.
- [X] Cooperative Matrix Multiplication.
The kernel is using Automatic Mixed Precision (AMP) to leverage cooperative matrix-multiply and accumulate instructions.
For `f32` tensors, the inputs are casted into `f16`, but the accumulation is still performed in `f32`.
This may cause a small lost in precision, but with way faster execution.
## Benchmarks
You can run the benchmarks from the workspace with the following:
```bash
cargo bench --bench matmul --features wgpu # for wgpu
cargo bench --bench matmul --features cuda # for cuda
```
On an RTX 3070 we get the following results:
```
matmul-wgpu-f32-tiling2d
―――――――― Result ―――――――――
Samples 100
Mean 13.289ms
Variance 28.000ns
Median 13.271ms
Min 12.582ms
Max 13.768ms
―――――――――――――――――――――――――
matmul-cuda-f32-tiling2d
―――――――― Result ―――――――――
Samples 100
Mean 12.754ms
Variance 93.000ns
Median 12.647ms
Min 12.393ms
Max 14.501ms
―――――――――――――――――――――――――
matmul-cuda-f32-cmma
―――――――― Result ―――――――――
Samples 100
Mean 4.996ms
Variance 35.000ns
Median 5.084ms
Min 4.304ms
Max 5.155ms
―――――――――――――――――――――――――
```