cubecl-linalg 0.4.0

# CubeCL Linear Algebra Library.


The crate contains common linear algebra algorithms.

## Algorithms

- [X] Tiling 2D Matrix Multiplication.

  The kernel is very flexible and can be used on pretty much any hardware.
- [X] Cooperative Matrix Multiplication.

  The kernel is using Automatic Mixed Precision (AMP) to leverage cooperative matrix-multiply and accumulate instructions.
  For `f32` tensors, the inputs are casted into `f16`, but the accumulation is still performed in `f32`.
  This may cause a small lost in precision, but with way faster execution.

## Benchmarks

You can run the benchmarks from the workspace with the following:

```bash
cargo bench --bench matmul --features wgpu # for wgpu
cargo bench --bench matmul --features cuda # for cuda
```

On an RTX 3070 we get the following results:

```
matmul-wgpu-f32-tiling2d

―――――――― Result ―――――――――
  Samples     100
  Mean        13.289ms
  Variance    28.000ns
  Median      13.271ms
  Min         12.582ms
  Max         13.768ms
―――――――――――――――――――――――――
matmul-cuda-f32-tiling2d

―――――――― Result ―――――――――
  Samples     100
  Mean        12.754ms
  Variance    93.000ns
  Median      12.647ms
  Min         12.393ms
  Max         14.501ms
―――――――――――――――――――――――――
matmul-cuda-f32-cmma

―――――――― Result ―――――――――
  Samples     100
  Mean        4.996ms
  Variance    35.000ns
  Median      5.084ms
  Min         4.304ms
  Max         5.155ms
―――――――――――――――――――――――――
```