CubeK: high-performance multi-platform kernels in CubeCL
Algorithms
| Algorithms | Variants |
|---|---|
| Random | bernoulli normal uniform |
| Quantization | symmetric per-block per-tensor q2 q4 q8 fp4 |
| Reduction | mean sum prod max min arg[max|min] per-cube per-plane |
| Matmul | mma unit tma multi-stage specialization ordered multi-rows |
| Convolution | mma unit tma multi-stage im2col |
| Attention | mma unit multi-rows |
Contributing
If you want to contribute new kernels, please read the GUIDE.md.