CubeK: high-performance multi-platform kernels in CubeCL

Algorithms

Algorithms	Variants
Random	`bernoulli` `normal` `uniform`
Quantization	`symmetric` `per-block` `per-tensor` `q2` `q4` `q8` `fp4`
Reduction	`mean` `sum` `prod` `max` `min` `arg[max\|min]` `per-cube` `per-plane`
Matmul	`mma` `unit` `tma` `multi-stage` `specialization` `ordered` `multi-rows`
Convolution	`mma` `unit` `tma` `multi-stage` `im2col`
Attention	`mma` `unit` `multi-rows`

Contributing

If you want to contribute new kernels, please read the GUIDE.md.

Running tests

Note: This applies to most kernels, but reduce works slightly differently for now, see its README.

Test suites

Four test suites are available:

Light test suite: a tractable subset of representative tests that run on the CI.
Basic test suite: adds to light suite some tests that would be considered basic but may hang on CI (slow on CPU).
Extended test suite: usually auto-generated combinatorial tests covering many configurations. Good to run when developing kernels. Normally kept tractable.
Full test suite: all generable test combinations; may be too large to compile or run practically.

Run tests with

# Replace <runtime> with cpu, cuda, rocm, wgpu, vulkan or metal

# Basic test suite (light on cpu)
cargo test-<runtime>

# Extended test suite
cargo test-<runtime>-extended

# Full test suite
cargo test-<runtime>-full

Cube test mode

You can control test behavior by setting the CUBE_TEST_MODE environment variable.
For more details, see Test Mode.

Modes