mistralrs-quant
An advanced and highly diverse set of quantization techniques. This crate supports both quantization and optimized inference.
It has grown beyon simply quantization and is used by mistral.rs to power:
- ISQ
- Imatrix collection
- General quantization features
- Specific CUDA and Metal features
- cuBLASlt integration
Currently supported:
- GGUF:
GgufMatMul(2-8 bit quantization, with imatrix) - Gptq/Awq:
GptqAwqLayer(with CUDA marlin kernel) - Hqq:
HqqLayer(4, 8 bit quantization) - FP8:
FP8Linear(optimized on CUDA) - Unquantized (used for ISQ):
UnquantLinear - Bnb:
BnbLinear(int8, fp4, nf4)
Some kernels are copied or based on implementations in: