mistralrs-quant 0.7.0

Fast, flexible LLM inference.

`mistralrs-quant`

An advanced and highly diverse set of quantization techniques. This crate supports both quantization and optimized inference.

It has grown beyon simply quantization and is used by mistral.rs to power:

ISQ
Imatrix collection
General quantization features
Specific CUDA and Metal features
cuBLASlt integration

Currently supported:

GGUF: GgufMatMul(2-8 bit quantization, with imatrix)
Gptq/Awq: GptqAwqLayer(with CUDA marlin kernel)
Hqq: HqqLayer (4, 8 bit quantization)
FP8: FP8Linear(optimized on CUDA)
Unquantized (used for ISQ): UnquantLinear
Bnb: BnbLinear (int8, fp4, nf4)

Some kernels are copied or based on implementations in: