Expand description
Quantization types and parameters for converting models to lower-bit precisions.
§Quick start
use llama_cpp_4::quantize::{LlamaFtype, QuantizeParams};
let params = QuantizeParams::new(LlamaFtype::MostlyQ4KM)
.with_nthread(8)
.with_quantize_output_tensor(true);
llama_cpp_4::model_quantize("model-f16.gguf", "model-q4km.gguf", ¶ms).unwrap();§TurboQuant – attention rotation (PR #21038)
llama.cpp applies a Hadamard rotation to Q/K/V tensors before writing them into the KV cache.
This significantly improves KV-cache quantization quality at near-zero cost, and is enabled by
default for every model whose head dimension is a power of two. You can opt out per-context
with LlamaContextParams::with_attn_rot_disabled or globally with
set_attn_rot_disabled.
Structs§
- Imatrix
- A collection of importance matrix entries (one per quantized tensor).
- Imatrix
Entry - A single per-tensor importance matrix entry, as loaded from a
.imatrixfile. - KvOverride
- A single GGUF metadata key-value override.
- Quantize
Params - Parameters for quantizing a model.
- Tensor
Type Override - Override the quantization type of every tensor whose name matches a glob
pattern.
Enums§
- Ggml
Type - GGML tensor storage type (maps to
ggml_type). - KvOverride
Value - A value in a GGUF key-value metadata override.
- Llama
Ftype - The quantization type used for the bulk of a model file (maps to
llama_ftype).
Functions§
- attn_
rot_ disabled - Returns
trueif TurboQuant attention rotation is currently disabled. - set_
attn_ rot_ disabled - Control the TurboQuant attention-rotation feature globally.