Skip to main content

Module quantize

Module quantize 

Source
Expand description

Quantization types and parameters for converting models to lower-bit precisions.

§Quick start

use llama_cpp_4::quantize::{LlamaFtype, QuantizeParams};

let params = QuantizeParams::new(LlamaFtype::MostlyQ4KM)
    .with_nthread(8)
    .with_quantize_output_tensor(true);

llama_cpp_4::model_quantize("model-f16.gguf", "model-q4km.gguf", &params).unwrap();

§TurboQuant – attention rotation (PR #21038)

llama.cpp applies a Hadamard rotation to Q/K/V tensors before writing them into the KV cache. This significantly improves KV-cache quantization quality at near-zero cost, and is enabled by default for every model whose head dimension is a power of two. You can opt out per-context with LlamaContextParams::with_attn_rot_disabled or globally with set_attn_rot_disabled.

Structs§

Imatrix
A collection of importance matrix entries (one per quantized tensor).
ImatrixEntry
A single per-tensor importance matrix entry, as loaded from a .imatrix file.
KvOverride
A single GGUF metadata key-value override.
QuantizeParams
Parameters for quantizing a model.
TensorTypeOverride
Override the quantization type of every tensor whose name matches a glob pattern.

Enums§

GgmlType
GGML tensor storage type (maps to ggml_type).
KvOverrideValue
A value in a GGUF key-value metadata override.
LlamaFtype
The quantization type used for the bulk of a model file (maps to llama_ftype).

Functions§

attn_rot_disabled
Returns true if TurboQuant attention rotation is currently disabled.
set_attn_rot_disabled
Control the TurboQuant attention-rotation feature globally.