Module quantize

Expand description

Quantization types and parameters for converting models to lower-bit precisions.

§Quick start

use llama_cpp_4::quantize::{LlamaFtype, QuantizeParams};

let params = QuantizeParams::new(LlamaFtype::MostlyQ4KM)
    .with_nthread(8)
    .with_quantize_output_tensor(true);

llama_cpp_4::model_quantize("model-f16.gguf", "model-q4km.gguf", &params).unwrap();

§`TurboQuant` – attention rotation (PR #21038)

llama.cpp applies a Hadamard rotation to Q/K/V tensors before writing them into the KV cache. This significantly improves KV-cache quantization quality at near-zero cost, and is enabled by default for every model whose head dimension is a power of two. You can opt out per-context with LlamaContextParams::with_attn_rot_disabled or globally with set_attn_rot_disabled.

Structs§

Imatrix: A collection of importance matrix entries (one per quantized tensor).
ImatrixEntry: A single per-tensor importance matrix entry, as loaded from a .imatrix file.
KvOverride: A single GGUF metadata key-value override.
QuantizeParams: Parameters for quantizing a model.
TensorTypeOverride: Override the quantization type of every tensor whose name matches a glob pattern.

Enums§

GgmlType: GGML tensor storage type (maps to ggml_type).
KvOverrideValue: A value in a GGUF key-value metadata override.
LlamaFtype: The quantization type used for the bulk of a model file (maps to llama_ftype).

Functions§

attn_rot_disabled: Returns true if TurboQuant attention rotation is currently disabled.
set_attn_rot_disabled: Control the TurboQuant attention-rotation feature globally.

Module quantize

Module quantize Copy item path

§Quick start

§TurboQuant – attention rotation (PR #21038)

Structs§

Enums§

Functions§

Module quantize

§`TurboQuant` – attention rotation (PR #21038)