Crate rlx_gguf

Expand description

GGUF (GGML Universal Format) parser, dequantizer, quantization encoder, and file writer.

Standalone: no rlx-* dependencies. Higher-level WeightLoader / HF name mapping lives in the separate model-builders repo (see root README).

Supports GGUF v1, v2, v3 (the live formats). Tensor dtypes decoded today: F32, F16, BF16, Q8_0, Q4_0, Q4_1, Q5_0, Q5_1, and the full K-quant family Q2_K, Q3_K, Q4_K, Q5_K, Q6_K, Q8_K. The encoder side covers every dtype the decoder accepts — see [quantize] and writer::GgufWriter. Anything outside that set parses fine but errors on dequant_f32 so callers know exactly which key is unreadable; extending is a one-arm match.

Endianness: little-endian assumed (the only flavor that ships in practice). The GGUF spec reserves a flag for big-endian; we don’t parse it.

§Reading a GGUF file

use rlx_gguf::GgufFile;

let f = GgufFile::from_path("model.gguf")?;
let (data, shape) = f.dequant_f32("token_embd.weight")?;

§Writing a GGUF file with mixed quant schemes

use rlx_gguf::{GgmlType, GgufWriter, MetaValue, quantize};

let w_floats: Vec<f32> = /* … */;
let bias_floats: Vec<f32> = /* … */;

let mut w = GgufWriter::new();
w.set_arch("llama");
w.set_meta("general.name", MetaValue::String("my-model".into()));

// Big projection → 4-bit K-quant. Tiny bias → float-16 (stays at
// native precision so we don't pay 5% accuracy for 32 numbers).
w.add_tensor_bytes("w", vec![4096, 4096], GgmlType::Q4K,
    quantize(&w_floats, GgmlType::Q4K)?)?;
w.add_tensor_bytes("b", vec![4096], GgmlType::F16,
    quantize(&bias_floats, GgmlType::F16)?)?;
w.write_to_path("out.gguf")?;

For end-to-end safetensors / ONNX → GGUF conversion with per-tensor scheme rules see the companion rlx-gguf-convert crate.

Re-exports§

pub use quantize::quantize;
pub use writer::GgufWriter;
pub use writer::TensorPayload;

Modules§

quantize: Float → GGML quant encoders. Mirrors quantize_row_* from llama.cpp’s ggml-quants.c. Output is byte-identical for the legacy schemes (Q4_0/Q4_1/Q5_0/Q5_1/Q8_0); K-quants use a simpler per-sub-block min/max search than upstream’s iterative make_qx_quants, so quality is a notch below llama-quantize but is fully valid GGUF output that round-trips through the dequant kernels in super.
writer: GGUF v3 file writer. Serializes a sequence of metadata key/value pairs and tensors. Layout matches the live spec (see lib.rs parser): little-endian, alignment padding before the data segment, tensor data appended in declaration order.

Structs§

GgufFile
GgufTensor

Enums§

GgmlType
MetaValue

Constants§

DEFAULT_ALIGNMENT
GGUF_MAGIC
K_SCALE_SIZE: Byte size of the packed scales+mins region in block_q4_K / block_q5_K — 8 sub-blocks × 12 bits (6 bits scale + 6 bits min) = 96 bits = 12 bytes. Same layout in both formats.
QK4_0: Legacy Q4_0 block size (32 elements).
QK8_0: Legacy Q8_0 block size (32 elements).
QK_K: Super-block size shared by every K-quant format. Per llama.cpp’s ggml-quants.h. Tensors quantized with Q{4,5,6,8}_K must have an element count divisible by 256. Super-block size for K-quant formats (256 elements).

Functions§

bytes_for_public: Bytes a tensor of n elements occupies in storage for dtype. Returns None if n doesn’t divide the scheme’s block size.
dequant_q2_k
dequant_q2_k_block: Dequantize one Q2_K super-block (84 bytes) into out.
dequant_q3_k
dequant_q3_k_block: Dequantize one Q3_K super-block (110 bytes) into out.
dequant_q4_0: Full-tensor Q4_0 dequant (element count must be a multiple of QK4_0).
dequant_q4_0_block: Dequant one Q4_0 block (2 + QK4_0/2 bytes → QK4_0 f32 values).
dequant_q4_k: Q4_K block: 144 bytes / 256 elements (4.5 bits/element). Layout: f16 d + f16 dmin + 12-byte packed scales/mins + 128 nibbles.
dequant_q4_k_block: Dequantize one Q4_K super-block (144 bytes) into out (256 f32s).
dequant_q5_k: Q5_K block: 176 bytes / 256 elements (5.5 bits/element). Layout: f16 d + f16 dmin + 12-byte packed scales/mins + 32-byte high-bits + 128 nibbles. Each element’s 5th bit lives in qh indexed by position-within-super-block.
dequant_q5_k_block: Dequantize one Q5_K super-block (176 bytes) into out.
dequant_q6_k: Q6_K block: 210 bytes / 256 elements (6.5625 bits/element). The highest-quality K-quant; common in *-Q6_K.gguf model dumps. Layout: 128 low-nibble bytes + 64 high-2-bit bytes + 16 i8 scales
dequant_q6_k_block: Dequantize one Q6_K super-block (210 bytes) into out.
dequant_q8_0: Full-tensor Q8_0 dequant (element count must be a multiple of QK8_0).
dequant_q8_0_block: Dequant one Q8_0 block (2 + QK8_0 bytes → QK8_0 f32 values).
dequant_q8_k: Q8_K block: 276 bytes / 256 elements. Mostly an intermediate format used inside llama.cpp’s matmul kernels, but some dumps do store it directly. We only need to materialize the i8 quants × the f32 super-block scale; bsums (per-16-block partial sums) is metadata we can safely ignore for plain dequant.
dequant_q8_k_block: Dequantize one Q8_K super-block (276 bytes) into out.

Crate rlx_gguf

Crate rlx_gguf Copy item path

§Reading a GGUF file

§Writing a GGUF file with mixed quant schemes

Re-exports§

Modules§

Structs§

Enums§

Constants§

Functions§

Crate rlx_gguf