Module weight

Expand description

Weight loading from safetensors files into Metal GPU buffers.

This module provides utilities for loading quantized model weights from the safetensors file format into Metal StorageModeShared buffers for GPU inference.

§Architecture

The loading pipeline is:

Memory-map the safetensors file(s) via memmap2 (no full read into RAM).
Parse the header to discover tensor names, shapes, dtypes, and byte offsets.
For each tensor, create a Metal StorageModeShared buffer and copy the raw bytes from the mmap region into it.
Attach quantization metadata (bits, group_size) from the quantization_config.json file.

§Zero-Copy Consideration

On Apple Silicon, Metal shared-mode buffers reside in unified memory. We could create a Metal buffer that wraps the mmap pointer directly, but this is unsafe because the mmap lifetime is tied to the file mapping. Instead we copy the tensor bytes into a fresh Metal buffer, which is a single memcpy on unified memory and guarantees the buffer outlives the file mapping.

Structs§

QuantizationConfig: Per-tensor quantization configuration from quantization_config.json.
QuantizedWeight: A quantized weight tensor loaded into Metal GPU buffers.
SafetensorsFile: A memory-mapped safetensors file ready for tensor extraction.
TensorQuantConfig: Quantization parameters for an individual tensor.

Functions§

load_quantized_weights: Load quantized weights from a directory containing safetensors file(s) and a quantization_config.json.
safetensors_to_metal_buffer: Copy raw bytes from a safetensors tensor view into a new Metal StorageModeShared buffer.

Module weight

Module weight Copy item path

§Architecture

§Zero-Copy Consideration

Structs§

Functions§

Module weight