Skip to main content

Module weight

Module weight 

Source
Expand description

Weight loading from safetensors files into Metal GPU buffers.

This module provides utilities for loading quantized model weights from the safetensors file format into Metal StorageModeShared buffers for GPU inference.

§Architecture

The loading pipeline is:

  1. Memory-map the safetensors file(s) via memmap2 (no full read into RAM).
  2. Parse the header to discover tensor names, shapes, dtypes, and byte offsets.
  3. For each tensor, create a Metal StorageModeShared buffer and copy the raw bytes from the mmap region into it.
  4. Attach quantization metadata (bits, group_size) from the quantization_config.json file.

§Zero-Copy Consideration

On Apple Silicon, Metal shared-mode buffers reside in unified memory. We could create a Metal buffer that wraps the mmap pointer directly, but this is unsafe because the mmap lifetime is tied to the file mapping. Instead we copy the tensor bytes into a fresh Metal buffer, which is a single memcpy on unified memory and guarantees the buffer outlives the file mapping.

Structs§

QuantizationConfig
Per-tensor quantization configuration from quantization_config.json.
QuantizedWeight
A quantized weight tensor loaded into Metal GPU buffers.
SafetensorsFile
A memory-mapped safetensors file ready for tensor extraction.
TensorQuantConfig
Quantization parameters for an individual tensor.

Functions§

load_quantized_weights
Load quantized weights from a directory containing safetensors file(s) and a quantization_config.json.
safetensors_to_metal_buffer
Copy raw bytes from a safetensors tensor view into a new Metal StorageModeShared buffer.