Expand description
Weight loading from safetensors files into Metal GPU buffers.
This module provides utilities for loading quantized model weights from
the safetensors file format
into Metal StorageModeShared buffers for GPU inference.
§Architecture
The loading pipeline is:
- Memory-map the safetensors file(s) via
memmap2(no full read into RAM). - Parse the header to discover tensor names, shapes, dtypes, and byte offsets.
- For each tensor, create a Metal
StorageModeSharedbuffer and copy the raw bytes from the mmap region into it. - Attach quantization metadata (bits, group_size) from the
quantization_config.jsonfile.
§Zero-Copy Consideration
On Apple Silicon, Metal shared-mode buffers reside in unified memory. We could create a Metal buffer that wraps the mmap pointer directly, but this is unsafe because the mmap lifetime is tied to the file mapping. Instead we copy the tensor bytes into a fresh Metal buffer, which is a single memcpy on unified memory and guarantees the buffer outlives the file mapping.
Structs§
- Quantization
Config - Per-tensor quantization configuration from
quantization_config.json. - Quantized
Weight - A quantized weight tensor loaded into Metal GPU buffers.
- Safetensors
File - A memory-mapped safetensors file ready for tensor extraction.
- Tensor
Quant Config - Quantization parameters for an individual tensor.
Functions§
- load_
quantized_ weights - Load quantized weights from a directory containing safetensors file(s) and
a
quantization_config.json. - safetensors_
to_ metal_ buffer - Copy raw bytes from a safetensors tensor view into a new Metal
StorageModeSharedbuffer.