Expand description
QuantizedBrick Implementation (PMAT-013)
Implements quantized weight support for ComputeBricks per cbtop spec S17.
§Supported Formats
| Format | Bits/Weight | Memory | Perplexity Delta |
|---|---|---|---|
| Q4_0 | 4.0 | 25% | ~0.5% |
| Q4_K | 4.5 | 28% | ~0.3% |
| Q5_K | 5.5 | 34% | ~0.1% |
| Q8_0 | 8.0 | 50% | ~0.01% |
§Citations
- [Dettmers et al. 2022] “LLM.int8(): 8-bit Matrix Multiplication” NeurIPS
- [Frantar et al. 2023] “GPTQ: Accurate Post-Training Quantization” ICLR
- [Lin et al. 2023] “AWQ: Activation-aware Weight Quantization” MLSys
Structs§
- Gguf
Header - GGUF file header (simplified parsing).
- Gguf
Loader - GGUF file loader (basic implementation).
- Gguf
Tensor Info - GGUF tensor info.
- Layer
Quant Stats - Per-layer quantization statistics.
- Quant
Stats - Quantization statistics for a model or layer.
- Quantized
Brick - QuantizedBrick wraps compute operations with quantized weights.
- Quantized
Weights - Quantized weight storage for a single layer.
Enums§
- Dequant
Strategy - Dequantization strategy.
- Gguf
Error - GGUF parsing errors.
- Gguf
Value - GGUF metadata value types.
- Quant
Format - Supported quantization formats for ComputeBricks.
Functions§
- ggml_
type_ to_ format - GGML tensor type to QuantFormat mapping.
Type Aliases§
- Gguf
Result - Result type for GGUF operations.