anamnesis 0.3.0

Parse any tensor format, recover any precision — framework-agnostic FP8/GPTQ/AWQ/BnB dequantization and NPZ parsing for Rust
Documentation

anamnesis

CI crates.io docs.rs MSRV license unsafe forbidden

ἀνάμνησιςParse any format, recover any precision.

⚠️ This crate is under active development. See ROADMAP.md for the plan and CHANGELOG.md for current progress.

Table of Contents

Tested Models

FP8 Dequantization

Cross-validated against PyTorch on 7 real FP8 models from 5 quantization tools. Bit-exact output (0 ULP difference). Auto-vectorized: SSE2 on any x86-64, AVX2 with target-cpu=native.

Model Quantizer Scheme Scales vs PyTorch (AVX2)
EXAONE-4.0-1.2B-FP8 LG AI Fine-grained BF16 6.0x faster
Qwen3-1.7B-FP8 Qwen Fine-grained BF16 3.9x faster
Qwen3-4B-Instruct-2507-FP8 Qwen Fine-grained F16 3.0x faster
Ministral-3-3B-Instruct-2512 Mistral Per-tensor BF16 9.7x faster
Llama-3.2-1B-Instruct-FP8 RedHat Per-tensor BF16 3.9x faster
Llama-3.2-1B-Instruct-FP8-dynamic RedHat Per-channel BF16 2.7x faster
Llama-3.1-8B-Instruct-FP8 NVIDIA Per-tensor F32 6.3x faster

GPTQ Dequantization

Cross-validated against PyTorch on 4 real GPTQ models from 2 quantizers (AutoGPTQ, GPTQModel). Bit-exact output (0 ULP difference). Loop fission for full AVX2 vectorization.

Model Quantizer Bits vs PyTorch (AVX2)
Falcon3-1B-Instruct-GPTQ-Int4 AutoGPTQ 4 6.5x faster
Llama-3.2-1B-Instruct-GPTQ AutoGPTQ 4 12.2x faster
Falcon3-1B-Instruct-GPTQ-Int8 AutoGPTQ 8 7.0x faster
Llama-3.2-1B-gptqmodel-8bit GPTQModel 8 7.9x faster

AWQ Dequantization

Cross-validated against PyTorch on 2 real AWQ models (AutoAWQ GEMM). Bit-exact output (0 ULP difference). Loop fission for full AVX2 vectorization.

Model Quantizer Bits vs PyTorch (AVX2)
llama-3.2-1b-instruct-awq AutoAWQ 4 5.7x faster
Falcon3-1B-Instruct-AWQ AutoAWQ 4 4.7x faster

BitsAndBytes Dequantization

Cross-validated against PyTorch on 4 real BitsAndBytes models (NF4, FP4, double-quant, INT8). Bit-exact output (0 ULP difference). Loop fission for AVX2 on NF4/FP4; single-pass AVX2 on INT8 (vpmovsxbdvcvtdq2psvmulps).

Model Format Elements vs PyTorch (AVX2)
Llama-3.2-1B-Instruct-bnb-nf4 NF4 4,096 21.8x faster
Llama-3.2-1B-BNB-FP4 FP4 4,096 18.0x faster
Llama-3.2-1B-Instruct-bnb-nf4-double-quant NF4 double-quant 4,096 54.0x faster
Llama-3.2-1B-BNB-INT8 INT8 65,536 1.2x faster

Note: INT8 speedup is modest because the operation is trivially simple (i8→f32→multiply). Both PyTorch and anamnesis are near memory bandwidth limits at ~0.7–0.8 ns/element. The AVX2 hot loop is fully vectorized — the 1.2× reflects the inherent ceiling, not a missed optimization.

NPZ/NPY Parsing

Feature-gated behind npz. Custom NPY header parser with bulk read_exact — zero per-element deserialization for little-endian data on little-endian machines. Cross-validated byte-exact against NumPy on Gemma Scope 2B SAE weights.

Metric Value
Throughput (302 MB Gemma Scope, F32) 3,586 MB/s
Overhead vs raw I/O 1.3x
vs npyz crate 17.7x faster
Supported dtypes F16, BF16, F32, F64, Bool, U8–U64, I8–I64

BF16 support via JAX V2 void-dtype convention. Big-endian NPY files handled with in-place byte-swap.

Development