anamnesis
ἀνάμνησις — Parse any format, recover any precision.
⚠️ This crate is under active development. See ROADMAP.md for the plan and CHANGELOG.md for current progress.
Table of Contents
- Install
- CLI Commands
- Tested Models
- Safetensors Header Inspection
- NPZ/NPY Parsing
- GGUF Inspection
- PyTorch
.pthParsing - PyTorch
.pthInspection - Used by
- License
- Development
Install
Installs both anamnesis and amn (short alias). Feature flags: gptq, awq, bnb, npz, pth, gguf, indicatif (progress bars).
CLI Commands
| Command | |
|---|---|
amn parse <file> |
Parse and summarize a model file (.safetensors, .pth, .npz, .gguf) |
amn inspect <file> |
Show format, tensor counts, size estimates, and byte order |
amn remember <file> |
Dequantize to BF16 (safetensors) or convert .pth/.gguf → .safetensors |
Aliases: amn info = amn inspect, amn dequantize = amn remember.
Format detection is automatic: .safetensors files go through the dequantization pipeline, .pth/.pt files go through the pickle parser, .npz files go through the header-only NPZ inspector, .gguf files go through the GGUF parser. .bin files are probed for ZIP/GGUF magic to distinguish PyTorch, GGUF, and safetensors.
$ amn parse model.pth
Parsed model.pth (PyTorch state_dict)
Tensors: 3
Total size: 1.7 KB
Dtypes: F32
Byte order: little-endian
rnn.weight_ih_l0 F32 [16, 1] 64 B
rnn.weight_hh_l0 F32 [16, 16] 1.0 KB
linear.weight F32 [10, 16] 640 B
$ amn inspect weights.npz
Format: NPZ archive
Tensors: 5
Total size: 160 B
Dtypes: F32
$ amn remember model.pth
Converting model.pth → model.safetensors
3 tensors, 1.7 KB
Done.
Tested Models
FP8 Dequantization
Cross-validated against PyTorch on 7 real FP8 models from 5 quantization tools. Bit-exact output (0 ULP difference). Auto-vectorized: SSE2 on any x86-64, AVX2 with target-cpu=native.
| Model | Quantizer | Scheme | Scales | vs PyTorch (AVX2) |
|---|---|---|---|---|
| EXAONE-4.0-1.2B-FP8 | LG AI | Fine-grained | BF16 | 6.0x faster |
| Qwen3-1.7B-FP8 | Qwen | Fine-grained | BF16 | 3.9x faster |
| Qwen3-4B-Instruct-2507-FP8 | Qwen | Fine-grained | F16 | 3.0x faster |
| Ministral-3-3B-Instruct-2512 | Mistral | Per-tensor | BF16 | 9.7x faster |
| Llama-3.2-1B-Instruct-FP8 | RedHat | Per-tensor | BF16 | 3.9x faster |
| Llama-3.2-1B-Instruct-FP8-dynamic | RedHat | Per-channel | BF16 | 2.7x faster |
| Llama-3.1-8B-Instruct-FP8 | NVIDIA | Per-tensor | F32 | 6.3x faster |
GPTQ Dequantization
Cross-validated against PyTorch on 4 real GPTQ models from 2 quantizers (AutoGPTQ, GPTQModel). Bit-exact output (0 ULP difference). Loop fission for full AVX2 vectorization.
| Model | Quantizer | Bits | vs PyTorch (AVX2) |
|---|---|---|---|
| Falcon3-1B-Instruct-GPTQ-Int4 | AutoGPTQ | 4 | 6.5x faster |
| Llama-3.2-1B-Instruct-GPTQ | AutoGPTQ | 4 | 12.2x faster |
| Falcon3-1B-Instruct-GPTQ-Int8 | AutoGPTQ | 8 | 7.0x faster |
| Llama-3.2-1B-gptqmodel-8bit | GPTQModel | 8 | 7.9x faster |
AWQ Dequantization
Cross-validated against PyTorch on 2 real AWQ models (AutoAWQ GEMM). Bit-exact output (0 ULP difference). Loop fission for full AVX2 vectorization.
| Model | Quantizer | Bits | vs PyTorch (AVX2) |
|---|---|---|---|
| llama-3.2-1b-instruct-awq | AutoAWQ | 4 | 5.7x faster |
| Falcon3-1B-Instruct-AWQ | AutoAWQ | 4 | 4.7x faster |
BitsAndBytes Dequantization
Cross-validated against PyTorch on 4 real BitsAndBytes models (NF4, FP4, double-quant, INT8). Bit-exact output (0 ULP difference). Loop fission for AVX2 on NF4/FP4; single-pass AVX2 on INT8 (vpmovsxbd → vcvtdq2ps → vmulps).
| Model | Format | Elements | vs PyTorch (AVX2) |
|---|---|---|---|
| Llama-3.2-1B-Instruct-bnb-nf4 | NF4 | 4,096 | 21.8x faster |
| Llama-3.2-1B-BNB-FP4 | FP4 | 4,096 | 18.0x faster |
| Llama-3.2-1B-Instruct-bnb-nf4-double-quant | NF4 double-quant | 4,096 | 54.0x faster |
| Llama-3.2-1B-BNB-INT8 | INT8 | 65,536 | 1.2x faster |
Note: INT8 speedup is modest because the operation is trivially simple (
i8→f32→multiply). Both PyTorch and anamnesis are near memory bandwidth limits at ~0.7–0.8 ns/element. The AVX2 hot loop is fully vectorized — the 1.2× reflects the inherent ceiling, not a missed optimization.
BitsAndBytes Quantization (Lethe — Phase 5)
The inverse direction. Phase 5 ships the lethe namespace alongside remember: encode_bnb4 / encode_bnb4_double_quant / encode_bnb_int8 plus the bit-exact round_trip validation harness. Cross-validated against PyTorch bitsandbytes on 7 fixtures across 4 architecture families (Llama 3.2 / Qwen3 / Qwen2.5 / Phi-3.5): every fixture round-trips byte-exact (0 byte diffs) against the original PyTorch-quantised bytes.
| Fixture | Format | Elements | Byte-exact round-trip | vs PyTorch quantize (CPU) |
|---|---|---|---|---|
| Llama-3.2-1B-Instruct-bnb-nf4 | NF4 plain | 4,096 | ✓ 0 / 2048 | 0.22× (slower) |
| Llama-3.2-1B-BNB-FP4 | FP4 plain | 4,096 | ✓ 0 / 2048 | 0.24× (slower) |
| Llama-3.2-1B-Instruct-bnb-nf4-double-quant | NF4 double-quant | 4,096 | ✓ 0 / 2048 | 0.22× (slower) |
| Llama-3.2-1B-BNB-INT8 | INT8 | 65,536 | ✓ 0 / 65536 | 0.03× (32× slower) |
| ema1234/qwen_mcqa_bnb_fp4 | FP4 plain (Qwen3) | 4,096 | ✓ 0 / 2048 | 0.20× (slower) |
| unsloth/Qwen2.5-1.5B-Instruct-bnb-4bit | NF4 double-quant (Qwen2.5) | 4,096 | ✓ 0 / 2048 | 0.18× (slower) |
| unsloth/Phi-3.5-mini-instruct-bnb-4bit | NF4 double-quant (Phi-3.5) | 4,096 | ✓ 0 / 2048 | 0.18× (slower) |
Sign-of-zero preservation finding (FP4): The on-disk
bitsandbytesPythonFP4quant_mapstores+0.0at both index 0 and index 8 — collapsing the±0pair. A naivedecode → encoderound-trip would be mathematically impossible under that codebook. Phase 5 introduces a narrow, principled tweak indequantize_bnb4_to_bf16: when a codebook entry is exactly+0.0AND the nibble has its high bit set (nibble & 0x8 != 0), the emittedBF16is-0.0. This recovers the sign informationbitsandbytes' Python decode discards. Arithmetically invisible (both are IEEE 754 zero), affects0.2 %ofFP4elements, no-op forNF4. The encoder mirrors the rule withapply_sign_magnitude_encode_correction. Confirmed to generalise: the Qwen3 FP4 fixture shows the same+0.0/+0.0codebook collapse and round-trips byte-exact under the rule.
Ecosystem finding (NF4 double-quant):
hf-fm inspectHTTP-range probes during cross-architecture candidate selection revealed that every non-Llama BnB-NF4 model checked uses double-quant — bitsandbytes' default. Plain NF4 is effectively a Llama-fixture-only phenomenon. Withoutencode_bnb4_double_quant(Step 1c), anamnesis would only encode a tiny corner of real-world BnB-4bit models. Promoted from deferred polish to required Step 1c gate onv0.5.0.
On the "slower than PyTorch" column: The encode kernels are 4–6× slower than PyTorch's broadcast-vectorised quantize on
BnB4, 32× slower onINT8. This is expected — PyTorch encode uses a single broadcast tensor op ((blocks.unsqueeze(-1) - codebook).abs().argmin(dim=-1)) that vectorises across the whole tensor; the Rust encode loop is currently scalar per element. Phase 9 (CPU SIMD pass) is the natural target — this table makes the gap visible. The same loop-fission +target-cpu=nativeinfrastructure that gave the decode path its 18–54× wins is the candidate retrofit on the encode side.
GGUF Block-Quant Dequantization
Cross-validated against the gguf Python package (ggml-org reference, mirrors ggml-quants.c) on 22 block-quant kernels from 4 real models (bartowski SmolLM2-135M-Instruct, TheBloke TinyLlama-1.1B-Chat, bartowski Mistral-7B-Instruct-v0.3, bartowski Qwen2.5-0.5B-Instruct) plus 3 synthetic fixtures (TQ1_0 / TQ2_0 / MXFP4 — only ~15 BitNet-derivative GGUFs ship the TQ* types on HuggingFace, and mainstream MXFP4 only ships inside the 11 GB gpt-oss-20b upload, so a deterministic random tensor is the practical fixture source). Bit-exact output (0 ULP difference). All 22 of 22 GGUF block types now supported — Phase 4.5 closed in step 6 (MXFP4). Feature-gated behind gguf.
| Kernel | Model | vs gguf Python (AVX2) |
|---|---|---|
| Q4_0 | SmolLM2-135M | 6.9x faster |
| Q4_1 | SmolLM2-135M | 6.3x faster |
| Q5_0 | TinyLlama-1.1B | 31.3x faster |
| Q5_1 | SmolLM2-135M | 11.4x faster |
| Q8_0 | SmolLM2-135M | 6.3x faster |
| IQ4_NL | SmolLM2-135M | 12.2x faster |
| Q2_K | TinyLlama-1.1B | 6.7x faster |
| Q3_K | SmolLM2-135M | 10.9x faster |
| Q4_K | SmolLM2-135M | 8.1x faster |
| Q5_K | SmolLM2-135M | 11.6x faster |
| Q6_K | SmolLM2-135M | 26.6x faster |
| IQ4_XS | SmolLM2-135M | 12.6x faster |
| IQ2_XXS | Mistral-7B-v0.3 | 3.45x faster |
| IQ2_XS | Mistral-7B-v0.3 | 2.84x faster |
| IQ2_S | Qwen2.5-0.5B | 4.10x faster |
| IQ3_XXS | Mistral-7B-v0.3 | 3.32x faster |
| IQ3_S | Mistral-7B-v0.3 | 4.37x faster |
| IQ1_S | Mistral-7B-v0.3 | 15.00x faster |
| IQ1_M | Mistral-7B-v0.3 | 7.85x faster |
| TQ1_0 | synthetic | 35.59x faster |
| TQ2_0 | synthetic | 26.31x faster |
| MXFP4 | synthetic | 30.14x faster |
Note:
Q8_1andQ8_Kare internalllama.cppactivation quant types, not shipped as model weights — they are covered by unit tests only. Speedup measured on 65,536 elements (release build,target-cpu=native, best-of-5 per kernel). TheIQ2_*andIQ3_*kernels land in the 2.8×–4.4× range rather than the 6×–31× range of the pure-arithmeticQ*kernels because their pass 1 involves a codebook LUT gather and a per-element sign branch — neither of which the auto-vectoriser can eliminate. TheIQ1_*kernels are notably faster (7.9×–15.0×) because their inner loop replaces the per-element sign branch with a single scalar±deltaper 8-element group, and the codebook gather is a plain[u64; 2048]table lookup. The ternaryTQ*kernels are the fastest in the crate (26×–36×) — no codebook lookup at all, just bit shifts (TQ2_0) or a base-3 multiplication trick (TQ1_0) decoding directly to{-d, 0, +d}.MXFP4lands at 30× — structurally identical toIQ4_NL(12.2×) but with a tighter 17 B/block layout (1 BE8M0exponent vs 2 Bf16) and a smaller codebook (16 entries × 4-bit nibble lookup) that the auto-vectoriser handles cleanly. Phase 9 (CPU SIMD pass) will further address the IQ2/IQ3 case with hand-written AVX2 intrinsics.
Limitations (peak heap): Whole-model dequantisation via
ParsedModel::rememberoramn remember model.gguf -o out.safetensorsretains every dequantised tensor in heap memory simultaneously until the underlyingsafetensors::serialize_to_filecall returns. Peak heap isO(total_BF16_output_size)≈2 × n_parametersbytes — comfortable for ≤7 B models on a 32 GB system, tight at 13 B, OOMs at 70 B+. The single-tensor kerneldequantize_gguf_blocks_to_bf16is already streaming (O(one block)); the orchestrator-level streaming output path is planned for Phase 10 — see ROADMAP.md. Phase 9 (SIMD) and Phase 10 (streaming) are independent; this perf table will be unaffected by Phase 10 because the per-tensor kernel timings stay the same.
Safetensors Header Inspection
Header-only safetensors parsing ships in three forms. The path-based parse(path) memory-maps the file and returns a ParsedModel (header + mmap-backed buffer) ready for inspect() or remember(); the slice-based parse_safetensors_header(&[u8]) operates on a buffer that already contains the prefix and JSON; and parse_safetensors_header_from_reader<R: Read>(reader) accepts any Read substrate (in-memory Cursor, HTTP-range-backed adapter, custom transport) and reads only the 8-byte length prefix plus the JSON header. Total transfer ≈ header size (~1 MiB on a multi-GB shard) instead of the full file.
Read — not Read + Seek — is sufficient because the safetensors layout is purely prefix-then-JSON: two contiguous reads in order, never seek-back. This keeps the simplest possible HTTP-range adapter (one connection, two range fetches) viable. Anamnesis itself takes on no network or TLS dependency — downstream crates plug in their own adapter when remote inspection is needed. See the rustdoc on parse_safetensors_header_from_reader for the access pattern.
NPZ/NPY Parsing
Feature-gated behind npz. Custom NPY header parser with bulk read_exact — zero per-element deserialization for little-endian data on little-endian machines. Cross-validated byte-exact against NumPy on Gemma Scope 2B SAE weights.
| Metric | Value |
|---|---|
| Throughput (302 MB Gemma Scope, F32) | 3,586 MB/s |
| Overhead vs raw I/O | 1.3x |
vs npyz crate |
17.7x faster |
| Supported dtypes | F16, BF16, F32, F64, Bool, U8–U64, I8–I64 |
BF16 support via JAX V2 void-dtype convention. Big-endian NPY files handled with in-place byte-swap.
Header-only inspection ships in two forms: inspect_npz(path) for files on disk and inspect_npz_from_reader<R: Read + Seek>(reader) for any other substrate (in-memory Cursor, HTTP-range-backed adapter, custom transport). Anamnesis itself takes on no network or TLS dependency — downstream crates plug in their own Read + Seek adapter when remote inspection is needed. See the rustdoc on inspect_npz_from_reader for the access pattern an HTTP-range adapter must satisfy.
GGUF Inspection
Feature-gated behind gguf. The path-based parse_gguf(path) memory-maps the file and returns a ParsedGguf with zero-copy Cow::Borrowed tensor views into the mapping; the reader-generic inspect_gguf_from_reader<R: Read + Seek>(reader) accepts any positional substrate (in-memory Cursor, HTTP-range-backed adapter, custom transport) and returns just the GgufInspectInfo summary (version, architecture, tensor count, total size, dtypes, alignment) without materialising the data segment.
Read + Seek — not just Read — is required because GGUF's parser computes the absolute tensor-data offset by combining the relative offsets in the tensor-info table with the post-tensor-info data_section_start anchor, then validates each offset against the captured stream length. The simplest correct refactor preserves this positional access pattern via Seek. A pure-Read reformulation would require restructuring the parser into a strict forward pass and is out of scope. A 2 GiB quantised GGUF is inspectable in two or three small range requests covering a few MiB of front-loaded metadata — no weight data downloaded. Anamnesis itself takes on no network or TLS dependency; downstream crates plug in their own adapter when remote inspection is needed. See the rustdoc on inspect_gguf_from_reader for the access pattern.
PyTorch .pth Parsing
Feature-gated behind pth. Minimal pickle VM (~36 opcodes) with security allowlist. Memory-mapped I/O with zero-copy tensor access (Cow::Borrowed from mmap). Cross-validated byte-exact against PyTorch torch.load() on 3 AlgZoo models (MIT-0 license).
| Model | Size | Tensors | vs torch.load |
|---|---|---|---|
| torchvision ResNet-18 | 45 MB | 102 | 11.2x faster |
| torchvision ResNet-50 | 98 MB | 267 | 12.7x faster |
| torchvision ViT-B/16 | 330 MB | 152 | 30.8x faster |
Lossless .pth → .safetensors conversion preserving original dtypes (F16, BF16, F32, F64, I8–I64, U8, Bool). The conversion pipeline writes directly from mmap slices to the output file — zero intermediate data copies.
Handles both newer (archive/ prefix) and older ({model_name}/ prefix) PyTorch ZIP conventions. Legacy (pre-1.6) raw-pickle files are rejected with a clear error.
PyTorch .pth Inspection
Feature-gated behind pth. The path-based parse_pth(path) memory-maps the file and returns a ParsedPth with zero-copy tensors() views into the mapping; the reader-generic inspect_pth_from_reader<R: Read + Seek>(reader) accepts any positional substrate (in-memory Cursor, HTTP-range-backed adapter, custom transport) and returns just the PthInspectInfo summary (tensor_count, total_bytes, dtypes, big_endian) without materialising any of the tensor-data files inside the archive (data/0, data/1, …).
Only the ZIP central directory and the data.pkl entry — typically <100 KiB even on torchvision-class 300 MB models — are fetched. A 300 MB torchvision .pth is inspectable through an HTTP-range adapter in well under 100 KiB of network transfer, instead of 300 MB.
Measured across the full 6 960-file AlgZoo corpus (the algzoo_weights/ set imported for candle-mi v0.1.9's stoicheia module; best-of-5 release-mode median per file, target-cpu=native, PyTorch 2.10.0+cu130):
| Substrate | Median per file | vs torch.load |
|---|---|---|
parse_pth(path).inspect() (mmap) |
124.0 µs | 4.07x faster |
inspect_pth_from_reader(File) (reader) |
168.7 µs | 2.99x faster |
torch.load(weights_only=True) (PyTorch) |
504.3 µs | baseline |
PyTorch has no separate inspect-only primitive — torch.load(weights_only=True) is the closest comparable; it fully materialises every tensor before the caller can iterate the state_dict for summary stats, so the speedup is a lower bound that grows by orders of magnitude on larger models (the reader path stays bounded by data.pkl size while torch.load scales linearly in total tensor-data size). Per-family breakdown and the full method are in docs/perf-experiments.md Experiment 6.
Read + Seek (not just Read) is required because the ZIP format keeps its central directory at the end of the file, then seeks back to each local-file header to read entry payloads. zip::ZipArchive::new already requires Read + Seek for that reason, and inspect_pth_from_reader inherits the constraint verbatim. The pickle interpreter itself runs over an owned Vec<u8> (the materialised data.pkl) — same security allowlist as the path-based parse_pth, shared by construction so the two entry points cannot diverge. Anamnesis itself takes on no network or TLS dependency; downstream crates plug in their own adapter when remote inspection is needed. See the rustdoc on inspect_pth_from_reader for the full access pattern.
Used by
- candle-mi — Mechanistic interpretability toolkit for language models
License
Licensed under either of Apache License, Version 2.0 or MIT License at your option.
Development
- Exclusively developed with Claude Code (dev) and Augment Code (review)
- Git workflow managed with Fork
- All code follows CONVENTIONS.md, derived from Amphigraphic-Strict's Grit — a strict Rust subset designed to improve AI coding accuracy.