anamnesis 0.5.0

# anamnesis

[![CI](https://github.com/PCfVW/anamnesis/actions/workflows/ci.yml/badge.svg)](https://github.com/PCfVW/anamnesis/actions/workflows/ci.yml)
[![crates.io](https://img.shields.io/crates/v/anamnesis.svg)](https://crates.io/crates/anamnesis)
[![docs.rs](https://docs.rs/anamnesis/badge.svg)](https://docs.rs/anamnesis)
[![MSRV](https://img.shields.io/badge/MSRV-1.88-blue.svg)](https://www.rust-lang.org)
[![license](https://img.shields.io/crates/l/anamnesis.svg)](https://github.com/PCfVW/anamnesis#license)
[![unsafe: deny](https://img.shields.io/badge/unsafe-deny_(mmap_only)-blue.svg)](https://github.com/rust-secure-code/safety-dance/)

**ἀνάμνησις** — *Parse any format, recover any precision.*

> ⚠️ **This crate is under active development.** See [ROADMAP.md](ROADMAP.md) for the plan and [CHANGELOG.md](CHANGELOG.md) for current progress.

## Table of Contents

- [Install](#install)
- [CLI Commands](#cli-commands)
- [Tested Models](#tested-models)
  - [FP8 Dequantization](#fp8-dequantization)
  - [GPTQ Dequantization](#gptq-dequantization)
  - [AWQ Dequantization](#awq-dequantization)
  - [BitsAndBytes Dequantization](#bitsandbytes-dequantization)
  - [BitsAndBytes Quantization (Lethe — Phase 5)](#bitsandbytes-quantization-lethe--phase-5)
  - [GGUF Block-Quant Dequantization](#gguf-block-quant-dequantization)
- [Safetensors Header Inspection](#safetensors-header-inspection)
- [NPZ/NPY Parsing](#npznpy-parsing)
- [GGUF Inspection](#gguf-inspection)
- [PyTorch `.pth` Parsing](#pytorch-pth-parsing)
- [PyTorch `.pth` Inspection](#pytorch-pth-inspection)
- [Used by](#used-by)
- [License](#license)
- [Development](#development)

## Install

```sh
cargo install anamnesis --features cli,pth,gguf
```

Installs both `anamnesis` and `amn` (short alias). Feature flags: `gptq`, `awq`, `bnb`, `npz`, `pth`, `gguf`, `indicatif` (progress bars).

## CLI Commands

| Command | |
|---------|---|
| `amn parse <file>` | Parse and summarize a model file (`.safetensors`, `.pth`, `.npz`, `.gguf`) |
| `amn inspect <file>` | Show format, tensor counts, size estimates, and byte order |
| `amn remember <file>` | Dequantize to BF16 (safetensors) or convert `.pth`/`.gguf` → `.safetensors` |

Aliases: `amn info` = `amn inspect`, `amn dequantize` = `amn remember`.

Format detection is automatic: `.safetensors` files go through the dequantization pipeline, `.pth`/`.pt` files go through the pickle parser, `.npz` files go through the header-only NPZ inspector, `.gguf` files go through the GGUF parser. `.bin` files are probed for ZIP/GGUF magic to distinguish PyTorch, GGUF, and safetensors.

```
$ amn parse model.pth
Parsed model.pth (PyTorch state_dict)
  Tensors:    3
  Total size: 1.7 KB
  Dtypes:     F32
  Byte order: little-endian

  rnn.weight_ih_l0               F32 [16, 1]         64 B
  rnn.weight_hh_l0               F32 [16, 16]        1.0 KB
  linear.weight                  F32 [10, 16]        640 B

$ amn inspect weights.npz
Format:      NPZ archive
Tensors:     5
Total size:  160 B
Dtypes:      F32

$ amn remember model.pth
Converting model.pth → model.safetensors
  3 tensors, 1.7 KB
  Done.
```

## Tested Models

### FP8 Dequantization

Cross-validated against PyTorch on 7 real FP8 models from 5 quantization tools. Bit-exact output (0 ULP difference). Auto-vectorized: SSE2 on any x86-64, AVX2 with `target-cpu=native`.

| Model | Quantizer | Scheme | Scales | vs PyTorch (AVX2) |
|---|---|---|---|---|
| EXAONE-4.0-1.2B-FP8 | LG AI | Fine-grained | BF16 | 6.0x faster |
| Qwen3-1.7B-FP8 | Qwen | Fine-grained | BF16 | 3.9x faster |
| Qwen3-4B-Instruct-2507-FP8 | Qwen | Fine-grained | F16 | 3.0x faster |
| Ministral-3-3B-Instruct-2512 | Mistral | Per-tensor | BF16 | 9.7x faster |
| Llama-3.2-1B-Instruct-FP8 | RedHat | Per-tensor | BF16 | 3.9x faster |
| Llama-3.2-1B-Instruct-FP8-dynamic | RedHat | Per-channel | BF16 | 2.7x faster |
| Llama-3.1-8B-Instruct-FP8 | NVIDIA | Per-tensor | F32 | 6.3x faster |

### GPTQ Dequantization

Cross-validated against PyTorch on 4 real GPTQ models from 2 quantizers (AutoGPTQ, GPTQModel). Bit-exact output (0 ULP difference). Loop fission for full AVX2 vectorization.

| Model | Quantizer | Bits | vs PyTorch (AVX2) |
|---|---|---|---|
| Falcon3-1B-Instruct-GPTQ-Int4 | AutoGPTQ | 4 | 6.5x faster |
| Llama-3.2-1B-Instruct-GPTQ | AutoGPTQ | 4 | 12.2x faster |
| Falcon3-1B-Instruct-GPTQ-Int8 | AutoGPTQ | 8 | 7.0x faster |
| Llama-3.2-1B-gptqmodel-8bit | GPTQModel | 8 | 7.9x faster |

### AWQ Dequantization

Cross-validated against PyTorch on 2 real AWQ models (AutoAWQ GEMM). Bit-exact output (0 ULP difference). Loop fission for full AVX2 vectorization.

| Model | Quantizer | Bits | vs PyTorch (AVX2) |
|---|---|---|---|
| llama-3.2-1b-instruct-awq | AutoAWQ | 4 | 5.7x faster |
| Falcon3-1B-Instruct-AWQ | AutoAWQ | 4 | 4.7x faster |

### BitsAndBytes Dequantization

Cross-validated against PyTorch on 4 real BitsAndBytes models (NF4, FP4, double-quant, INT8). Bit-exact output (0 ULP difference). Loop fission for AVX2 on NF4/FP4; single-pass AVX2 on INT8 (`vpmovsxbd` → `vcvtdq2ps` → `vmulps`).

| Model | Format | Elements | vs PyTorch (AVX2) |
|---|---|---|---|
| Llama-3.2-1B-Instruct-bnb-nf4 | NF4 | 4,096 | 21.8x faster |
| Llama-3.2-1B-BNB-FP4 | FP4 | 4,096 | 18.0x faster |
| Llama-3.2-1B-Instruct-bnb-nf4-double-quant | NF4 double-quant | 4,096 | 54.0x faster |
| Llama-3.2-1B-BNB-INT8 | INT8 | 65,536 | 1.2x faster |

> **Note:** INT8 speedup is modest because the operation is trivially simple (`i8→f32→multiply`). Both PyTorch and anamnesis are near memory bandwidth limits at ~0.7–0.8 ns/element. The AVX2 hot loop is fully vectorized — the 1.2× reflects the inherent ceiling, not a missed optimization.

### BitsAndBytes Quantization (Lethe — Phase 5)

The inverse direction. Phase 5 ships the `lethe` namespace alongside `remember`: `encode_bnb4` / `encode_bnb4_double_quant` / `encode_bnb_int8` plus the bit-exact `round_trip` validation harness. Cross-validated against PyTorch `bitsandbytes` on **7 fixtures across 4 architecture families** (Llama 3.2 / Qwen3 / Qwen2.5 / Phi-3.5): every fixture round-trips **byte-exact** (0 byte diffs) against the original PyTorch-quantised bytes.

| Fixture | Format | Elements | Byte-exact round-trip | vs PyTorch quantize (CPU) |
|---|---|---|---|---|
| Llama-3.2-1B-Instruct-bnb-nf4 | NF4 plain | 4,096 | ✓ 0 / 2048 | 0.22× (slower) |
| Llama-3.2-1B-BNB-FP4 | FP4 plain | 4,096 | ✓ 0 / 2048 | 0.24× (slower) |
| Llama-3.2-1B-Instruct-bnb-nf4-double-quant | NF4 double-quant | 4,096 | ✓ 0 / 2048 | 0.22× (slower) |
| Llama-3.2-1B-BNB-INT8 | INT8 | 65,536 | ✓ 0 / 65536 | 0.03× (32× slower) |
| ema1234/qwen_mcqa_bnb_fp4 | FP4 plain (Qwen3) | 4,096 | ✓ 0 / 2048 | 0.20× (slower) |
| unsloth/Qwen2.5-1.5B-Instruct-bnb-4bit | NF4 double-quant (Qwen2.5) | 4,096 | ✓ 0 / 2048 | 0.18× (slower) |
| unsloth/Phi-3.5-mini-instruct-bnb-4bit | NF4 double-quant (Phi-3.5) | 4,096 | ✓ 0 / 2048 | 0.18× (slower) |

> **Sign-of-zero preservation finding (FP4):** The on-disk `bitsandbytes` Python `FP4` `quant_map` stores `+0.0` at *both* index 0 *and* index 8 — collapsing the `±0` pair. A naive `decode → encode` round-trip would be mathematically impossible under that codebook. Phase 5 introduces a narrow, principled tweak in `dequantize_bnb4_to_bf16`: when a codebook entry is exactly `+0.0` AND the nibble has its high bit set (`nibble & 0x8 != 0`), the emitted `BF16` is `-0.0`. This recovers the sign information `bitsandbytes`' Python decode discards. Arithmetically invisible (both are IEEE 754 zero), affects `0.2 %` of `FP4` elements, no-op for `NF4`. The encoder mirrors the rule with `apply_sign_magnitude_encode_correction`. Confirmed to generalise: the Qwen3 FP4 fixture shows the same `+0.0` / `+0.0` codebook collapse and round-trips byte-exact under the rule.

> **Ecosystem finding (NF4 double-quant):** `hf-fm inspect` HTTP-range probes during cross-architecture candidate selection revealed that **every** non-Llama BnB-NF4 model checked uses double-quant — bitsandbytes' default. Plain NF4 is effectively a Llama-fixture-only phenomenon. Without `encode_bnb4_double_quant` (Step 1c), anamnesis would only encode a tiny corner of real-world BnB-4bit models. Promoted from deferred polish to required Step 1c gate on `v0.5.0`.

> **On the "slower than PyTorch" column:** The encode kernels are 4–6× slower than PyTorch's broadcast-vectorised quantize on `BnB4`, 32× slower on `INT8`. This is expected — PyTorch encode uses a single broadcast tensor op (`(blocks.unsqueeze(-1) - codebook).abs().argmin(dim=-1)`) that vectorises across the whole tensor; the Rust encode loop is currently scalar per element. **Phase 9 (CPU SIMD pass) is the natural target** — this table makes the gap visible. The same loop-fission + `target-cpu=native` infrastructure that gave the decode path its 18–54× wins is the candidate retrofit on the encode side.

### GGUF Block-Quant Dequantization

Cross-validated against the `gguf` Python package (`ggml-org` reference, mirrors `ggml-quants.c`) on **22 block-quant kernels** from 4 real models (bartowski SmolLM2-135M-Instruct, TheBloke TinyLlama-1.1B-Chat, bartowski Mistral-7B-Instruct-v0.3, bartowski Qwen2.5-0.5B-Instruct) plus 3 synthetic fixtures (`TQ1_0` / `TQ2_0` / `MXFP4` — only ~15 BitNet-derivative GGUFs ship the `TQ*` types on HuggingFace, and mainstream `MXFP4` only ships inside the 11 GB `gpt-oss-20b` upload, so a deterministic random tensor is the practical fixture source). Bit-exact output (0 ULP difference). **All 22 of 22 GGUF block types now supported** — Phase 4.5 closed in step 6 (MXFP4). Feature-gated behind `gguf`.

| Kernel | Model | vs `gguf` Python (AVX2) |
|---|---|---|
| Q4_0 | SmolLM2-135M | 6.9x faster |
| Q4_1 | SmolLM2-135M | 6.3x faster |
| Q5_0 | TinyLlama-1.1B | 31.3x faster |
| Q5_1 | SmolLM2-135M | 11.4x faster |
| Q8_0 | SmolLM2-135M | 6.3x faster |
| IQ4_NL | SmolLM2-135M | 12.2x faster |
| Q2_K | TinyLlama-1.1B | 6.7x faster |
| Q3_K | SmolLM2-135M | 10.9x faster |
| Q4_K | SmolLM2-135M | 8.1x faster |
| Q5_K | SmolLM2-135M | 11.6x faster |
| Q6_K | SmolLM2-135M | 26.6x faster |
| IQ4_XS | SmolLM2-135M | 12.6x faster |
| IQ2_XXS | Mistral-7B-v0.3 | 3.45x faster |
| IQ2_XS | Mistral-7B-v0.3 | 2.84x faster |
| IQ2_S | Qwen2.5-0.5B | 4.10x faster |
| IQ3_XXS | Mistral-7B-v0.3 | 3.32x faster |
| IQ3_S | Mistral-7B-v0.3 | 4.37x faster |
| IQ1_S | Mistral-7B-v0.3 | 15.00x faster |
| IQ1_M | Mistral-7B-v0.3 | 7.85x faster |
| TQ1_0 | synthetic | 35.59x faster |
| TQ2_0 | synthetic | 26.31x faster |
| MXFP4 | synthetic | 30.14x faster |

> **Note:** `Q8_1` and `Q8_K` are internal `llama.cpp` activation quant types, not shipped as model weights — they are covered by unit tests only. Speedup measured on 65,536 elements (release build, `target-cpu=native`, best-of-5 per kernel). The `IQ2_*` and `IQ3_*` kernels land in the 2.8×–4.4× range rather than the 6×–31× range of the pure-arithmetic `Q*` kernels because their pass 1 involves a codebook LUT gather and a per-element sign branch — neither of which the auto-vectoriser can eliminate. The `IQ1_*` kernels are notably faster (7.9×–15.0×) because their inner loop replaces the per-element sign branch with a single scalar `±delta` per 8-element group, and the codebook gather is a plain `[u64; 2048]` table lookup. The ternary `TQ*` kernels are the **fastest in the crate** (26×–36×) — no codebook lookup at all, just bit shifts (`TQ2_0`) or a base-3 multiplication trick (`TQ1_0`) decoding directly to `{-d, 0, +d}`. `MXFP4` lands at 30× — structurally identical to `IQ4_NL` (12.2×) but with a tighter 17 B/block layout (1 B `E8M0` exponent vs 2 B `f16`) and a smaller codebook (16 entries × 4-bit nibble lookup) that the auto-vectoriser handles cleanly. Phase 9 (CPU SIMD pass) will further address the IQ2/IQ3 case with hand-written AVX2 intrinsics.

> **Limitations (peak heap):** Whole-model dequantisation via `ParsedModel::remember` or `amn remember model.gguf -o out.safetensors` retains every dequantised tensor in heap memory simultaneously until the underlying `safetensors::serialize_to_file` call returns. Peak heap is `O(total_BF16_output_size)` ≈ `2 × n_parameters` bytes — comfortable for **≤7 B** models on a 32 GB system, **tight at 13 B**, **OOMs at 70 B+**. The single-tensor kernel `dequantize_gguf_blocks_to_bf16` is already streaming (O(one block)); the orchestrator-level streaming output path is planned for Phase 10 — see [ROADMAP.md](ROADMAP.md). Phase 9 (SIMD) and Phase 10 (streaming) are independent; this perf table will be unaffected by Phase 10 because the per-tensor kernel timings stay the same.

### Safetensors Header Inspection

Header-only safetensors parsing ships in three forms. The path-based `parse(path)` memory-maps the file and returns a `ParsedModel` (header + mmap-backed buffer) ready for `inspect()` or `remember()`; the slice-based `parse_safetensors_header(&[u8])` operates on a buffer that already contains the prefix and JSON; and `parse_safetensors_header_from_reader<R: Read>(reader)` accepts any `Read` substrate (in-memory `Cursor`, HTTP-range-backed adapter, custom transport) and reads only the 8-byte length prefix plus the JSON header. Total transfer ≈ header size (~1 MiB on a multi-GB shard) instead of the full file.

`Read` — not `Read + Seek` — is sufficient because the safetensors layout is purely prefix-then-JSON: two contiguous reads in order, never seek-back. This keeps the simplest possible HTTP-range adapter (one connection, two range fetches) viable. Anamnesis itself takes on no network or TLS dependency — downstream crates plug in their own adapter when remote inspection is needed. See the rustdoc on `parse_safetensors_header_from_reader` for the access pattern.

### NPZ/NPY Parsing

Feature-gated behind `npz`. Custom NPY header parser with bulk `read_exact` — zero per-element deserialization for little-endian data on little-endian machines. Cross-validated byte-exact against NumPy on Gemma Scope 2B SAE weights.

| Metric | Value |
|---|---|
| Throughput (302 MB Gemma Scope, F32) | **3,586 MB/s** |
| Overhead vs raw I/O | 1.3x |
| vs `npyz` crate | **17.7x faster** |
| Supported dtypes | F16, BF16, F32, F64, Bool, U8–U64, I8–I64 |

BF16 support via JAX `V2` void-dtype convention. Big-endian NPY files handled with in-place byte-swap.

Header-only inspection ships in two forms: `inspect_npz(path)` for files on disk and `inspect_npz_from_reader<R: Read + Seek>(reader)` for any other substrate (in-memory `Cursor`, HTTP-range-backed adapter, custom transport). Anamnesis itself takes on no network or TLS dependency — downstream crates plug in their own `Read + Seek` adapter when remote inspection is needed. See the rustdoc on `inspect_npz_from_reader` for the access pattern an HTTP-range adapter must satisfy.

### GGUF Inspection

Feature-gated behind `gguf`. The path-based `parse_gguf(path)` memory-maps the file and returns a `ParsedGguf` with zero-copy `Cow::Borrowed` tensor views into the mapping; the reader-generic `inspect_gguf_from_reader<R: Read + Seek>(reader)` accepts any positional substrate (in-memory `Cursor`, HTTP-range-backed adapter, custom transport) and returns just the `GgufInspectInfo` summary (version, architecture, tensor count, total size, dtypes, alignment) without materialising the data segment.

`Read + Seek` — not just `Read` — is required because `GGUF`'s parser computes the absolute tensor-data offset by combining the relative offsets in the tensor-info table with the post-tensor-info `data_section_start` anchor, then validates each offset against the captured stream length. The simplest correct refactor preserves this positional access pattern via `Seek`. A pure-`Read` reformulation would require restructuring the parser into a strict forward pass and is out of scope. A 2 GiB quantised `GGUF` is inspectable in two or three small range requests covering a few MiB of front-loaded metadata — no weight data downloaded. Anamnesis itself takes on no network or TLS dependency; downstream crates plug in their own adapter when remote inspection is needed. See the rustdoc on `inspect_gguf_from_reader` for the access pattern.

### PyTorch `.pth` Parsing

Feature-gated behind `pth`. Minimal pickle VM (~36 opcodes) with security allowlist. Memory-mapped I/O with zero-copy tensor access (`Cow::Borrowed` from mmap). Cross-validated byte-exact against PyTorch `torch.load()` on 3 [AlgZoo](https://github.com/alignment-research-center/alg-zoo) models (MIT-0 license).

| Model | Size | Tensors | vs `torch.load` |
|---|---|---|---|
| torchvision ResNet-18 | 45 MB | 102 | **11.2x faster** |
| torchvision ResNet-50 | 98 MB | 267 | **12.7x faster** |
| torchvision ViT-B/16 | 330 MB | 152 | **30.8x faster** |

Lossless `.pth` → `.safetensors` conversion preserving original dtypes (F16, BF16, F32, F64, I8–I64, U8, Bool). The conversion pipeline writes directly from mmap slices to the output file — zero intermediate data copies.

Handles both newer (`archive/` prefix) and older (`{model_name}/` prefix) PyTorch ZIP conventions. Legacy (pre-1.6) raw-pickle files are rejected with a clear error.

### PyTorch `.pth` Inspection

Feature-gated behind `pth`. The path-based `parse_pth(path)` memory-maps the file and returns a `ParsedPth` with zero-copy `tensors()` views into the mapping; the reader-generic `inspect_pth_from_reader<R: Read + Seek>(reader)` accepts any positional substrate (in-memory `Cursor`, HTTP-range-backed adapter, custom transport) and returns just the `PthInspectInfo` summary (`tensor_count`, `total_bytes`, `dtypes`, `big_endian`) without materialising any of the tensor-data files inside the archive (`data/0`, `data/1`, …).

Only the ZIP central directory and the `data.pkl` entry — typically <100 KiB even on torchvision-class 300 MB models — are fetched. A 300 MB torchvision `.pth` is inspectable through an HTTP-range adapter in well under 100 KiB of network transfer, instead of 300 MB.

Measured across the **full 6 960-file [AlgZoo](https://github.com/alignment-research-center/alg-zoo) corpus** (the `algzoo_weights/` set imported for `candle-mi` v0.1.9's `stoicheia` module; best-of-5 release-mode median per file, `target-cpu=native`, PyTorch 2.10.0+cu130):

| Substrate                                    | Median per file | vs `torch.load` |
|---|---:|---:|
| `parse_pth(path).inspect()` (mmap)           | 124.0 µs |  **4.07x faster** |
| `inspect_pth_from_reader(File)` (reader)     | 168.7 µs |  **2.99x faster** |
| `torch.load(weights_only=True)` (PyTorch)    | 504.3 µs |        baseline   |

PyTorch has no separate inspect-only primitive — `torch.load(weights_only=True)` is the closest comparable; it fully materialises every tensor before the caller can iterate the `state_dict` for summary stats, so the speedup is a **lower bound** that grows by orders of magnitude on larger models (the reader path stays bounded by `data.pkl` size while `torch.load` scales linearly in total tensor-data size). Per-family breakdown and the full method are in [`docs/perf-experiments.md`](docs/perf-experiments.md) Experiment 6.

`Read + Seek` (not just `Read`) is required because the ZIP format keeps its central directory at the end of the file, then seeks back to each local-file header to read entry payloads. `zip::ZipArchive::new` already requires `Read + Seek` for that reason, and `inspect_pth_from_reader` inherits the constraint verbatim. The pickle interpreter itself runs over an owned `Vec<u8>` (the materialised `data.pkl`) — same security allowlist as the path-based `parse_pth`, shared by construction so the two entry points cannot diverge. Anamnesis itself takes on no network or TLS dependency; downstream crates plug in their own adapter when remote inspection is needed. See the rustdoc on `inspect_pth_from_reader` for the full access pattern.

## Used by

- [candle-mi](https://github.com/PCfVW/candle-mi) — Mechanistic interpretability toolkit for language models

## License

Licensed under either of [Apache License, Version 2.0](LICENSE-APACHE)
or [MIT License](LICENSE-MIT) at your option.

## Development

- Exclusively developed with [Claude Code](https://claude.com/product/claude-code) (dev) and [Augment Code](https://www.augmentcode.com/) (review)
- Git workflow managed with [Fork](https://fork.dev/)
- All code follows [CONVENTIONS.md](CONVENTIONS.md), derived from [Amphigraphic-Strict](https://github.com/PCfVW/Amphigraphic-Strict)'s [Grit](https://github.com/PCfVW/Amphigraphic-Strict/tree/master/Grit) — a strict Rust subset designed to improve AI coding accuracy.