Expand description
ἀνάμνησις — parse any tensor format, recover any precision.
anamnesis is a framework-agnostic Rust library for dequantizing
quantized model weights and parsing tensor archives. It handles
.safetensors (memory-mapped, classify, dequantize to BF16),
.npz (bulk extraction at near-I/O speed), and PyTorch .pth
(zero-copy mmap with lossless safetensors conversion, 11–31× faster
than torch.load()).
§Supported Quantization Schemes (decode — remember)
| Scheme | Feature gate | Speedup vs PyTorch CPU (AVX2) |
|---|---|---|
FP8 E4M3 (fine-grained, per-channel, per-tensor) | (always on) | 2.7–9.7× |
GPTQ (INT4/INT8, group-wise, g_idx) | gptq | 6.5–12.2× |
AWQ (INT4, per-group, activation-aware) | awq | 4.7–5.7× |
BitsAndBytes NF4/FP4 (lookup + per-block absmax) | bnb | 18–54× |
BitsAndBytes INT8 (LLM.int8(), per-row absmax) | bnb | 1.2× |
All schemes produce bit-exact output (0 ULP difference) against the
canonical quantization libraries’ own dequantization code —
bitsandbytes (dequantize_4bit / int8_vectorwise_dequant), AutoAWQ
(unpack_awq + reverse_awq_order), GPTQModel (dequantize_weight
plus its v1→v2 zero-point conversion), and PyTorch’s native fp8
cast — verified on real-model fixtures. Hand-rolled reference
reimplementations are banned from the fixture generators (v0.6.4 rule;
see docs/dogfooding-feedbacks/ for the circular-validation incident
that motivated it).
§Supported Quantization Schemes (encode — lethe)
Phase 5 introduces the encode side as the architectural inverse of
remember. Each kernel here takes the BF16 bytes that
remember produces and writes the corresponding quantised bytes
plus the per-block / per-row metadata (absmax, SCB, …), so
encode(decode(q, scale)) == q holds bit-exactly for every codebook-
LUT family (NF4, FP4) and within i8 representation error
plus the documented clamp at ± 127 for INT8.
| Scheme | Feature gate | Cross-validation contract |
|---|---|---|
BitsAndBytes NF4/FP4 encode | bnb | byte-exact vs bitsandbytes’ on-disk bytes on every fixture |
BitsAndBytes INT8 encode | bnb | byte-exact vs bitsandbytes’ on-disk bytes on every fixture |
Subsequent encode-kernel families (FP8, GGUF legacy / K-quants
/ IQ / TQ / MXFP4) land in Phase 7.5 and reuse the
lethe::round_trip harness introduced here.
§Format Conversion Pipeline (Phase 6, v0.6.0)
amn convert <input> --to <target> routes any v0.6.0-available
format pair through a single CLI dispatch. The same pipeline is
exposed as a library through three new helper families:
write_gguf/write_gguf_to_writer/GgufWriteTensor— the format-symmetric inverse ofparse_gguf. Phase 6 emits scalar dtypes only (F32,F16,BF16,F64,I8–I64); quantised emit (Q*,IQ*,TQ*,MXFP4) lands in Phase 7.5 through the same writer scaffold. Behind thegguffeature.npz_to_safetensors/npz_to_safetensors_bytes— losslessNPZ → safetensorsconversion. EveryNpzDtypevariant maps directly to itssafetensors::Dtypecounterpart. Behind thenpzfeature.write_bnb_nf4_safetensors/write_bnb_nf4_safetensors_bytes/BnbWriteInput/classify_inputs/is_eligible_for_nf4— end-to-endBF16 → BnB-NF4 safetensors filepath. Wraps theencode_bnb4_compute_absmaxkernel into the four-tensor on-disk companion layout (weight,weight.absmax,weight.quant_map,weight.quant_state.bitsandbytes__nf4). 2-D tensors only; 1-D biases / norms / embeddings pass through unchanged inBF16. Behind thebnbfeature.
| Conversion | anamnesis (CPU) | Python baseline (CPU) | Ratio |
|---|---|---|---|
NPZ → safetensors (4096×4096 F32) | 11.2 ms | 75.7 ms (numpy + safetensors-py) | 6.75× |
PTH → safetensors (4096×4096 BF16) | 5.7 ms | 29.6 ms (torch.load + safetensors.torch) | 5.18× |
safetensors-BF16 → GGUF (4096×4096 BF16) | 13.6 ms | 15.1 ms (gguf-py) | 1.11× |
safetensors-BF16 → BnB-NF4 (4096×4096 BF16) | 141 ms | 376.8 ms (bitsandbytes CPU) | 2.67× |
Headline numbers measured by t14_perf_vs_python_size_matched in
tests/cross_validation_convert.rs at target-cpu=native, release,
best-of-5 median. Full table including PyTorch-CPU equivalents for
the two non-PyTorch paths is in the project README.
§Ollama integration (Phase 6.5)
Feature-gated behind ollama (implies gguf). Adds no
third-party dependency — pure stdlib + serde_json (already a
runtime dep). Exposes one function:
resolve_ollama_model("llama3.2:1b") -> PathBuf— reads theOllamamanifest at~/.ollama/models/manifests/registry.ollama.ai/library/<name>/<tag>and returns theGGUFblob path (~/.ollama/models/blobs/sha256-<hash>). TheOLLAMA_MODELSenv var overrides the cache root. Accepts theollama:name:tagURL-scheme form foramnCLI integration.
The amn CLI’s parse / inspect / remember / convert
subcommands recognise the ollama: URL scheme prefix and resolve
transparently:
amn inspect ollama:llama3.2:1b
amn remember ollama:gemma2:2b --to bf16 -o gemma2.safetensors
amn convert ollama:qwen2.5-coder:7b-instruct --to safetensorscross_validation_ollama.rs cross-validates anamnesis’s GGUF
dequant byte-exactly against the gguf-py reference on a slice
pulled from a real Ollama-cached blob — the same kernel anamnesis
already validates against bartowski / TheBloke quantisations,
now also validated on the dominant local-LLM distribution channel.
§Validation infrastructure (Phase 6.5, dev-only)
Phase 6.5 ships three dev-only validation tracks. None of them
affect the published crate (benches/, tests/peak_heap_*.rs,
and the dhat / criterion dev-dependencies are excluded from
the published tarball by Cargo’s defaults).
- Criterion runtime benchmarks (
benches/dequant.rs,benches/parsing.rs) — throughput baselines per kernel family plus a real-world bench on the Ollama-cachedllama3.2:1bQ8_0slice. Run viacargo bench --features gptq,awq,bnb,gguf,npz,pth. Seebenches/README.mdfor run commands + machine-spec baselines. dhat-rspeak-heap assertions (tests/peak_heap_gptq.rs,tests/peak_heap_awq.rs,tests/peak_heap_bnb_dq.rs) — three#[ignore]d test binaries that wrap the global allocator and assert observed peak heap stays within the documentedoutput_size + O(out_features)(GPTQ/AWQ) oroutput_size + num_blocks × 4 + block_size × 4(BnBdouble-quant) ceiling. Each kernel’s scratch matches the documented# Memoryclaim to the byte on the reference machine. Seetests/peak_heap_README.mdfor calibration and failure-interpretation guidance.- Ollama-fixture cross-validation (
tests/cross_validation_ollama.rs) — bit-exactQ8_0dequant againstgguf-pyon a real Ollama blob (see theOllamaintegration section above).
§NPZ/NPY Parsing
Feature-gated behind npz. Custom NPY header parser with bulk
read_exact — zero per-element deserialization for LE data on LE
machines. Supports F16, BF16, F32, F64, all integer types,
and Bool. 3,586 MB/s on a 302 MB file (1.3× raw I/O overhead).
§PyTorch .pth Parsing
Feature-gated behind pth. Minimal pickle VM (~36 opcodes) with a
security allowlist (only PyTorch tensor-reconstruction globals are
permitted — the VM never invokes callables, so there is no code-execution
path) and working-set governance: the value stack, memo clones, and
nesting depth are charged against permanent floors
(MAX_PICKLE_WORKING_SET, MAX_PICKLE_VM_DEPTH) and the caller’s
ParseLimits, so a crafted .pth cannot drive multi-GiB heap or a
recursive-Drop stack overflow even on the cheap inspect_pth_from_reader
pre-filter. Memory-mapped I/O with zero-copy Cow::Borrowed tensor data.
Lossless .pth → .safetensors conversion. 11–31× faster than
torch.load() on torchvision models.
§Quick Start
Path-based dequantisation (FP8 → BF16):
use anamnesis::{parse, TargetDtype};
let model = parse("model-fp8.safetensors")?;
let info = model.inspect();
println!("{info}");
model.remember("model-bf16.safetensors", TargetDtype::BF16)?;Reader-generic inspection over any Read + Seek substrate (in-memory
Cursor, HTTP-range-backed adapter, custom transport). The example
below uses a std::fs::File; an HTTP-range adapter from a
downstream crate (e.g. hf-fm’s HttpRangeReader) plugs in
identically — anamnesis itself stays HTTP-free. Four reader-generic
entry points cover the supported tensor formats:
use anamnesis::{
inspect_gguf_from_reader, inspect_npz_from_reader,
inspect_pth_from_reader, parse_safetensors_header_from_reader,
};
let st_header = parse_safetensors_header_from_reader(
std::fs::File::open("shard.safetensors")?,
)?;
let npz_info = inspect_npz_from_reader(std::fs::File::open("weights.npz")?)?;
let gguf_info = inspect_gguf_from_reader(std::fs::File::open("model.gguf")?)?;
let pth_info = inspect_pth_from_reader(std::fs::File::open("model.pth")?)?;§Architecture
parse()— memory-map a.safetensorsfile into aParsedModel. Inspect-only workflows touch only the header (~1 MiB) regardless of file size; full dequantisation pages tensor bytes in lazily.ParsedModel::inspect— derive format, tensor counts, and size estimates from the parsed header (zero further I/O)ParsedModel::remember— dequantize all quantized tensors toBF16and write a standard.safetensorsfileParsedModel::remember_to_bytes— the same dequant, returning the.safetensorsbytes in memory instead of writing a file (no disk round-trip for an embedder)parse_safetensors_header/parse_safetensors_header_from_reader— header-only safetensors parsing. The reader-generic variant accepts anyReadsubstrate (in-memoryCursor,HTTP-range-backed adapter, …) and reads only the 8-byte length prefix plus theJSONheader, so a multi-GB shard’s metadata can be inspected with a single ~1 MiB sequential fetch.parse_npz()— read an.npzarchive into aHashMap<String, NpzTensor>(requiresnpzfeature)inspect_npz()/inspect_npz_from_reader()— header-onlyNPZinspection. The reader-generic variant accepts anyRead + Seeksubstrate (in-memoryCursor, HTTP-range-backed adapter, …) so callers can extract tensor metadata without materialising the data segment (requiresnpzfeature)parse_gguf()/inspect_gguf_from_reader()—GGUFparsing / inspection. The path-based variant memory-maps the file and returns aParsedGgufwith zero-copy tensor views; the reader-generic variant accepts anyRead + Seeksubstrate and returns just theGgufInspectInfosummary, so a multi-GB quantisedGGUF’s metadata can be inspected in a few range fetches over the front-loaded header without downloading the data section (requiresgguffeature)parse_pth()/inspect_pth_from_reader()—PyTorch.pthparsing / inspection. The path-based variant memory-maps the file and returns aParsedPthwith zero-copytensors(); the reader-generic variant accepts anyRead + Seeksubstrate and returns just thePthInspectInfosummary, so a torchvision-class.pthis inspectable in a single<100 KiBrange fetch over the ZIP central directory anddata.pklentry — no tensor-data files inside the archive are read (requirespthfeature)pth_to_safetensors()/pth_to_safetensors_bytes()— lossless.pth→.safetensorsconversion (requirespthfeature)npz_to_safetensors()/npz_to_safetensors_bytes()— lossless.npz→.safetensorsconversion (requiresnpzfeature; Phase 6)write_gguf()/write_gguf_to_writer()— emit a.gguffile from scalar-dtype tensors plus a metadataKVtable; the format-symmetric inverse ofparse_gguf(requiresgguffeature; Phase 6)write_bnb_nf4_safetensors()/write_bnb_nf4_safetensors_bytes()— end-to-endBF16 → BnB-NF4 safetensorspath with the four-tensor companion layout (weight,weight.absmax,weight.quant_map,weight.quant_state.bitsandbytes__nf4) (requiresbnbfeature; Phase 6)
The remember module contains one submodule per quantization family
(remember::fp8 always-on; remember::gptq, remember::awq,
remember::bnb feature-gated independently under gptq / awq /
bnb).
The lethe module mirrors that layout on the encode side. Phase 5
ships lethe::bnb (feature-gated behind bnb) plus the
always-on lethe::round_trip validation harness. Encoding a fresh
BF16 source into BnB-NF4:
use anamnesis::{encode_bnb4_compute_absmax, NF4_CODEBOOK};
// 64 BF16 elements arranged as one 64-element block.
let bf16_bytes: Vec<u8> = vec![0u8; 64 * 2];
let codebook_bytes: Vec<u8> =
NF4_CODEBOOK.iter().flat_map(|v| v.to_le_bytes()).collect();
let (weight, absmax) =
encode_bnb4_compute_absmax(&bf16_bytes, &codebook_bytes, 64, 64)?;
assert_eq!(weight.len(), 32); // 64 elements / 2 nibbles per byte
assert_eq!(absmax.len(), 4); // 1 block × F32 LEWriting a GGUF file (Phase 6 — scalar dtypes only):
use std::collections::HashMap;
use anamnesis::{write_gguf, GgufType, GgufWriteTensor};
// Two BF16 tensors. `shape` is most-significant-first, matching
// `parse_gguf` on the read side.
let w_data: Vec<u8> = vec![0u8; 8 * 2];
let b_data: Vec<u8> = vec![0u8; 4 * 2];
let tensors = [
GgufWriteTensor { name: "w", shape: &[4, 2], dtype: GgufType::BF16, data: &w_data },
GgufWriteTensor { name: "b", shape: &[4], dtype: GgufType::BF16, data: &b_data },
];
// Metadata is optional; `general.alignment` is injected if absent.
write_gguf("out.gguf", &tensors, &HashMap::new())?;Re-exports§
pub use error::AnamnesisError;pub use error::Result;pub use inspect::format_bytes;pub use inspect::InspectInfo;pub use lethe::classify_inputs;pub use lethe::encode_bnb4;pub use lethe::encode_bnb4_compute_absmax;pub use lethe::encode_bnb4_double_quant;pub use lethe::encode_bnb_int8;pub use lethe::encode_bnb_int8_compute_scb;pub use lethe::is_eligible_for_nf4;pub use lethe::write_bnb_nf4_safetensors;pub use lethe::write_bnb_nf4_safetensors_bytes;pub use lethe::BnbNf4WriteStats;pub use lethe::BnbWriteInput;pub use lethe::FP4_CODEBOOK;pub use lethe::NF4_BLOCK_SIZE;pub use lethe::NF4_CODEBOOK;pub use limits::ParseLimits;pub use model::parse;pub use model::parse_with_limits;pub use model::ParsedModel;pub use model::TargetDtype;pub use parse::resolve_ollama_model;pub use parse::inspect_gguf_from_reader;pub use parse::parse_gguf;pub use parse::parse_gguf_with_limits;pub use parse::write_gguf;pub use parse::write_gguf_to_writer;pub use parse::GgufInspectInfo;pub use parse::GgufMetadataArray;pub use parse::GgufMetadataValue;pub use parse::GgufTensor;pub use parse::GgufTensorInfo;pub use parse::GgufType;pub use parse::GgufWriteTensor;pub use parse::ParsedGguf;pub use parse::inspect_npz;pub use parse::inspect_npz_from_reader;pub use parse::parse_npz;pub use parse::parse_npz_with_limits;pub use parse::NpzDtype;pub use parse::NpzInspectInfo;pub use parse::NpzTensor;pub use parse::NpzTensorInfo;pub use parse::inspect_pth_from_reader;pub use parse::parse_pth;pub use parse::parse_pth_with_limits;pub use parse::ParsedPth;pub use parse::PthDtype;pub use parse::PthInspectInfo;pub use parse::PthTensor;pub use parse::PthTensorInfo;pub use parse::parse_safetensors_header;pub use parse::parse_safetensors_header_from_reader;pub use parse::parse_safetensors_header_from_reader_with_limits;pub use parse::parse_safetensors_header_with_limits;pub use parse::AwqCompanions;pub use parse::AwqConfig;pub use parse::Bnb4Companions;pub use parse::BnbConfig;pub use parse::Dtype;pub use parse::GptqCompanions;pub use parse::GptqConfig;pub use parse::QuantScheme;pub use parse::SafetensorsHeader;pub use parse::TensorEntry;pub use parse::TensorRole;pub use remember::dequantize_awq_to_bf16;pub use remember::dequantize_gptq_to_bf16;pub use remember::dequantize_bnb4_to_bf16;pub use remember::dequantize_bnb_int8_to_bf16;pub use remember::dequantize_fp8_to_bf16;pub use remember::dequantize_per_channel_fp8_to_bf16;pub use remember::dequantize_per_tensor_fp8_to_bf16;pub use remember::dequantize_gguf_blocks_to_bf16;pub use remember::dequantize_gguf_to_bf16;pub use remember::npz_to_safetensors;pub use remember::npz_to_safetensors_bytes;pub use remember::pth_to_safetensors;pub use remember::pth_to_safetensors_bytes;