Expand description
ἀνάμνησις — parse any tensor format, recover any precision.
anamnesis is a framework-agnostic Rust library for dequantizing
quantized model weights and parsing tensor archives. It handles
.safetensors (memory-mapped, classify, dequantize to BF16),
.npz (bulk extraction at near-I/O speed), and PyTorch .pth
(zero-copy mmap with lossless safetensors conversion, 11–31× faster
than torch.load()).
§Supported Quantization Schemes (decode — remember)
| Scheme | Feature gate | Speedup vs PyTorch CPU (AVX2) |
|---|---|---|
FP8 E4M3 (fine-grained, per-channel, per-tensor) | (always on) | 2.7–9.7× |
GPTQ (INT4/INT8, group-wise, g_idx) | gptq | 6.5–12.2× |
AWQ (INT4, per-group, activation-aware) | awq | 4.7–5.7× |
BitsAndBytes NF4/FP4 (lookup + per-block absmax) | bnb | 18–54× |
BitsAndBytes INT8 (LLM.int8(), per-row absmax) | bnb | 1.2× |
All schemes produce bit-exact output (0 ULP difference) against
PyTorch reference implementations, verified on real models.
§Supported Quantization Schemes (encode — lethe)
Phase 5 introduces the encode side as the architectural inverse of
remember. Each kernel here takes the BF16 bytes that
remember produces and writes the corresponding quantised bytes
plus the per-block / per-row metadata (absmax, SCB, …), so
encode(decode(q, scale)) == q holds bit-exactly for every codebook-
LUT family (NF4, FP4) and within i8 representation error
plus the documented clamp at ± 127 for INT8.
| Scheme | Feature gate | Cross-validation contract |
|---|---|---|
BitsAndBytes NF4/FP4 encode | bnb | 0-ULP bit-exact round trip on every fixture |
BitsAndBytes INT8 encode | bnb | 0-ULP bit-exact round trip on every fixture |
Subsequent encode-kernel families (FP8, GGUF legacy / K-quants
/ IQ / TQ / MXFP4) land in Phase 7.5 and reuse the
lethe::round_trip harness introduced here.
§NPZ/NPY Parsing
Feature-gated behind npz. Custom NPY header parser with bulk
read_exact — zero per-element deserialization for LE data on LE
machines. Supports F16, BF16, F32, F64, all integer types,
and Bool. 3,586 MB/s on a 302 MB file (1.3× raw I/O overhead).
§PyTorch .pth Parsing
Feature-gated behind pth. Minimal pickle VM (~36 opcodes) with
security allowlist. Memory-mapped I/O with zero-copy Cow::Borrowed
tensor data. Lossless .pth → .safetensors conversion.
11–31× faster than torch.load() on torchvision models.
§Quick Start
Path-based dequantisation (FP8 → BF16):
use anamnesis::{parse, TargetDtype};
let model = parse("model-fp8.safetensors")?;
let info = model.inspect();
println!("{info}");
model.remember("model-bf16.safetensors", TargetDtype::BF16)?;Reader-generic inspection over any Read + Seek substrate (in-memory
Cursor, HTTP-range-backed adapter, custom transport). The example
below uses a std::fs::File; an HTTP-range adapter from a
downstream crate (e.g. hf-fm’s HttpRangeReader) plugs in
identically — anamnesis itself stays HTTP-free. Four reader-generic
entry points cover the supported tensor formats:
use anamnesis::{
inspect_gguf_from_reader, inspect_npz_from_reader,
inspect_pth_from_reader, parse_safetensors_header_from_reader,
};
let st_header = parse_safetensors_header_from_reader(
std::fs::File::open("shard.safetensors")?,
)?;
let npz_info = inspect_npz_from_reader(std::fs::File::open("weights.npz")?)?;
let gguf_info = inspect_gguf_from_reader(std::fs::File::open("model.gguf")?)?;
let pth_info = inspect_pth_from_reader(std::fs::File::open("model.pth")?)?;§Architecture
parse()— memory-map a.safetensorsfile into aParsedModel. Inspect-only workflows touch only the header (~1 MiB) regardless of file size; full dequantisation pages tensor bytes in lazily.ParsedModel::inspect— derive format, tensor counts, and size estimates from the parsed header (zero further I/O)ParsedModel::remember— dequantize all quantized tensors toBF16and write a standard.safetensorsfileparse_safetensors_header/parse_safetensors_header_from_reader— header-only safetensors parsing. The reader-generic variant accepts anyReadsubstrate (in-memoryCursor,HTTP-range-backed adapter, …) and reads only the 8-byte length prefix plus theJSONheader, so a multi-GB shard’s metadata can be inspected with a single ~1 MiB sequential fetch.parse_npz()— read an.npzarchive into aHashMap<String, NpzTensor>(requiresnpzfeature)inspect_npz()/inspect_npz_from_reader()— header-onlyNPZinspection. The reader-generic variant accepts anyRead + Seeksubstrate (in-memoryCursor, HTTP-range-backed adapter, …) so callers can extract tensor metadata without materialising the data segment (requiresnpzfeature)parse_gguf()/inspect_gguf_from_reader()—GGUFparsing / inspection. The path-based variant memory-maps the file and returns aParsedGgufwith zero-copy tensor views; the reader-generic variant accepts anyRead + Seeksubstrate and returns just theGgufInspectInfosummary, so a multi-GB quantisedGGUF’s metadata can be inspected in a few range fetches over the front-loaded header without downloading the data section (requiresgguffeature)parse_pth()/inspect_pth_from_reader()—PyTorch.pthparsing / inspection. The path-based variant memory-maps the file and returns aParsedPthwith zero-copytensors(); the reader-generic variant accepts anyRead + Seeksubstrate and returns just thePthInspectInfosummary, so a torchvision-class.pthis inspectable in a single<100 KiBrange fetch over the ZIP central directory anddata.pklentry — no tensor-data files inside the archive are read (requirespthfeature)pth_to_safetensors()— lossless.pth→.safetensorsconversion (requirespthfeature)
The remember module contains one submodule per quantization family
(remember::fp8 always-on; remember::gptq, remember::awq,
remember::bnb feature-gated independently under gptq / awq /
bnb).
The lethe module mirrors that layout on the encode side. Phase 5
ships lethe::bnb (feature-gated behind bnb) plus the
always-on lethe::round_trip validation harness. Encoding a fresh
BF16 source into BnB-NF4:
use anamnesis::{encode_bnb4_compute_absmax, NF4_CODEBOOK};
// 64 BF16 elements arranged as one 64-element block.
let bf16_bytes: Vec<u8> = vec![0u8; 64 * 2];
let codebook_bytes: Vec<u8> =
NF4_CODEBOOK.iter().flat_map(|v| v.to_le_bytes()).collect();
let (weight, absmax) =
encode_bnb4_compute_absmax(&bf16_bytes, &codebook_bytes, 64, 64)?;
assert_eq!(weight.len(), 32); // 64 elements / 2 nibbles per byte
assert_eq!(absmax.len(), 4); // 1 block × F32 LERe-exports§
pub use error::AnamnesisError;pub use error::Result;pub use inspect::format_bytes;pub use inspect::InspectInfo;pub use lethe::encode_bnb4;pub use lethe::encode_bnb4_compute_absmax;pub use lethe::encode_bnb4_double_quant;pub use lethe::encode_bnb_int8;pub use lethe::encode_bnb_int8_compute_scb;pub use lethe::FP4_CODEBOOK;pub use lethe::NF4_CODEBOOK;pub use model::parse;pub use model::ParsedModel;pub use model::TargetDtype;pub use parse::inspect_gguf_from_reader;pub use parse::parse_gguf;pub use parse::GgufInspectInfo;pub use parse::GgufMetadataArray;pub use parse::GgufMetadataValue;pub use parse::GgufTensor;pub use parse::GgufTensorInfo;pub use parse::GgufType;pub use parse::ParsedGguf;pub use parse::inspect_npz;pub use parse::inspect_npz_from_reader;pub use parse::parse_npz;pub use parse::NpzDtype;pub use parse::NpzInspectInfo;pub use parse::NpzTensor;pub use parse::NpzTensorInfo;pub use parse::inspect_pth_from_reader;pub use parse::parse_pth;pub use parse::ParsedPth;pub use parse::PthDtype;pub use parse::PthInspectInfo;pub use parse::PthTensor;pub use parse::PthTensorInfo;pub use parse::parse_safetensors_header;pub use parse::parse_safetensors_header_from_reader;pub use parse::AwqCompanions;pub use parse::AwqConfig;pub use parse::Bnb4Companions;pub use parse::BnbConfig;pub use parse::Dtype;pub use parse::GptqCompanions;pub use parse::GptqConfig;pub use parse::QuantScheme;pub use parse::SafetensorsHeader;pub use parse::TensorEntry;pub use parse::TensorRole;pub use remember::dequantize_awq_to_bf16;pub use remember::dequantize_gptq_to_bf16;pub use remember::dequantize_bnb4_to_bf16;pub use remember::dequantize_bnb_int8_to_bf16;pub use remember::dequantize_fp8_to_bf16;pub use remember::dequantize_per_channel_fp8_to_bf16;pub use remember::dequantize_per_tensor_fp8_to_bf16;pub use remember::dequantize_gguf_blocks_to_bf16;pub use remember::dequantize_gguf_to_bf16;pub use remember::pth_to_safetensors;pub use remember::pth_to_safetensors_bytes;