Expand description
§fast-md5
A small MD5 implementation with hand-written assembly cores for
x86_64 and aarch64, plus a portable Rust fallback for every other
target. The assembly was ported from
animetosho/md5-optimisation
(released into the public domain by the author; see
discussion #4).
On Apple Silicon and modern x86_64 the per-block compression is within ~1 % of AWS-LC’s hand-tuned C. End-to-end in HMAC-MD5 workloads (e.g. RADIUS Message-Authenticator), the all-Rust call path beats AWS-LC by ~18 % thanks to cross-call inlining and a precomputed-state HMAC structure (see below).
§Security warning
MD5 is cryptographically broken — it is trivially vulnerable to
collision attacks and must not be used for digital signatures,
certificate fingerprints, or any other security-sensitive integrity
check. HMAC-MD5 is also discouraged for new designs; this crate
exposes HmacMd5 solely to support legacy protocols (RADIUS,
CHAP, certain SASL/SIP digests) and non-cryptographic uses such as
deduplication and checksumming.
§Quick start
let digest = fast_md5::digest(b"The quick brown fox jumps over the lazy dog");
assert_eq!(
digest,
[
0x9e, 0x10, 0x7d, 0x9d, 0x37, 0x2b, 0xb6, 0x82,
0x6b, 0xd8, 0x1d, 0x35, 0x42, 0xa4, 0x19, 0xd6,
],
);Streaming MD5:
let mut h = fast_md5::Md5::new();
h.update(b"The quick brown fox ");
h.update(b"jumps over the lazy dog");
assert_eq!(
h.finalize(),
[
0x9e, 0x10, 0x7d, 0x9d, 0x37, 0x2b, 0xb6, 0x82,
0x6b, 0xd8, 0x1d, 0x35, 0x42, 0xa4, 0x19, 0xd6,
],
);Streaming HMAC-MD5 (RFC 2104):
let mut h = fast_md5::HmacMd5::new(b"Jefe");
h.update(b"what do ya want ");
h.update(b"for nothing?");
assert_eq!(
h.finalize(),
[
0x75, 0x0c, 0x78, 0x3e, 0x6a, 0xb0, 0xb5, 0x03,
0xea, 0xa8, 0x6e, 0x31, 0x0a, 0x5d, 0xb7, 0x38,
],
);§no_std
The crate is #![no_std] and performs no heap allocation. All
buffers (the 64-byte block buffer, the 4-word state, the HMAC
ipad/opad scratch) live inline in the user’s value or on the
caller’s stack.
§Cargo features
force-fallback— disable the architecture-specific assembly backends and routetransformthrough the portable Rustfallbackimplementation on every target. Intended for CI coverage of the fallback on assembly hosts and for downstream debugging; not recommended for production use onx86_64/aarch64(the fallback is correct but materially slower).
§Design notes
The crate is small enough to read end-to-end, but a few choices are worth flagging because they’re load-bearing for performance and were arrived at empirically.
§Architecture dispatch
transform is a cfg-dispatched shim that routes to one of
three backends, all with identical semantics:
x86_64: a single monolithicasm!block per 64-byte block, ported from animetosho’s “NoLEA” sequence. It usesaddchains instead ofleato keep the critical path on the integer ALUs, which is faster on every x86_64 microarchitecture from Haswell onward.aarch64: per-round Rust expressions with a one-lineasm!block for the rotate. LLVM produces nearly the same code as a monolithic asm block here because AArch64’s three-operand ALU (no destination clobber) and free shifts give the register allocator more freedom; the inlinerorexists only to pin a concrete 32-bit register at each round boundary, which guides scheduling.- fallback: portable safe Rust used on all other targets and
compiled (but not linked) under
cfg(test)everywhere, so itstransformcan be cross-checked against the assembly backends on every supported host.
§Inlining policy
- The architecture-specific
transformis#[inline(always)]— it must fuse withMd5::update’s per-block loop or LLVM leaves the state in memory between blocks. Md5::updateandMd5::finalizeare plain#[inline]— their bodies are large oncetransforminlines (~700 instructions on aarch64), and forcing inline at every call site measurably regresses I-cache-bound workloads (HMAC chains, etc.). Plain#[inline]exposes MIR for cross-crate inlining and lets LLVM’s size heuristic decide.Md5::newanddigestare#[inline(always)]— trivial wrappers; forcing inline lets the IV propagate as register immediates and lets known-length one-shots collapsefinalize’s padding into constants.
§Block buffer uses MaybeUninit
Md5’s 64-byte partial-block buffer is [MaybeUninit<u8>; 64]
rather than [u8; 64]. Construction is therefore a no-op (no
64-byte zero fill on every Md5::new()), and bytes are only read
after they have been written. This matters when the caller does
many short hashes per second — RADIUS again — because the cost of
initialising a Md5 becomes negligible relative to the
compression itself.
§HmacMd5 precomputes ipad/opad states
HmacMd5::new runs the ipad and opad block compressions
immediately and stores only the resulting [u32; 4] states
(32 bytes total). Steady-state, update is
pure delegation to the inner Md5, and
finalize costs exactly two extra
compressions on top of the message work (inner tail + outer
tail). This is the same shape as AWS-LC’s HMAC_CTX and is what
gives the ~18 % end-to-end win in HMAC-bound workloads.
§No volatile key zeroization
Neither Md5 nor HmacMd5 performs write_volatile-style
scrubbing of stack scratch on drop. The transient ipad/opad XOR
blocks inside HmacMd5::new are recoverable only during the
lifetime of that call, and the persistent struct holds digests of
the key (one-way) rather than the key itself. Callers with
stricter threat models (FIPS, processes that emit core dumps,
protocols where memory-disclosure bugs are realistic) should
wrap their key in zeroize::Zeroizing
at the call site; this crate cannot meaningfully protect a key
that the caller is already holding in long-lived memory.
Skipping the volatile writes also keeps the hot path free of optimization barriers, which is part of why the all-Rust HMAC path beats the FFI’d C implementations.
Structs§
Constants§
- BLOCK_
SIZE - MD5 compresses 64-byte blocks.
- DIGEST_
LENGTH - MD5 produces a 16-byte digest.