fast_md5/lib.rs
1//! # fast-md5
2//!
3//! A small MD5 implementation with hand-written assembly cores for
4//! `x86_64` and `aarch64`, plus a portable Rust fallback for every other
5//! target. The assembly was ported from
6//! [animetosho/md5-optimisation](https://github.com/animetosho/md5-optimisation)
7//! (released into the public domain by the author; see
8//! [discussion #4](https://github.com/animetosho/md5-optimisation/discussions/4)).
9//!
10//! On Apple Silicon and modern x86_64 the per-block compression is
11//! within ~1 % of AWS-LC's hand-tuned C. End-to-end in HMAC-MD5
12//! workloads (e.g. RADIUS Message-Authenticator), the all-Rust call
13//! path beats AWS-LC by ~18 % thanks to cross-call inlining and a
14//! precomputed-state HMAC structure (see below).
15//!
16//! ## Security warning
17//!
18//! MD5 is **cryptographically broken** — it is trivially vulnerable to
19//! collision attacks and must not be used for digital signatures,
20//! certificate fingerprints, or any other security-sensitive integrity
21//! check. HMAC-MD5 is also discouraged for new designs; this crate
22//! exposes [`HmacMd5`] solely to support legacy protocols (RADIUS,
23//! CHAP, certain SASL/SIP digests) and non-cryptographic uses such as
24//! deduplication and checksumming.
25//!
26//! ## Quick start
27//!
28//! ```
29//! let digest = fast_md5::digest(b"The quick brown fox jumps over the lazy dog");
30//! assert_eq!(
31//! digest,
32//! [
33//! 0x9e, 0x10, 0x7d, 0x9d, 0x37, 0x2b, 0xb6, 0x82,
34//! 0x6b, 0xd8, 0x1d, 0x35, 0x42, 0xa4, 0x19, 0xd6,
35//! ],
36//! );
37//! ```
38//!
39//! Streaming MD5:
40//!
41//! ```
42//! let mut h = fast_md5::Md5::new();
43//! h.update(b"The quick brown fox ");
44//! h.update(b"jumps over the lazy dog");
45//! assert_eq!(
46//! h.finalize(),
47//! [
48//! 0x9e, 0x10, 0x7d, 0x9d, 0x37, 0x2b, 0xb6, 0x82,
49//! 0x6b, 0xd8, 0x1d, 0x35, 0x42, 0xa4, 0x19, 0xd6,
50//! ],
51//! );
52//! ```
53//!
54//! Streaming HMAC-MD5 (RFC 2104):
55//!
56//! ```
57//! let mut h = fast_md5::HmacMd5::new(b"Jefe");
58//! h.update(b"what do ya want ");
59//! h.update(b"for nothing?");
60//! assert_eq!(
61//! h.finalize(),
62//! [
63//! 0x75, 0x0c, 0x78, 0x3e, 0x6a, 0xb0, 0xb5, 0x03,
64//! 0xea, 0xa8, 0x6e, 0x31, 0x0a, 0x5d, 0xb7, 0x38,
65//! ],
66//! );
67//! ```
68//!
69//! ## `no_std`
70//!
71//! The crate is `#![no_std]` and performs no heap allocation. All
72//! buffers (the 64-byte block buffer, the 4-word state, the HMAC
73//! ipad/opad scratch) live inline in the user's value or on the
74//! caller's stack.
75//!
76//! ## Cargo features
77//!
78//! - **`force-fallback`** — disable the architecture-specific assembly
79//! backends and route [`transform`] through the portable Rust
80//! [`fallback`](crate) implementation on every target. Intended for
81//! CI coverage of the fallback on assembly hosts and for downstream
82//! debugging; not recommended for production use on `x86_64` /
83//! `aarch64` (the fallback is correct but materially slower).
84//!
85//! ## Design notes
86//!
87//! The crate is small enough to read end-to-end, but a few choices are
88//! worth flagging because they're load-bearing for performance and
89//! were arrived at empirically.
90//!
91//! ### Architecture dispatch
92//!
93//! [`transform`] is a `cfg`-dispatched shim that routes to one of
94//! three backends, all with identical semantics:
95//!
96//! - **`x86_64`**: a single monolithic `asm!` block per 64-byte block,
97//! ported from animetosho's "NoLEA" sequence. It uses `add` chains
98//! instead of `lea` to keep the critical path on the integer ALUs,
99//! which is faster on every x86_64 microarchitecture from Haswell
100//! onward.
101//! - **`aarch64`**: per-round Rust expressions with a one-line `asm!`
102//! block for the rotate. LLVM produces nearly the same code as a
103//! monolithic asm block here because AArch64's three-operand ALU
104//! (no destination clobber) and free shifts give the register
105//! allocator more freedom; the inline `ror` exists only to pin a
106//! concrete 32-bit register at each round boundary, which guides
107//! scheduling.
108//! - **fallback**: portable safe Rust used on all other targets *and*
109//! compiled (but not linked) under `cfg(test)` everywhere, so its
110//! `transform` can be cross-checked against the assembly backends
111//! on every supported host.
112//!
113//! ### Inlining policy
114//!
115//! - The architecture-specific `transform` is `#[inline(always)]` —
116//! it must fuse with [`Md5::update`]'s per-block loop or LLVM
117//! leaves the state in memory between blocks.
118//! - [`Md5::update`] and [`Md5::finalize`] are plain `#[inline]` —
119//! their bodies are large once `transform` inlines (~700
120//! instructions on aarch64), and forcing inline at every call site
121//! measurably regresses I-cache-bound workloads (HMAC chains, etc.).
122//! Plain `#[inline]` exposes MIR for cross-crate inlining and lets
123//! LLVM's size heuristic decide.
124//! - [`Md5::new`] and [`digest`] are `#[inline(always)]` — trivial
125//! wrappers; forcing inline lets the IV propagate as register
126//! immediates and lets known-length one-shots collapse `finalize`'s
127//! padding into constants.
128//!
129//! ### Block buffer uses `MaybeUninit`
130//!
131//! [`Md5`]'s 64-byte partial-block buffer is `[MaybeUninit<u8>; 64]`
132//! rather than `[u8; 64]`. Construction is therefore a no-op (no
133//! 64-byte zero fill on every `Md5::new()`), and bytes are only read
134//! after they have been written. This matters when the caller does
135//! many short hashes per second — RADIUS again — because the cost of
136//! initialising a `Md5` becomes negligible relative to the
137//! compression itself.
138//!
139//! ### `HmacMd5` precomputes ipad/opad states
140//!
141//! [`HmacMd5::new`] runs the ipad and opad block compressions
142//! immediately and stores **only** the resulting `[u32; 4]` states
143//! (32 bytes total). Steady-state, [`update`](HmacMd5::update) is
144//! pure delegation to the inner [`Md5`], and
145//! [`finalize`](HmacMd5::finalize) costs exactly **two** extra
146//! compressions on top of the message work (inner tail + outer
147//! tail). This is the same shape as AWS-LC's `HMAC_CTX` and is what
148//! gives the ~18 % end-to-end win in HMAC-bound workloads.
149//!
150//! ### No volatile key zeroization
151//!
152//! Neither [`Md5`] nor [`HmacMd5`] performs `write_volatile`-style
153//! scrubbing of stack scratch on drop. The transient ipad/opad XOR
154//! blocks inside [`HmacMd5::new`] are recoverable only during the
155//! lifetime of that call, and the persistent struct holds digests of
156//! the key (one-way) rather than the key itself. Callers with
157//! stricter threat models (FIPS, processes that emit core dumps,
158//! protocols where memory-disclosure bugs are realistic) should
159//! wrap their key in [`zeroize::Zeroizing`](https://docs.rs/zeroize)
160//! at the call site; this crate cannot meaningfully protect a key
161//! that the caller is already holding in long-lived memory.
162//!
163//! Skipping the volatile writes also keeps the hot path free of
164//! optimization barriers, which is part of why the all-Rust HMAC
165//! path beats the FFI'd C implementations.
166
167#![no_std]
168#![deny(missing_docs)]
169
170/// MD5 compresses 64-byte blocks.
171pub const BLOCK_SIZE: usize = 64;
172
173/// MD5 produces a 16-byte digest.
174pub const DIGEST_LENGTH: usize = 16;
175
176const STATE_WORDS: usize = 4;
177
178/// Standard MD5 initialization vector (RFC 1321 §3.3).
179const IV: [u32; STATE_WORDS] = [0x6745_2301, 0xefcd_ab89, 0x98ba_dcfe, 0x1032_5476];
180
181#[cfg(all(target_arch = "aarch64", not(feature = "force-fallback")))]
182mod aarch64;
183// The fallback module is compiled when:
184// - the target has no assembly backend, or
185// - the `force-fallback` feature is enabled (CI / debugging), or
186// - we are running tests (so its in-module RFC 1321 vectors run on
187// every host, regardless of the active backend).
188#[cfg(any(
189 test,
190 feature = "force-fallback",
191 not(any(target_arch = "x86_64", target_arch = "aarch64"))
192))]
193mod fallback;
194#[cfg(all(target_arch = "x86_64", not(feature = "force-fallback")))]
195mod x86_64;
196
197mod hmac;
198pub use hmac::HmacMd5;
199
200/// Streaming MD5 hasher.
201///
202/// Buffers partial blocks internally and dispatches to the
203/// architecture-specific compression function on every full 64-byte
204/// block. Call [`Md5::finalize`] to consume the hasher and obtain the
205/// 16-byte digest.
206pub struct Md5 {
207 state: [u32; STATE_WORDS],
208 /// Number of bytes processed so far (used for final length encoding).
209 count: u64,
210 /// Partial block buffer. Only the first `buf_len` bytes are
211 /// initialized at any given moment; the tail is untouched
212 /// uninitialized memory.
213 buf: [core::mem::MaybeUninit<u8>; BLOCK_SIZE],
214 /// Number of bytes currently in `buf`.
215 buf_len: usize,
216}
217
218#[allow(clippy::new_without_default)]
219impl Md5 {
220 /// Construct a new hasher initialized with the standard MD5 IV.
221 #[inline(always)]
222 pub fn new() -> Self {
223 Self {
224 state: IV,
225 count: 0,
226 // Buffer is left uninitialized; `update`/`finalize` only
227 // read bytes after writing them.
228 buf: [const { core::mem::MaybeUninit::<u8>::uninit() }; BLOCK_SIZE],
229 buf_len: 0,
230 }
231 }
232
233 /// Construct a hasher resuming from a precomputed `state` that has
234 /// already absorbed `count_bytes` bytes (which must be a multiple
235 /// of [`BLOCK_SIZE`]). Used by [`HmacMd5`] to install the
236 /// post-ipad / post-opad inner states without rerunning the
237 /// compression on the key block at every operation.
238 #[inline(always)]
239 pub(crate) fn from_parts(state: [u32; STATE_WORDS], count_bytes: u64) -> Self {
240 debug_assert!(count_bytes % (BLOCK_SIZE as u64) == 0);
241 Self {
242 state,
243 count: count_bytes,
244 buf: [const { core::mem::MaybeUninit::<u8>::uninit() }; BLOCK_SIZE],
245 buf_len: 0,
246 }
247 }
248
249 /// Absorb additional input. May be called any number of times.
250 ///
251 /// Annotated `#[inline]` (not `inline(always)`): the body is large
252 /// once the architecture-specific `transform` inlines into it, and
253 /// forcing inline at every HMAC call site causes I-cache pressure
254 /// that measurably regresses real-world workloads. `#[inline]`
255 /// still exposes MIR for cross-crate inlining when LLVM's size
256 /// heuristic decides the win is worth it.
257 #[inline]
258 pub fn update(&mut self, mut data: &[u8]) {
259 self.count = self.count.wrapping_add(data.len() as u64);
260
261 // Fill partial buffer first.
262 if self.buf_len > 0 {
263 let need = BLOCK_SIZE - self.buf_len;
264 let take = need.min(data.len());
265 // SAFETY: `take <= need` and `buf_len + need == BLOCK_SIZE`,
266 // so the destination range is inside `buf`. `MaybeUninit<u8>`
267 // has the same layout as `u8`, so we can write into it via a
268 // byte pointer.
269 unsafe {
270 core::ptr::copy_nonoverlapping(
271 data.as_ptr(),
272 self.buf.as_mut_ptr().add(self.buf_len).cast::<u8>(),
273 take,
274 );
275 }
276 self.buf_len += take;
277 data = &data[take..];
278 if self.buf_len == BLOCK_SIZE {
279 // SAFETY: all 64 bytes of `buf` are now initialized.
280 let block: &[u8; BLOCK_SIZE] =
281 unsafe { &*(self.buf.as_ptr().cast::<[u8; BLOCK_SIZE]>()) };
282 transform(&mut self.state, block);
283 self.buf_len = 0;
284 }
285 }
286
287 // Process full blocks directly from `data`.
288 while data.len() >= BLOCK_SIZE {
289 let block: &[u8; BLOCK_SIZE] = unsafe { &*(data.as_ptr().cast::<[u8; BLOCK_SIZE]>()) };
290 transform(&mut self.state, block);
291 data = &data[64..];
292 }
293
294 // Save remainder.
295 if !data.is_empty() {
296 // SAFETY: `data.len() < BLOCK_SIZE` (loop above drained
297 // full blocks), so this write stays inside `buf`.
298 unsafe {
299 core::ptr::copy_nonoverlapping(
300 data.as_ptr(),
301 self.buf.as_mut_ptr().cast::<u8>(),
302 data.len(),
303 );
304 }
305 self.buf_len = data.len();
306 }
307 }
308
309 /// Consume the hasher and return the 16-byte digest.
310 ///
311 /// See [`Md5::update`] for the rationale behind plain `#[inline]`.
312 #[inline]
313 pub fn finalize(mut self) -> [u8; DIGEST_LENGTH] {
314 // Append bit '1' (0x80 byte), then zero-pad to 56 mod 64, then 64-bit LE bit count.
315 let bit_count = self.count.wrapping_mul(8);
316 // SAFETY: `buf_len <= BLOCK_SIZE - 1` on entry to `finalize`
317 // (a full block would have been flushed by `update`), so this
318 // write stays inside `buf`.
319 unsafe {
320 self.buf
321 .as_mut_ptr()
322 .add(self.buf_len)
323 .cast::<u8>()
324 .write(0x80);
325 }
326 self.buf_len += 1;
327
328 if self.buf_len > 56 {
329 // Need an extra block: zero-pad the rest of this one and
330 // process it; the length encoding goes into the next block.
331 // SAFETY: writes `BLOCK_SIZE - buf_len` zero bytes starting
332 // at offset `buf_len`, staying inside `buf`.
333 unsafe {
334 core::ptr::write_bytes(
335 self.buf.as_mut_ptr().add(self.buf_len).cast::<u8>(),
336 0,
337 BLOCK_SIZE - self.buf_len,
338 );
339 }
340 // SAFETY: all 64 bytes of `buf` are initialized.
341 let block: &[u8; BLOCK_SIZE] =
342 unsafe { &*(self.buf.as_ptr().cast::<[u8; BLOCK_SIZE]>()) };
343 transform(&mut self.state, block);
344 self.buf_len = 0;
345 }
346
347 // Zero-pad up to byte 56, then write the bit count.
348 // SAFETY: `buf_len <= 56`, so the zero fill and length write
349 // jointly cover bytes `buf_len..64` inside `buf`.
350 unsafe {
351 core::ptr::write_bytes(
352 self.buf.as_mut_ptr().add(self.buf_len).cast::<u8>(),
353 0,
354 56 - self.buf_len,
355 );
356 core::ptr::copy_nonoverlapping(
357 bit_count.to_le_bytes().as_ptr(),
358 self.buf.as_mut_ptr().add(56).cast::<u8>(),
359 8,
360 );
361 }
362 // SAFETY: all 64 bytes of `buf` are initialized.
363 let block: &[u8; 64] = unsafe { &*(self.buf.as_ptr().cast::<[u8; 64]>()) };
364 transform(&mut self.state, block);
365
366 let mut digest = [0u8; DIGEST_LENGTH];
367 for (i, word) in self.state.iter().enumerate() {
368 digest[i * 4..i * 4 + 4].copy_from_slice(&word.to_le_bytes());
369 }
370 digest
371 }
372}
373
374/// Compress one 64-byte block into the running MD5 state.
375///
376/// This is the architecture-dispatch entry point used by [`Md5::update`]
377/// and [`Md5::finalize`]. It is exposed for callers that already have
378/// the message-length padding handled themselves (for example, custom
379/// streaming wrappers). Most users should prefer [`digest`] or the
380/// [`Md5`] type.
381#[inline]
382pub fn transform(state: &mut [u32; STATE_WORDS], block: &[u8; BLOCK_SIZE]) {
383 #[cfg(all(target_arch = "x86_64", not(feature = "force-fallback")))]
384 {
385 x86_64::transform(state, block);
386 }
387 #[cfg(all(target_arch = "aarch64", not(feature = "force-fallback")))]
388 {
389 aarch64::transform(state, block);
390 }
391 #[cfg(any(
392 feature = "force-fallback",
393 not(any(target_arch = "x86_64", target_arch = "aarch64"))
394 ))]
395 {
396 fallback::transform(state, block);
397 }
398}
399
400/// One-shot convenience: hash `data` and return the digest.
401#[inline(always)]
402pub fn digest(data: &[u8]) -> [u8; DIGEST_LENGTH] {
403 let mut m = Md5::new();
404 m.update(data);
405 m.finalize()
406}
407
408#[cfg(test)]
409mod tests {
410 use super::*;
411
412 // (input, expected hex digest) — RFC 1321 §A.5 test suite plus a few
413 // long inputs that exercise the multi-block path.
414 const VECTORS: &[(&[u8], &str)] = &[
415 (b"", "d41d8cd98f00b204e9800998ecf8427e"),
416 (b"a", "0cc175b9c0f1b6a831c399e269772661"),
417 (b"abc", "900150983cd24fb0d6963f7d28e17f72"),
418 (b"message digest", "f96b697d7cb7938d525a2f31aaf161d0"),
419 (
420 b"abcdefghijklmnopqrstuvwxyz",
421 "c3fcd3d76192e4007dfb496cca67e13b",
422 ),
423 (
424 b"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789",
425 "d174ab98d277d9f5a5611c2c9f419d9f",
426 ),
427 (
428 b"12345678901234567890123456789012345678901234567890123456789012345678901234567890",
429 "57edf4a22be3c955ac49da2e2107b67a",
430 ),
431 ];
432
433 fn hex(bytes: &[u8]) -> [u8; 32] {
434 const HEX: &[u8; 16] = b"0123456789abcdef";
435 let mut out = [0u8; 32];
436 for (i, b) in bytes.iter().enumerate() {
437 out[i * 2] = HEX[(b >> 4) as usize];
438 out[i * 2 + 1] = HEX[(b & 0x0f) as usize];
439 }
440 out
441 }
442
443 #[test]
444 fn rfc1321_vectors_oneshot() {
445 for (input, want) in VECTORS {
446 let got = digest(input);
447 assert_eq!(
448 core::str::from_utf8(&hex(&got)).unwrap(),
449 *want,
450 "input len {}",
451 input.len()
452 );
453 }
454 }
455
456 #[test]
457 fn rfc1321_vectors_streaming_byte_by_byte() {
458 for (input, want) in VECTORS {
459 let mut h = Md5::new();
460 for chunk in input.chunks(1) {
461 h.update(chunk);
462 }
463 assert_eq!(
464 core::str::from_utf8(&hex(&h.finalize())).unwrap(),
465 *want,
466 "streaming input len {}",
467 input.len()
468 );
469 }
470 }
471
472 #[test]
473 fn rfc1321_vectors_streaming_odd_chunks() {
474 // Use a chunk size that doesn't divide BLOCK_SIZE evenly to
475 // exercise the buffer / boundary code paths.
476 for (input, want) in VECTORS {
477 let mut h = Md5::new();
478 for chunk in input.chunks(13) {
479 h.update(chunk);
480 }
481 assert_eq!(core::str::from_utf8(&hex(&h.finalize())).unwrap(), *want,);
482 }
483 }
484
485 #[test]
486 fn million_a_test_vector() {
487 // RFC 1321 §A.5: MD5("a" * 1_000_000) = 7707d6ae4e027c70eea2a935c2296f21
488 let mut h = Md5::new();
489 let chunk = [b'a'; 1024];
490 for _ in 0..1000 {
491 h.update(&chunk[..1000]);
492 }
493 let got = h.finalize();
494 assert_eq!(
495 core::str::from_utf8(&hex(&got)).unwrap(),
496 "7707d6ae4e027c70eea2a935c2296f21",
497 );
498 }
499
500 // The active `transform` is the architecture-specific one when built
501 // on x86_64 or aarch64; on other targets it is the fallback. Either
502 // way, this test compares it block-for-block against the always-
503 // available portable fallback implementation, giving us cross-
504 // implementation cross-checking on the assembly hosts.
505 #[test]
506 fn transform_matches_fallback_on_random_blocks() {
507 let seed: u64 = 0x1234_5678_9abc_def0;
508 let mut rng_state = seed;
509 let mut next = || -> u64 {
510 // xorshift64*
511 rng_state ^= rng_state << 13;
512 rng_state ^= rng_state >> 7;
513 rng_state ^= rng_state << 17;
514 rng_state.wrapping_mul(0x2545_F491_4F6C_DD1D)
515 };
516
517 for _ in 0..256 {
518 let mut block = [0u8; BLOCK_SIZE];
519 for chunk in block.chunks_mut(8) {
520 chunk.copy_from_slice(&next().to_le_bytes());
521 }
522 let init_state = [next() as u32, next() as u32, next() as u32, next() as u32];
523
524 let mut s_active = init_state;
525 transform(&mut s_active, &block);
526
527 let mut s_ref = init_state;
528 fallback::transform(&mut s_ref, &block);
529
530 assert_eq!(s_active, s_ref);
531 }
532 }
533}