fast_md5/
lib.rs

1//! # fast-md5
2//!
3//! A small MD5 implementation with hand-written assembly cores for
4//! `x86_64` and `aarch64`, plus a portable Rust fallback for every other
5//! target. The assembly was ported from
6//! [animetosho/md5-optimisation](https://github.com/animetosho/md5-optimisation)
7//! (released into the public domain by the author; see
8//! [discussion #4](https://github.com/animetosho/md5-optimisation/discussions/4)).
9//!
10//! On Apple Silicon and modern x86_64 the per-block compression is
11//! within ~1 % of AWS-LC's hand-tuned C. End-to-end in HMAC-MD5
12//! workloads (e.g. RADIUS Message-Authenticator), the all-Rust call
13//! path beats AWS-LC by ~18 % thanks to cross-call inlining and a
14//! precomputed-state HMAC structure (see below).
15//!
16//! ## Security warning
17//!
18//! MD5 is **cryptographically broken** — it is trivially vulnerable to
19//! collision attacks and must not be used for digital signatures,
20//! certificate fingerprints, or any other security-sensitive integrity
21//! check. HMAC-MD5 is also discouraged for new designs; this crate
22//! exposes [`HmacMd5`] solely to support legacy protocols (RADIUS,
23//! CHAP, certain SASL/SIP digests) and non-cryptographic uses such as
24//! deduplication and checksumming.
25//!
26//! ## Quick start
27//!
28//! ```
29//! let digest = fast_md5::digest(b"The quick brown fox jumps over the lazy dog");
30//! assert_eq!(
31//!     digest,
32//!     [
33//!         0x9e, 0x10, 0x7d, 0x9d, 0x37, 0x2b, 0xb6, 0x82,
34//!         0x6b, 0xd8, 0x1d, 0x35, 0x42, 0xa4, 0x19, 0xd6,
35//!     ],
36//! );
37//! ```
38//!
39//! Streaming MD5:
40//!
41//! ```
42//! let mut h = fast_md5::Md5::new();
43//! h.update(b"The quick brown fox ");
44//! h.update(b"jumps over the lazy dog");
45//! assert_eq!(
46//!     h.finalize(),
47//!     [
48//!         0x9e, 0x10, 0x7d, 0x9d, 0x37, 0x2b, 0xb6, 0x82,
49//!         0x6b, 0xd8, 0x1d, 0x35, 0x42, 0xa4, 0x19, 0xd6,
50//!     ],
51//! );
52//! ```
53//!
54//! Streaming HMAC-MD5 (RFC 2104):
55//!
56//! ```
57//! let mut h = fast_md5::HmacMd5::new(b"Jefe");
58//! h.update(b"what do ya want ");
59//! h.update(b"for nothing?");
60//! assert_eq!(
61//!     h.finalize(),
62//!     [
63//!         0x75, 0x0c, 0x78, 0x3e, 0x6a, 0xb0, 0xb5, 0x03,
64//!         0xea, 0xa8, 0x6e, 0x31, 0x0a, 0x5d, 0xb7, 0x38,
65//!     ],
66//! );
67//! ```
68//!
69//! ## `no_std`
70//!
71//! The crate is `#![no_std]` and performs no heap allocation. All
72//! buffers (the 64-byte block buffer, the 4-word state, the HMAC
73//! ipad/opad scratch) live inline in the user's value or on the
74//! caller's stack.
75//!
76//! ## Cargo features
77//!
78//! - **`force-fallback`** — disable the architecture-specific assembly
79//!   backends and route [`transform`] through the portable Rust
80//!   [`fallback`](crate) implementation on every target. Intended for
81//!   CI coverage of the fallback on assembly hosts and for downstream
82//!   debugging; not recommended for production use on `x86_64` /
83//!   `aarch64` (the fallback is correct but materially slower).
84//!
85//! ## Design notes
86//!
87//! The crate is small enough to read end-to-end, but a few choices are
88//! worth flagging because they're load-bearing for performance and
89//! were arrived at empirically.
90//!
91//! ### Architecture dispatch
92//!
93//! [`transform`] is a `cfg`-dispatched shim that routes to one of
94//! three backends, all with identical semantics:
95//!
96//! - **`x86_64`**: a single monolithic `asm!` block per 64-byte block,
97//!   ported from animetosho's "NoLEA" sequence. It uses `add` chains
98//!   instead of `lea` to keep the critical path on the integer ALUs,
99//!   which is faster on every x86_64 microarchitecture from Haswell
100//!   onward.
101//! - **`aarch64`**: per-round Rust expressions with a one-line `asm!`
102//!   block for the rotate. LLVM produces nearly the same code as a
103//!   monolithic asm block here because AArch64's three-operand ALU
104//!   (no destination clobber) and free shifts give the register
105//!   allocator more freedom; the inline `ror` exists only to pin a
106//!   concrete 32-bit register at each round boundary, which guides
107//!   scheduling.
108//! - **fallback**: portable safe Rust used on all other targets *and*
109//!   compiled (but not linked) under `cfg(test)` everywhere, so its
110//!   `transform` can be cross-checked against the assembly backends
111//!   on every supported host.
112//!
113//! ### Inlining policy
114//!
115//! - The architecture-specific `transform` is `#[inline(always)]` —
116//!   it must fuse with [`Md5::update`]'s per-block loop or LLVM
117//!   leaves the state in memory between blocks.
118//! - [`Md5::update`] and [`Md5::finalize`] are plain `#[inline]` —
119//!   their bodies are large once `transform` inlines (~700
120//!   instructions on aarch64), and forcing inline at every call site
121//!   measurably regresses I-cache-bound workloads (HMAC chains, etc.).
122//!   Plain `#[inline]` exposes MIR for cross-crate inlining and lets
123//!   LLVM's size heuristic decide.
124//! - [`Md5::new`] and [`digest`] are `#[inline(always)]` — trivial
125//!   wrappers; forcing inline lets the IV propagate as register
126//!   immediates and lets known-length one-shots collapse `finalize`'s
127//!   padding into constants.
128//!
129//! ### Block buffer uses `MaybeUninit`
130//!
131//! [`Md5`]'s 64-byte partial-block buffer is `[MaybeUninit<u8>; 64]`
132//! rather than `[u8; 64]`. Construction is therefore a no-op (no
133//! 64-byte zero fill on every `Md5::new()`), and bytes are only read
134//! after they have been written. This matters when the caller does
135//! many short hashes per second — RADIUS again — because the cost of
136//! initialising a `Md5` becomes negligible relative to the
137//! compression itself.
138//!
139//! ### `HmacMd5` precomputes ipad/opad states
140//!
141//! [`HmacMd5::new`] runs the ipad and opad block compressions
142//! immediately and stores **only** the resulting `[u32; 4]` states
143//! (32 bytes total). Steady-state, [`update`](HmacMd5::update) is
144//! pure delegation to the inner [`Md5`], and
145//! [`finalize`](HmacMd5::finalize) costs exactly **two** extra
146//! compressions on top of the message work (inner tail + outer
147//! tail). This is the same shape as AWS-LC's `HMAC_CTX` and is what
148//! gives the ~18 % end-to-end win in HMAC-bound workloads.
149//!
150//! ### No volatile key zeroization
151//!
152//! Neither [`Md5`] nor [`HmacMd5`] performs `write_volatile`-style
153//! scrubbing of stack scratch on drop. The transient ipad/opad XOR
154//! blocks inside [`HmacMd5::new`] are recoverable only during the
155//! lifetime of that call, and the persistent struct holds digests of
156//! the key (one-way) rather than the key itself. Callers with
157//! stricter threat models (FIPS, processes that emit core dumps,
158//! protocols where memory-disclosure bugs are realistic) should
159//! wrap their key in [`zeroize::Zeroizing`](https://docs.rs/zeroize)
160//! at the call site; this crate cannot meaningfully protect a key
161//! that the caller is already holding in long-lived memory.
162//!
163//! Skipping the volatile writes also keeps the hot path free of
164//! optimization barriers, which is part of why the all-Rust HMAC
165//! path beats the FFI'd C implementations.
166
167#![no_std]
168#![deny(missing_docs)]
169
170/// MD5 compresses 64-byte blocks.
171pub const BLOCK_SIZE: usize = 64;
172
173/// MD5 produces a 16-byte digest.
174pub const DIGEST_LENGTH: usize = 16;
175
176const STATE_WORDS: usize = 4;
177
178/// Standard MD5 initialization vector (RFC 1321 §3.3).
179const IV: [u32; STATE_WORDS] = [0x6745_2301, 0xefcd_ab89, 0x98ba_dcfe, 0x1032_5476];
180
181#[cfg(all(target_arch = "aarch64", not(feature = "force-fallback")))]
182mod aarch64;
183// The fallback module is compiled when:
184//   - the target has no assembly backend, or
185//   - the `force-fallback` feature is enabled (CI / debugging), or
186//   - we are running tests (so its in-module RFC 1321 vectors run on
187//     every host, regardless of the active backend).
188#[cfg(any(
189    test,
190    feature = "force-fallback",
191    not(any(target_arch = "x86_64", target_arch = "aarch64"))
192))]
193mod fallback;
194#[cfg(all(target_arch = "x86_64", not(feature = "force-fallback")))]
195mod x86_64;
196
197mod hmac;
198pub use hmac::HmacMd5;
199
200/// Streaming MD5 hasher.
201///
202/// Buffers partial blocks internally and dispatches to the
203/// architecture-specific compression function on every full 64-byte
204/// block. Call [`Md5::finalize`] to consume the hasher and obtain the
205/// 16-byte digest.
206pub struct Md5 {
207    state: [u32; STATE_WORDS],
208    /// Number of bytes processed so far (used for final length encoding).
209    count: u64,
210    /// Partial block buffer. Only the first `buf_len` bytes are
211    /// initialized at any given moment; the tail is untouched
212    /// uninitialized memory.
213    buf: [core::mem::MaybeUninit<u8>; BLOCK_SIZE],
214    /// Number of bytes currently in `buf`.
215    buf_len: usize,
216}
217
218#[allow(clippy::new_without_default)]
219impl Md5 {
220    /// Construct a new hasher initialized with the standard MD5 IV.
221    #[inline(always)]
222    pub fn new() -> Self {
223        Self {
224            state: IV,
225            count: 0,
226            // Buffer is left uninitialized; `update`/`finalize` only
227            // read bytes after writing them.
228            buf: [const { core::mem::MaybeUninit::<u8>::uninit() }; BLOCK_SIZE],
229            buf_len: 0,
230        }
231    }
232
233    /// Construct a hasher resuming from a precomputed `state` that has
234    /// already absorbed `count_bytes` bytes (which must be a multiple
235    /// of [`BLOCK_SIZE`]). Used by [`HmacMd5`] to install the
236    /// post-ipad / post-opad inner states without rerunning the
237    /// compression on the key block at every operation.
238    #[inline(always)]
239    pub(crate) fn from_parts(state: [u32; STATE_WORDS], count_bytes: u64) -> Self {
240        debug_assert!(count_bytes % (BLOCK_SIZE as u64) == 0);
241        Self {
242            state,
243            count: count_bytes,
244            buf: [const { core::mem::MaybeUninit::<u8>::uninit() }; BLOCK_SIZE],
245            buf_len: 0,
246        }
247    }
248
249    /// Absorb additional input. May be called any number of times.
250    ///
251    /// Annotated `#[inline]` (not `inline(always)`): the body is large
252    /// once the architecture-specific `transform` inlines into it, and
253    /// forcing inline at every HMAC call site causes I-cache pressure
254    /// that measurably regresses real-world workloads. `#[inline]`
255    /// still exposes MIR for cross-crate inlining when LLVM's size
256    /// heuristic decides the win is worth it.
257    #[inline]
258    pub fn update(&mut self, mut data: &[u8]) {
259        self.count = self.count.wrapping_add(data.len() as u64);
260
261        // Fill partial buffer first.
262        if self.buf_len > 0 {
263            let need = BLOCK_SIZE - self.buf_len;
264            let take = need.min(data.len());
265            // SAFETY: `take <= need` and `buf_len + need == BLOCK_SIZE`,
266            // so the destination range is inside `buf`. `MaybeUninit<u8>`
267            // has the same layout as `u8`, so we can write into it via a
268            // byte pointer.
269            unsafe {
270                core::ptr::copy_nonoverlapping(
271                    data.as_ptr(),
272                    self.buf.as_mut_ptr().add(self.buf_len).cast::<u8>(),
273                    take,
274                );
275            }
276            self.buf_len += take;
277            data = &data[take..];
278            if self.buf_len == BLOCK_SIZE {
279                // SAFETY: all 64 bytes of `buf` are now initialized.
280                let block: &[u8; BLOCK_SIZE] =
281                    unsafe { &*(self.buf.as_ptr().cast::<[u8; BLOCK_SIZE]>()) };
282                transform(&mut self.state, block);
283                self.buf_len = 0;
284            }
285        }
286
287        // Process full blocks directly from `data`.
288        while data.len() >= BLOCK_SIZE {
289            let block: &[u8; BLOCK_SIZE] = unsafe { &*(data.as_ptr().cast::<[u8; BLOCK_SIZE]>()) };
290            transform(&mut self.state, block);
291            data = &data[64..];
292        }
293
294        // Save remainder.
295        if !data.is_empty() {
296            // SAFETY: `data.len() < BLOCK_SIZE` (loop above drained
297            // full blocks), so this write stays inside `buf`.
298            unsafe {
299                core::ptr::copy_nonoverlapping(
300                    data.as_ptr(),
301                    self.buf.as_mut_ptr().cast::<u8>(),
302                    data.len(),
303                );
304            }
305            self.buf_len = data.len();
306        }
307    }
308
309    /// Consume the hasher and return the 16-byte digest.
310    ///
311    /// See [`Md5::update`] for the rationale behind plain `#[inline]`.
312    #[inline]
313    pub fn finalize(mut self) -> [u8; DIGEST_LENGTH] {
314        // Append bit '1' (0x80 byte), then zero-pad to 56 mod 64, then 64-bit LE bit count.
315        let bit_count = self.count.wrapping_mul(8);
316        // SAFETY: `buf_len <= BLOCK_SIZE - 1` on entry to `finalize`
317        // (a full block would have been flushed by `update`), so this
318        // write stays inside `buf`.
319        unsafe {
320            self.buf
321                .as_mut_ptr()
322                .add(self.buf_len)
323                .cast::<u8>()
324                .write(0x80);
325        }
326        self.buf_len += 1;
327
328        if self.buf_len > 56 {
329            // Need an extra block: zero-pad the rest of this one and
330            // process it; the length encoding goes into the next block.
331            // SAFETY: writes `BLOCK_SIZE - buf_len` zero bytes starting
332            // at offset `buf_len`, staying inside `buf`.
333            unsafe {
334                core::ptr::write_bytes(
335                    self.buf.as_mut_ptr().add(self.buf_len).cast::<u8>(),
336                    0,
337                    BLOCK_SIZE - self.buf_len,
338                );
339            }
340            // SAFETY: all 64 bytes of `buf` are initialized.
341            let block: &[u8; BLOCK_SIZE] =
342                unsafe { &*(self.buf.as_ptr().cast::<[u8; BLOCK_SIZE]>()) };
343            transform(&mut self.state, block);
344            self.buf_len = 0;
345        }
346
347        // Zero-pad up to byte 56, then write the bit count.
348        // SAFETY: `buf_len <= 56`, so the zero fill and length write
349        // jointly cover bytes `buf_len..64` inside `buf`.
350        unsafe {
351            core::ptr::write_bytes(
352                self.buf.as_mut_ptr().add(self.buf_len).cast::<u8>(),
353                0,
354                56 - self.buf_len,
355            );
356            core::ptr::copy_nonoverlapping(
357                bit_count.to_le_bytes().as_ptr(),
358                self.buf.as_mut_ptr().add(56).cast::<u8>(),
359                8,
360            );
361        }
362        // SAFETY: all 64 bytes of `buf` are initialized.
363        let block: &[u8; 64] = unsafe { &*(self.buf.as_ptr().cast::<[u8; 64]>()) };
364        transform(&mut self.state, block);
365
366        let mut digest = [0u8; DIGEST_LENGTH];
367        for (i, word) in self.state.iter().enumerate() {
368            digest[i * 4..i * 4 + 4].copy_from_slice(&word.to_le_bytes());
369        }
370        digest
371    }
372}
373
374/// Compress one 64-byte block into the running MD5 state.
375///
376/// This is the architecture-dispatch entry point used by [`Md5::update`]
377/// and [`Md5::finalize`]. It is exposed for callers that already have
378/// the message-length padding handled themselves (for example, custom
379/// streaming wrappers). Most users should prefer [`digest`] or the
380/// [`Md5`] type.
381#[inline]
382pub fn transform(state: &mut [u32; STATE_WORDS], block: &[u8; BLOCK_SIZE]) {
383    #[cfg(all(target_arch = "x86_64", not(feature = "force-fallback")))]
384    {
385        x86_64::transform(state, block);
386    }
387    #[cfg(all(target_arch = "aarch64", not(feature = "force-fallback")))]
388    {
389        aarch64::transform(state, block);
390    }
391    #[cfg(any(
392        feature = "force-fallback",
393        not(any(target_arch = "x86_64", target_arch = "aarch64"))
394    ))]
395    {
396        fallback::transform(state, block);
397    }
398}
399
400/// One-shot convenience: hash `data` and return the digest.
401#[inline(always)]
402pub fn digest(data: &[u8]) -> [u8; DIGEST_LENGTH] {
403    let mut m = Md5::new();
404    m.update(data);
405    m.finalize()
406}
407
408#[cfg(test)]
409mod tests {
410    use super::*;
411
412    // (input, expected hex digest) — RFC 1321 §A.5 test suite plus a few
413    // long inputs that exercise the multi-block path.
414    const VECTORS: &[(&[u8], &str)] = &[
415        (b"", "d41d8cd98f00b204e9800998ecf8427e"),
416        (b"a", "0cc175b9c0f1b6a831c399e269772661"),
417        (b"abc", "900150983cd24fb0d6963f7d28e17f72"),
418        (b"message digest", "f96b697d7cb7938d525a2f31aaf161d0"),
419        (
420            b"abcdefghijklmnopqrstuvwxyz",
421            "c3fcd3d76192e4007dfb496cca67e13b",
422        ),
423        (
424            b"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789",
425            "d174ab98d277d9f5a5611c2c9f419d9f",
426        ),
427        (
428            b"12345678901234567890123456789012345678901234567890123456789012345678901234567890",
429            "57edf4a22be3c955ac49da2e2107b67a",
430        ),
431    ];
432
433    fn hex(bytes: &[u8]) -> [u8; 32] {
434        const HEX: &[u8; 16] = b"0123456789abcdef";
435        let mut out = [0u8; 32];
436        for (i, b) in bytes.iter().enumerate() {
437            out[i * 2] = HEX[(b >> 4) as usize];
438            out[i * 2 + 1] = HEX[(b & 0x0f) as usize];
439        }
440        out
441    }
442
443    #[test]
444    fn rfc1321_vectors_oneshot() {
445        for (input, want) in VECTORS {
446            let got = digest(input);
447            assert_eq!(
448                core::str::from_utf8(&hex(&got)).unwrap(),
449                *want,
450                "input len {}",
451                input.len()
452            );
453        }
454    }
455
456    #[test]
457    fn rfc1321_vectors_streaming_byte_by_byte() {
458        for (input, want) in VECTORS {
459            let mut h = Md5::new();
460            for chunk in input.chunks(1) {
461                h.update(chunk);
462            }
463            assert_eq!(
464                core::str::from_utf8(&hex(&h.finalize())).unwrap(),
465                *want,
466                "streaming input len {}",
467                input.len()
468            );
469        }
470    }
471
472    #[test]
473    fn rfc1321_vectors_streaming_odd_chunks() {
474        // Use a chunk size that doesn't divide BLOCK_SIZE evenly to
475        // exercise the buffer / boundary code paths.
476        for (input, want) in VECTORS {
477            let mut h = Md5::new();
478            for chunk in input.chunks(13) {
479                h.update(chunk);
480            }
481            assert_eq!(core::str::from_utf8(&hex(&h.finalize())).unwrap(), *want,);
482        }
483    }
484
485    #[test]
486    fn million_a_test_vector() {
487        // RFC 1321 §A.5: MD5("a" * 1_000_000) = 7707d6ae4e027c70eea2a935c2296f21
488        let mut h = Md5::new();
489        let chunk = [b'a'; 1024];
490        for _ in 0..1000 {
491            h.update(&chunk[..1000]);
492        }
493        let got = h.finalize();
494        assert_eq!(
495            core::str::from_utf8(&hex(&got)).unwrap(),
496            "7707d6ae4e027c70eea2a935c2296f21",
497        );
498    }
499
500    // The active `transform` is the architecture-specific one when built
501    // on x86_64 or aarch64; on other targets it is the fallback. Either
502    // way, this test compares it block-for-block against the always-
503    // available portable fallback implementation, giving us cross-
504    // implementation cross-checking on the assembly hosts.
505    #[test]
506    fn transform_matches_fallback_on_random_blocks() {
507        let seed: u64 = 0x1234_5678_9abc_def0;
508        let mut rng_state = seed;
509        let mut next = || -> u64 {
510            // xorshift64*
511            rng_state ^= rng_state << 13;
512            rng_state ^= rng_state >> 7;
513            rng_state ^= rng_state << 17;
514            rng_state.wrapping_mul(0x2545_F491_4F6C_DD1D)
515        };
516
517        for _ in 0..256 {
518            let mut block = [0u8; BLOCK_SIZE];
519            for chunk in block.chunks_mut(8) {
520                chunk.copy_from_slice(&next().to_le_bytes());
521            }
522            let init_state = [next() as u32, next() as u32, next() as u32, next() as u32];
523
524            let mut s_active = init_state;
525            transform(&mut s_active, &block);
526
527            let mut s_ref = init_state;
528            fallback::transform(&mut s_ref, &block);
529
530            assert_eq!(s_active, s_ref);
531        }
532    }
533}
fast_md5/lib.rs

fast_md5/
lib.rs