1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
//! SeaHash: A blazingly fast, portable hash function with proven statistical guarantees.
//!
//! SeaHash is a hash function with performance better than (around 3-20% improvement) xxHash and
//! MetroHash. Furthermore, SeaHash has mathematically provable statistical guarantees.
//!
//! SeaHash is a portable hash function, meaning that the output is not dependent on the hosting
//! architecture, and makes no assumptions on endianness or the alike. This stable layout allows it
//! to be used for on-disk/permanent storage (e.g. checksums).
//!
//! # Design, advantages, and features
//!
//! - **High quality**: It beats most other general purpose hash functions because it provides full
//!   avalanche inbetween state updates.
//! - **Performance**: SeaHash beats every high-quality (grading 10/10 in smhasher) hash function
//!    that I know of.
//! - **Provable quality guarantees**: Contrary to most other non-cryptographic hash function,
//!   SeaHash can be proved to satisfy the avalanche criterion as well as BIC.
//! - **Parallelizable**: Consists of multiple, independent states to take advantage of ILP and/or
//!   software threads.
//! - **Bulk reads**: Reads 8 or 4 bytes a time.
//! - **Stable and portable**: Does not depend on the target architecture, and produces a stable
//!   value, which is only changed in major version bumps.
//! - **Keyed**: Designed to not leak the seed/key. Note that it has not gone through
//!   cryptoanalysis yet, so the keyed version shouldn't be relied on when security is needed.
//! - **Hardware accelerateable**: SeaHash is designed such that ASICs can implement it with really
//!   high performance.
//!
//! # A word of warning!
//!
//! This is **not** a cryptographic function, and it certainly should not be used as one. If you
//! want a good cryptograhic hash function, you should use SHA-3 (Keccak) or BLAKE2.
//!
//! It is not secure, nor does it aim to be. It aims to have high quality pseudorandom output and
//! few collisions, as well as being fast.
//!
//! # Benchmark
//!
//! On normal hardware, it is expected to run with a rate around 5.9-6.7 GB/S on a 2.5 GHz CPU.
//! Further improvement can be seen when hashing very big buffers in parallel.
//!
//! | Function    | Quality       | Cycles per byte (lower is better) | Author
//! |-------------|---------------|-----------------------------------|-------------------
//! | **SeaHash** | **Excellent** | **0.24**                          | **Ticki**
//! | xxHash      | Excellent     | 0.31                              | Collet
//! | MetroHash   | Excellent     | 0.35                              | Rogers
//! | Murmur      | Excellent     | 0.64                              | Appleby
//! | Rabin       | Medium        | 1.51                              | Rabin
//! | CityHash    | Excellent     | 1.62                              | Pike, Alakuijala
//! | LoseLose    | Terrible      | 2.01                              | Kernighan, Ritchie
//! | FNV         | Poor          | 3.12                              | Fowler, Noll, Vo
//! | SipHash     | Pseudorandom  | 3.21                              | Aumasson, Bernstein
//! | CRC         | Good          | 3.91                              | Peterson
//! | DJB2        | Poor          | 4.13                              | Bernstein
//!
//! ## Ideal architecture
//!
//! SeaHash is designed and optimized for the most common architecture in use:
//!
//! - Little-endian
//! - 64-bit
//! - 64 or more bytes cache lines
//! - 4 or more instruction pipelines
//! - 4 or more 64-bit registers
//!
//! Anything that does not hold the above requirements will perform worse by up to 30-40%. Note that
//! this means it is still faster than CityHash (~1 GB/S), MurMurHash (~2.6 GB/S), FNV (~0.5 GB/S),
//! etc.
//!
//! # Achieving the performance
//!
//! Like any good general-purpose hash function, SeaHash reads 8 bytes at once effectively reducing
//! the running time by an order of ~5.
//!
//! Secondly, SeaHash achieves the performance by heavily exploiting Instruction-Level Parallelism.
//! In particular, it fetches 4 integers in every round and independently diffuses them. This
//! yields four different states, which are finally combined.
//!
//! # Statistical guarantees
//!
//! SeaHash comes with certain proven guarantees about the statistical properties of the output:
//!
//! 1. Pick some _n_-byte sequence, _s_. The number of _n_-byte sequence colliding with _s_ is
//!    independent of the choice of _s_ (all equivalence class have equal size).
//! 2. If you flip any bit in the input, the probability for any bit in the output to be flipped is
//!    0.5.
//! 3. The hash value of a sequence of uniformly distributed bytes is itself uniformly distributed.
//!
//! The first guarantee can be derived through deduction, by proving that the diffusion function is
//! bijective (reverse the XORs and find the congruence inverses to the primes).
//!
//! The second guarantee requires more complex calculations: Construct a matrix of probabilities
//! and set one to certain (1), then apply transformations through the respective operations. The
//! proof is a bit long, but relatively simple.
//!
//! The third guarantee requires proving that the hash value is a tree, such that:
//! - Leafs represents the input values.
//! - Single-child nodes reduce to the diffusion of the child.
//! - Multiple-child nodes reduce to the sum of the children.
//!
//! Then simply show that each of these reductions transform uniformly distributed variables to
//! uniformly distributed variables.
//!
//! # Inner workings
//!
//! In technical terms, SeaHash follows a alternating 4-state length-padded Merkle–Damgård
//! construction with an XOR-diffuse compression function (click to enlargen):
//!
//! [![A diagram.](http://ticki.github.io/img/seahash_construction_diagram.svg)]
//! (http://ticki.github.io/img/seahash_construction_diagram.svg)
//!
//! It starts with 4 initial states, then it alternates between them (increment, wrap on 4) and
//! does XOR with the respective block. When a state has been visited the diffusion function (f) is
//! applied. The very last block is padded with zeros.
//!
//! After all the blocks have been gone over, all the states are XOR'd to the number of bytes
//! written. The sum is then passed through the diffusion function, which produces the final hash
//! value.
//!
//! The diffusion function is drawn below.
//!
//! ```notest
//! x ← px
//! x ← x ⊕ ((x ≫ 32) ≫ (x ≫ 60))
//! x ← px
//! ```
//!
//! The advantage of having four completely segregated (note that there is no mix round, so they're
//! entirely independent) states is that fast parallelism is possible. For example, if I were to
//! hash 1 TB, I can spawn up four threads which can run independently without _any_
//! intercommunication or syncronization before the last round.
//!
//! If the diffusion function (f) was cryptographically secure, it would pass cryptoanalysis
//! trivially. This might seem irrelavant, as it clearly isn't cryptographically secure, but it
//! tells us something about the inner semantics. In particular, any diffusion function with
//! sufficient statistical quality will make up a good hash function in this construction.
//!
//! Read [the blog post](http://ticki.github.io/blog/seahash-explained/) for more details.
//!
//! # ASIC version
//!
//! SeaHash is specifically designed such that it can be efficiently implemented in the form of
//! ASIC while only using very few transistors.
//!
//! # Specification
//!
//! See the [`reference`](./reference) module.
//!
//! # Credits
//!
//! Aside for myself (@ticki), there are couple of other people who have helped creating this.
//! Joshua Landau suggested using the [PCG family of diffusions](http://www.pcg-random.org/),
//! created by Melissa E. O'Neill. Sokolov Yura spotted multiple bugs in SeaHash.

#![no_std]
#![warn(missing_docs)]

pub use buffer::{hash, hash_seeded};
pub use stream::SeaHasher;

pub mod reference;
mod buffer;
mod stream;

/// The diffusion function.
///
/// This is a bijective function emitting chaotic behavior. Such functions are used as building
/// blocks for hash functions.
fn diffuse(mut x: u64) -> u64 {
    // These are derived from the PCG RNG's round. Thanks to @Veedrac for proposing this. The basic
    // idea is that we use dynamic shifts, which are determined by the input itself. The shift is
    // chosen by the higher bits, which means that changing those flips the lower bits, which
    // scatters upwards because of the multiplication.

    x = x.wrapping_mul(0x6eed0e9da4d94a4f);
    let a = x >> 32;
    let b = x >> 60;
    x ^= a >> b;
    x = x.wrapping_mul(0x6eed0e9da4d94a4f);

    x
}

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn values() {
        assert_eq!(diffuse(94203824938), 17289265692384716055);
        assert_eq!(diffuse(0xDEADBEEF), 12110756357096144265);
        assert_eq!(diffuse(0), 0);
        assert_eq!(diffuse(1), 15197155197312260123);
        assert_eq!(diffuse(2), 1571904453004118546);
        assert_eq!(diffuse(3), 16467633989910088880);
    }
}