1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
/**
* # UMASH: a non-cryptographic hash function with collision bounds
*
* SPDX-License-Identifier: MIT
* Copyright 2020-2022 Backtrace I/O, Inc.
* Copyright 2022 Paul Khuong
*
* UMASH is a fast (9-22 ns latency for inputs of 1-64 bytes and 22
* GB/s peak throughput, on a 2.5 GHz Intel 8175M) 64-bit hash
* function with mathematically proven collision bounds: it is
* [ceil(s / 4096) * 2^{-55}]-almost-universal for inputs of s or
* fewer bytes.
*
* When that's not enough, UMASH can also generate a pair of 64-bit
* hashes in a single traversal. The resulting fingerprint reduces
* the collision probability to less than [ceil(s / 2^{26})^2 * 2^{-83}];
* the probability that two distinct inputs receive the same
* fingerprint is less 2^{-83} for inputs up to 64 MB, and less than
* 2^{-70} as long as the inputs are shorter than 5 GB each. This
* expectation is taken over the randomly generated `umash_params`.
* If an attacker can infer the contents of these parameters, the
* bounds do not apply.
*
* ## Initialisation
*
* In order to use `UMASH`, one must first generate a `struct
* umash_params`; each such param defines a distinct `UMASH` function
* (a pair of such functions, in fact). Ideally, one would fill
* a struct with random bytes and call`umash_params_prepare`.
*
* - `umash_params_prepare`: attempts to convert the contents of
* randomly filled `struct umash_params` into a valid UMASH
* parameter struct (key). When the input consists of uniformly
* generated random bytes, the probability of failure is
* astronomically small.
*
* - `umash_params_derive`: deterministically constructs a `struct
* umash_params` from a 64-bit seed and an optional 32-byte secret.
* The seed and secret are expanded into random bytes with Salsa20;
* the resulting `umash_params` should be practically random, as
* long the seed or secret are unknown.
*
* ## Batch hashing and fingerprinting
*
* Once we have a `struct umash_params`, we can use `umash_full` or
* `umash_fprint` like regular hash functions.
*
* - `umash_full` can compute either of the two UMASH functions
* described by a `struct umash_params`. Its `seed` argument will
* change the output, but is not associated with any collision
* bound.
*
* - `umash_fprint` computes both `UMASH` functions described by a
* `struct umash_params`. `umash_fp::hash[0]` corresponds to
* calling `umash_full` with the same arguments and `which = 0`;
* `umash_fp::hash[1]` corresponds to `which = 1`.
*
* ## Incremental hashing and fingerprinting
*
* We can also compute UMASH values by feeding bytes incrementally.
* The result is guaranteed to the same as if we had buffered all the
* bytes and called `umash_full` or `umash_fprint`.
*
* - `umash_init` initialises a `struct umash_state` with the same
* parameters one would pass to `umash_full`.
*
* - `umash_digest` computes the value `umash_full` would return
* were it passed the arguments that were given to `umash_init`,
* and the bytes "fed" into the `umash_state`.
*
* - `umash_fp_init` initialises a `struct umash_fp_state` with the
* same parameters one would pass to `umash_fprint`.
*
* - `umash_fp_digest` computes the value `umash_fprint` would return
* for the bytes "fed" into the `umash_fp_state`.
*
* In both cases, one passes a pointer to `struct umash_state::sink`
* or `struct umash_fp_state::sink` to callees that wish to feed bytes
* into the `umash_state` or `umash_fp_state`.
*
* - `umash_sink_update` feeds a byte range to the `umash_sink`
* initialised by calling `umash_init` or `umash_fp_init`. The sink
* does not take ownership of anything and the input bytes may be
* overwritten or freed as soon as `umash_sink_update` returns.
*/
extern "C" __cplusplus
}
/* !UMASH_H */