# Hashers

This crate is a collection of hashing-related functionality, for use
with Rust's `std::collections::HashMap`

, `HashSet`

, and so forth.

Additionally, there are benchmarks of the hash functions and a couple of statistical tests for hash quality.

## Disclaimer

**None** of this is *cryptographically secure*. Attempting to use this
for cryptographic purposes is not recommended. I am not a cryptographer;
I don't even play one on TV.

Many (most? all?) of these functions are not designed to prevent collision-based denial of service attacks. Rust's default hash function is SipHash (1-3?), which is designed for that. Many of these functions are intended to be used for performance purposes where that form of security is not needed.

## What's a Hasher?

A hash function, for our purposes here, is a function that takes as input another, general, value, and returns a number that is ideally unique to that value. This number can be used to store the value in an array and then locate it again later without searching the array; in other words, in O(1) time. More or less: there are a lot of other details. For more information, see Rust's HashMap and various information sources online.

In Rust specifically, std::hash::Hasher is a trait:

```
pub trait Hasher {
fn finish(&self) -> u64;
fn write(&mut self, bytes: &[u8]);
fn write_u8(&mut self, i: u8) { ... }
fn write_u16(&mut self, i: u16) { ... }
...
}
```

Hasher has two required methods, `finish`

and `write`

, and default implementations of other
useful methods like `write_u8`

and `write_u16`

, implemented by calling `write`

. In use, an
implementation of Hasher is created, data is fed to it using the various `write`

methods, then
the result is returned using the `finish`

method to get the hash number out.

## Using a custom hash function in Rust

Using a custom hash function with Rust's HashMap or HashSet has long been regarded as a deep mystery. Now, I will strip back the curtains of ignorance and reveal the secrets in all their unholy glory!

Or something like that. It's not really a big deal.

There are two ways to create a HashMap that uses a custom Hasher implementation: setting the hasher on a hash-map instance, and type-level hackery.

### Explicitly telling a HashMap what Hasher to use

Everytime a value needs to be hashed, when inserting or querying the HashMap for example, a new Hasher instance is created. (Remember that the only methods in the Hasher trait update its state or return the final value.)

As a result, instead of passing in a Hasher, we have to pass an instance of another trait,
`std::hash::BuildHash`

. Rust's standard library currently has two implementations of that
trait:

`std::collections::hash_map::RandomState`

, which creates instances of DefaultHasher, Rust's implementation of SIP-something using cryptographic keys to prevent denial-of-service attacks.`std::hash::BuildHasherDefault`

, which can create instances of any Hasher implementation that also implements the Default trait.

All of the Hashers in this collection also implement Default.

```
use HashMap;
use BuildHasherDefault;
use FxHasher;
// BuildHasherDefault also implements Default---it's not really interesting.
let mut map =
with_hasher;
map.insert;
assert_eq!;
```

### Using types to specify what Hasher to use

As an alternative, HashMap has three type-level parameters: the type of keys, the type of
values, and a type implementing `std::hash::BuildHash`

. By default, the latter is
`RandomState`

, which securely creates DefaultHashers. By replacing RandomState, the Hashers
used by the map can be determined by the HashMap's concrete type.
`std::hash::BuildHasherDefault`

is useful here, as well.

```
use HashMap;
use BuildHasherDefault;
use FNV1aHasher64;
// This could be more complicated.
let mut map = gimmie_a_map;
map.insert;
assert_eq!;
```

A more complicated example is the anagrams-hashmap.rs example program included with this module.

## About this crate

This collection of Hashers is based on:

- http://www.cse.yorku.ca/~oz/hash.html Oz's Hash functions. (oz)
- http://www.burtleburtle.net/bob/hash/doobs.html Bob Jenkins' (updated) 1997 Dr. Dobbs article. (jenkins)
- http://burtleburtle.net/bob/hash/spooky.html Jenkin's SpookyHash. (jenkins::spooky_hash)
- Rust's builtin DefaultHasher (SIP 1-3?) (default)
- https://github.com/cbreeden/fxhash A fast, non-secure, hashing algorithm derived from an internal hasher in FireFox. (fx_hash)
- http://www.isthe.com/chongo/tech/comp/fnv/ The Fowler/Noll/Vo hash algorithm. (fnv)
- https://hbfs.wordpress.com/2015/11/17/and-a-good-one-hash-functions-part vi/ Steven Pigeon's Bricolage hash algorithm.
- Two "null" hashers: NullHasher returns 0 for all inputs and PassThroughHasher returns the last 8 bytes of the data.

Each sub-module implements one or more Hashers plus a minimal testing module. As well, the module has a benchmarking module for comparing the Hashers and some example programs using statistical tests to prod the various Hashers.

## Example programs

### chi2

The chi-squared test is used to determine whether there is a significant difference between the expected frequencies and the observed frequencies in one or more categories. -- Chi-squared test

This program attempts to compute the hash values for one of a number of data sets, then simulates using those values in a 128-bucket hash table (a 2^7 - 1 mask) and tries to determine if the hash buckets are uniformly distributed. I think. I'm not a statistician and apparently not much of a programmer any more. Sorry.

Anyway, it shows what may be the chi2 test of the lower bits of the hash values for a number of samples and for each Hasher. Numbers closer to 0 are better, and between 3.0 and -3.0 are apparently "ok." Maybe.

The samples are:

- 1000 uniformly distributed 6-byte binary values.
- 1000 uniformly distributed 6-byte alphanumeric (ASCII) values.
- 1000 generated identifiers of the form 'annnnn'.
- The words from data/words.txt

### kolmogorov-smirnov

The Kolmogorovâ€“Smirnov statistic quantifies a distance between the empirical distribution function of the sample and the cumulative distribution function of the reference distribution. -- Kolmogorovâ€“Smirnov test.

Ok, this one I have a bit more confidence in. It hashes the same samples as the chi2 program, then attempts to determine how far from uniformly distributed the 64-bit hash values are, reporting values between 0.0 and 1.0. Lower values are better. 32-bit hashes like DJB2 trivially fail this test, though, although they may be fine for HashMaps with much less than 2^32 entries.

### anagrams-hashmap

This program finds the number of words that can be made from the letters "asdwtribnowplfglewhqagnbe", based on the anagrams dictionary in data/anadict.txt. (There are 7440 of them.) It uses implementations of HashMap and HashSet parameterized by Hashers, and reports the time taken by each hasher as well as a comparison with DefaultHasher.

For more information, check out my ancient series of blog posts: