# arrow-digest
[](https://crates.io/crates/arrow-digest)
[](https://github.com/sergiimk/arrow-digest/actions)
[](https://deps.rs/repo/github/sergiimk/arrow-digest)
Unofficial Apache Arrow crate that aims to standardize stable hashing of structured data.
## Motivation
Today, structured data formats like Parquet are binary-unstable / non-reproducible - writing the same logical data may result in different files on binary level depending on which writer implementation and you use and may vary with each version.
This crate provides a method and implementation for computing **stable hashes of structured data** (logical hash) based on Apache Arrow in-memory format.
Benefits:
- Fast way to check for equality / equivalence of large datasets
- Two parties can compare data without needing to transfer it or reveal its contents
- A step towards **content addressability** of structured data (e.g. when storing dataset chunks in DHTs like IPFS)
## Use
```rust
// Hash single array
let array = Int32Array::from(vec![1, 2, 3]);
let digest = ArrayDigestV0::<Sha3_256>::digest(&array);
println!("{:x}", digest);
// Alternatively: Use `.update(&array)` to hash multiple arrays of the same type
// Hash record batches
let schema = Arc::new(Schema::new(vec![
Field::new("a", DataType::Int32, false),
Field::new("b", DataType::Utf8, false),
]));
let record_batch = RecordBatch::try_new(Arc::new(schema), vec![
Arc::new(Int32Array::from(vec![1, 2, 3])),
Arc::new(StringArray::from(vec!["a", "b", "c"])),
]).unwrap();
let digest = RecordsDigestV0::<Sha3_256>::digest(&record_batch);
println!("{:x}", digest);
// Alternatively: Use `.update(&batch)` to hash multiple batches with same schema
```
## Status
While we're working towards `v1` we reserve the right to break the hash stability. Create an issue if you're planning to use this crate.
- [x] Schema hashing
- [x] Fixed size types
- [x] Nullability: primitive types
- [X] Binaries
- [x] Uft8 variants
- [x] Structs
- [x] Nested structs
- [x] Nullability: nested structs
- [x] Lists
- [ ] Lists of structs
- [ ] Dictionaries
- [ ] Intervals
- [ ] Unions
- [ ] Maps
- [ ] Metadata endianness check
- [ ] Better test coverage + fuzzing
- [ ] Performance: Benchmarks
- [ ] Performance: Parallelism
- [ ] Performance: Code optimization
## Design Goals
- Be reasonably fast
- Same hash no matter how many batches the input was split into
- Same hash no matter if dictionary encoding is used
## Drawbacks
- Logical hasing stops short of perfect content addressibility
- Logical hashing would need to be supported by `IPFS` and the likes, but this is a stretch as this is not a general-purpose hashing algo
- A fully deterministic binary encoding with Parquet compatibility may be a better approach
- Proposed method is order-dependent - it will produce different hashes if records are reordered
- Boolean hashing could be more efficient
## Hashing Process
Starting from primitives and building up:
- **Endinanness** - always assume little endian
- **Fixed Size Types**
- `Int, FloatingPoint, Decimal, Date, Time, Timestamp` - hashed using their in-memory binary representation
- `Bool` - hash the individual values as byte-sized values `1` for `false` and `2` for `true`
- **Variable Size Types**
- `Binary, LargeBinary, FixedSizeBinary, Utf8, LargeUtf8` - hash length (as `u64`) followed by in-memory representation of the value
- `List, LargeList, FixedSizeList` - hash length of the list (as `u64`) followed by the hash of the sub-array list according to its data type
- **Nullability** - every null value is represented by a `0` (zero) byte
- Arrays without validity bitmap have same hashes as arrays that do and all items are valid
- **Array Data**
- *(once per hashing session)* Hash data type according to the table below
- Hash items sequentially using the above rules
- **Record Batch Data**
- *(once per hashing session)* For every field hash `filed_name as utf8`, `nesting_level (zero-based) as u64` recursively traversing the schema in the **depth-first** order
- For every leaf column:
- Produce a **combined nullability bitmap** from nullability of every parent
- Update corresponding column's hasher using above rules
- *(final step)* Digests of every array are fed into the combined hasher to produce the final digest
| Null | 0 | |
| Int | 1 | `unsigned/signed (0/1) as u8`, `bitwidth as u64` |
| FloatingPoint | 2 | `bitwidth as u64` |
| Binary | 3 | |
| Utf8 | 4 | |
| Bool | 5 | |
| Decimal | 6 | `bitwidth as u64`, `precision as u64`, `scale as u64` |
| Date | 7 | `bitwidth as u64`, `DateUnitID` |
| Time | 8 | `bitwidth as u64`, `TimeUnitID` |
| Timestamp | 9 | `TimeUnitID`, `timeZone as nullable Utf8` |
| Interval | 10 | |
| List | 11 | `items data type` |
| Struct | 12 | |
| Union | 13 | |
| FixedSizeBinary | 3 | |
| FixedSizeList | 11 | `items data type` |
| Map | 16 | |
| Duration | 17 | |
| LargeBinary | 3 | |
| LargeUtf8 | 4 | |
| LargeList | 11 | `items data type` |
Note that some types (`Utf8` and `LargeUtf8`, `Binary` `FixedSizeBinary` and `LargeBinary`, `List` `FixedSizeList` and `LargeList`) are represented in the hash the same, as the difference between them is purely an encoding concern.
| DAY | 0 |
| MILLISECOND | 1 |
| SECOND | 0 |
| MILLISECOND | 1 |
| MICROSECOND | 2 |
| NANOSECOND | 3 |
## References
- [Arrow memory layout](https://arrow.apache.org/docs/format/Columnar.html#physical-memory-layout)
- [Arrow Flatbuffers schema](https://github.com/apache/arrow/blob/master/format/Schema.fbs)