compact-dict 0.1.0

A highly customizable, open-addressing dictionary in Rust.
Documentation

compact-dict

compact-dict is a highly customizable, open-addressing dictionary in Rust. It features linear probing and continuous memory layout for string keys, heavily inspired by Mojo's Dict.

Features

  • Customizable Layout: Generic integer types for key-count (KC) and key-offset (KO) sizes. This flexibility allows you to perfectly balance memory usage (e.g. u16 vs u32 indices) depending on your scale requirements.
  • Cache Friendly: Linear probing design with optional cached hashes (CACHING_HASHES).
  • SIMD Preparation: Includes preparations for SIMD-accelerated probing (via portable_simd).
  • Configurable Mutability: Optional destructive operations via the DESTRUCTIVE constant generic parameter, which toggles bitsets to mask deleted operations cleanly payload.
  • Performant Hash Function: By default, it uses high-performance internal hasher variants built upon aHash (like MojoAHashStrHash).

Requirements

  • Rust Nightly: The crate strictly uses #![feature(portable_simd)] and relies on the std::simd features, meaning you must compile it with the nightly Rust toolchain.

Usage

Below is a standard usage example:

use compact_dict::dict::Dict;

fn main() {
    // Creating a new Dict: Values are of type `u32`
    // Internally uses default Configs: MojoAHashStrHash, u32 for indices
    let mut map: Dict<u32> = Dict::new(8);

    // Insert elements
    map.put("hello", 42);
    map.put("world", 100);

    // Retrieval
    assert_eq!(map.get_or("hello", 0), 42);
    assert_eq!(map.get_or("world", 0), 100);

    // Missing key returns default
    assert_eq!(map.get_or("missing", 0), 0);

    // Metadata checks
    assert_eq!(map.contains("hello"), true);
    assert_eq!(map.len(), 2);
    
    // Reset the map
    map.clear();
    assert_eq!(map.len(), 0);
}

Running Tests

To test the package, run cargo test utilizing the nightly toolchain.

cargo +nightly test

Benchmarking:

There are two benchmarks in the repository: workload_bench (simulating real-world string ingestion and mutations) and dict_bench (using Criterion for pure primitive operations like get).

RUSTFLAGS="-C target-cpu=native" cargo +nightly bench --bench workload_bench

Workload Benchmark Results (workload_bench - Insertions + Mutations):

compact_dict_fx: 0.152 s ๐Ÿš€๐Ÿ”ฅ
fxhash: 0.182 s
std_hashmap: 0.267 s
hashbrown: 0.272 s

Pure Lookup Results (dict_bench - 10k random get operations):

RUSTFLAGS="-C target-cpu=native" cargo +nightly bench --bench dict_bench
hashbrown: ~74 ยตs ๐Ÿš€๐Ÿ”ฅ
compact_dict: ~143 ยตs

Note: For the most accurate comparisons without allocation overhead skewing metrics, ensure to run tests natively with LTO enabled, as seen in the bench profile.

Performance Analysis & Honest Comparison

compact-dict is exceptionally fast for very specific workloads, beating out highly optimized SwissTable implementations like hashbrown and standard std::collections::HashMap. However, it makes severe trade-offs to achieve this speed. Here is an honest guide on when to use what:

โœ… Where compact-dict Wins (Strengths)

  1. String-Heavy Initialization & Iteration: Instead of individually heap-allocating every String key like standard HashMaps do, compact-dict copies all strings into a single densely-packed, continuous Vec<u8> memory buffer (KeysContainer).
  2. SIMD Vectorization: Lookups leverage #![feature(portable_simd)] to compare up to 16 cached u32 hashes in exactly a single hardware instruction cycle.
  3. Data Analysis & Short-lived Workloads: If you need to ingest millions of unique strings rapidly, perform some math/updates on their values, and then drop the map, compact-dict will significantly outpace its competitors.

โŒ Where compact-dict Loses (Weaknesses / Use hashbrown instead)

  1. Continuous Server Deletions: Deleting elements in compact-dict only marks a bit field as deleted (tombstoning). It never compacts or frees the physical string memory from the keys buffer. If you implement a long-running web server that constantly adds and removes strings, compact-dict will act as a memory leak until the entire dictionary is dropped.
  2. Generic Keys: The dictionary is hardcoded around KeysContainer offsets, heavily specializing in &str. You cannot drop in HashMap<Uuid, usize> or custom structs as keys easily. Standard map implementations are completely generic.
  3. Ecosystem Stability: It relies on Nightly Rust explicitly for std::simd. hashbrown has zero unstable dependencies and runs practically everywhere perfectly optimized for all target architectures.
  4. Pure Lookup Speed: In pure read-heavy workloads (e.g., retrieving 10,000 strings without inserting or scanning new ones), highly optimized SwissTables like hashbrown still outperform compact-dict. As seen in the pure-get microbenchmark, hashbrown can be about 2x faster than compact-dict for isolated random-access lookups. The performance strength of compact-dict revolves around the combined speed of contiguous string ingestion, sequential iteration cache-locality, and mutation operations.