flash-map 0.1.0

GPU-native concurrent hash map with bulk-only API. SoA layout, CUDA kernels, CPU fallback. Designed for blockchain state, HFT, and batch-parallel workloads.
Documentation

FlashMap

GPU-native concurrent hash map for bulk operations. Designed for workloads where you need to look up, insert, or remove millions of keys in a single call — not one at a time.

Why FlashMap?

Traditional hash maps (std HashMap, DashMap) process keys sequentially or with CPU thread-level parallelism. FlashMap launches a CUDA kernel where each GPU thread handles one key — achieving massive parallelism on modern GPUs.

Operation DashMap (16-core) FlashMap (H100)
1M inserts ~500ms ~60ms
1M lookups ~300ms ~35ms

Quick Start

use flash_map::FlashMap;

// Create a map with capacity for 1M entries
let mut map: FlashMap<[u8; 32], [u8; 128]> =
    FlashMap::with_capacity(1_000_000).unwrap();

// Insert 100K key-value pairs in one GPU kernel launch
let pairs: Vec<([u8; 32], [u8; 128])> = generate_pairs();
map.bulk_insert(&pairs).unwrap();

// Look up all keys at once
let keys: Vec<[u8; 32]> = pairs.iter().map(|(k, _)| *k).collect();
let results: Vec<Option<[u8; 128]>> = map.bulk_get(&keys).unwrap();

// Remove a batch of keys
map.bulk_remove(&keys[..1000]).unwrap();

Features

  • cuda — GPU backend via CUDA (requires NVIDIA GPU + CUDA 12+ toolkit)
  • cpu-fallback — CPU backend for development/testing (enabled by default)
# CPU only (default, works everywhere)
flash-map = "0.1"

# GPU acceleration
flash-map = { version = "0.1", features = ["cuda"] }

# Both (tries GPU first, falls back to CPU)
flash-map = { version = "0.1", features = ["cuda", "cpu-fallback"] }

Applications

Blockchain State Storage

Store account state (pubkey → account data) with bulk commit after block execution. FlashMap's fixed-size key/value constraint maps directly to blockchain account models where keys are 32-byte public keys and values are fixed-size account structs.

use flash_map::FlashMap;

type Pubkey = [u8; 32];
type AccountData = [u8; 128];

let mut state: FlashMap<Pubkey, AccountData> =
    FlashMap::with_capacity(10_000_000).unwrap();

// After executing a block of 100K transactions,
// commit all state changes in one GPU call
let changes: Vec<(Pubkey, AccountData)> = execute_block(&txs);
state.bulk_insert(&changes).unwrap();

High-Frequency Trading / Order Books

Batch-update order book entries. Price levels and order quantities change in bursts — FlashMap processes an entire tick's worth of updates in a single kernel launch instead of one-by-one mutex-locked inserts.

use flash_map::FlashMap;

// OrderId (u64) → OrderEntry (price + qty + side + timestamp)
let mut book: FlashMap<u64, [u8; 32]> = FlashMap::with_capacity(1_000_000).unwrap();

// Process all order updates from a single market data tick
let updates: Vec<(u64, [u8; 32])> = parse_tick_updates(&market_data);
book.bulk_insert(&updates).unwrap();

// Cancel batch of orders
let cancels: Vec<u64> = parse_cancellations(&market_data);
book.bulk_remove(&cancels).unwrap();

Network Packet Deduplication

Deduplicate packets by hash in high-throughput network pipelines. At 100Gbps+, per-packet hash table lookups on CPU become a bottleneck — FlashMap processes an entire batch of packet hashes on GPU.

use flash_map::FlashMap;

type PacketHash = [u8; 32];
type Marker = u64; // timestamp or sequence number

let mut seen: FlashMap<PacketHash, Marker> =
    FlashMap::with_capacity(10_000_000).unwrap();

// Check which packets in this batch are duplicates
let hashes: Vec<PacketHash> = batch.iter().map(|p| hash(p)).collect();
let results = seen.bulk_get(&hashes).unwrap();

// Insert new (non-duplicate) packets
let new_packets: Vec<(PacketHash, Marker)> = hashes.iter()
    .zip(results.iter())
    .filter(|(_, r)| r.is_none())
    .map(|(h, _)| (*h, current_timestamp()))
    .collect();
seen.bulk_insert(&new_packets).unwrap();

GPU-Accelerated Databases

Use as a GPU-resident index for in-memory databases. Traditional B-tree or hash indexes live in CPU memory — FlashMap keeps the index on GPU, eliminating PCIe round-trips for query-heavy workloads.

use flash_map::FlashMap;

type RowId = u64;
type IndexKey = [u8; 32]; // hashed column value

let mut index: FlashMap<IndexKey, RowId> =
    FlashMap::with_capacity(50_000_000).unwrap();

// Bulk-load index from table scan
let entries: Vec<(IndexKey, RowId)> = table.iter()
    .map(|row| (hash_column(&row.indexed_col), row.id))
    .collect();
index.bulk_insert(&entries).unwrap();

// Batch point lookups (e.g., JOIN probe side)
let probe_keys: Vec<IndexKey> = probe_table.iter()
    .map(|row| hash_column(&row.join_col))
    .collect();
let matches = index.bulk_get(&probe_keys).unwrap();

Genomics / Bioinformatics

k-mer counting and sequence matching. Genomic analysis involves billions of short DNA subsequences (k-mers) that need to be counted or looked up — a naturally batch-parallel workload.

use flash_map::FlashMap;

// 32-mer encoded as 8 bytes (2 bits per nucleotide)
type Kmer = u64;
type Count = u64;

let mut kmer_counts: FlashMap<Kmer, Count> =
    FlashMap::with_capacity(100_000_000).unwrap();

// Insert k-mers from a batch of sequence reads
let kmers: Vec<(Kmer, Count)> = extract_kmers(&reads);
kmer_counts.bulk_insert(&kmers).unwrap();

// Query which k-mers from a target sequence exist
let query_kmers: Vec<Kmer> = extract_query_kmers(&target);
let hits = kmer_counts.bulk_get(&query_kmers).unwrap();

Design

Architecture

                    FlashMap<K, V>
                         │
              ┌──────────┴──────────┐
              │                     │
        GPU Backend           CPU Backend
        (cuda feature)     (cpu-fallback feature)
              │
    ┌─────────┼─────────┐
    │         │         │
  d_keys   d_flags   d_values     ← SoA layout on GPU
  [u8]     [u32]     [u8]

SoA (Struct of Arrays): Keys, flags, and values are stored in separate contiguous GPU buffers. This gives coalesced memory access when all threads read flags simultaneously, then keys, then values — instead of strided access through interleaved AoS records.

Linear probing with power-of-2 capacity and bitmask modulo. Cache-friendly on both CPU and GPU.

Identity hash (default): Interprets the first 8 bytes of the key as a u64. Zero compute overhead for keys that are already well-distributed (SHA256 digests, Ed25519 public keys, UUIDs).

MurmurHash3: Available via the builder for keys with poor distribution (sequential integers, low-entropy prefixes).

Constraints

  • Fixed-size keys and values: Both K and V must implement bytemuck::Pod (plain old data — Copy, fixed layout, any bit pattern valid). No String, Vec, or heap-allocated types.
  • Bulk-only API: No single-key get/insert/remove. Wrap in a 1-element slice if needed.
  • 50% max load factor: The table must have at least 2x the capacity of your data. This keeps probe chains short for GPU performance.
  • No duplicate keys per batch: If the same key appears twice in a single bulk_insert call, behavior is non-deterministic on GPU (one will win).

Configuration

use flash_map::{FlashMap, HashStrategy};

let map: FlashMap<[u8; 32], [u8; 64]> = FlashMap::builder(1_000_000)
    .hash_strategy(HashStrategy::Murmur3)  // default: Identity
    .device_id(1)                           // default: 0 (first GPU)
    .force_cpu()                            // skip GPU even if available
    .build()
    .unwrap();

Benchmarks

Run on your hardware:

# CPU fallback
cargo bench

# GPU (requires CUDA 12+)
cargo bench --no-default-features --features cuda

License

Dual-licensed under MIT or Apache-2.0.