Hightower KV
Overview
Hightower KV is a lightweight, embedded key-value store designed for nodes in a custom container orchestration platform. The initial milestone targets single-node deployments with strong consistency and a clear upgrade path to replicated clusters.
Goals
- Fast, append-friendly writes with predictable resource usage.
- Tunable memory footprint and IO characteristics suitable for shared worker nodes.
- Deterministic state machine interface so consensus can be layered on later.
- Integrated authentication helper that remains isolated from core get/put APIs.
Quick Start
Add hightower-kv as a dependency and spin up a SingleNodeEngine:
use ;
Prefer the original single-threaded behaviour? Set worker_threads = 0 and all
submissions will run inline on the caller.
Authentication helper
use ;
Additional runnable samples live under examples/:
cargo run --example basic_kvdemonstrates simple put/get/delete calls.cargo run --example auth_flowwires inAuthService, hashes passwords, and issues API keys.cargo run --example prefix_queriesshows efficient prefix-based key retrieval.
into_argon2_hasher_aes_gcm_auth_service splits the engine into a shared
Arc<SingleNodeEngine> and an AuthService preloaded with the default
Argon2/AES-GCM crypto helpers, so applications can keep using the engine while
serving auth flows without juggling ownership.
Core Decisions
Storage Engine
- Log-structured segments with sequential append-only writes and periodic compaction.
- In-memory hash index for hot keys backed by sparse on-disk segment metadata (including Bloom filters) for cold lookups.
- Copy-on-write rebuild of the index during compaction to avoid blocking reads/writes.
- Snapshot support (
serialize_snapshot/restore_snapshot) so recovery avoids replaying the entire log. - Single-node deployments schedule compaction opportunistically after batches and can be forced via
run_compaction_now, automatically emitting snapshots when enabled.
Command & State Machine
Commandenum captures set/delete/batch operations. Each command applies deterministically to theKvStatestate machine.KvEnginefacade routes commands to storage and provides read contexts; today it invokes storage directly, later it can submit via consensus.- Writes support batching and configurable fsync cadence to balance durability vs. latency.
SingleNodeEngineexposessubmit_batchfor grouped writes andread_withfor consistent read snapshots built on the in-memory state.
Replication Readiness
- Define neutral traits (
CommandSubmitter,SnapshotProvider) underreplication.rsthat the single-node engine fulfills trivially. - Plan to swap in a Raft (or similar) implementation without changing storage or higher layers by conforming to these traits.
- Reads currently execute locally; future leader-leases/read-index checks can plug into the existing read context abstraction.
LocalReplicationwraps the single-node engine today, forwarding submit/batch calls and providing snapshot state for downstream consensus modules.
Indexing Strategy
- Primary in-memory map:
key -> (segment_id, offset, length, version)with configurable load factor. - Radix trie-based prefix index maintained alongside the primary map for efficient prefix queries with O(prefix_length + results) lookups.
- Segment-level sparse index + Bloom filter to accelerate cold lookups and enable promotion of entries back into the hot map.
- Background compactor regenerates segment metadata and rebuilds both indexes atomically.
Authentication Layer
- Separate
AuthServicemodule that depends only on theKvEngineplus crypto traits. - Key space namespaced under
auth/*(e.g.,auth/user/<id>,auth/apikey/<id>) with secondary indexes for lookups (auth/user_by_name/<username>). - Passwords/API keys stored as hashes using configurable
SecretHasher(Argon2/Bcrypt). Encrypted metadata blobs handled byEnvelopeEncryptor(e.g., AES-GCM). - API produces helpers like
create_user,verify_password,create_api_key, each maintaining indexes via batched writes. - Metadata passed to
create_user_with_metadata/create_api_key_with_metadatais envelope-encrypted and stored alongside the record (UserRecord.metadata,ApiKeyRecord.metadata), ensuring sensitive attributes never hit disk in plaintext; callers can recover it through the service-level decrypt helpers without touching cipher primitives.
Configuration & Telemetry
- Central
config.rsdefines storage paths, compaction thresholds, flush cadence, auth crypto settings. StoreConfig::emit_snapshot_after_compactioncontrols whether scheduled compaction writes a fresh snapshot.metrics.rsprovides hooks for counters/timers so operational visibility stays consistent when clustering arrives.
Tooling & Guides
htkvctl(insrc/bin/htkvctl.rs) offers basic maintenance commands for manual compaction, segment inspection, and listing auth users.DEVELOPERS.mddocuments storage/auth/replication extension points for contributors building on the engine.
Benchmark Summary
Criterion benchmarks (cargo bench) measured on a single core, using tempdir
storage with snapshot emission enabled. Host machine specs:
- Intel Core Ultra 7 265F (20 cores / 20 threads, up to 6.5 GHz boost)
- 128 GiB system memory
- Linux 6.16.8-arch3-1 (Arch Linux)
- Bench data directory on local NVMe SSD
Results:
| Scenario | Payload | Throughput |
|---|---|---|
engine_writes/put_1k/small |
16 B keys / 64 B values | ~0.26 M ops/s |
engine_writes/put_1k/medium |
16 B keys / 512 B values | ~0.14 M ops/s |
engine_writes/put_1k/large |
32 B keys / 4 KB values | ~0.04 M ops/s |
engine_reads/get_4k/hot |
4 096 cached gets | ~2.9 M ops/s |
engine_reads/get_4k/cold |
4 096 gets after reopening | ~3.2 M ops/s |
compaction/run_compaction_now |
20 K writes + deletes | ~118 ms/run |
| prefix_queries/get_prefix/1000 | 1 000 matches from 10 K keys | ~88 K ops/s |
| prefix_vs_scan/prefix_query | Prefix query (1 K of 10 K) | 11.3 ms |
| prefix_vs_scan/manual_scan | Manual iteration (1 K of 10 K) | 13.1 ms |
Prefix queries are ~14% faster than manual scanning with negligible write overhead.
Multi-threaded write scaling (4096 puts, 32 B keys / 256 B values):
| Worker mode | Throughput |
|---|---|
Inline (worker_threads = 0) |
~0.40 M ops/s |
| 1 worker | ~0.22 M ops/s |
| 2 workers | ~0.33 M ops/s |
| 4 workers | ~0.51 M ops/s |
| 8 workers | ~0.47 M ops/s |
Full reports (histograms, regression analysis) live under target/criterion/.
File Layout
Each logical component lives in its own file in a flat module structure:
lib.rs– module declarations and re-exports.config.rs– runtime configuration knobs.error.rs– shared error types andResultalias.command.rs– command definitions and serialization.state.rs– deterministic KV state machine + snapshot helpers.engine.rs–KvEnginetrait, single-node engine, batching, read contexts.storage.rs– storage facade orchestrating log/index/compaction.log_segment.rs– segment IO, sparse index, Bloom filters.index.rs– in-memory index management and rebuild logic.prefix_index.rs– radix trie for efficient prefix queries.compactor.rs– background compaction pipeline and atomically swapping state.snapshot.rs– checkpoint format and persistence helpers.replication.rs– future-facing consensus traits with single-node stubs.id_generator.rs– ID/token generation utilities.crypto.rs–SecretHasher/EnvelopeEncryptorabstractions + implementations.auth_types.rs– user and API-key record structs, serialization helpers.auth_service.rs– standalone authentication interface backed by KV storage.metrics.rs– instrumentation helpers.tests.rs– integration smoke tests for engine and auth flows.
Testing Philosophy
- Favor small, single-purpose functions that are easy to reason about and validate in isolation.
- Every function should come with direct unit coverage; each component/module should also expose higher-level tests that exercise its public surface.
- Prefer deterministic tests by injecting traits or mocks for IO, crypto, and timing dependencies.
- Maintain fast test execution to encourage frequent runs; heavier integration scenarios live alongside the crate’s integration suite.
Future Work
- Integrate a consensus module (likely Raft) implementing
CommandSubmitterto enable clustered deployments. - Leader-read optimizations once consensus is in place (read index, leases).
- Optional range-scan support via sorted segment variants if workloads demand it.
- Extended auth features (permissions, audit trails) once use cases are clarified.
- Operational tooling: benchmarks, chaos testing hooks, and CLI inspector.