fast-cache 0.1.0

Embedded-first thread-per-core in-memory cache with optional Redis-compatible server
Documentation

fast-cache

fast-cache is an embedded-first, in-memory key-value database. The default crate build exposes a Rust API for direct in-process use. The optional server feature builds fast-cache-server, a Redis-compatible TCP server with WAL and snapshot persistence.

The default build uses conservative safe memory paths. Reviewed lower-overhead paths are available only when the unsafe feature is enabled.

Embedded Use

Use [storage::EmbeddedStore] when several threads should share one cache handle. Keys and values are byte vectors. TTL arguments are milliseconds; pass None for persistent values.

use fast_cache::storage::EmbeddedStore;

let cache = EmbeddedStore::new(16);

cache.set(b"user:42".to_vec(), b"ready".to_vec(), None);
assert_eq!(cache.get(b"user:42"), Some(b"ready".to_vec()));

cache.delete(b"user:42");
assert!(!cache.exists(b"user:42"));

Batch APIs group routing work and return results in key order:

use fast_cache::storage::EmbeddedStore;

let cache = EmbeddedStore::new(4);
cache.batch_set(
    vec![
        (b"alpha".to_vec(), b"one".to_vec()),
        (b"beta".to_vec(), b"two".to_vec()),
    ],
    None,
);

let values = cache.batch_get(vec![b"alpha".to_vec(), b"missing".to_vec()]);
assert_eq!(values[0], Some(b"one".to_vec()));
assert_eq!(values[1], None);

Use [storage::LocalEmbeddedStore] when each worker owns its shards and calls the cache through an exclusive &mut handle. This is the lowest-overhead Rust API and is enabled by the default sharded feature.

This ownership model is intentionally different from shared reference-counted caches such as DashMap. Shared caches let every worker clone one map handle and access any key through internal synchronization. fast-cache instead routes each key to a shard and gives the owning worker exclusive access to that shard's cache slab. Use EmbeddedStore to build and route the cache, then split it with into_local_stores when workers are pinned.

Use [storage::SharedEmbeddedStore] when every worker needs to clone one handle and reach every key. It is cache-padded and lock-striped like DashMap, while still sharing the same embedded shard implementation. EmbeddedStore, SharedEmbeddedStore, and LocalEmbeddedStore all require power-of-two shard counts so routing can use shift-based striping.

use fast_cache::storage::{
    EmbeddedRouteMode, EmbeddedStore, LocalEmbeddedStoreBootstrap,
};

let shared = EmbeddedStore::with_route_mode(4, EmbeddedRouteMode::FullKey);
let mut stores = LocalEmbeddedStoreBootstrap::from_embedded(shared, 1).into_stores();
let mut local = stores.pop().expect("one worker store");

local.set(b"local-key".to_vec(), b"value".to_vec(), None);
assert_eq!(local.get(b"local-key"), Some(b"value".to_vec()));

Session APIs keep related KV-cache chunks on the same route and can pack values into contiguous buffers:

use fast_cache::storage::{EmbeddedStore, PackedSessionWrite};

let cache = EmbeddedStore::new(4);
let mut write = PackedSessionWrite::with_capacity(b"session:1".to_vec(), 2, 16);

write.push_owned_record(b"session:1:layer:0".to_vec(), b"kv0".to_vec());
write.push_owned_record(b"session:1:layer:1".to_vec(), b"kv1".to_vec());
cache.batch_set_session_packed_no_ttl(write);

let keys = vec![
    b"session:1:layer:0".to_vec(),
    b"session:1:layer:1".to_vec(),
];
let batch = cache.batch_get_session_packed(b"session:1", &keys);

assert!(batch.all_hit());
assert_eq!(batch.total_bytes(), 6);

Embedded Benchmark Highlights

The embedded API is optimized for thread-per-core deployments. The lowest overhead path is [storage::LocalEmbeddedStore], where each worker owns its shards and calls the cache through an exclusive &mut handle. For applications that need a cloneable shared handle, [storage::SharedEmbeddedStore] provides lock-striped access with the same embedded shard implementation.

Current Linux release benchmarks use pinned workers, 100k keys for small values, 10s measured runs, and latency sampling disabled for max-throughput rows. The fc-embed rows below are direct owner-local embedded stores with no TTL and no eviction:

Value Mix fast-cache direct DashMap Moka
64B GET 422.82M ops/s 54.84M ops/s 2.65M ops/s
64B SET 114.85M ops/s 35.40M ops/s 1.51M ops/s
64B 80/20 253.88M ops/s 45.23M ops/s 4.52M ops/s
4KiB 80/20 21.03M ops/s 6.90M ops/s 3.77M ops/s

Capacity-bounded rows stress a different path. With LRU enabled, 64B, read-only, 16 workers, and 25% resident capacity, fc-embed reaches 473.7M ops/s. On large write-heavy LRU workloads, Moka can be faster because value materialization and eviction bookkeeping dominate the small-value shard hot path; for example, a 64KiB write-only LRU row measured fast-cache around 19.1 GB/s and Moka around 33.3 GB/s.

Treat these as workload-specific reference points, not universal constants. The full embedded matrix, LRU/TTL rows, CSV artifact paths, and reproduction commands live in benchmarks/FAST_CACHE_EMBEDDED_RELEASE.md.

API Map

The most commonly used Rust APIs live in [storage]:

  • [storage::EmbeddedStore]: shared, sharded store for byte-string keys.
  • [storage::SharedEmbeddedStore]: cloneable, lock-striped embedded store for cross-worker shared handles.
  • [storage::LocalEmbeddedStore]: worker-local store for thread-per-core workers.
  • [storage::EmbeddedRouteMode], [storage::EmbeddedKeyRoute], and [storage::PreparedPointKey]: routing and precomputed lookup helpers.
  • [storage::PackedBatch] and [storage::PackedSessionWrite]: contiguous batch read/write payloads.
  • [storage::TierStatsSnapshot], [storage::ShardStatsSnapshot], and [storage::WalStatsSnapshot]: runtime statistics.

Core key/value methods include set, set_value_bytes, batch_set, get, get_view, batch_get, batch_get_view, batch_get_packed, delete, exists, ttl_seconds, pttl_millis, expire, persist, len, key_snapshot, stored_bytes, stats_snapshot, and process_maintenance.

Session-oriented methods include batch_set_session_owned_no_ttl, batch_set_session_packed_no_ttl, batch_get_session, batch_get_session_view, batch_get_session_packed, prepare_point_key, and their routed or prehashed variants.

Redis object helpers are exposed on [storage::EmbeddedStore] for hashes, lists, sets, and sorted sets. They use Redis-style wrong-type behavior through [storage::RedisObjectResult]. The public method families are hset/hget and related hash methods, lpush/rpush/lrange and related list methods, sadd/srem/smembers and related set methods, and zadd/zrange/zscore and related sorted-set methods.

Other modules:

  • [config]: FastCacheConfig, EvictionPolicy, tier sizing, persistence configuration, and TOML load/store helpers.
  • [protocol]: RESP and native fast protocol codecs.
  • [persistence]: snapshot loading/writing and WAL runtime support.
  • [cuda]: GPU-facing configuration and transfer descriptors.
  • [server]: TCP listener and connection handling, available with server.

Server Use

Install and run the optional server binary:

cargo install fast-cache --features server --locked
fast-cache-server --data-dir ./var/fast-cache

From a checkout:

cargo run -p fast-cache --features server --bin fast-cache-server -- --data-dir ./var/fast-cache

The server listens on 127.0.0.1:6380 by default and accepts RESP clients:

printf '*1\r\n$4\r\nPING\r\n' | nc 127.0.0.1 6380
printf '*3\r\n$3\r\nSET\r\n$3\r\nfoo\r\n$3\r\nbar\r\n' | nc 127.0.0.1 6380
printf '*2\r\n$3\r\nGET\r\n$3\r\nfoo\r\n' | nc 127.0.0.1 6380

The current server command catalog implements the redesigned string-key hot path for GET and SET. Additional RESP commands should be added through the per-command module pattern in src/commands/README.md so parser, storage, and direct-server behavior stay local to the command.

Configuration

Load configuration from TOML with [config::FastCacheConfig]:

use std::path::Path;

use fast_cache::config::FastCacheConfig;

fn main() -> fast_cache::Result<()> {
    let config = FastCacheConfig::load_from_path(Path::new("fast-cache.toml"))?;
    config.validate()?;
    Ok(())
}

The repository includes fast-cache.toml.example with the supported fields.

WAL TCP Export

The server can stream live WAL frames in addition to writing segments to disk. Two modes are supported:

  • connect: fast-cache connects to one downstream collector.
  • listen: fast-cache binds a subscription port and fans out live frames to authenticated subscribers.
[persistence.tcp_export]
enabled = true
mode = "listen"
addr = "127.0.0.1:7630"
auth_token = "replace-me"
channel_capacity = 16384
max_subscribers = 64
backpressure_on_full = false

The stream uses the same framed WAL bytes as disk segments: FCW2 magic, flags, payload length, payload, and CRC. This is a live export path, not a catch-up or replay API; disk WAL segments and snapshots remain the recovery source. With backpressure_on_full = false, a slow TCP exporter can drop live export frames while disk WAL append continues. Set it to true only when the export consumer is allowed to backpressure writes. Auth tokens are plaintext inside the TCP stream, so use localhost, a private network, or a TLS tunnel across trust boundaries.

Native Replication

Native replication is separate from WAL and Redis PSYNC. It ships storage-level mutation batches for async read replicas and service subscribers:

[replication]
enabled = true
role = "primary"
bind_addr = "127.0.0.1:7631"
auth_token = "replace-me"
compression = "none"
zstd_level = 3
send_policy = "batch"
batch_max_records = 64
batch_max_bytes = 262144
batch_max_delay_us = 250
backlog_bytes = 67108864
snapshot_chunk_bytes = 1048576

send_policy = "immediate" flushes every write as a one-record mutation batch. send_policy = "batch" flushes by record count, byte size, or delay. Replication defaults to compression = "none" because realistic write-sync payloads usually compress poorly enough that zstd costs more CPU than it saves on the hot path. Use compression = "zstd" only when bandwidth is the limiting resource and benchmark data for the payload shape justifies it. Primary export runs through shard-local queues and batchers. This keeps write sync aligned with fast-cache's owned-shard architecture: a saturated replication lane can backpressure its shard without centralizing all writes behind one global replication queue. Subscribers still receive one FCRP frame stream, and the per-shard sequence watermarks make replay idempotent. Replicas track per-shard sequence watermarks so they can apply mutations idempotently and catch up from backlog or snapshot-plus-delta.

Feature Flags

  • embedded: default embedded Rust database API.
  • sharded: default sharded storage and owner-local embedded API.
  • server: builds the Redis-compatible fast-cache-server binary.
  • monoio: enables the Linux-only server runtime selected with FAST_CACHE_USE_MONOIO=1. The server still uses bytes-handoff for connection read buffering, using its monoio adapter on Linux. With FAST_CACHE_DIRECT_SHARD_PORTS=1, the server also binds one listener per shard, starting at FAST_CACHE_DIRECT_SHARD_BASE_PORT or the fanout port + 1, so direct clients can route while fanout RESP/FCNP stays available. Monoio writer experiments are selected with FAST_CACHE_MONOIO_SAFE_WRITER=inline|split|writev; Tokio remains the portable default runtime.
  • telemetry: integrates with fast-telemetry.
  • cuda: exposes GPU-facing configuration and transfer descriptors.
  • fast-point-map: enables the experimental point-map storage path.
  • unsafe: opts into reviewed unsafe hot paths for lower overhead.

Safety

The unsafe feature keeps the same public API while enabling reviewed hot paths for server I/O, protocol codecs, flat-map indexing, and owner-local read views. See SAFETY.md for the unsafe inventory, invariants, and safe fallbacks.

License

Apache-2.0. See the repository LICENSE file.