Crate kismet_cache

Expand description

Kismet implements multiprocess lock-free¹ crash-safe and (roughly) bounded persistent caches stored in filesystem directories, with a Second Chance eviction strategy. The maintenance logic is batched and invoked at periodic jittered intervals to make sure accesses amortise to a constant number of filesystem system calls and logarithmic (in the number of cached file) time complexity. That’s good for performance, and enables lock-freedom,² but does mean that caches are expected to temporarily grow past their capacity limits, although rarely by more than a factor of 2 or 3.

In addition to constant per-cache space overhead, each Kismet cache maintains a variable-length std::path::PathBuf for the directory, one byte of lock-free metadata per shard, and no other non-heap resource (i.e., Kismet caches do not hold on to file objects). This holds for individual cache directories; when stacking multiple caches in a Cache or ReadOnlyCache, the read-write cache and all constituent read-only caches will each have their own PathBuf and per-shard metadata.

When a Kismet cache triggers second chance evictions, it will allocate temporary data. That data’s size is proportional to the number of files in the cache shard subdirectory undergoing eviction (or the whole directory for a plain unsharded cache), and includes a copy of the basename (without the path prefix) for each cached file in the subdirectory (or plain cache directory). This eviction process is linearithmic-time in the number of files in the cache subdirectory (directory), and is invoked periodically, so as to amortise the maintenance overhead to logarithmic (in the total number of files in the subdirectory) time per write to a cache subdirectory, and constant file operations per write.

Kismet does not pre-allocate any long-lived file object, so it may need to temporarily open file objects. Each call nevertheless bounds the number of concurrently allocated file objects; the current logic never allocates more than two concurrent file objects.

The load (number of files) in each cache may exceed the cache’s capacity because there is no centralised accounting, except for what filesystems provide natively. This design choice forces Kismet to amortise maintenance calls with randomisation, but also means that any number of threads or processes may safely access the same cache directories without any explicit synchronisation.

Filesystems can’t be trusted to provide much; Kismet only relies on file modification times (mtime), and on file access times (atime) that are either less than or equal to the mtime, or greater than the mtime (i.e., relatime is acceptable). This implies that cached files should not be linked in multiple Kismet cache directories. It is however safe to hardlink cached files in multiple places, as long as the files are not modified, or their mtime otherwise updated, through these non-Kismet links.

Plain and sharded caches

Kismet cache directories are plain (unsharded) or sharded.

Plain Kismet caches are simply directories where the cache entry for “key” is the file named “key.” These are most effective for read-only access to cache directories managed by some other process, or for small caches of up to ~100 cached files.

Sharded caches scale to higher capacities, by indexing into one of a constant number of shard subdirectories with a hash, and letting each shard manage fewer files (ideally 10-100 files). They are also much less likely to grow to multiples of the target capacity than plain (unsharded) cache directories.

Simple usage should be covered by the ReadOnlyCache or Cache structs, which wrap plain::Cache and sharded::Cache in a convenient type-erased interface.

While the cache code syncs cached data files by default, it does not sync the parent cache directories: we assume that it’s safe, if unfortunate, for caches to lose data or revert to an older state after kernel or hardware crashes. In general, the code attempts to be robust again direct manipulation of the cache directories. It’s always safe to delete cache files from kismet directories (ideally not recently created files in .kismet_temp subdirectories), and even adding files should mostly do what one expects: they will be picked up if they’re in the correct place (in a plain unsharded cache directory or in the correct shard subdirectory), and eventually evicted if useless or in the wrong shard.

It is however essential to only publish files atomically to the cache directories, and it probably never makes sense to modify cached file objects in place. In fact, Kismet always set files readonly before publishing them to the cache and always returns read-only std::fs::File objects for cached data.

Sample usage

One could access a list of read-only caches with a ReadOnlyCache.

const NUM_SHARDS: usize = 10;

let read_only = kismet_cache::ReadOnlyCacheBuilder::new()
    .plain("/tmp/plain_cache")  // Read first here
    .sharded("/tmp/sharded_cache", NUM_SHARDS)  // Then try there.
    .take()
    .build();

// Attempt to read the file for key "foo", with primary hash 1
// and second hash 2, first from `/tmp/plain.cache`, and then
// from `/tmp/sharded_cache`.  In practice, the hashes should
// probably be populated with by implementing the `From<&'a T>`
// trait, and passing a `&T` to the cache methods.
read_only.get(kismet_cache::Key::new("foo", 1, 2));

Read-write accesses should use a Cache:

struct CacheKey {
  // ...
}

fn get_contents(key: &CacheKey) -> Vec<u8> {
  // ...
}

impl<'a> From<&'a CacheKey> for kismet_cache::Key<'a> {
  fn from(key: &CacheKey) -> kismet_cache::Key {
    // ...
  }
}


// It's easier to increase the capacity than the number of shards,
// so, when in doubt, prefer a few too many shards with a lower
// capacity.  It's not incorrect to increase the number of shards,
// but will result in lost cached data (eventually deleted), since
// Kismet does not assign shards with a consistent hash.
const NUM_SHARDS: usize = 100;
const CAPACITY: usize = 1000;

use std::io::Read;
use std::io::Write;

let cache = kismet_cache::CacheBuilder::new()
    .sharded_writer("/tmp/root_cache", NUM_SHARDS, CAPACITY)
    .plain_reader("/tmp/extra_cache")  // Try to fill cache misses here
    .take()
    .build();

let key: CacheKey = // ...
    ;

// Fetches the current cached value for `key`, or populates it with
// the closure argument if missing.
let mut cached_file = cache
    .ensure(&key, |file| file.write_all(&get_contents(&key)))?;
let mut contents = Vec::new();
cached_file.read_to_end(&mut contents)?;

Cache directory structure

Plain (unsharded) cache directories simply store the value for each key under a file named key. They also have a single .kismet_temp subdirectory, for temporary files.

The second chance algorithm relies on mtime / atime (relatime updates suffice), so merely opening a file automatically updates the relevant read tracking metadata.

Sharded cache directories store the values for each key under one of two shard subdirectories. The primary and second potential shards are respectively determined by multiplying Key::hash and Key::secondary_hash by different odd integers before mapping the result to the shard universe with a fixed point scaling.

Each subdirectory is named .kismet_$HEX_SHARD_ID, and contains cached files with name equal to the cache key, and a .kismet_temp subsubdirectory, just like plain unsharded caches. In fact, each such shard is managed exactly like a plain cache.

Sharded caches attempt to balance load between two potential shards for each cache key in an attempt to make all shards grow at roughly the same rate. Once all the shards have reached their capacity, the sharded cache will slowly revert to storing cache keys in their primary shards.

This scheme lets plain cache directories easily interoperate with other programs that are not aware of Kismet, and also lets an application use the same directory to back both a plain and a sharded cache (concurrently or not) without any possibility of collision between cached files and Kismet’s internal directories.

Kismet will always store its internal data in files or directories start start with a .kismet prefix, and cached data lives in files with names equal to their keys. Since Kismet sanitises cache keys to forbid them from starting with ., /, or \, it is always safe for an application to store additional data in files or directories that start with a ., as long as they do not collide with the .kismet prefix.

Inasmuch as anything that makes a lot of syscalls can be “lock-free.” The cache access algorithms implement a protocol that makes a bounded number of file open, rename, or link syscalls; in other words, reads and writes are as wait-free as these syscalls. The batched second-chance maintenance algorithm, on the other hand, is merely lock-free: it could theoretically get stuck if writers kept adding new files to a cache (sub)directory. Again, this guarantee is in terms of file access and directory enumeration syscalls, and maintenance is only as lock-free as the underlying syscalls. However, we can assume the kernel is reasonably well designed, and doesn’t let any sequence of syscalls keep its hand on kernel locks forever. ↩
This design choice is different from, e.g., ccache’s, which attempts to maintain statistics per shard with locked files. Under high load, the lock to update ccache statistics becomes a bottleneck. Yet, despite taking this hit, ccache batches evictions like Kismet, because cleaning up a directory is slow; up-to-date access statistics aren’t enough to enforce tight cache capacity limits. ↩

Modules

plain

A crate::plain::Cache stores all cached file in a single directory (there may also be a .kismet_temp subdirectory for temporary files), and periodically scans for evictions with a second chance strategy. This implementation does not scale up to more than a few hundred files per cache directory (a crate::sharded::Cache can go higher), but interoperates seamlessly with other file-based programs that store cache files in flat directories.

raw_cache

The raw cache module manages directories of read-only files subject to a (batched) Second Chance eviction policy. Calling prune deletes files to make sure a cache directory does not exceed its capacity, in file count. The deletions will obey a Second Chance policy as long as insertions and updates go through insert_or_update or insert_or_touch, in order to update the cached files’ modification times correctly. Opening the cached file will automatically update its metadata to take that access into account, but a path can also be touched explicitly.

second_chance

The Second Chance or Clock page replacement policy is a simple approximation of the Least Recently Used policy. Kismet uses the second chance policy because it can be easily implemented on top of the usual file modification and access times that we can trust operating systems to update for us.

sharded

A crate::sharded::Cache uses the same basic file-based second chance strategy as a crate::plain::Cache. However, while the simple plain cache is well suited to small caches (down to 2-3 files, and up maybe one hundred), this sharded version can scale nearly arbitrarily high: each shard should have fewer than one hundred or so files, but there may be arbitrarily many shards (up to filesystem limits, since each shard is a subdirectory).

Structs

Cache

A Cache wraps either up to one plain or sharded read-write cache in a convenient interface, and may optionally fulfill read operations by deferring to a list of read-only cache when the read-write cache misses.

CacheBuilder

Construct a Cache with this builder. The resulting cache will always first access its write-side cache (if defined), and, on misses, will attempt to service Cache::get and Cache::touch calls by iterating over the read-only caches.

Key

Cache keys consist of a filename and two hash values. The two hashes should ideally be computed by distinct functions of the key’s name, but Kismet will function correctly if the hash and secondary_hash are the same. Each hash function must be identical for all processes that access the same sharded cache directory.

ReadOnlyCache

A ReadOnlyCache wraps an arbitrary number of crate::plain::Cache and crate::sharded::Cache, and attempts to satisfy ReadOnlyCache::get and ReadOnlyCache::touch requests by hitting each constituent cache in order. This interface hides the difference between plain and sharded cache directories, and should be the first resort for read-only uses.

ReadOnlyCacheBuilder

Construct a ReadOnlyCache with this builder. The resulting cache will access each constituent cache directory in the order they were added.

Enums

CacheHit

Where does a cache hit come from: the primary read-write cache, or one of the secondary read-only caches?

CacheHitAction

What to do with a cache hit in a Cache::get_or_update call?

Constants

KISMET_TEMPORARY_SUBDIRECTORY

Kismet cache directories put temporary files under this subdirectory in each cache or cache shard directory.

Functions

byte_equality_checker

Compares the contents of two files, and returns Ok(()) iff their contents are identical.

panicking_byte_equality_checker

Compares the contents of two files, and panics unless their contents are identical. Returns Ok(()) on equality.