1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273
//! Kismet implements multiprocess lock-free[^lock-free-fs]
//! crash-safe and (roughly) bounded persistent caches stored
//! in filesystem directories, with a
//! [Second Chance](https://en.wikipedia.org/wiki/Page_replacement_algorithm#Second-chance)
//! eviction strategy. The maintenance logic is batched and invoked
//! at periodic jittered intervals to make sure accesses amortise to a
//! constant number of filesystem system calls and logarithmic (in the
//! number of cached file) time complexity. That's good for performance,
//! and enables lock-freedom,[^unlike-ccache] but does mean that
//! caches are expected to temporarily grow past their capacity
//! limits, although rarely by more than a factor of 2 or 3.
//!
//! [^lock-free-fs]: Inasmuch as anything that makes a lot of syscalls
//! can be "lock-free." The cache access algorithms implement a
//! protocol that makes a bounded number of file open, rename, or link
//! syscalls; in other words, reads and writes are as wait-free as
//! these syscalls. The batched second-chance maintenance algorithm,
//! on the other hand, is merely lock-free: it could theoretically get
//! stuck if writers kept adding new files to a cache (sub)directory.
//! Again, this guarantee is in terms of file access and directory
//! enumeration syscalls, and maintenance is only as lock-free as the
//! underlying syscalls. However, we can assume the kernel is
//! reasonably well designed, and doesn't let any sequence of syscalls
//! keep its hand on kernel locks forever.
//!
//! [^unlike-ccache]: This design choice is different from, e.g.,
//! [ccache](https://ccache.dev/)'s, which attempts to maintain
//! statistics per shard with locked files. Under high load, the lock
//! to update ccache statistics becomes a bottleneck. Yet, despite
//! taking this hit, ccache batches evictions like Kismet, because
//! cleaning up a directory is slow; up-to-date access statistics
//! aren't enough to enforce tight cache capacity limits.
//!
//! In addition to constant per-cache space overhead, each Kismet
//! cache maintains a variable-length [`std::path::PathBuf`] for the
//! directory, one byte of lock-free metadata per shard, and no other
//! non-heap resource (i.e., Kismet caches do not hold on to file
//! objects). This holds for individual cache directories; when
//! stacking multiple caches in a [`Cache`] or [`ReadOnlyCache`], the
//! read-write cache and all constituent read-only caches will each
//! have their own `PathBuf` and per-shard metadata.
//!
//! When a Kismet cache triggers second chance evictions, it will
//! allocate temporary data. That data's size is proportional to the
//! number of files in the cache shard subdirectory undergoing
//! eviction (or the whole directory for a plain unsharded cache), and
//! includes a copy of the basename (without the path prefix) for each
//! cached file in the subdirectory (or plain cache directory). This
//! eviction process is linearithmic-time in the number of files in
//! the cache subdirectory (directory), and is invoked periodically,
//! so as to amortise the maintenance overhead to logarithmic (in the
//! total number of files in the subdirectory) time per write to a
//! cache subdirectory, and constant file operations per write.
//!
//! Kismet does not pre-allocate any long-lived file object, so it may
//! need to temporarily open file objects. Each call nevertheless
//! bounds the number of concurrently allocated file objects; the
//! current logic never allocates more than two concurrent file
//! objects.
//!
//! The load (number of files) in each cache may exceed the cache's
//! capacity because there is no centralised accounting, except for
//! what filesystems provide natively. This design choice forces
//! Kismet to amortise maintenance calls with randomisation, but also
//! means that any number of threads or processes may safely access
//! the same cache directories without any explicit synchronisation.
//!
//! Filesystems can't be trusted to provide much; Kismet only relies
//! on file modification times (`mtime`), and on file access times
//! (`atime`) that are either less than or equal to the `mtime`, or
//! greater than the `mtime` (i.e., `relatime` is acceptable). This
//! implies that cached files should not be linked in multiple Kismet
//! cache directories. It is however safe to hardlink cached files in
//! multiple places, as long as the files are not modified, or their
//! `mtime` otherwise updated, through these non-Kismet links.
//!
//! # Plain and sharded caches
//!
//! Kismet cache directories are plain (unsharded) or sharded.
//!
//! Plain Kismet caches are simply directories where the cache entry for
//! "key" is the file named "key." These are most effective for
//! read-only access to cache directories managed by some other
//! process, or for small caches of up to ~100 cached files.
//!
//! Sharded caches scale to higher capacities, by indexing into one of
//! a constant number of shard subdirectories with a hash, and letting
//! each shard manage fewer files (ideally 10-100 files). They are
//! also much less likely to grow to multiples of the target capacity
//! than plain (unsharded) cache directories.
//!
//! Simple usage should be covered by the [`ReadOnlyCache`] or
//! [`Cache`] structs, which wrap [`plain::Cache`] and
//! [`sharded::Cache`] in a convenient type-erased interface.
//!
//! While the cache code syncs cached data files by default, it does
//! not sync the parent cache directories: we assume that it's safe,
//! if unfortunate, for caches to lose data or revert to an older
//! state after kernel or hardware crashes. In general, the code
//! attempts to be robust again direct manipulation of the cache
//! directories. It's always safe to delete cache files from kismet
//! directories (ideally not recently created files in `.kismet_temp`
//! subdirectories), and even *adding* files should mostly do what one
//! expects: they will be picked up if they're in the correct place
//! (in a plain unsharded cache directory or in the correct shard
//! subdirectory), and eventually evicted if useless or in the wrong
//! shard.
//!
//! It is however essential to only publish files atomically to the
//! cache directories, and it probably never makes sense to modify
//! cached file objects in place. In fact, Kismet always set files
//! readonly before publishing them to the cache and always returns
//! read-only [`std::fs::File`] objects for cached data.
//!
//! # Sample usage
//!
//! One could access a list of read-only caches with a [`ReadOnlyCache`].
//!
//! ```no_run
//! const NUM_SHARDS: usize = 10;
//!
//! let read_only = kismet_cache::ReadOnlyCacheBuilder::new()
//! .plain("/tmp/plain_cache") // Read first here
//! .sharded("/tmp/sharded_cache", NUM_SHARDS) // Then try there.
//! .build();
//!
//! // Attempt to read the file for key "foo", with primary hash 1
//! // and second hash 2, first from `/tmp/plain.cache`, and then
//! // from `/tmp/sharded_cache`. In practice, the hashes should
//! // probably be populated with by implementing the `From<&'a T>`
//! // trait, and passing a `&T` to the cache methods.
//! read_only.get(kismet_cache::Key::new("foo", 1, 2));
//! ```
//!
//! Read-write accesses should use a [`Cache`]:
//!
//! ```no_run
//! struct CacheKey {
//! // ...
//! }
//!
//! fn get_contents(key: &CacheKey) -> Vec<u8> {
//! // ...
//! # unreachable!()
//! }
//!
//! impl<'a> From<&'a CacheKey> for kismet_cache::Key<'a> {
//! fn from(key: &CacheKey) -> kismet_cache::Key {
//! // ...
//! # unreachable!()
//! }
//! }
//!
//!
//! // It's easier to increase the capacity than the number of shards,
//! // so, when in doubt, prefer a few too many shards with a lower
//! // capacity. It's not incorrect to increase the number of shards,
//! // but will result in lost cached data (eventually deleted), since
//! // Kismet does not assign shards with a consistent hash.
//! const NUM_SHARDS: usize = 100;
//! const CAPACITY: usize = 1000;
//!
//! # fn main() -> std::io::Result<()> {
//! use std::io::Read;
//! use std::io::Write;
//!
//! let cache = kismet_cache::CacheBuilder::new()
//! .sharded_writer("/tmp/root_cache", NUM_SHARDS, CAPACITY)
//! .plain_reader("/tmp/extra_cache") // Try to fill cache misses here
//! .build();
//!
//! let key: CacheKey = // ...
//! # CacheKey {}
//! ;
//!
//! // Fetches the current cached value for `key`, or populates it with
//! // the closure argument if missing.
//! let mut cached_file = cache
//! .ensure(&key, |file| file.write_all(&get_contents(&key)))?;
//! let mut contents = Vec::new();
//! cached_file.read_to_end(&mut contents)?;
//! # Ok(())
//! # }
//! ```
//!
//! # Cache directory structure
//!
//! Plain (unsharded) cache directories simply store the value for
//! each `key` under a file named `key`. They also have a single
//! `.kismet_temp` subdirectory, for temporary files.
//!
//! The second chance algorithm relies on mtime / atime (`relatime`
//! updates suffice), so merely opening a file automatically updates
//! the relevant read tracking metadata.
//!
//! Sharded cache directories store the values for each `key` under
//! one of two shard subdirectories. The primary and second potential
//! shards are respectively determined by multiplying `Key::hash` and
//! `Key::secondary_hash` by different odd integers before mapping the
//! result to the shard universe with a fixed point scaling.
//!
//! Each subdirectory is named `.kismet_$HEX_SHARD_ID`, and contains
//! cached files with name equal to the cache key, and a
//! `.kismet_temp` subsubdirectory, just like plain unsharded caches.
//! In fact, each such shard is managed exactly like a plain cache.
//!
//! Sharded caches attempt to balance load between two potential
//! shards for each cache key in an attempt to make all shards grow at
//! roughly the same rate. Once all the shards have reached their
//! capacity, the sharded cache will slowly revert to storing cache
//! keys in their primary shards.
//!
//! This scheme lets plain cache directories easily interoperate with
//! other programs that are not aware of Kismet, and also lets an
//! application use the same directory to back both a plain and a
//! sharded cache (concurrently or not) without any possibility of
//! collision between cached files and Kismet's internal directories.
//!
//! Kismet will always store its internal data in files or directories
//! start start with a `.kismet` prefix, and cached data lives in
//! files with names equal to their keys. Since Kismet sanitises
//! cache keys to forbid them from starting with `.`, `/`, or `\`, it
//! is always safe for an application to store additional data in
//! files or directories that start with a `.`, as long as they do not
//! collide with the `.kismet` prefix.
mod cache_dir;
mod multiplicative_hash;
pub mod plain;
pub mod raw_cache;
mod readonly;
pub mod second_chance;
pub mod sharded;
mod stack;
mod trigger;
pub use readonly::ReadOnlyCache;
pub use readonly::ReadOnlyCacheBuilder;
pub use stack::Cache;
pub use stack::CacheBuilder;
pub use stack::CacheHit;
pub use stack::CacheHitAction;
/// Kismet cache directories put temporary files under this
/// subdirectory in each cache or cache shard directory.
pub const KISMET_TEMPORARY_SUBDIRECTORY: &str = ".kismet_temp";
/// Cache keys consist of a filename and two hash values. The two
/// hashes should ideally be computed by distinct functions of the
/// key's name, but Kismet will function correctly if the `hash` and
/// `secondary_hash` are the same. Each hash function **must** be
/// identical for all processes that access the same sharded cache
/// directory.
///
/// The `name` should not be empty nor start with a dot, forward
/// slash, a backslash: caches will reject any operation on such names
/// with an `ErrorKind::InvalidInput` error.
#[derive(Clone, Copy, Debug)]
pub struct Key<'a> {
pub name: &'a str,
pub hash: u64,
pub secondary_hash: u64,
}
impl<'a> Key<'a> {
/// Returns a new `Key` for this `name`, `hash`, and `secondary_hash`.
pub fn new(name: &str, hash: u64, secondary_hash: u64) -> Key {
Key {
name,
hash,
secondary_hash,
}
}
}