Crate redhac

Expand description

redhac is derived from Rust Embedded Distributed Highly Available Cache

The keywords embedded and distributed are a bit strange at the same time.
The idea of redhac is to provide a caching library which can be embedded into any Rust application, while still providing the ability to build a distributed HA caching layer.

It can be used as a cache for a single instance too, of course, but then it will not be the most performant, since it needs clones of most values to be able to send them over the network concurrently. If you need a distributed cache however, you might give redhac a try.

The underlying system works with quorum and will elect a leader dynamically on startup. Each node will start a server and a client part for bi-directional gRPC streaming. Each client will connect to each of the other servers. This is the reason why the performance will degrade, if you scale up replicas without further optimization. You can go higher, but the optimal amount of HA nodes is 3.

§Single Instance

// These 3 are needed for the `cache_get` macro
use redhac::{cache_get, cache_get_from, cache_get_value};
use redhac::{cache_put, CacheConfig, SizedCache};

#[tokio::main]
async fn main() {
    let (_, mut cache_config) = CacheConfig::new();

    // The cache name is used to reference the cache later on
    let cache_name = "my_cache";
    // We need to spawn a global handler for each cache instance.
    // Communication is done over channels.
    cache_config.spawn_cache(cache_name.to_string(), SizedCache::with_size(16), None);

    // Cache keys can only be `String`s at the time of writing.
    let key = "myKey";
    // The value you want to cache must implement `serde::Serialize`.
    // The serialization of the values is done with `bincode`.
    let value = "myCacheValue".to_string();

    // At this point, we need cloned values to make everything work nicely with networked
    // connections. If you only ever need a local cache, you might be better off with using the
    // `cached` crate directly and use references whenever possible.
    cache_put(cache_name.to_string(), key.to_string(), &cache_config, &value)
        .await
        .unwrap();

    let res = cache_get!(
        // The type of the value we want to deserialize the value into
        String,
        // The cache name from above. We can start as many for our application as we like
        cache_name.to_string(),
        // For retrieving values, the same as above is true - we need real `String`s
        key.to_string(),
        // All our caches have the necessary information added to the cache config. Here a
        // reference is totally fine.
        &cache_config,
        // This does not really apply to this single instance example. If we would have started
        // a HA cache layer we could do remote lookups for a value we cannot find locally, if
        // this would be set to `true`
        false
    )
        .await
        .unwrap();

    assert!(res.is_some());
    assert_eq!(res.unwrap(), value);
}

§High Availability

The High Availability (HA) works in a way, that each cache member connects to each other. When eventually quorum is reached, a leader will be elected, which then is responsible for all cache modifications to prevent collisions (if you do not decide against it with a direct cache_put). Since each node connects to each other, it means that you cannot just scale up the cache layer infinitely. The ideal number of nodes is 3. You can scale this number up for instance to 5 or 7 if you like, but this has not been tested in greater detail so far. Write performance will degrade the more nodes you add to the cluster, since you simply need to wait for more Ack’s from the other members. Read performance however should stay the same. Each node will keep a local copy of each value inside the cache (if it has not lost connection or joined the cluster at some later point), which means in most cases reads do not require any remote network access.

§Configuration

The way to configure the HA_MODE is optimized for a Kubernetes deployment but may seem a bit odd at the same time, if you deploy somewhere else. You can either provide the .env file and use it as a config file, or just set these variables for the environment directly. You need to set the following values:

§`HA_MODE`

The first one is easy, just set HA_MODE=true

§`HA_HOSTS`

NOTE:
In a few examples down below, the name for deployments may be rauthy. The reason is, that this crate was written originally to complement another project of mine, Rauthy (link will follow), which is an OIDC Provider and Single Sign-On Solution written in Rust.

The HA_HOSTS is working in a way, that it is really easy inside Kubernetes to configure it, as long as a StatefulSet is used for the deployment.

The way a cache node finds its members is by the HA_HOSTS and its own HOSTNAME. In the HA_HOSTS, add every cache member. For instance, if you want to use 3 replicas in HA mode which are running and deployed as a StatefulSet with the name rauthy again:

HA_HOSTS="http://rauthy-0:8000, http://rauthy-1:8000 ,http://rauthy-2:8000"

The way it works:

A node gets its own hostname from the OS
This is the reason, why you use a StatefulSet for the deployment, even without any volumes attached. For a StatefulSet called rauthy, the replicas will always have the names rauthy-0, rauthy-1, …, which are at the same time the hostnames inside the pod.
Find “me” inside the HA_HOSTS variable
If the hostname cannot be found in the HA_HOSTS, the application will panic and exit because of a misconfiguration.
Use the port from the “me”-Entry that was found for the server part
This means you do not need to specify the port in another variable which eliminates the risk of having inconsistencies or a bad config in that case.
Extract “me” from the HA_HOSTS
then take the leftover nodes as all cache members and connect to them
Once a quorum has been reached, a leader will be elected
From that point on, the cache will start accepting requests
If the leader is lost - elect a new one - No values will be lost
If quorum is lost, the cache will be invalidated
This happens for security reasons to provide cache inconsistencies. Better invalidate the cache and fetch the values fresh from the DB or other cache members than working with possibly invalid values, which is especially true in an authn / authz situation.

NOTE:
If you are in an environment where the described mechanism with extracting the hostname would not work, you can set the HOSTNAME_OVERWRITE for each instance to match one of the HA_HOSTS entries, or you can overwrite the name when using the redhac::start_cluster.

§`CACHE_AUTH_TOKEN`

You need to set a secret for the CACHE_AUTH_TOKEN, which is then used for authenticating cache members.

§TLS

For the sake of this example, we will not dig into TLS and disable it in the example, which can be done with CACHE_TLS=false.

You can add your TLS certificates in PEM format and an optional Root CA. This is true for the Server and the Client part separately. This means you can configure the cache layer to use mTLS connections.

§Reference Config

The following variables are the ones you can use to configure redhac via env vars.
At the time of writing, the configuration can only be done via the env.

# If the cache should start in HA mode or standalone
# accepts 'true|false', defaults to 'false'
HA_MODE=true

# The connection strings (with hostnames) of the HA instances as a CSV
# Format: 'scheme://hostname:port'
HA_HOSTS="http://redhac.redhac:8080, http://redhac.redhac:8180 ,http://redhac.redhac:8280"

# This can overwrite the hostname which is used to identify each cache member.
# Useful in scenarios, where all members are on the same host or for testing.
# You need to add the port, since `redhac` will do an exact match to find "me".
#HOSTNAME_OVERWRITE="127.0.0.1:8080"

# Enable / disable TLS for the cache communication (default: true)
CACHE_TLS=true

# The path to the server TLS certificate PEM file (default: tls/redhac.cert-chain.pem)
CACHE_TLS_SERVER_CERT=tls/redhac.cert-chain.pem
# The path to the server TLS key PEM file (default: tls/redhac.key.pem)
CACHE_TLS_SERVER_KEY=tls/redhac.key.pem

# The path to the client mTLS certificate PEM file. This is optional.
CACHE_TLS_CLIENT_CERT=tls/redhac.local.cert.pem
# The path to the client mTLS key PEM file. This is optional.
CACHE_TLS_CLIENT_KEY=tls/redhac.local.key.pem

# If not empty, the PEM file from the specified location will be added as the CA certificate chain for validating
# the servers TLS certificate. This is optional.
CACHE_TLS_CA_SERVER=tls/ca-chain.cert.pem
# If not empty, the PEM file from the specified location will be added as the CA certificate chain for validating
# the clients mTLS certificate. This is optional.
CACHE_TLS_CA_CLIENT=tls/ca-chain.cert.pem

# The domain / CN the client should validate the certificate against. This domain MUST be inside the
# 'X509v3 Subject Alternative Name' when you take a look at the servers certificate with the openssl tool.
# default: redhac.local
CACHE_TLS_CLIENT_VALIDATE_DOMAIN=redhac.local

# Can be used if you need to overwrite the SNI when the client connects to the server, for
# instance if you are behind a loadbalancer which combines multiple certificates. (default: "")
#CACHE_TLS_SNI_OVERWRITE=

# Define different buffer sizes for channels between the components
# Buffer for client request on the incoming stream - server side (default: 128)
# Makes sense to have the CACHE_BUF_SERVER roughly set to:
# `(number of total HA cache hosts - 1) * CACHE_BUF_CLIENT`
CACHE_BUF_SERVER=128
# Buffer for client requests to remote servers for all cache operations (default: 64)
CACHE_BUF_CLIENT=64

# Secret token, which is used to authenticate the cache members
CACHE_AUTH_TOKEN=SuperSafeSecretToken1337

# Connections Timeouts
# The Server sends out keepalive pings with configured timeouts

# The keepalive ping interval in seconds (default: 5)
CACHE_KEEPALIVE_INTERVAL=5

# The keepalive ping timeout in seconds (default: 5)
CACHE_KEEPALIVE_TIMEOUT=5

# The timeout for the leader election. If a newly saved leader request has not reached quorum
# after the timeout, the leader will be reset and a new request will be sent out.
# CAUTION: This should not be below CACHE_RECONNECT_TIMEOUT_UPPER, since cold starts and
# elections will be problematic in that case.
# value in seconds, default: 2
CACHE_ELECTION_TIMEOUT=2

# These 2 values define the reconnect timeout for the HA Cache Clients.
# The values are in ms and a random between these 2 will be chosen each time to avoid conflicts
# and race conditions (default: 500)
CACHE_RECONNECT_TIMEOUT_LOWER=500
# (default: 2000)
CACHE_RECONNECT_TIMEOUT_UPPER=2000

§Example

use std::env;
use redhac::*;
use redhac::quorum::{AckLevel, QuorumState};
use std::time::Duration;
use tokio::time;

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    // `redhac` is configured via env variables.
    // For the sake of this example, we set the env vars directly inside the code. Usually you
    // would want to configure them on the outside of course.

    // Enable the HA_MODE
    env::set_var("HA_MODE", "true");

    // Configure the HA Cache members. You need an uneven number for quorum.
    // For this example, we will have all of them on the same host on different ports to make
    // it work.
    env::set_var(
        "HA_HOSTS",
        "http://127.0.0.1:7001, http://127.0.0.1:7002, http://127.0.0.1:7003",
     );

    // Disable TLS for this example
    env::set_var("CACHE_TLS", "false");

    // Configure a cache
    let (tx_health_1, mut cache_config_1) = CacheConfig::new();
    let cache_name = "my_cache";
    let cache = SizedCache::with_size(16);
    cache_config_1.spawn_cache(cache_name.to_string(), cache.clone(), None);

    // start server
    start_cluster(
        tx_health_1,
        &mut cache_config_1,
        // optional notification channel: `Option<mpsc::Sender<CacheNotify>>`
        None,
        // We need to overwrite the hostname, so we can start all nodes on the same host for this
        // example. Usually, this will be set to `None`
        Some("127.0.0.1:7001".to_string()),
    )
    .await?;
    time::sleep(Duration::from_millis(100)).await;
    println!("First cache node started");

    // Mimic the other 2 cache members. This should usually not be done in the same code - only
    // for this example to make it work.
    let (tx_health_2, mut cache_config_2) = CacheConfig::new();
    cache_config_2.spawn_cache(cache_name.to_string(), cache.clone(), None);
    start_cluster(
        tx_health_2,
        &mut cache_config_2,
        None,
        Some("127.0.0.1:7002".to_string()),
    )
    .await?;
    time::sleep(Duration::from_millis(100)).await;
    println!("2nd cache node started");
    // Now after the 2nd cache member has been started, we would already have quorum and a
    // working cache layer. As long as there is no leader and / or quorum, the cache will not
    // save any values to avoid inconsistencies.

    let (tx_health_3, mut cache_config_3) = CacheConfig::new();
    cache_config_3.spawn_cache(cache_name.to_string(), cache.clone(), None);
    start_cluster(
        tx_health_3,
        &mut cache_config_3,
        None,
        Some("127.0.0.1:7003".to_string()),
    )
    .await?;
    time::sleep(Duration::from_millis(100)).await;
    println!("3rd cache node started");

    // For the sake of this example again, we need to wait until the cache is in a healthy
    // state, before we can actually insert a value
    loop {
        let health_borrow = cache_config_1.rx_health_state.borrow();
        let health = health_borrow.as_ref().unwrap();
        if health.state == QuorumState::Leader || health.state == QuorumState::Follower {
            break;
        }
        time::sleep(Duration::from_secs(1)).await;
    }

    println!("Cache 1 State: {:?}", cache_config_1.rx_health_state.borrow());
    println!("Cache 2 State: {:?}", cache_config_2.rx_health_state.borrow());
    println!("Cache 3 State: {:?}", cache_config_3.rx_health_state.borrow());

    println!("We are ready to go - insert and get a value");

    let entry = "myKey".to_string();
    // `cache_insert` is the HA version of `cache_put` from the single instance example.
    //
    // `cache_put` does a direct push in HA situations, ignoring the leader. This is way more
    // performant, but may lead to collisions. However, if you are inserting a completely new
    // value with a unique new key, for instance a newly generated UUID, it should be favored,
    // since the same UUID will not be inserted from someone else most likely.
    //
    // `cache_insert` can be used if you want to avoid conflicts. With the `AckLevel`, you can
    // specify the level of safety you want. For instance, if we use `AckLevel::Quorum`, the
    // Result will not be `Ok(())` until the leader has at least the ack for the insert from at
    // least "quorum" amount of nodes.
    // Different levels provide different levels of safety and performance - it is always a
    // tradeoff.
    cache_insert(
        cache_name.to_string(),
        entry.clone(),
        &cache_config_1,
        // This is a reference to the value, which must implement `serde::Serialize`
        &1337i32,
        // The `AckLevel` defines the level of safety we want
        AckLevel::Quorum
    )
        .await?;
    // check the value
    let one = cache_get!(
        i32,
        cache_name.to_string(),
        entry,
        &cache_config_1,
        // Nn a HA situation, we could set this to `true` if we wanted to do a lookup on another
        // cache member in case we do not find the value locally, for instance if a node joined
        // at a later time or when there was a network partition.
        false
    )
        .await?;
    assert_eq!(one, Some(1337));

    Ok(())
}

Re-exports§

pub use crate::quorum::AckLevel;
pub use crate::quorum::QuorumHealth;
pub use crate::quorum::QuorumHealthState;
pub use crate::quorum::QuorumState;

Modules§

quorum

Macros§

cache_get: This is a simple macro to get values from the cache and deserialize them properly at the same time.

Structs§

CacheConfig: The CacheConfig needs to be initialized at the very beginning of the application and must be given to the start_cluster function, when running with HA_MODE == true.
CacheError: The general CacheError which is returned by most of the cache wrapper functions.
CacheNotify: The CacheNotify will be sent over the optional tx_notify for the server, so clients are able to subscribe to updates inside their cache.
SizedCache: Least Recently Used / Sized Cache
TimedCache: Cache store bound by time
TimedSizedCache: Timed LRU Cache

Enums§

CacheMethod: These CacheMethods should be self explainatory. However, Insert and Remove are only executed over the leader and routed internally, whereas Put and Del are direct methods, which ignore the leader and trade conflict safety for faster execution speed.
CacheReq: The enum which will be sent to the “real” cache handler loop in the end.

Functions§

cache_del: Deletes a value from the cache. Will return immediately with Ok(()) if quorum is Bad in HA_MODE. Does ignore the quorum health state, if the cache is running in HA mode.
cache_get_from: This is the deserializer function for cached values. If you do not need deserialization, you can skip using the cache_get! macro and only use cache_get_value to not deserialize.
cache_get_value: Gets values out of the cache.
cache_insert: The HA pendant to cache_put - defaults to cache_put in non-HA mode
cache_put: Put values into the cache.
cache_remove: The HA pendant to cache_del - defaults to cache_del in non-HA mode
clear_caches: Clears all cache entries of each existing cache. This full reset is local and will not be pushed to possible remote HA cache members.
start_cluster: The main function to start the whole backend when HA_MODE == true. It cares about starting the server and one client for each remote server configured in HA_HOSTS, as well as the internal ‘quorum_handler’.

Crate redhacCopy item path