walcraft 0.3.0

A light-weight Write Ahead Log (WAL) solution with garbage collection
Documentation

Walcraft

Walcraft is a Write Ahead Log (WAL) solution for concurrent environments. The library provides high performance by using an in-memory buffer and append-only logs. The logs are stored in multiple files, and older files are deleted to save space.

Features

  • Awesome crate name
  • Simple to use and customize
  • Configurable storage limit
  • Configurable page size
  • CRC 32 checksum for data integrity
  • fsync support
  • High write throughput
  • Built for concurrent and parallel environments
  • Prevents write amplification for high frequency writes
  • Automatically syncs logs with the disk (default: every 100ms)
  • Bring your own serialization format

Initialization

Builder Pattern (Recommended)

The builder pattern allows for complete customization of the WAL instance.

use walcraft::{Size, WalBuilder, Wal};

fn main() {
    // create a wal with 4 KB page size, 10 GB storage and autosync every 50ms
    let wal: Wal = WalBuilder::new()
        .location("/tmp/logs/wal")
        .page_size(Size::Kb(4))
        .storage_size(Size::Gb(10))
        .sync_interval(50)
        .build()
        .unwrap();

    // create a wal with 16 KB page size, enable fsync, use 250 MB of storage and disable autosync
    let wal2: Wal = WalBuilder::new()
        .location("/tmp/logs/wal")
        .storage_size(Size::Mb(250))
        .page_size(Size::Mb(16))
        .sync_interval(0)
        .enable_fsync()
        .build()
        .unwrap();
}

Direct Initialization

This method only allows you to set location and storage size (in MBs) only. The buffer size is set to 4 KB by default and fsync is disabled.

use walcraft::Wal;

fn main() {
    // Create a wal instance with 200 MB of storage
    let wal = Wal::new("/tmp/logs/wal", Some(200));
}

Usage

Writing logs

use serde::{Deserialize, Serialize};
use walcraft::Wal;

// Log to write
#[derive(Serialize, Deserialize, Clone)]
struct Log {
    id: usize,
    value: f64
}

fn main() {
    let log = Log { id: 1, value: 5.6234 };

    // initiate wal and add a log
    let wal = Wal::new("./tmp/", None).unwrap();
    // write a struct
    wal.append_struct(log).unwrap();
    // write raw bytes
    wal.append(b"raw binary data").unwrap();

    // write a log in another thread
    let wal2 = wal.clone();
    std::thread::spawn(move || {
        let log = Log { id: 2, value: 0.45 };
        wal2.append_struct(log).unwrap();
    });

    // keep writing logs in current thread
    let log = Log { id: 3, value: 123.59 };
    wal.append_struct(log).unwrap();

    // Flush the logs to the disk manually
    // This happens automatically as well after some time. However, it's advised to
    // run this method before terminating the program to ensure that no logs are lost.
    wal.flush().unwrap();
}

Reading logs

use serde::{Deserialize, Serialize};
use walcraft::Wal;

// Log to read
#[derive(Serialize, Deserialize, Debug)]
struct Log {
    id: usize,
    value: f64
}

fn main() {
    let wal = Wal::new("./tmp/", None).unwrap();
    let iterator = wal.iter().unwrap();

    for entry in iterator {
        let raw_log = entry.data(); // read raw bytes
        let log: Log = entry.to_struct().unwrap(); // convert raw bytes to struct
        println!("Log: {:?}", log);
    }
}

Limiting the size of logs

Wal::new method accepts two arguments. The first argument is the directory where logs will be stored. The second (optional) argument is for the preferred storage that logs shall occupy in MBs.

Once the storage occupied by log files exceeds the provided limit, the older logs are deleted in chunks to free up some space.

use walcraft::Wal;

fn main() {
    // Unlimited log storage
    let wal = Wal::new("/tmp/logz", None);

    // 500 MB of logs storage
    let wal = Wal::new("/tmp/logz", Some(500));

    // 20 GB of logs storage
    let wal = Wal::new("/tmp/logz", Some(20_000));
}

Breaking Changes in version 0.3

We have introduced significant changes to the WAL library in version 0.3 that are not backward compatible with the previous versions. Because of these breaking changes, any logs created by older versions of the WAL library will not be readable by the new version. The key differences are:

Paged File Layout

The WAL file is now divided into fixed-size pages, and data is managed on a per-page basis rather than as a continuous byte file. This new layout provides better consistency checks, corruption detection & recovery and internal organization but makes files generated by older versions unreadable under the new scheme.

Binary Data Only

All data is now handled strictly as binary blobs (&[u8]). Both the append and read APIs expect and return raw bytes. This change helps streamline performance and reduce dependency overhead. It reduces the tight coupling with the serde library and offers more flexibility in how data is serialized and deserialized.

Mandatory CRC32 Checksums

Every page now includes a CRC32 checksum that is computed on write and validated on read. Pages that fail validation are automatically skipped during iteration. This is always enabled and cannot be turned off.

Convenient Struct Interface

A convenience method append_struct<T: Serialize>(item: T) is provided to serialize your structs automatically. On reading, the library returns a LogEntry object, which can be converted to a struct using to_struct::<T: Deserialize>(). Alternatively, you can use the data() method to get the underlying binary data.

Useful tips

  • Storage size: The storage size can be adjusted to limit the amount of space the logs can occupy. Once the limit is reached or exceeded, the older logs are deleted to free up space.
  • Page size: The default page size is 4 KB. The maximum size of a single log is PAGE_SIZE - 8 bytes, You can set page size to any value you want. However, it must be in multiple of 4 KB. It is recommended to keep it as small as possible and between 4 KB and 1 MB. The page size is the size of the buffer that is used to write the logs. The larger the page size,the more data can be written at once. But it will result in higher write amplification and memory usage.
  • Fsync: By default, fsync is disabled. You can enable it by using the builder pattern. Enabling fsync will ensure that the data is written to the disk before returning from the write operation. This will ensure that the data is not lost in case of a power failure. However, this method reduces the number of disk IO per second significantly.
  • Recovery: The library provides a way to recover the logs at startup. You can read the logs using the .iter() method. This method returns an iterator that you can use to read the logs. Calling this method after writing starts, results in a panic.
  • Flush: The library automatically flushes the logs to the disk once the page is filled and periodically as specified by sync_interval. However, it's advised to run the .flush() method before terminating the program to ensure that no logs are lost.

Handling Log Versioning

It's important to note that the WAL library does not support versioning of logs. If you need to handle different versions of logs, you will need to implement your own versioning mechanism. One way to do this is to use an enum to represent the different versions of the logs. You can then use the serde library to serialize and deserialize the logs. The following example demonstrates how to handle log versioning using an enum:

use walcraft::Wal;
use serde::{Deserialize, Serialize};

#[derive(Serialize, Deserialize, Debug, PartialEq)]
enum Log {
    V1 { id: usize, name: String },
    V2 { id: usize, name: String, age: u8 },
}


fn main() {
    // write logs
    let wal = Wal::new("/tmp/walcraft", Some(100)).unwrap();
    wal.append_struct(Log::V1 {
        id: 1,
        name: "Alice".to_string(),
    })
        .unwrap();
    wal.append_struct(Log::V2 {
        id: 2,
        name: "John".to_string(),
        age: 30,
    })
        .unwrap();
    wal.flush().unwrap();
    drop(wal);

    // read logs
    let wal = Wal::new("/tmp/walcraft", Some(100)).unwrap();
    let iterator = wal.iter().unwrap();
    let logs: Vec<Log> = iterator
        .into_iter()
        .map(|entry| entry.to_struct::<Log>().unwrap())
        .collect();
    assert_eq!(logs.len(), 2);
    assert_eq!(logs[0], Log::V1 { id: 1, name: "Alice".to_string() });
    assert_eq!(logs[1], Log::V2 { id: 2, name: "John".to_string(), age: 30 });
}

Quirks

The WAL can only be in read mode or write mode, not both at the same time.

  • Idle: When created, the WAL is in an idle mode.
  • Read: Calling .iter() method switches the WAL to read mode. In this mode, you cannot write data; any write attempts will be ignored. Once the reading finishes, the WAL automatically reverts to idle mode.
  • Write: When you start writing to the WAL, it switches to write mode and cannot switch back to idle or read mode.

This design prevents conflicts between reading and writing. Ideally, you should read the data at startup, as part of the recovery process, before beginning to write.

use serde::{Deserialize, Serialize};
use walcraft::Wal;

// Log to write
#[derive(Serialize, Deserialize, Clone, Debug)]
struct Log {
    id: usize,
    value: f64
}

fn main() {
    // create an instance of WAL
    let wal = Wal::new("/tmp/logz", Some(2000)).unwrap();

    // recovery: Option A (read all data at once)
    // This method reads all the data at once and shall only be used
    // if all the logs, depending on storage size, can fit in the memory
    let all_logs = wal.iter().unwrap().collect::<Vec<_>>();

    // recovery: Option B
    // This method reads data in chunks of page size (default: 4 KB).
    // It is memory efficient and ideal when you have a large number of logs
    for entry in wal.iter().unwrap() {
        let log: Log = entry.to_struct().unwrap();
        dbg!(log);
    }

    // start writing
    wal.append_struct(Log { id: 1, value: 3.14 }).unwrap();
    wal.append(b"raw binary data").unwrap();
}

Known issues

  • None at the moment.