RingAl - Efficient Ring Allocator for Short-lived Buffers

Overview

RingAl is a highly efficient ring allocator designed specifically for the allocation of short-lived buffers. The allocator operates in a circular manner, which allows for fast and inexpensive buffer allocations, provided they are ephemeral in nature. It is crucial that these allocations are short-lived; otherwise, the allocator may become clogged with long-lived allocations, rendering it inefficient.

Primary Use Case:

Preallocation: Establish a backing store of size N bytes.
Small Buffer Allocation: Allocate buffers from the backing store, where M < N.
Buffer Utilization: Use the allocated buffer across threads if necessary. Buffers can be cloned efficiently, akin to using an Arc.
Timely Deallocation: Ensure buffers are dropped before the allocator cycles back to the same memory region.
Recycled Storage: Upon buffer deallocation, the backing store becomes available for subsequent allocations.

Design Philosophy

RingAl library focuses on a robust and versatile memory management system, through the use of dynamic and self descriptive backing store. This system is engineered to accommodate a wide range of allocation requirements through the use of guard sequences. These guard sequences are capable of adjusting dynamically to different allocation conditions by being created, modified, and removed as necessary. The system effectively manages these memory guards using a single usize head pointer.

The design is structured to ensure both safe and efficient multithreaded buffer operations. It grants exclusive write access to one thread while allowing another thread to read simultaneously. Although this design does inevitably lead to race conditions, particularly when the writing thread releases the buffer, without proper synchronization with reading or allocating thread, these issues are addressed through a method of optimistic availability checks. If the allocating thread finds a buffer in use, it returns None, indicating to the caller to retry the operation. This method avoids the need for expensive atomic synchronization by relying on eventual consistency, which is suitable for the use cases of this allocator.

It is important to note that this allocator is not marked as Sync, which restricts its concurrent use across multiple threads. All allocation actions require &mut self, inherently preventing the allocator from being enclosed within an Arc. Using locks around the allocator is not recommended as it can greatly reduce performance; instead, it is advisable to use thread-local storage.

Guard Insights:

Each guard encodes:

A flag indicating whether the guarded memory region is in use.
The address of the next guard in the backing store.

Allocation Scenarios:

When an allocation request is made, RingAl assesses the current guard, which can result in one of four scenarios:

Exact Fit: The requested size matches the guarded region. The guard is marked as occupied, the pointer shifts to the next guard, and the buffer is returned.
Oversized Guard: The guarded region exceeds the requested size. The region is split, a new guard is established for the remainder, and the buffer of the requested size is returned. This can lead to fragmentation.
Undersized Guard: The guarded region is smaller than required. The allocator proceeds to merge subsequent regions until the requested size is met, effectively defragmenting the storage. Only the initial guard persists.
Insufficient Capacity: Even after merging, the accumulated buffer is insufficient. The allocation fails, returning None.

   allocator
      |
      v
-----------------------------------------------------------------------------------
| head canary | N bytes | guard1 | L bytes | guard2 | M bytes | ... | tail canary |
-----------------------------------------------------------------------------------
     |                    ^   |              ^   |              ^ |          ^ 
     |                    |   |              |   |              | |          |
     ----------------------   ----------------   ---------------- ------------
     ^                                                                       |
     |                                                                       |
     -------------------------------------------------------------------------

Note: Head and Tail canaries are standard guard sequences that persist, with the Tail canary perpetually pointing to the Head, forming a circular (ring) structure.

Features

Dynamic Fragmentation and Defragmentation: Facilitates variable-size allocations through adaptive backing store management.
Extendable Buffers: Allow dynamic reallocations akin to Vec<u8>, typically inexpensive due to minimal pointer arithmetic and no data copy. Such reallocations may fail if capacity limits are reached.
Fixed-Size Buffers: Unexpandable but more efficient due to simpler design, with safe cross-thread transportation. They make storage available upon deallocation.
Read-Only Buffers: Fixed-size buffers that are easily cloneable and distributable across multiple threads. These involve an additional heap allocation for a reference counter and should be avoided unless necessary to prevent overhead.

For more details, visit the RingAl Documentation.

Optional Crate Features (Cargo)

tls (Thread-Local Storage): This feature enables advanced functionalities related to thread-local storage within the allocator. By activating tls, developers can initiate allocation requests from any point in the codebase, thereby eliminating the cumbersome need to pass the allocator instance explicitly. This enhancement streamlines code ergonomics, albeit with a slight performance trade-off due to the utilization of RefCell for managing thread-local data.
drop (Allocator Deallocation): Typically, the allocator is designed to remain active for the duration of the application's execution. However, in scenarios where early deallocation of the allocator and its associated resources is required, activating the drop feature is essential. This feature implements a tailored Drop mechanism that blocks (by busy wating) the executing thread until all associated allocations are conclusively released, subsequently deallocating the underlying storage. It is critical to ensure allocations do not extend significantly beyond the intended drop point. Failure to enable this feature will result in a memory leak upon attempting to drop the allocator.

Usage examples

Extendable buffer

let mut allocator = RingAl::new(1024); // Create an allocator with initial size
let mut buffer = allocator.extendable(64).unwrap();
// the slice length exceeds preallocated capacity of 64
let msg = b"hello world, this message is longer than allocated capacity, but buffer will
grow as needed during the write, provided that allocator still has necessary capacity";
// but we're still able to write the entire message, as the buffer grows dynamically
let size = buffer.write(msg).unwrap();
// until the ExtBuf is finalized or dropped no further allocations are possible
let fixed = buffer.finalize();
assert_eq!(fixed.as_ref(), msg);
assert_eq!(fixed.len(), size);

Fixed buffer

let mut allocator = RingAl::new(1024); // Create an allocator with initial size
let mut buffer = allocator.fixed(256).unwrap();
let size = buffer.write(b"hello world, this message is relatively short").unwrap();
// we have written some some bytes 
assert_eq!(buffer.len(), size);
// but we still have some capacity left for more writes if necessary
assert_eq!(buffer.spare(), 256 - size);

Multi-threaded environment

let mut allocator = RingAl::new(1024); // Create an allocator with initial size
let (tx, rx) = channel();
let mut buffer = allocator.fixed(64).unwrap();
let _ = buffer.write(b"this a message to other thread").unwrap();
// send the buffer to another thread
let handle = std::thread::spawn(move || {
    let buffer = rx.recv().unwrap();
    // from another thread, freeze the buffer, making it readonly
    let readonly = buffer.freeze();
    let mut handles = Vec::with_capacity(16);
    for i in 0..16 {
        let (tx, rx) = channel();
        // send the clones (cheap) of readonly buffer to more threads
        let h = std::thread::spawn(move || {
            let msg = rx.recv().unwrap();
            let msg = std::str::from_utf8(&msg[..]).unwrap();
            println!("{i}. {msg}");
        });
        tx.send(readonly.clone());
        handles.push(h);
    }
    for h in handles {
        h.join();
    }
});
tx.send(buffer);
handle.join();

Thread Local Storage

ringal!(@init, 1024);
// allocate fixed buffer
let mut fixed = ringal!(@fixed, 64).unwrap();
let _ = fixed.write(b"hello world!").unwrap();
// allocate extendable buffer and write some data to it
ringal!{@ext, 64, |extendable| {
    let _ = extendable.write(b"hello world!").unwrap();
    extendable.finalize()
}};

Benchmarks

Benchmark comparisons are made between two buffer allocator implementations: RingAl and System allocator used by Vec<u8>. The benchmarks measure performance across different scenarios with varying parameters.

Parameters Explanation

Iterations: Number of operations performed
Buffer Size: Maximum number of bytes that could be written to buffer, the actual number is random
Max Buffers: Maximum number of buffers that are kept around after allocation

Fixed Preallocated Buffer Results

Scenario:

Allocate the buffer with respective allocator, the buffer capacity is equal to or greater than the size of data to be written
Write random number of bytes via Write trait implementation, upper limit is capped at Buffer Size
Send the buffer to another thread via unbounded channel, to prevent immediate deallocation
After enough buffers are collected on other thread, drop them all in one batch
Perform this sequence Iterations times

Iterations	Buffer Size	Max Buffers	ringal (ms)	vec (ms)	Speed Improvement
10,000,000	64	64	696.7	998.0	1.43x
10,000,000	1,024	64	923.4	2,062.0	2.23x
10,000,000	1,024	1,024	913.7	1,622.0	1.77x
1,000,000	65,536	64	932.6	1,718.0	1.84x
1,000,000	131,072	1,024	1,709.0	3,960.0	2.32x

Extendable Buffer Results (Chunked Writes)

Scenario:

Create an empty buffer with respective allocator, the buffer capacity is 0
Write random number of bytes via Write trait implementation, upper limit is capped at Buffer Size, the writes are performed in chunks of 64 bytes, forcing the buffers to grow dynamically and keep reallocating.
Send the buffer to another thread via unbounded channel, to prevent immediate deallocation
After enough buffers are collected on other thread, drop them all in one batch
Perform this sequence Iterations times

Iterations	Buffer Size	Max Buffers	ringal (ms)	vec (ms)	Speed Improvement
10,000,000	64	64	834.0	1,120.0	1.34x
10,000,000	1,024	64	1,164.0	3,228.0	2.77x
10,000,000	1,024	1,024	1,205.0	3,044.0	2.53x
1,000,000	65,536	64	1,958.0	3,646.0	1.86x
1,000,000	131,072	1,024	3,588.0	9,008.0	2.51x

Key Findings

The ringal implementation consistently outperforms the vec implementation across all test scenarios
Performance improvements range from 1.34x to 2.77x faster
Largest performance gains are observed with larger buffer sizes
Both fixed preallocated and extendable buffer scenarios show significant improvements

System Information

All benchmarks were run using hyperfine with 10 runs per test case. Times shown are mean values.

This setup was used to run all benchmarks mentioned in this document:

Device: Apple MacBook Air
Processor: Apple M2
Memory: 16 GB RAM

Dependencies

The crate is designed without any external dependencies, and only relies on standard library

Planned features

Allocation of buffers with generic types

Safety

This library is the epitome of cautious engineering! Well, that's what we'd love to claim, but the truth is it's peppered with unsafe blocks. At times, it seems like the code is channeling its inner C spirit, with raw pointer operations lurking around every corner. But in all seriousness, considerable effort has been devoted to ensuring that the safe API exposed by this crate is truly safe for users and doesn't invite any unwelcome Undefined Behaviors or other nefarious calamities. Proceed with confidence...

ringal 0.2.0

RingAl - Efficient Ring Allocator for Short-lived Buffers

Overview

Primary Use Case:

Design Philosophy

Guard Insights:

Allocation Scenarios:

Features

Optional Crate Features (Cargo)

Usage examples

Extendable buffer

Fixed buffer

Multi-threaded environment

Thread Local Storage

Benchmarks

Parameters Explanation

Fixed Preallocated Buffer Results

Extendable Buffer Results (Chunked Writes)

Key Findings

System Information

Dependencies

Planned features

Safety