⚡️ UltraSlayer – DRAM Refresh‑Stall Killer
UltraSlayer is a lock‑free, hardware‑aware memory slab that eliminates the “DRAM refresh stall” (tREFI) tail‑latency that destroys nanosecond‑level determinism in High‑Frequency‑Trading (HFT) and other ultra‑low‑latency workloads.
It mirrors every hot‑path object across a configurable number of physical DRAM channels and lets a dedicated Slayer Core race the reads in parallel, guaranteeing that at least one channel will answer before a refresh can stall the request.
⚠️ WARNING – UltraSlayer uses
unsafe,volatileloads/stores and a core that spins 100 % of the time. Use it only for the critical hot‑path of a latency‑sensitive application.
🎯 The Problem – DRAM “Tail”
| Situation | Latency |
|---|---|
| Normal DRAM read | ≈ 60 ns |
| Read that hits a refresh (tREFI) | ≈ 200 ns + (spike) |
A single 200 ns jitter can be the difference between a profitable trade and a missed opportunity.
🚀 The Solution – Hardware Hedging
| Step | What UltraSlayer does |
|---|---|
| Mirroring | Stores each hot object on N distinct DRAM channels (different DIMMs / banks). |
| Slayer Core | A dedicated thread, pinned to a physical core, issues N parallel reads at the pipeline level. |
| Race‑to‑first | The first response that arrives is returned; the other reads are discarded. |
| Deterministic latency | Probability that all channels are refreshed simultaneously is 1/N → tail is dramatically reduced. |
The core spins continuously to keep the core hot and avoid C‑state exits that would re‑introduce jitter.
❤️ Inspiration – Laurie Wired’s TailSlayer
UltraSlayer is a Rust port of the original TailSlayer implementation created by Laurie Wired.
- TailSlayer (C++ version) – https://github.com/LaurieWired/tailslayer
- Video explanation (Laurie Wired) – https://www.youtube.com/watch?v=KKbgulTp3FE
Laurie Wired’s work introduced the concept of hardware‑level hedging to eliminate DRAM refresh‑stall tail latency. UltraSlayer adapts that concept to safe‑ish Rust while preserving the same deterministic guarantees.
🛠️ New Features (v0.2)
| Feature | Description |
|---|---|
| Configurable channel count | --channels N (2‑8 mirrors). |
| Cross-Platform Memory | Native mmap (Linux) and VirtualAlloc (Windows) support. |
| Huge‑Page support | Uses MAP_HUGETLB on Linux $\rightarrow$ zero TLB misses. |
| Spin policies | busy (full spin), hybrid (spin → yield), sleep (periodic pause). |
Side‑car (sidecar feature) |
Builds a cdylib with a tiny C‑FFI (ul_init, ul_read_u64, …). |
POSIX Shared‑Memory wrapper (src/shm.rs) |
ShmSlab<T> lets multiple processes map the same slab via /dev/shm (Linux only). |
Criterion benchmark harness (benchmark feature) |
benches/read_latency.rs measures nanosecond read latency for 2/4/8 channels. |
CLI demo binary (cli feature) |
examples/ultraslayer_cli.rs parses flags, creates the slab, starts the core, and idles. |
Zero‑copy slice view (slice feature) |
src/slice.rs exposes a raw‑pointer slice for bulk reads without copying. |
| Full LTO + thin‑LTO options | Optimised release builds for the smallest, fastest binary. |
📋 System Requirements
| Requirement | Linux | Windows |
|---|---|---|
| Kernel/OS | Kernel $\ge$ 5.10 | Windows 10 / 11 |
| Huge Pages | sudo sysctl -w vm.nr_hugepages=2048 |
Standard Virtual Memory (Automatic) |
| DRAM Channels | $\ge$ 2 physical channels | $\ge$ 2 physical channels |
| CPU Affinity | taskset / chrt |
SetThreadAffinityMask (via core_affinity) |
| Permissions | Root/Sudo for Huge Pages/RT priority | Administrator for certain memory flags |
📦 Getting Started – Build & Install
1️⃣ Clone the repository
2️⃣ Build the core library (default)
3️⃣ Optional builds
| Goal | Cargo command | What you get |
|---|---|---|
CLI demo (examples/ultraslayer_cli.rs) |
cargo build --release --features cli |
target/release/examples/ultraslayer_cli |
C‑FFI side‑car (libultraslayer.so) |
cargo build --release --features sidecar |
target/release/libultraslayer.so |
Zero‑copy slices (src/slice.rs) |
cargo build --release --features slice |
Enables the .slice() API in UltraSlayer |
| Benchmark harness (Criterion) | cargo bench --features benchmark |
Runs benches/read_latency.rs and prints latency tables |
| All features | cargo build --release --features "cli sidecar slice benchmark" |
Everything compiled together |
The release profile already uses full LTO, opt-level = 3, panic = "abort" and a single codegen unit for maximum inlining.
▶️ Running UltraSlayer (demo binary)
Linux (with Real‑Time priority)
# Use the binary located in the examples directory
Windows
# Use the binary located in the examples directory
.\target\release\examples\ultraslayer_cli.exe --channels 4 --size 2GiB --spin busy
Alternatively, run directly via Cargo:
Flags
| Flag | Meaning |
|---|---|
--channels N |
Number of DRAM mirrors (default 2). |
--size <bytes> |
Total slab size per channel (e.g. 2GiB, 512MiB). |
--spin <policy> |
busy, hybrid, or sleep (default busy). |
📊 Benchmarking
UltraSlayer ships with two ways to benchmark latency.
1️⃣ Criterion read‑latency benchmark
The benchmark creates slabs with 2, 4, and 8 channels, fills them with deterministic data, then performs 1 000 000 random reads per configuration while measuring nanosecond‑resolution latency.
2️⃣ Stand‑alone micro‑benchmark binary
Via Cargo:
Via binary (for Real-Time priority on Linux):
# First, build the example
# Then run the resulting binary from the examples folder
Windows execution:
.\target\release\examples\benchmark.exe --channels 4 --size 2GiB --ops 1_000_000 --spin busy
🔌 Integration Guide
UltraSlayer is designed as the "Hot Storage" layer for your most critical data. Because the Slayer Core must spin at 100% CPU to maintain determinism, it should be treated as a dedicated hardware service rather than a standard library.
🏗️ The Sidecar Architecture (For C++, Python, Node.js)
If your strategy or risk engine is not written in Rust, use the Sidecar Model. In this pattern, UltraSlayer runs as a native shared library (.so or .dll), managing the hardware mirroring and the spinning core, while your application interacts with it via a lean C-FFI.
1. Build the Sidecar:
# Produces target/release/libultraslayer.so (Linux) or .dll (Windows)
2. The Integration Workflow: The Sidecar uses an opaque handle pattern. You initialize the slab, start the hardware engine, and then perform volatile reads/writes using that handle.
| Step | C-API Function | Purpose |
|---|---|---|
| Init | ul_init(channels, size) |
Allocates mirrored DRAM and returns a handle. |
| Ignite | ul_start_core(handle) |
Spawns the spinning Slayer Core on a physical CPU. |
| Access | ul_read_u64(handle, idx) |
Performs a hedged, stall-immune read. |
| Update | ul_write_u64(handle, idx, val) |
Mirrored write to all DRAM channels. |
| Tear Down | ul_destroy(handle) |
Stops the core and frees the slab. |
🐍 Python Integration
Using ctypes, Python can treat UltraSlayer as a high-performance backend. Since the FFI uses uint64, we use the struct module to handle floating-point prices.
# Load the shared library
=
# Define function signatures
=
=
=
=
=
=
# 1. Setup: 4 channels, 1GiB slab
=
# 2. Hot-Path: Write a float price as u64 bits
= 1250.50
=
# 3. Hot-Path: Read with DRAM-stall immunity
=
=
🟢 Node.js Integration
Using ffi-napi, Node.js can interface with the slab. Note that uint64 in C maps to BigInt in JavaScript.
const ffi = require('ffi-napi');
const lib = ffi.Library('./libultraslayer.so', {
'ul_init': ['pointer', ['uint32', 'size_t']],
'ul_start_core': ['int', ['pointer']],
'ul_read_u64': ['uint64', ['pointer', 'size_t']],
'ul_write_u64': ['void', ['pointer', 'size_t', 'uint64']],
'ul_destroy': ['void', ['pointer']],
});
// 1. Setup: 4 channels, 1GiB slab
const handle = lib.ul_init(4, 1024 * 1024 * 1024);
lib.ul_start_core(handle);
// 2. Hot-Path: Read a price (returns as BigInt)
const price = lib.ul_read_u64(handle, 0);
console.log(`DRAM-hedged price: ${price}`);
// 3. Hot-Path: Update price
lib.ul_write_u64(handle, 0, 999888777n);
lib.ul_destroy(handle);
🦀 Pure Rust Integration
For Rust-native engines, UltraSlayer<T> provides a type-safe wrapper. The key is to initialize the slab once and share it via Arc across your strategy threads.
use Arc;
use ;
🤝 Inter-Process Communication (IPC)
If your Market Data Feed and your Trading Strategy live in different processes, use the ShmSlab wrapper. This uses POSIX shared memory to map the mirrored slab into multiple address spaces.
Process A (The Feed):
use ShmSlab;
// Create the shared mirrored slab
let shm = create?;
shm.write;
Process B (The Strategy):
use ShmSlab;
// Open the existing mirrored slab
let shm = open?;
let price = shm.read; // Deterministic read
⚙️ Integration Summary
| Goal | Pattern | Best For... |
|---|---|---|
| Max Performance | Pure Rust | Native HFT engines. |
| Polyglot Stack | Sidecar (.so) |
Python/Node.js strategies with a Rust driver. |
| Multi-Process | ShmSlab |
Separating Feed, Risk, and Strategy into different PIDs. |
| Bulk Processing | Slice feature |
Dumping the entire slab to a network buffer/log. |
📁 Project Layout
ultraslayer/
├─ src/
│ ├─ lib.rs ← public UltraSlayer API
│ ├─ slab.rs ← low‑level mirroring & volatile ops
│ ├─ arch.rs ← CPU‑affinity helpers
│ ├─ reader.rs ← internal read‑path logic
│ ├─ main.rs ← optional entry point
│ ├─ shm.rs ← POSIX shared‑memory wrapper (Linux)
│ ├─ ffi.rs ← C‑FFI side‑car (feature = "sidecar")
│ └─ slice.rs ← zero‑copy slice view
├─ benches/
│ └─ read_latency.rs ← Criterion read‑latency benchmark
├─ examples/
│ ├─ ultraslayer_cli.rs ← CLI demo binary (feature = "cli")
│ └─ benchmark.rs ← micro‑benchmark binary (feature = "benchmark")
├─ Cargo.toml
└─ README.md ← this file
📜 License
UltraSlayer is released under the Apache License, Version 2.0.
TL;DR – Quick start for a typical HFT node
# 1️⃣ Reserve huge pages (Linux only)
# 2️⃣ Build with the CLI demo + side‑car + slice view
# 3️⃣ Run the demo (core 2, 4 channels, 2 GiB per channel)
UltraSlayer – the practical, Rust‑native answer to Laurie Wired’s TailSlayer concept. 🚀