⚡️ UltraSlayer – DRAM Refresh‑Stall Killer
UltraSlayer is a lock‑free, hardware‑aware memory slab that eliminates the “DRAM refresh stall” (tREFI) tail‑latency that destroys nanosecond‑level determinism in High‑Frequency‑Trading (HFT) and other ultra‑low‑latency workloads.
It mirrors every hot‑path object across a configurable number of physical DRAM channels and lets a dedicated Slayer Core race the reads in parallel, guaranteeing that at least one channel will answer before a refresh can stall the request.
⚠️ WARNING – UltraSlayer uses
unsafe,volatileloads/stores and a core that spins 100 % of the time. Use it only for the critical hot‑path of a latency‑sensitive application.
🎯 The Problem – DRAM “Tail”
| Situation | Latency |
|---|---|
| Normal DRAM read | ≈ 60 ns |
| Read that hits a refresh (tREFI) | ≈ 200 ns + (spike) |
A single 200 ns jitter can be the difference between a profitable trade and a missed opportunity.
🚀 The Solution – Hardware Hedging
| Step | What UltraSlayer does |
|---|---|
| Mirroring | Stores each hot object on N distinct DRAM channels (different DIMMs / banks). |
| Slayer Core | A dedicated thread, pinned to a physical core, issues N parallel reads at the pipeline level. |
| Race‑to‑first | The first response that arrives is returned; the other reads are discarded. |
| Deterministic latency | Probability that all channels are refreshed simultaneously is 1/N → tail is dramatically reduced. |
The core spins continuously to keep the core hot and avoid C‑state exits that would re‑introduce jitter.
❤️ Inspiration – Laurie Wired’s TailSlayer
UltraSlayer is a Rust port of the original TailSlayer implementation created by Laurie Wired.
- TailSlayer (C++ version) – https://github.com/LaurieWired/tailslayer
- Video explanation (Laurie Wired) – https://www.youtube.com/watch?v=KKbgulTp3FE
Laurie Wired’s work introduced the concept of hardware‑level hedging to eliminate DRAM refresh‑stall tail latency. UltraSlayer adapts that concept to safe‑ish Rust while preserving the same deterministic guarantees.
🛠️ New Features (v0.2)
| Feature | Description |
|---|---|
| Configurable channel count | --channels N (2‑8 mirrors). |
| Cross-Platform Memory | Native mmap (Linux) and VirtualAlloc (Windows) support. |
| Huge‑Page support | Uses MAP_HUGETLB on Linux $\rightarrow$ zero TLB misses. |
| Spin policies | busy (full spin), hybrid (spin → yield), sleep (periodic pause). |
Side‑car (sidecar feature) |
Builds a cdylib with a tiny C‑FFI (ul_init, ul_read_u64, …). |
POSIX Shared‑Memory wrapper (src/shm.rs) |
ShmSlab<T> lets multiple processes map the same slab via /dev/shm (Linux only). |
Criterion benchmark harness (benchmark feature) |
benches/read_latency.rs measures nanosecond read latency for 2/4/8 channels. |
CLI demo binary (cli feature) |
examples/ultraslayer_cli.rs parses flags, creates the slab, starts the core, and idles. |
Zero‑copy slice view (slice feature) |
src/slice.rs exposes a raw‑pointer slice for bulk reads without copying. |
| Full LTO + thin‑LTO options | Optimised release builds for the smallest, fastest binary. |
📋 System Requirements
| Requirement | Linux | Windows |
|---|---|---|
| Kernel/OS | Kernel $\ge$ 5.10 | Windows 10 / 11 |
| Huge Pages | sudo sysctl -w vm.nr_hugepages=2048 |
Standard Virtual Memory (Automatic) |
| DRAM Channels | $\ge$ 2 physical channels | $\ge$ 2 physical channels |
| CPU Affinity | taskset / chrt |
SetThreadAffinityMask (via core_affinity) |
| Permissions | Root/Sudo for Huge Pages/RT priority | Administrator for certain memory flags |
📦 Getting Started – Build & Install
1️⃣ Clone the repository
2️⃣ Build the core library (default)
3️⃣ Optional builds
| Goal | Cargo command | What you get |
|---|---|---|
CLI demo (examples/ultraslayer_cli.rs) |
cargo build --release --features cli |
target/release/examples/ultraslayer_cli |
C‑FFI side‑car (libultraslayer.so) |
cargo build --release --features sidecar |
target/release/libultraslayer.so |
Zero‑copy slices (src/slice.rs) |
cargo build --release --features slice |
Enables the .slice() API in UltraSlayer |
| Benchmark harness (Criterion) | cargo bench --features benchmark |
Runs benches/read_latency.rs and prints latency tables |
| All features | cargo build --release --features "cli sidecar slice benchmark" |
Everything compiled together |
The release profile already uses full LTO, opt-level = 3, panic = "abort" and a single codegen unit for maximum inlining.
▶️ Running UltraSlayer (demo binary)
Linux (with Real‑Time priority)
# Use the binary located in the examples directory
Windows
# Use the binary located in the examples directory
.\target\release\examples\ultraslayer_cli.exe --channels 4 --size 2GiB --spin busy
Alternatively, run directly via Cargo:
Flags
| Flag | Meaning |
|---|---|
--channels N |
Number of DRAM mirrors (default 2). |
--size <bytes> |
Total slab size per channel (e.g. 2GiB, 512MiB). |
--spin <policy> |
busy, hybrid, or sleep (default busy). |
📊 Benchmarking
UltraSlayer ships with two ways to benchmark latency.
1️⃣ Criterion read‑latency benchmark
The benchmark creates slabs with 2, 4, and 8 channels, fills them with deterministic data, then performs 1 000 000 random reads per configuration while measuring nanosecond‑resolution latency.
2️⃣ Stand‑alone micro‑benchmark binary
Via Cargo:
Via binary (for Real-Time priority on Linux):
# First, build the example
# Then run the resulting binary from the examples folder
Windows execution:
.\target\release\examples\benchmark.exe --channels 4 --size 2GiB --ops 1_000_000 --spin busy
🔌 Integration Guide
A️ Pure Rust Engine
use Arc;
use ;
All public methods (read, write, slice, stats, set_spin_policy, pin_to_core) are exposed through the crate root (ultraslayer::UltraSlayer).
B️ Non‑Rust Languages (C / Node / Python) – Side‑car
The exported C API (in src/ffi.rs) provides ul_init, ul_start_core, ul_read_u64, ul_write_u64, and ul_destroy.
C️ Multiple Processes – POSIX Shared‑Memory (Linux Only)
use ShmSlab;
// Process A – creates the slab
let shm = create?;
let slayer = shm.into_ultraslayer;
📁 Project Layout
ultraslayer/
├─ src/
│ ├─ lib.rs ← public UltraSlayer API
│ ├─ slab.rs ← low‑level mirroring & volatile ops
│ ├─ arch.rs ← CPU‑affinity helpers
│ ├─ reader.rs ← internal read‑path logic
│ ├─ main.rs ← optional entry point
│ ├─ shm.rs ← POSIX shared‑memory wrapper (Linux)
│ ├─ ffi.rs ← C‑FFI side‑car (feature = "sidecar")
│ └─ slice.rs ← zero‑copy slice view
├─ benches/
│ └─ read_latency.rs ← Criterion read‑latency benchmark
├─ examples/
│ ├─ ultraslayer_cli.rs ← CLI demo binary (feature = "cli")
│ └─ benchmark.rs ← micro‑benchmark binary (feature = "benchmark")
├─ Cargo.toml
└─ README.md ← this file
📜 License
UltraSlayer is released under the Apache License, Version 2.0.
TL;DR – Quick start for a typical HFT node
# 1️⃣ Reserve huge pages (Linux only)
# 2️⃣ Build with the CLI demo + side‑car + slice view
# 3️⃣ Run the demo (core 2, 4 channels, 2 GiB per channel)
UltraSlayer – the practical, Rust‑native answer to Laurie Wired’s TailSlayer concept. 🚀