scx_cake: CAKE DRR++ Adapted for CPU Scheduling
ABSTRACT:
scx_cakeis an experimental BPF CPU scheduler that adapts the network CAKE algorithm's DRR++ (Deficit Round Robin++) for CPU scheduling. Designed for gaming workloads on modern AMD and Intel hardware.
- 4-Class System — Tasks classified as GAME / NORMAL / HOG / BG by PELT utilization and game family detection
- Zero Global Atomics — Per-CPU BSS arrays with MESI-guarded writes eliminate bus locking
- 3-Gate select_cpu — prev_cpu idle → performance-ordered scan → kernel fallback → tunnel
- Per-LLC DSQ Sharding — Eliminates cross-CCD lock contention on multi-chiplet CPUs
- EEVDF-Inspired Weighting — Virtual runtime with sleep lag credit, nice scaling, and tiered DSQ ordering
[!WARNING] EXPERIMENTAL SOFTWARE This scheduler is experimental and intended for use with
sched_exton Linux Kernel 6.12+. Performance may vary depending on hardware and user configuration.
[!NOTE] AI TRANSPARENCY Large Language Models were used for optimization pattern matching and design exploration. All implementation details have been human-verified, benchmarked on real gaming workloads, and validated for correctness.
Navigation
- 1. Quick Start
- 2. Philosophy
- 3. 4-Class System
- 4. Architecture
- 5. Configuration
- 6. Build Modes
- 7. TUI Guide
- 8. Performance
- 9. Vocabulary
1. Quick Start
# Prerequisites: Linux Kernel 6.12+ with sched_ext, Rust toolchain
# Clone and build
&&
# Run (requires root)
# Run with live stats TUI
2. Philosophy
Traditional schedulers (CFS, EEVDF) optimize for fairness — if a game and a compiler both run, each gets 50% CPU time. For gaming, this creates two problems:
- Latency inversion: A 50µs input handler waits behind a 50ms compile job
- Frame jitter: Game render threads get preempted mid-frame by background work
scx_cake's answer: Detect the game process family automatically (Steam, Wine/Proton, native games) and give it scheduling priority. Non-game tasks are classified by PELT CPU utilization into NORMAL, HOG, or BG classes with progressively lower priority. The system self-tunes — no manual configuration needed.
This is the same insight behind network CAKE: short flows (DNS, gaming packets) should not be delayed by bulk flows (downloads). scx_cake applies this to CPU time.
3. 4-Class System
scx_cake classifies every task into one of four classes. Classification uses PELT (Per-Entity Load Tracking) utilization from the kernel and automatic game family detection via process tree analysis.
Class Hierarchy
| Class | DSQ Weight Range | Typical Workload |
|---|---|---|
| GAME | [0, 5120] | Game process tree + audio daemons + compositor (during GAMING state) |
| NORMAL | [8192, 13312] | Default — interactive desktop tasks |
| HOG | [16384, 21504] | High PELT utilization (≥78% CPU) non-game tasks |
| BG | [49152, 54272] | Low PELT utilization non-game tasks during GAMING |
[!TIP] Lower weight = dispatches first. Non-overlapping weight ranges guarantee class ordering: all GAME tasks dispatch before any NORMAL task, all NORMAL before any HOG, etc.
How Classification Works
- Game detection: Two-phase detection scans for Steam environment variables and Wine
.exeprocesses. Detected game TGIDs and their parent PID are written to BPF BSS. The entire process family (game + wineserver + audio + compositor) is promoted to GAME class. - PELT-based classification: Every 64th stop, the scheduler reads the kernel's
util_avgfor each task. Tasks with ≥78% CPU utilization are classified as HOG; lower-utilization non-game tasks become BG during GAMING state, or NORMAL otherwise. - Audio/Compositor protection: PipeWire daemons and Wayland compositors are detected at startup and baked into RODATA. During GAMING state, they receive GAME-level priority for latency parity with game threads.
- Waker-boost chain: Tasks woken by GAME threads inherit GAME priority for one scheduling cycle. This automatically promotes game pipeline threads (sim→render→present) without explicit classification.
Game Detection Deep Dive
Game detection runs in the Rust TUI polling loop (userspace, every refresh interval) and writes results to BPF BSS. The BPF side never scans /proc — it only reads the pre-resolved game_tgid, game_ppid, and sched_state from BSS.
Detection Pipeline (priority order):
- Phase 1 — Steam: For each PPID group with ≥
GAME_MIN_THREADSthreads, read/proc/<ppid>/environlooking forSteamGameId=orSTEAM_GAME=. If found, validate that the group contains a real game binary (not juststeam,steamwebhelper, orpressure-vessel). Confidence: 100 (instant lock). - Phase 2 — Wine/Proton .exe: For remaining PPIDs with ≥
GAME_MIN_THREADS, read/proc/<ppid>/cmdlinelooking for any argument ending in.exe. Covers Heroic Launcher, Lutris, manual Wine launches. Confidence: 90 (5-second holdoff).
TGID Resolution: Once a winning PPID is found, resolve the actual game TGID by:
- Building per-TGID max
pelt_utilacross all threads in the PPID group - Sorting TGIDs by descending PELT (the game's render loop consumes ms, infra exes consume µs)
- Filtering through a Windows infrastructure blocklist:
services.exe,winedevice.exe,pluginhost.exe,svchost.exe,explorer.exe,wineboot.exe,crashhandler.exe, etc. - Extracting the game name from
/proc/<tgid>/cmdline(for.exe) or/proc/<tgid>/comm(for native)
Hysteresis State Machine:
- Holdoff timer: Lower-confidence candidates wait before locking (Steam=instant, .exe=5s). The same candidate must persist across multiple polls to lock.
- Sticky incumbent: Once locked, a game stays locked until its
/proc/<tgid>disappears (process exits). A challenger can only displace it with strictly higher confidence. - Exit detection: Every poll checks if
/proc/<game_tgid>still exists. On exit →tracked_game_tgid = 0,sched_statetransitions from GAMING → IDLE.
BSS Propagation: The resolved state is written to BPF BSS every refresh:
bss.game_tgid = tracked_game_tgid // for existence checks + display
bss.game_ppid = tracked_game_ppid // primary family signal for BPF classification
bss.game_confidence = game_confidence // 0/90/100
bss.sched_state = IDLE | COMPILATION | GAMING
The BPF classification engine in cake_init_task and cake_stopping reads game_ppid to decide if a task belongs to the game family. All scheduling behavior changes (class assignment, DSQ weights, quantum caps, CPUPERF boost) flow from sched_state.
DRR++ Deficit Tracking
Adapted from network CAKE's flow fairness:
- Each task starts with a deficit (quantum + new-flow bonus ≈ 10ms credit)
- Each execution bout consumes deficit proportional to runtime
- When deficit exhausts → new-flow bonus removed → task competes normally
- GAME tasks skip deficit drain entirely — their new-flow bonus persists forever
EEVDF-Inspired Weighting
scx_cake uses a virtual runtime system inspired by EEVDF:
- Sleep lag credit: Tasks that yield voluntarily (game threads at vsync, audio callbacks) accumulate credit that reduces their DSQ weight on the next wakeup — dispatching them ahead of continuous consumers.
- Nice scaling: Per-task
nice_shift(0-12) scales runtime cost. Highnicepriority → less vruntime cost → dispatches sooner. Computed once per 64 stops fromp->scx.weight. - Capacity scaling: On heterogeneous CPUs (P/E cores), E-core runtime is scaled by CPU capacity so tasks running on slower cores accumulate proportionally less vruntime.
- CPUPERF steering: During GAMING state, GAME tasks signal max CPU frequency boost (1024); non-GAME tasks use reduced boost (768). Check-before-write avoids redundant kfunc calls.
4. Architecture
Overview
flowchart TD
subgraph HOT["BPF Hot Path"]
SELECT["cake_select_cpu<br/>3-gate + tunnel"] --> |GATE 1| PREV["prev_cpu idle?<br/>scx_bpf_test_and_clear_cpu_idle"]
SELECT --> |GATE 2| SCAN["Perf-ordered scan<br/>cpus_fast_to_slow / cpus_slow_to_fast<br/>(only when big_core_phys_mask != 0)"]
SELECT --> |GATE 3| KERNEL["scx_bpf_select_cpu_dfl<br/>kernel authoritative idle scan"]
SELECT --> |TUNNEL| ENQ["cake_enqueue<br/>all CPUs busy"]
PREV --> LOCALON["SCX_DSQ_LOCAL_ON<br/>direct to CPU"]
SCAN --> LOCALON
KERNEL --> LOCALON
ENQ --> |"weighted vtime"| LLCDSQ["Per-LLC DSQ"]
LLCDSQ --> DISPATCH["cake_dispatch"]
DISPATCH --> |"1. Local LLC<br/>scx_bpf_dsq_move_to_local"| LOCAL["Run task"]
DISPATCH --> |"2. Cross-LLC steal<br/>(nr_llcs > 1, victim queued > 1)"| STEAL["Steal from other LLC"]
end
subgraph CLASSIFY["Classification Engine (cake_stopping)"]
EVERY["Every stop"] --> DEFICIT["DRR++ deficit drain"]
EVERY --> WFREQ["Wake frequency EWMA"]
EVERY --> VTIME["Vtime staging<br/>(staged_vtime_bits)"]
GATE64["Every 64th stop<br/>(confidence gate)"] --> RECLASS["PELT reclassify<br/>GAME / NORMAL / HOG / BG"]
GATE64 --> NICE["Nice shift recompute"]
RUNNING["cake_running<br/>(every context switch)"] --> STAMP["BSS: run_start, tick_slice,<br/>is_yielder, running_class,<br/>wake_freq, game_cpu_mask"]
end
Source Files
| File | Lines | Purpose |
|---|---|---|
cake.bpf.c |
~3,300 | All BPF ops + classification engine + BenchLab |
intf.h |
~690 | Shared structs, constants, telemetry definitions |
bpf_compat.h |
~38 | Relaxed atomics compatibility shim |
main.rs |
~750 | Rust loader, CLI, profiles, topology, audio/compositor detection |
topology.rs |
~270 | CPU topology detection (CCDs, P/E cores, V-Cache, SMT) |
calibrate.rs |
~305 | ETD inter-core latency measurement (CAS ping-pong) |
tui.rs |
~4,500 | Terminal UI: debug view, live matrix, BenchLab, topology |
Ops Callbacks
| Callback | Role | Hot/Cold |
|---|---|---|
cake_select_cpu |
3-gate idle CPU selection + kfunc tunneling | Hot |
cake_enqueue |
Weighted vtime insert into per-LLC DSQ | Hot |
cake_dispatch |
Local LLC → cross-LLC steal | Hot |
cake_running |
BSS staging: run_start, is_yielder, game_cpu_mask, cpuperf | Hot (minimal) |
cake_stopping |
Confidence-gated reclassification + DRR++ + vtime staging | Warm |
cake_yield |
Yield count telemetry (stats-gated) | Cold |
cake_runnable |
Preempt count + wakeup source telemetry (stats-gated) | Cold |
cake_set_cpumask |
Event-driven affinity update (replaces polling) | Cold |
cake_init_task |
Arena + task_storage allocation, initial classification | Cold (once) |
cake_exit_task |
Arena deallocation | Cold (once) |
cake_init / cake_exit |
DSQ creation, arena init, UEI | Cold (once) |
Data Structures
Dual-storage architecture:
cake_task_hot(BPF task_storage, ~10ns lookup) — CL0 scheduling-critical fields used every stop:task_class,deficit_u16,packed_info,warm_cpus,staged_vtime_bits,nice_shift,sleep_lag,cached_cpumaskcake_task_ctx(BPF Arena, ~29ns TLB walk) — Telemetry-only fields, gated behindCAKE_STATS_ACTIVE. Dead in release builds.cake_cpu_bss(BSS array, L1-cached) — Per-CPU hot fields:run_start,tick_slice,is_yielder,cached_now,idle_hint,waker_boost,cached_perf
Per-CPU arena (cake_per_cpu, conditional sizing):
- Release: 64B/CPU (CL0 only, 1 page total)
- Debug: 128B/CPU (CL0 + CL1 telemetry, 2 pages total)
DSQ Architecture
flowchart LR
subgraph SINGLE["Single-CCD (9800X3D)"]
DSQ0["LLC_DSQ_BASE + 0<br/>vtime ordered<br/>nr_llcs = 1<br/>stealing skipped"]
end
subgraph MULTI["Multi-CCD (9950X)"]
DSQ1["LLC_DSQ_BASE + 0<br/>CCD 0 cores"] <-->|"cross-LLC steal<br/>when local empty"| DSQ2["LLC_DSQ_BASE + 1<br/>CCD 1 cores"]
end
- Vtime encoding:
now_cached + dsq_weight— class weight ranges guarantee ordering (GAME always before NORMAL) - RODATA gate:
if (nr_llcs <= 1) return;skips all cross-LLC stealing on single-CCD systems
Zero Global State
| Anti-pattern | scx_cake |
|---|---|
| Global atomics | 0 (except game_cpu_mask, transition-only) |
| Volatile variables | 0 |
| Division in hot path | 0 (shift-based µs conversion: >> 10) |
| Global vtime writes | 0 (per-task only) |
| RCU lock/unlock in hot path | 0 |
Kfunc Tunneling
select_cpu caches scx_bpf_now() in per-CPU BSS (cpu_bss[cpu].cached_now). enqueue reuses this value, saving ~15ns (1 kfunc trampoline entry) on the all-busy path.
VPROT: Preemption Protection
When a GAME task enters the DSQ during GAMING state and all CPUs are busy, cake_enqueue actively preempts a non-GAME task:
- O(1) victim finding:
__builtin_ctzll(~game_cpu_mask)— singletzcntinstruction (1 cycle on Zen 4). Bits set in~game_cpu_maskcorrespond to CPUs running non-GAME tasks. - VPROT guard: Before preempting, check if the victim has run long enough to justify interruption. The protection threshold is computed as:
- Base:
tick_slice >> 4, clamped to [125µs, 500µs] - Per-class scaling:
- NORMAL: 75% (×3>>2) — useful interactive work, strong protection
- BG: 50% (>>1) — background tasks, moderate protection
- HOG: 25% (>>2) — bulk CPU consumers, minimal protection
- Base:
- Preempt decision: If
elapsed >= vprot_ns→scx_bpf_kick_cpu(victim, SCX_KICK_PREEMPT). Otherwise → suppressed (counted asnr_vprot_suppressedin stats).
This ensures GAME tasks never wait in the DSQ for a natural context switch while still protecting tasks from micro-slicing. The per-class scaling means HOG tasks (compilers, render farms) get preempted quickly while NORMAL desktop tasks get reasonable protection.
Starvation Guard
The tiered DSQ weight system intrinsically prevents starvation:
- Non-overlapping weight ranges guarantee that within each class, tasks compete fairly on vtime (runtime cost determines ordering)
- New-flow bonus gives newly-woken tasks a vtime advantage, preventing permanent queue-back
- Deficit drain ensures long-running tasks lose their new-flow bonus and compete normally
- Sleep lag credit rewards voluntary yielders, preventing inversion where a yielding task falls behind a continuous consumer
The clock domain fix in cake_running (using scx_bpf_now() instead of p->se.exec_start) prevents a subtle starvation bug: after ~22 minutes, accumulated IRQ time drift in exec_start would exceed the u32 wrap boundary, corrupting elapsed-time checks and causing unconditional preemption (priority inversion).
Scheduler States
The userspace TUI drives state machine transitions written to BPF BSS:
| State | Value | Trigger | Effect |
|---|---|---|---|
| IDLE | 0 | No game or compiler detected | Baseline — NORMAL/HOG classes only |
| COMPILATION | 1 | ≥2 compiler processes at ≥78% PELT | Cluster co-scheduling for build locality |
| GAMING | 2 | Game detected (Steam/.exe/family) | Full priority system: GAME/HOG/BG active |
Loader Intelligence
The Rust loader (main.rs) performs significant one-time work at startup, baking results into BPF RODATA (immutable after load):
Prefcore Ranking → Core Steering Arrays:
- Reads
/sys/devices/system/cpu/cpu*/cpufreq/amd_pstate_prefcore_rankingfor each CPU - Sorts by descending rank (fastest first), grouping SMT siblings together:
[best_phys, best_smt, second_phys, second_smt, ...] - Populates
cpus_fast_to_slow(GAME scan order) andcpus_slow_to_fast(non-GAME scan order) in RODATA - 0xFF sentinel terminates the array on CPUs without prefcore rankings
Audio Stack Detection (2-phase):
- Phase 1 — Comm scan: Searches
/proc/*/commfor known audio daemons:pipewire,wireplumber,pipewire-pulse,pulseaudio,jackd,jackdbus - Phase 2 — PipeWire socket scan: Reads
/proc/net/unixto find the PipeWire socket inode (/run/user/<uid>/pipewire-0), then scans/proc/*/fdfor processes holding a file descriptor to that inode. This catches audio mixer daemons (goxlr-daemon,easyeffects, etc.) without brittle comm lists.
- Up to 8 audio TGIDs baked into
audio_tgids[]RODATA. During GAMING, BPF promotes matching tasks to GAME class.
Compositor Detection:
- Scans
/proc/*/commfor known Wayland/X11 compositors:kwin_wayland,kwin_x11,mutter,gnome-shell,sway,Hyprland,weston,labwc,wayfire,river,gamescope - Up to 4 compositor TGIDs baked into
compositor_tgids[]RODATA. During GAMING, compositors receive GAME-level priority — essential for frame presentation latency parity.
Topology Arrays:
cpu_sibling_map[]— SMT sibling pairs (Gate 2 class-mismatch filter)cpu_llc_id[]— Per-CPU LLC assignment (DSQ sharding)llc_cpu_mask[],core_cpu_mask[]— Bitmask sets for LLC/core groupingbig_core_phys_mask,little_core_mask,vcache_llc_mask— Intel hybrid + V-Cache topology
5. Configuration
Profiles (--profile, -p)
| Profile | Quantum | Starvation | Use Case |
|---|---|---|---|
| gaming | 2ms | 100ms | (Default) Balanced for most games |
| esports | 1ms | 50ms | Competitive FPS, ultra-low latency |
| legacy | 4ms | 200ms | Older CPUs, reduced overhead |
| battery | 4ms | 200ms | Power-efficient for handhelds/laptops |
| default | 2ms | 100ms | Alias for gaming |
CLI Arguments
| Argument | Default | Description |
|---|---|---|
--profile, -p <PROFILE> |
gaming |
Select preset profile |
--quantum <µs> |
profile | Base time slice in microseconds |
--new-flow-bonus <µs> |
profile | Extra deficit for newly woken tasks |
--starvation <µs> |
profile | Max run time before forced preemption |
--verbose, -v |
false |
Enable live TUI stats display |
--interval <secs> |
1 |
TUI refresh interval |
--testing |
false |
Automated benchmarking mode (see below) |
Testing Mode
--testing runs an automated benchmark for CI and regression testing:
- Warmup: 1 second pause for the scheduler to stabilize
- Collection: 10 seconds of operation, sampling per-CPU BSS dispatch counters
- Output: Single-line JSON to stdout:
The scheduler exits automatically after printing. Requires a debug build (release builds silently ignore --testing). Useful for comparing scheduling throughput across code changes.
Yield-Gated Quantum
Instead of per-tier multipliers, scx_cake uses a yield-gated quantum system:
- Yielders (cooperative tasks that voluntarily yield): Get full quantum ceiling (up to 2ms default)
- Non-yielders (bulk consumers): Get PELT-scaled slice, capped per class
- GAME during GAMING: 2x quantum ceiling (tasks yield at vsync, so they'll never consume it all)
- HOG/BG during GAMING: Halved caps (forces more preemption points for GAME tasks)
[!NOTE] Higher weight = dispatches later. GAME [0-5120] dispatches before NORMAL [8192-13312] dispatches before HOG [16384-21504] dispatches before BG [49152-54272]. Within each class, PELT utilization and runtime cost provide fine-grained ordering.
Examples
# Default gaming profile
# Competitive gaming
# Gaming with custom quantum and live stats
# Battery-friendly for laptop gaming
6. Build Modes
scx_cake has two build modes that control whether telemetry instrumentation is compiled into the BPF code.
Release Mode (cargo build --release)
build.rspasses-DCAKE_RELEASE=1to Clang- All
#ifndef CAKE_RELEASEblocks are dead-code eliminated — arena telemetry, per-task counters, gate hit tracking, BenchLab, and the entirecake_task_ctxarena struct are compiled out CAKE_STATS_ENABLEDis a compile-time constant0— Clang eliminates all telemetry branches at BPF compile time- Per-CPU arena blocks shrink from 128B (debug) to 64B (release)
- TUI still works but only shows aggregate BSS stats (gate latencies, dispatch counts). Per-task arena fields (gate hit %, callback durations, quantum breakdown) are unavailable
[!TIP] Use
--releasefor production gaming. The telemetry overhead is small (~2-5%), but eliminating it gives the tightest possible scheduling latency.
Debug Mode (cargo build)
CAKE_RELEASEis not defined — all telemetry code is compiled inCAKE_STATS_ENABLEDbecomes a volatile BSS read ofenable_stats, controllable at runtimeCAKE_STATS_ACTIVE=CAKE_STATS_ENABLED && !bench_active— telemetry is suppressed during BenchLab runs to avoid measuring measurement overhead- Full per-task arena telemetry: gate hit counters, callback duration timers, quantum utilization, waker chain, LLC placement, dispatch gap tracking
- Required for: full TUI live matrix, BenchLab benchmarks, per-task gate analysis
# Production (release) — minimal overhead, limited TUI
# Development (debug) — full telemetry, complete TUI
7. TUI Guide
The TUI is activated with --verbose / -v and provides real-time visibility into every scheduling decision. It requires a debug build for full per-task telemetry.
Tabs
Navigate between tabs with Tab / → (next) and Shift-Tab / ← (previous).
| Tab | Content |
|---|---|
| Dashboard | Aggregate stats header + live task matrix with per-task scheduling data |
| Topology | CPU topology map: CCDs, P/E cores, V-Cache, SMT siblings, core-to-core latency heatmap |
| BenchLab | In-kernel kfunc microbenchmarks: 60+ operations with ns-precision timing |
| Reference Guide | Quick reference for column meanings, keybindings, and terminology |
Dashboard: Aggregate Stats
The top section shows system-wide scheduling statistics aggregated from per-CPU BSS counters:
- Gate hit rates: Gate 1 (prev_cpu idle) %, Gate 2 (perf-ordered scan) %, Gate 3 (kernel fallback) %, Tunnel %
- Dispatch stats: DSQ queued, consumed, local dispatches, cross-LLC steals, dispatch misses
- Flow stats: New flow dispatches, old flow dispatches, DRR++ deficit activity
- Game state: Detected game name, TGID, PPID, thread count, sched_state (IDLE/COMPILATION/GAMING)
- EEVDF stats: Sleep lag applications, vprot suppression count, nice shift distribution
Dashboard: Live Task Matrix
The scrollable table shows one row per task with these columns:
| Column | Meaning |
|---|---|
| CPU | Last CPU the task ran on |
| PID | Process ID |
| ST | Liveness: ●LIVE (BPF-tracked + running), ○IDLE (alive, no BPF data), ✗DEAD (exited) |
| COMM | Task command name (from /proc/PID/comm) |
| CLS | Task class: GAME 🎮, NORM, HOG 🔥, BG |
| VCSW | Voluntary context switches (delta per interval) |
| AVGRT | Average runtime per execution bout (µs) |
| MAXRT | Maximum single-run duration (µs) |
| GAP | Average dispatch gap — time between consecutive runs (µs) |
| JITTER | Scheduling jitter — variance in dispatch timing (µs) |
| WAIT | Average wait time in DSQ before dispatch (µs) |
| RUNS/s | Scheduling frequency — how often this task runs per second |
| CPU | Current/last CPU placement (core ID) |
| SEL | select_cpu callback duration (ns) |
| ENQ | enqueue callback duration (ns) |
| STOP | stopping callback duration (ns) |
| RUN | running callback duration (ns) |
| G1 | Gate 1 (prev_cpu idle) hit percentage for this task |
| G3 | Gate 3 (kernel scan) hit percentage for this task |
| DSQ | DSQ tunnel (all busy) hit percentage |
| MIGR/s | CPU migrations per second |
| TGID | Thread Group ID (process leader PID) |
| Q%F | Quantum full — % of runs where task consumed entire time slice |
| Q%Y | Quantum yield — % of runs where task yielded voluntarily |
| Q%P | Quantum preempt — % of runs where task was preempted |
| WAKER | PID of the last task that woke this task |
| NICE | Nice shift value (0-12) — EEVDF vruntime scaling factor |
Tasks are grouped by TGID (process) with collapsible headers. The detected game family is highlighted.
Topology Tab
Displays the detected CPU topology in a visual format:
- CCD map: Which cores belong to which LLC/CCD
- P/E core identification: Big cores vs Little cores (Intel hybrid)
- V-Cache detection: Asymmetric LLC sizes across CCDs
- SMT siblings: Logical CPU pairs sharing a physical core
- Core-to-core latency matrix: Press
bto run an ETD (Empirical Topology Discovery) benchmark measuring CAS ping-pong latency between every core pair. Results display as a color-coded heatmap.
BenchLab Tab
In-kernel microbenchmark suite measuring the real cost of scheduling primitives:
- 60+ benchmarked operations: kfunc trampolines, BSS reads, arena lookups, RODATA constants, data structure accesses, idle cascades, classification paths
- Each operation shows: ID, Name, Category (Data Read, Synchronization, Idle Selection, etc.), Type (K=kfunc, C=cake-internal), and measured latency in nanoseconds
- Press
bto trigger a benchmark run (runs in-kernel, suppresses telemetry during measurement) - Results persist across runs; press
cto copy to clipboard for external analysis
Keybindings
| Key | Action |
|---|---|
q / Esc |
Quit |
Tab / → |
Next tab |
Shift-Tab / ← |
Previous tab |
↑ / ↓ |
Scroll table / bench results |
t |
Jump to top of table |
s |
Cycle sort column (PID → PELT → Runs/s → CPU → Gate1% → ...) |
S |
Toggle sort direction (ascending ↔ descending) |
f |
Toggle filter: BPF-tracked only ↔ all system tasks |
Enter |
Collapse/expand PPID group (Dashboard) |
Space |
Collapse/expand TGID group (Dashboard) |
x |
Fold/unfold all PPID groups (Dashboard) |
c |
Copy current tab data to clipboard |
d |
Dump full snapshot to timestamped file (tui_dump_<epoch>.txt) |
r |
Reset all BSS stats counters to zero |
b |
Run benchmark (BenchLab on most tabs, core latency on Topology tab) |
+ / = |
Faster refresh rate (halve interval, min 250ms) |
- |
Slower refresh rate (double interval, max 5000ms) |
Data Export
Two export methods are available from any tab:
- Clipboard (
c): Copies the current tab's data as formatted text. On Dashboard, includes aggregate stats + full task matrix. On BenchLab, includes all benchmark results with categories. - File dump (
d): Writes a complete snapshot totui_dump_<epoch>.txtin the current working directory. Includes all stats, task matrix, and metadata. Useful for before/after comparisons or sharing with developers.
8. Performance
Target Hardware
| Component | Specification |
|---|---|
| CPU | AMD Ryzen 7 9800X3D (1 CCD, 8C/16T) |
| Kernel | Linux 6.12+ with sched_ext |
Design Targets
- Sub-microsecond scheduling decisions — Select CPU + enqueue under 100ns typical
- Zero bus lock contention — No global atomics means no scaling regression under load
- Consistent 1% lows — Tiered weight system prevents frame time spikes from background work
- Automatic game detection — Two-phase Steam/Wine detection with holdoff hysteresis
Benchmarks
- schbench — Scheduler latency microbenchmark
- Arc Raiders — AAA game stress testing (frame rates, 1% lows)
- Splitgate 2 — Competitive FPS latency testing
[!NOTE] Throughput workloads (compilers, render farms) will perform worse than CFS/EEVDF. This scheduler explicitly trades throughput for latency — the same tradeoff network CAKE makes for packets.
Comparison with Other sched_ext Schedulers
| Feature | scx_cake | scx_bpfland | scx_lavd | scx_cosmos |
|---|---|---|---|---|
| Primary goal | Gaming latency | General-purpose interactive | Latency-sensitive workloads | General-purpose + gaming |
| Task classification | 4-class (GAME/NORMAL/HOG/BG) | Interactive vs batch (2-class) | Urgency score (continuous) | Latency criticality (multi-level) |
| Game detection | Automatic (Steam/Wine/process tree) | None | Behavioral (latency patterns) | None |
| DSQ structure | Per-LLC vtime-ordered | Per-LLC + shared | Global ordered | Per-LLC |
| Idle CPU selection | 3-gate custom + kernel fallback | Kernel default | Custom idle scan | Kernel default + LLC preference |
| EEVDF features | Sleep lag, nice scaling, vprot | None | Urgency scoring | None |
| Core steering | Prefcore-aware (P/E, V-Cache) | LLC-aware | LLC-aware | LLC-aware |
| Global atomics | 0 (MESI-guarded BSS) | Minimal | Some | Minimal |
| When to choose | Gaming-first, multi-game families | Desktop daily driver | Mixed interactive + throughput | Desktop with gaming needs |
[!NOTE] scx_cake is not a general-purpose scheduler. It makes explicit tradeoffs that benefit gaming at the cost of throughput workloads. For daily desktop use without gaming,
scx_bpflandorscx_cosmosmay be better choices.
9. Vocabulary
Core Concepts
| Term | Definition |
|---|---|
| CAKE | Common Applications Kept Enhanced. Network AQM algorithm this scheduler adapts. |
| DRR++ | Deficit Round Robin++. Network algorithm balancing fair queuing with strict priority. |
| PELT | Per-Entity Load Tracking. Kernel mechanism providing exponentially-decayed CPU utilization per task (util_avg). |
| Class | Classification level (GAME/NORMAL/HOG/BG). Controls DSQ weight, quantum cap, and preemption policy. |
| Deficit | Per-task credit from DRR++. New tasks get bonus credit; GAME tasks skip deficit drain entirely. |
| Quantum | Base time slice a task is allotted before a scheduling decision. |
| Sleep Lag | EEVDF credit for voluntary yields. Reduces DSQ weight on next wakeup, so yielders dispatch first. |
| Waker Boost | Transitive priority inheritance: tasks woken by GAME threads get GAME-level priority for one cycle. |
| Staged Bits | Pre-computed scheduling state packed into a single u64 by cake_stopping, consumed by cake_enqueue to avoid redundant computation. |
| Jitter | Variance in scheduling latency between consecutive events. Low jitter = consistent frame delivery. |
Architecture
| Term | Definition |
|---|---|
| Task Storage | BPF local storage attached to each task. 10ns lookup; holds CL0 hot scheduling fields. |
| Arena | BPF memory region for per-task telemetry data. 29ns TLB walk; dead in release builds. |
| BSS | Zero-initialized BPF global data. Per-CPU arrays (cpu_bss) provide L1-cached hot fields. |
| Kfunc Tunneling | Caching kfunc return values in per-CPU BSS to avoid redundant trampoline calls. |
| MESI Guard | Read-before-write pattern: skip store if value unchanged, preventing cache invalidation. |
| RODATA Gate | Compile-time constant that eliminates entire code paths (e.g., single-CCD skips stealing). |
| Confidence Gate | Reclassification runs every 64th stop. 63/64 stops reuse cached task_class from task_storage. |
| BenchLab | In-kernel microbenchmark suite measuring kfunc costs, data access patterns, and gate cascades. |
| Bitfield Coalescing | Packing fields written together into adjacent bits for fused clear/set ops. |
Hardware
| Term | Definition |
|---|---|
| CCD | Core Complex Die. Physical chiplet containing cores (9800X3D: 1 CCD, 9950X: 2 CCDs). |
| LLC | Last Level Cache (L3). Cores in same LLC communicate ~3-5x faster than cross-LLC. |
| SMT | Simultaneous Multi-Threading. Two logical CPUs per physical core. |
| P/E Cores | Intel hybrid architecture: Performance cores (fast) and Efficiency cores (power-saving). |
| V-Cache | AMD 3D V-Cache. Asymmetric LLC sizes across CCDs (e.g., 96MB vs 32MB on 9800X3D). |
| ETD | Empirical Topology Discovery. Measures inter-core CAS latency at startup for display. |
| Cache Line | 64-byte block of memory. Smallest unit the CPU loads from RAM. Foundation of data layout. |
| MESI Protocol | Cache coherency protocol (Modified/Exclusive/Shared/Invalid). Unnecessary writes trigger invalidations. |
Performance
| Term | Definition |
|---|---|
| Hot Path | Code on every scheduling decision: select_cpu → enqueue → dispatch. |
| Cold Path | Infrequent code: task init, reclassification (every 64th stop). |
| Direct Dispatch | SCX_DSQ_LOCAL_ON — task goes directly to a CPU's local queue, bypassing the DSQ path. |
| 1% Lows | Average framerate of the slowest 1% of frames. Key metric for stutter. |
| Branchless | Code avoiding if/else to prevent CPU pipeline stalls from branch misprediction. |
Anti-Patterns
| Term | Definition |
|---|---|
| False Sharing | Performance penalty when CPUs write to different data on the same 64-byte cache line. |
| Cache Invalidation | Forcing other cores to discard cached data via unnecessary writes. Causes bus locking. |
| Micro-slicing | Preempting tasks too frequently. Queued interruptions degrade throughput and increase jitter. |
| Volatile | Compiler hint preventing optimization. Clogs LSU, breaks ILP/MLP parallelism. Avoid in BPF. |
License: GPL-2.0 Maintainer: RitzDaCat