scx_cake-1.1.1 is not a library.

scx_cake: CAKE DRR++ Adapted for CPU Scheduling

ABSTRACT: scx_cake is an experimental BPF CPU scheduler that adapts the network CAKE algorithm's DRR++ (Deficit Round Robin++) for CPU scheduling. Designed for gaming workloads on modern AMD and Intel hardware.

4-Class System — Tasks classified as GAME / NORMAL / HOG / BG by PELT utilization and game family detection

Zero Global Atomics — Per-CPU BSS arrays with MESI-guarded writes eliminate bus locking

3-Gate select_cpu — prev_cpu idle → performance-ordered scan → kernel fallback → tunnel

Per-LLC DSQ Sharding — Eliminates cross-CCD lock contention on multi-chiplet CPUs

EEVDF-Inspired Weighting — Virtual runtime with sleep lag credit, nice scaling, and tiered DSQ ordering

[!WARNING] EXPERIMENTAL SOFTWARE This scheduler is experimental and intended for use with sched_ext on Linux Kernel 6.12+. Performance may vary depending on hardware and user configuration.

[!NOTE] AI TRANSPARENCY Large Language Models were used for optimization pattern matching and design exploration. All implementation details have been human-verified, benchmarked on real gaming workloads, and validated for correctness.

1. Quick Start

# Prerequisites: Linux Kernel 6.12+ with sched_ext, Rust toolchain

# Clone and build
git clone https://github.com/sched-ext/scx.git
cd scx && cargo build --release -p scx_cake

# Run (requires root)
sudo ./target/release/scx_cake

# Run with live stats TUI
sudo scx_cake -v

2. Philosophy

Traditional schedulers (CFS, EEVDF) optimize for fairness — if a game and a compiler both run, each gets 50% CPU time. For gaming, this creates two problems:

Latency inversion: A 50µs input handler waits behind a 50ms compile job
Frame jitter: Game render threads get preempted mid-frame by background work

scx_cake's answer: Detect the game process family automatically (Steam, Wine/Proton, native games) and give it scheduling priority. Non-game tasks are classified by PELT CPU utilization into NORMAL, HOG, or BG classes with progressively lower priority. The system self-tunes — no manual configuration needed.

This is the same insight behind network CAKE: short flows (DNS, gaming packets) should not be delayed by bulk flows (downloads). scx_cake applies this to CPU time.

3. 4-Class System

scx_cake classifies every task into one of four classes. Classification uses PELT (Per-Entity Load Tracking) utilization from the kernel and automatic game family detection via process tree analysis.

Class Hierarchy

Class	DSQ Weight Range	Typical Workload
GAME	[0, 5120]	Game process tree + audio daemons + compositor (during GAMING state)
NORMAL	[8192, 13312]	Default — interactive desktop tasks
HOG	[16384, 21504]	High PELT utilization (≥78% CPU) non-game tasks
BG	[49152, 54272]	Low PELT utilization non-game tasks during GAMING

[!TIP] Lower weight = dispatches first. Non-overlapping weight ranges guarantee class ordering: all GAME tasks dispatch before any NORMAL task, all NORMAL before any HOG, etc.

How Classification Works

Game detection: Two-phase detection scans for Steam environment variables and Wine .exe processes. Detected game TGIDs and their parent PID are written to BPF BSS. The entire process family (game + wineserver + audio + compositor) is promoted to GAME class.
PELT-based classification: Every 64th stop, the scheduler reads the kernel's util_avg for each task. Tasks with ≥78% CPU utilization are classified as HOG; lower-utilization non-game tasks become BG during GAMING state, or NORMAL otherwise.
Audio/Compositor protection: PipeWire daemons and Wayland compositors are detected at startup and baked into RODATA. During GAMING state, they receive GAME-level priority for latency parity with game threads.
Waker-boost chain: Tasks woken by GAME threads inherit GAME priority for one scheduling cycle. This automatically promotes game pipeline threads (sim→render→present) without explicit classification.

Game Detection Deep Dive

Game detection runs in the Rust TUI polling loop (userspace, every refresh interval) and writes results to BPF BSS. The BPF side never scans /proc — it only reads the pre-resolved game_tgid, game_ppid, and sched_state from BSS.

Detection Pipeline (priority order):

Phase 1 — Steam: For each PPID group with ≥GAME_MIN_THREADS threads, read /proc/<ppid>/environ looking for SteamGameId= or STEAM_GAME=. If found, validate that the group contains a real game binary (not just steam, steamwebhelper, or pressure-vessel). Confidence: 100 (instant lock).
Phase 2 — Wine/Proton .exe: For remaining PPIDs with ≥GAME_MIN_THREADS, read /proc/<ppid>/cmdline looking for any argument ending in .exe. Covers Heroic Launcher, Lutris, manual Wine launches. Confidence: 90 (5-second holdoff).

TGID Resolution: Once a winning PPID is found, resolve the actual game TGID by:

Building per-TGID max pelt_util across all threads in the PPID group
Sorting TGIDs by descending PELT (the game's render loop consumes ms, infra exes consume µs)
Filtering through a Windows infrastructure blocklist: services.exe, winedevice.exe, pluginhost.exe, svchost.exe, explorer.exe, wineboot.exe, crashhandler.exe, etc.
Extracting the game name from /proc/<tgid>/cmdline (for .exe) or /proc/<tgid>/comm (for native)

Hysteresis State Machine:

Holdoff timer: Lower-confidence candidates wait before locking (Steam=instant, .exe=5s). The same candidate must persist across multiple polls to lock.
Sticky incumbent: Once locked, a game stays locked until its /proc/<tgid> disappears (process exits). A challenger can only displace it with strictly higher confidence.
Exit detection: Every poll checks if /proc/<game_tgid> still exists. On exit → tracked_game_tgid = 0, sched_state transitions from GAMING → IDLE.

BSS Propagation: The resolved state is written to BPF BSS every refresh:

bss.game_tgid       = tracked_game_tgid    // for existence checks + display
bss.game_ppid       = tracked_game_ppid    // primary family signal for BPF classification
bss.game_confidence  = game_confidence      // 0/90/100
bss.sched_state      = IDLE | COMPILATION | GAMING

The BPF classification engine in cake_init_task and cake_stopping reads game_ppid to decide if a task belongs to the game family. All scheduling behavior changes (class assignment, DSQ weights, quantum caps, CPUPERF boost) flow from sched_state.

DRR++ Deficit Tracking

Adapted from network CAKE's flow fairness:

Each task starts with a deficit (quantum + new-flow bonus ≈ 10ms credit)
Each execution bout consumes deficit proportional to runtime
When deficit exhausts → new-flow bonus removed → task competes normally
GAME tasks skip deficit drain entirely — their new-flow bonus persists forever

EEVDF-Inspired Weighting

scx_cake uses a virtual runtime system inspired by EEVDF:

Sleep lag credit: Tasks that yield voluntarily (game threads at vsync, audio callbacks) accumulate credit that reduces their DSQ weight on the next wakeup — dispatching them ahead of continuous consumers.
Nice scaling: Per-task nice_shift (0-12) scales runtime cost. High nice priority → less vruntime cost → dispatches sooner. Computed once per 64 stops from p->scx.weight.
Capacity scaling: On heterogeneous CPUs (P/E cores), E-core runtime is scaled by CPU capacity so tasks running on slower cores accumulate proportionally less vruntime.
CPUPERF steering: During GAMING state, GAME tasks signal max CPU frequency boost (1024); non-GAME tasks use reduced boost (768). Check-before-write avoids redundant kfunc calls.

4. Architecture

Overview

flowchart TD
    subgraph HOT["BPF Hot Path"]
        SELECT["cake_select_cpu<br/>3-gate + tunnel"] --> |GATE 1| PREV["prev_cpu idle?<br/>scx_bpf_test_and_clear_cpu_idle"]
        SELECT --> |GATE 2| SCAN["Perf-ordered scan<br/>cpus_fast_to_slow / cpus_slow_to_fast<br/>(only when big_core_phys_mask != 0)"]
        SELECT --> |GATE 3| KERNEL["scx_bpf_select_cpu_dfl<br/>kernel authoritative idle scan"]
        SELECT --> |TUNNEL| ENQ["cake_enqueue<br/>all CPUs busy"]

        PREV --> LOCALON["SCX_DSQ_LOCAL_ON<br/>direct to CPU"]
        SCAN --> LOCALON
        KERNEL --> LOCALON

        ENQ --> |"weighted vtime"| LLCDSQ["Per-LLC DSQ"]
        LLCDSQ --> DISPATCH["cake_dispatch"]
        DISPATCH --> |"1. Local LLC<br/>scx_bpf_dsq_move_to_local"| LOCAL["Run task"]
        DISPATCH --> |"2. Cross-LLC steal<br/>(nr_llcs > 1, victim queued > 1)"| STEAL["Steal from other LLC"]
    end

    subgraph CLASSIFY["Classification Engine (cake_stopping)"]
        EVERY["Every stop"] --> DEFICIT["DRR++ deficit drain"]
        EVERY --> WFREQ["Wake frequency EWMA"]
        EVERY --> VTIME["Vtime staging<br/>(staged_vtime_bits)"]
        GATE64["Every 64th stop<br/>(confidence gate)"] --> RECLASS["PELT reclassify<br/>GAME / NORMAL / HOG / BG"]
        GATE64 --> NICE["Nice shift recompute"]
        RUNNING["cake_running<br/>(every context switch)"] --> STAMP["BSS: run_start, tick_slice,<br/>is_yielder, running_class,<br/>wake_freq, game_cpu_mask"]
    end

Source Files

File	Lines	Purpose
`cake.bpf.c`	~3,300	All BPF ops + classification engine + BenchLab
`intf.h`	~690	Shared structs, constants, telemetry definitions
`bpf_compat.h`	~38	Relaxed atomics compatibility shim
`main.rs`	~750	Rust loader, CLI, profiles, topology, audio/compositor detection
`topology.rs`	~270	CPU topology detection (CCDs, P/E cores, V-Cache, SMT)
`calibrate.rs`	~305	ETD inter-core latency measurement (CAS ping-pong)
`tui.rs`	~4,500	Terminal UI: debug view, live matrix, BenchLab, topology

Ops Callbacks

Callback	Role	Hot/Cold
`cake_select_cpu`	3-gate idle CPU selection + kfunc tunneling	Hot
`cake_enqueue`	Weighted vtime insert into per-LLC DSQ	Hot
`cake_dispatch`	Local LLC → cross-LLC steal	Hot
`cake_running`	BSS staging: run_start, is_yielder, game_cpu_mask, cpuperf	Hot (minimal)
`cake_stopping`	Confidence-gated reclassification + DRR++ + vtime staging	Warm
`cake_yield`	Yield count telemetry (stats-gated)	Cold
`cake_runnable`	Preempt count + wakeup source telemetry (stats-gated)	Cold
`cake_set_cpumask`	Event-driven affinity update (replaces polling)	Cold
`cake_init_task`	Arena + task_storage allocation, initial classification	Cold (once)
`cake_exit_task`	Arena deallocation	Cold (once)
`cake_init` / `cake_exit`	DSQ creation, arena init, UEI	Cold (once)

Data Structures

Dual-storage architecture:

cake_task_hot (BPF task_storage, ~10ns lookup) — CL0 scheduling-critical fields used every stop: task_class, deficit_u16, packed_info, warm_cpus, staged_vtime_bits, nice_shift, sleep_lag, cached_cpumask
cake_task_ctx (BPF Arena, ~29ns TLB walk) — Telemetry-only fields, gated behind CAKE_STATS_ACTIVE. Dead in release builds.
cake_cpu_bss (BSS array, L1-cached) — Per-CPU hot fields: run_start, tick_slice, is_yielder, cached_now, idle_hint, waker_boost, cached_perf

Per-CPU arena (cake_per_cpu, conditional sizing):

Release: 64B/CPU (CL0 only, 1 page total)
Debug: 128B/CPU (CL0 + CL1 telemetry, 2 pages total)

DSQ Architecture

flowchart LR
    subgraph SINGLE["Single-CCD (9800X3D)"]
        DSQ0["LLC_DSQ_BASE + 0<br/>vtime ordered<br/>nr_llcs = 1<br/>stealing skipped"]
    end

    subgraph MULTI["Multi-CCD (9950X)"]
        DSQ1["LLC_DSQ_BASE + 0<br/>CCD 0 cores"] <-->|"cross-LLC steal<br/>when local empty"| DSQ2["LLC_DSQ_BASE + 1<br/>CCD 1 cores"]
    end

Vtime encoding: now_cached + dsq_weight — class weight ranges guarantee ordering (GAME always before NORMAL)
RODATA gate: if (nr_llcs <= 1) return; skips all cross-LLC stealing on single-CCD systems

Zero Global State

Anti-pattern	scx_cake
Global atomics	0 (except game_cpu_mask, transition-only)
Volatile variables	0
Division in hot path	0 (shift-based µs conversion: `>> 10`)
Global vtime writes	0 (per-task only)
RCU lock/unlock in hot path	0

Kfunc Tunneling

select_cpu caches scx_bpf_now() in per-CPU BSS (cpu_bss[cpu].cached_now). enqueue reuses this value, saving ~15ns (1 kfunc trampoline entry) on the all-busy path.

VPROT: Preemption Protection

When a GAME task enters the DSQ during GAMING state and all CPUs are busy, cake_enqueue actively preempts a non-GAME task:

O(1) victim finding: __builtin_ctzll(~game_cpu_mask) — single tzcnt instruction (1 cycle on Zen 4). Bits set in ~game_cpu_mask correspond to CPUs running non-GAME tasks.
VPROT guard: Before preempting, check if the victim has run long enough to justify interruption. The protection threshold is computed as:
- Base: tick_slice >> 4, clamped to [125µs, 500µs]
- Per-class scaling:
  - NORMAL: 75% (×3>>2) — useful interactive work, strong protection
  - BG: 50% (>>1) — background tasks, moderate protection
  - HOG: 25% (>>2) — bulk CPU consumers, minimal protection
Preempt decision: If elapsed >= vprot_ns → scx_bpf_kick_cpu(victim, SCX_KICK_PREEMPT). Otherwise → suppressed (counted as nr_vprot_suppressed in stats).

This ensures GAME tasks never wait in the DSQ for a natural context switch while still protecting tasks from micro-slicing. The per-class scaling means HOG tasks (compilers, render farms) get preempted quickly while NORMAL desktop tasks get reasonable protection.

Starvation Guard

The tiered DSQ weight system intrinsically prevents starvation:

Non-overlapping weight ranges guarantee that within each class, tasks compete fairly on vtime (runtime cost determines ordering)
New-flow bonus gives newly-woken tasks a vtime advantage, preventing permanent queue-back
Deficit drain ensures long-running tasks lose their new-flow bonus and compete normally
Sleep lag credit rewards voluntary yielders, preventing inversion where a yielding task falls behind a continuous consumer

The clock domain fix in cake_running (using scx_bpf_now() instead of p->se.exec_start) prevents a subtle starvation bug: after ~22 minutes, accumulated IRQ time drift in exec_start would exceed the u32 wrap boundary, corrupting elapsed-time checks and causing unconditional preemption (priority inversion).

Scheduler States

The userspace TUI drives state machine transitions written to BPF BSS:

State	Value	Trigger	Effect
IDLE	0	No game or compiler detected	Baseline — NORMAL/HOG classes only
COMPILATION	1	≥2 compiler processes at ≥78% PELT	Cluster co-scheduling for build locality
GAMING	2	Game detected (Steam/.exe/family)	Full priority system: GAME/HOG/BG active

Loader Intelligence

The Rust loader (main.rs) performs significant one-time work at startup, baking results into BPF RODATA (immutable after load):

Prefcore Ranking → Core Steering Arrays:

Reads /sys/devices/system/cpu/cpu*/cpufreq/amd_pstate_prefcore_ranking for each CPU
Sorts by descending rank (fastest first), grouping SMT siblings together: [best_phys, best_smt, second_phys, second_smt, ...]
Populates cpus_fast_to_slow (GAME scan order) and cpus_slow_to_fast (non-GAME scan order) in RODATA
0xFF sentinel terminates the array on CPUs without prefcore rankings

Audio Stack Detection (2-phase):

Phase 1 — Comm scan: Searches /proc/*/comm for known audio daemons: pipewire, wireplumber, pipewire-pulse, pulseaudio, jackd, jackdbus
Phase 2 — PipeWire socket scan: Reads /proc/net/unix to find the PipeWire socket inode (/run/user/<uid>/pipewire-0), then scans /proc/*/fd for processes holding a file descriptor to that inode. This catches audio mixer daemons (goxlr-daemon, easyeffects, etc.) without brittle comm lists.

Up to 8 audio TGIDs baked into audio_tgids[] RODATA. During GAMING, BPF promotes matching tasks to GAME class.

Compositor Detection:

Scans /proc/*/comm for known Wayland/X11 compositors: kwin_wayland, kwin_x11, mutter, gnome-shell, sway, Hyprland, weston, labwc, wayfire, river, gamescope
Up to 4 compositor TGIDs baked into compositor_tgids[] RODATA. During GAMING, compositors receive GAME-level priority — essential for frame presentation latency parity.

Topology Arrays:

cpu_sibling_map[] — SMT sibling pairs (Gate 2 class-mismatch filter)
cpu_llc_id[] — Per-CPU LLC assignment (DSQ sharding)
llc_cpu_mask[], core_cpu_mask[] — Bitmask sets for LLC/core grouping
big_core_phys_mask, little_core_mask, vcache_llc_mask — Intel hybrid + V-Cache topology

5. Configuration

Profiles (`--profile, -p`)

Profile	Quantum	Starvation	Use Case
gaming	2ms	100ms	(Default) Balanced for most games
esports	1ms	50ms	Competitive FPS, ultra-low latency
legacy	4ms	200ms	Older CPUs, reduced overhead
battery	4ms	200ms	Power-efficient for handhelds/laptops
default	2ms	100ms	Alias for gaming

CLI Arguments

Argument	Default	Description
`--profile, -p <PROFILE>`	`gaming`	Select preset profile
`--quantum <µs>`	profile	Base time slice in microseconds
`--new-flow-bonus <µs>`	profile	Extra deficit for newly woken tasks
`--starvation <µs>`	profile	Max run time before forced preemption
`--verbose, -v`	`false`	Enable live TUI stats display
`--interval <secs>`	`1`	TUI refresh interval
`--testing`	`false`	Automated benchmarking mode (see below)

Testing Mode

--testing runs an automated benchmark for CI and regression testing:

Warmup: 1 second pause for the scheduler to stabilize
Collection: 10 seconds of operation, sampling per-CPU BSS dispatch counters
Output: Single-line JSON to stdout:

{
  "duration_sec": 10.0,
  "total_dispatches": 1847263,
  "dispatches_per_sec": 184726.3
}

The scheduler exits automatically after printing. Requires a debug build (release builds silently ignore --testing). Useful for comparing scheduling throughput across code changes.

Yield-Gated Quantum

Instead of per-tier multipliers, scx_cake uses a yield-gated quantum system:

Yielders (cooperative tasks that voluntarily yield): Get full quantum ceiling (up to 2ms default)
Non-yielders (bulk consumers): Get PELT-scaled slice, capped per class
GAME during GAMING: 2x quantum ceiling (tasks yield at vsync, so they'll never consume it all)
HOG/BG during GAMING: Halved caps (forces more preemption points for GAME tasks)

[!NOTE] Higher weight = dispatches later. GAME [0-5120] dispatches before NORMAL [8192-13312] dispatches before HOG [16384-21504] dispatches before BG [49152-54272]. Within each class, PELT utilization and runtime cost provide fine-grained ordering.

Examples

# Default gaming profile
sudo scx_cake

# Competitive gaming
sudo scx_cake -p esports

# Gaming with custom quantum and live stats
sudo scx_cake --quantum 1500 -v

# Battery-friendly for laptop gaming
sudo scx_cake -p battery

6. Build Modes

scx_cake has two build modes that control whether telemetry instrumentation is compiled into the BPF code.

Release Mode (`cargo build --release`)

build.rs passes -DCAKE_RELEASE=1 to Clang
All #ifndef CAKE_RELEASE blocks are dead-code eliminated — arena telemetry, per-task counters, gate hit tracking, BenchLab, and the entire cake_task_ctx arena struct are compiled out
CAKE_STATS_ENABLED is a compile-time constant 0 — Clang eliminates all telemetry branches at BPF compile time
Per-CPU arena blocks shrink from 128B (debug) to 64B (release)
TUI still works but only shows aggregate BSS stats (gate latencies, dispatch counts). Per-task arena fields (gate hit %, callback durations, quantum breakdown) are unavailable

[!TIP] Use --release for production gaming. The telemetry overhead is small (~2-5%), but eliminating it gives the tightest possible scheduling latency.

Debug Mode (`cargo build`)

CAKE_RELEASE is not defined — all telemetry code is compiled in
CAKE_STATS_ENABLED becomes a volatile BSS read of enable_stats, controllable at runtime
CAKE_STATS_ACTIVE = CAKE_STATS_ENABLED && !bench_active — telemetry is suppressed during BenchLab runs to avoid measuring measurement overhead
Full per-task arena telemetry: gate hit counters, callback duration timers, quantum utilization, waker chain, LLC placement, dispatch gap tracking
Required for: full TUI live matrix, BenchLab benchmarks, per-task gate analysis

# Production (release) — minimal overhead, limited TUI
cargo build --release -p scx_cake
sudo ./target/release/scx_cake -v

# Development (debug) — full telemetry, complete TUI
cargo build -p scx_cake
sudo ./target/debug/scx_cake -v

7. TUI Guide

The TUI is activated with --verbose / -v and provides real-time visibility into every scheduling decision. It requires a debug build for full per-task telemetry.

Tabs

Navigate between tabs with Tab / → (next) and Shift-Tab / ← (previous).

Tab	Content
Dashboard	Aggregate stats header + live task matrix with per-task scheduling data
Topology	CPU topology map: CCDs, P/E cores, V-Cache, SMT siblings, core-to-core latency heatmap
BenchLab	In-kernel kfunc microbenchmarks: 60+ operations with ns-precision timing
Reference Guide	Quick reference for column meanings, keybindings, and terminology

Dashboard: Aggregate Stats

The top section shows system-wide scheduling statistics aggregated from per-CPU BSS counters:

Gate hit rates: Gate 1 (prev_cpu idle) %, Gate 2 (perf-ordered scan) %, Gate 3 (kernel fallback) %, Tunnel %
Dispatch stats: DSQ queued, consumed, local dispatches, cross-LLC steals, dispatch misses
Flow stats: New flow dispatches, old flow dispatches, DRR++ deficit activity
Game state: Detected game name, TGID, PPID, thread count, sched_state (IDLE/COMPILATION/GAMING)
EEVDF stats: Sleep lag applications, vprot suppression count, nice shift distribution

Dashboard: Live Task Matrix

The scrollable table shows one row per task with these columns:

Column	Meaning
CPU	Last CPU the task ran on
PID	Process ID
ST	Liveness: ●LIVE (BPF-tracked + running), ○IDLE (alive, no BPF data), ✗DEAD (exited)
COMM	Task command name (from `/proc/PID/comm`)
CLS	Task class: GAME 🎮, NORM, HOG 🔥, BG
VCSW	Voluntary context switches (delta per interval)
AVGRT	Average runtime per execution bout (µs)
MAXRT	Maximum single-run duration (µs)
GAP	Average dispatch gap — time between consecutive runs (µs)
JITTER	Scheduling jitter — variance in dispatch timing (µs)
WAIT	Average wait time in DSQ before dispatch (µs)
RUNS/s	Scheduling frequency — how often this task runs per second
CPU	Current/last CPU placement (core ID)
SEL	`select_cpu` callback duration (ns)
ENQ	`enqueue` callback duration (ns)
STOP	`stopping` callback duration (ns)
RUN	`running` callback duration (ns)
G1	Gate 1 (prev_cpu idle) hit percentage for this task
G3	Gate 3 (kernel scan) hit percentage for this task
DSQ	DSQ tunnel (all busy) hit percentage
MIGR/s	CPU migrations per second
TGID	Thread Group ID (process leader PID)
Q%F	Quantum full — % of runs where task consumed entire time slice
Q%Y	Quantum yield — % of runs where task yielded voluntarily
Q%P	Quantum preempt — % of runs where task was preempted
WAKER	PID of the last task that woke this task
NICE	Nice shift value (0-12) — EEVDF vruntime scaling factor

Tasks are grouped by TGID (process) with collapsible headers. The detected game family is highlighted.

Topology Tab

Displays the detected CPU topology in a visual format:

CCD map: Which cores belong to which LLC/CCD
P/E core identification: Big cores vs Little cores (Intel hybrid)
V-Cache detection: Asymmetric LLC sizes across CCDs
SMT siblings: Logical CPU pairs sharing a physical core
Core-to-core latency matrix: Press b to run an ETD (Empirical Topology Discovery) benchmark measuring CAS ping-pong latency between every core pair. Results display as a color-coded heatmap.

BenchLab Tab

In-kernel microbenchmark suite measuring the real cost of scheduling primitives:

60+ benchmarked operations: kfunc trampolines, BSS reads, arena lookups, RODATA constants, data structure accesses, idle cascades, classification paths
Each operation shows: ID, Name, Category (Data Read, Synchronization, Idle Selection, etc.), Type (K=kfunc, C=cake-internal), and measured latency in nanoseconds
Press b to trigger a benchmark run (runs in-kernel, suppresses telemetry during measurement)
Results persist across runs; press c to copy to clipboard for external analysis

Keybindings

Key	Action
`q` / `Esc`	Quit
`Tab` / `→`	Next tab
`Shift-Tab` / `←`	Previous tab
`↑` / `↓`	Scroll table / bench results
`t`	Jump to top of table
`s`	Cycle sort column (PID → PELT → Runs/s → CPU → Gate1% → ...)
`S`	Toggle sort direction (ascending ↔ descending)
`f`	Toggle filter: BPF-tracked only ↔ all system tasks
`Enter`	Collapse/expand PPID group (Dashboard)
`Space`	Collapse/expand TGID group (Dashboard)
`x`	Fold/unfold all PPID groups (Dashboard)
`c`	Copy current tab data to clipboard
`d`	Dump full snapshot to timestamped file (`tui_dump_<epoch>.txt`)
`r`	Reset all BSS stats counters to zero
`b`	Run benchmark (BenchLab on most tabs, core latency on Topology tab)
`+` / `=`	Faster refresh rate (halve interval, min 250ms)
`-`	Slower refresh rate (double interval, max 5000ms)

Data Export

Two export methods are available from any tab:

Clipboard (c): Copies the current tab's data as formatted text. On Dashboard, includes aggregate stats + full task matrix. On BenchLab, includes all benchmark results with categories.
File dump (d): Writes a complete snapshot to tui_dump_<epoch>.txt in the current working directory. Includes all stats, task matrix, and metadata. Useful for before/after comparisons or sharing with developers.

8. Performance

Target Hardware

Component	Specification
CPU	AMD Ryzen 7 9800X3D (1 CCD, 8C/16T)
Kernel	Linux 6.12+ with sched_ext

Design Targets

Sub-microsecond scheduling decisions — Select CPU + enqueue under 100ns typical
Zero bus lock contention — No global atomics means no scaling regression under load
Consistent 1% lows — Tiered weight system prevents frame time spikes from background work
Automatic game detection — Two-phase Steam/Wine detection with holdoff hysteresis

Benchmarks

schbench — Scheduler latency microbenchmark
Arc Raiders — AAA game stress testing (frame rates, 1% lows)
Splitgate 2 — Competitive FPS latency testing

[!NOTE] Throughput workloads (compilers, render farms) will perform worse than CFS/EEVDF. This scheduler explicitly trades throughput for latency — the same tradeoff network CAKE makes for packets.

Comparison with Other sched_ext Schedulers

Feature	scx_cake	scx_bpfland	scx_lavd	scx_cosmos
Primary goal	Gaming latency	General-purpose interactive	Latency-sensitive workloads	General-purpose + gaming
Task classification	4-class (GAME/NORMAL/HOG/BG)	Interactive vs batch (2-class)	Urgency score (continuous)	Latency criticality (multi-level)
Game detection	Automatic (Steam/Wine/process tree)	None	Behavioral (latency patterns)	None
DSQ structure	Per-LLC vtime-ordered	Per-LLC + shared	Global ordered	Per-LLC
Idle CPU selection	3-gate custom + kernel fallback	Kernel default	Custom idle scan	Kernel default + LLC preference
EEVDF features	Sleep lag, nice scaling, vprot	None	Urgency scoring	None
Core steering	Prefcore-aware (P/E, V-Cache)	LLC-aware	LLC-aware	LLC-aware
Global atomics	0 (MESI-guarded BSS)	Minimal	Some	Minimal
When to choose	Gaming-first, multi-game families	Desktop daily driver	Mixed interactive + throughput	Desktop with gaming needs

[!NOTE] scx_cake is not a general-purpose scheduler. It makes explicit tradeoffs that benefit gaming at the cost of throughput workloads. For daily desktop use without gaming, scx_bpfland or scx_cosmos may be better choices.

9. Vocabulary

Core Concepts

Term	Definition
CAKE	Common Applications Kept Enhanced. Network AQM algorithm this scheduler adapts.
DRR++	Deficit Round Robin++. Network algorithm balancing fair queuing with strict priority.
PELT	Per-Entity Load Tracking. Kernel mechanism providing exponentially-decayed CPU utilization per task (`util_avg`).
Class	Classification level (GAME/NORMAL/HOG/BG). Controls DSQ weight, quantum cap, and preemption policy.
Deficit	Per-task credit from DRR++. New tasks get bonus credit; GAME tasks skip deficit drain entirely.
Quantum	Base time slice a task is allotted before a scheduling decision.
Sleep Lag	EEVDF credit for voluntary yields. Reduces DSQ weight on next wakeup, so yielders dispatch first.
Waker Boost	Transitive priority inheritance: tasks woken by GAME threads get GAME-level priority for one cycle.
Staged Bits	Pre-computed scheduling state packed into a single u64 by `cake_stopping`, consumed by `cake_enqueue` to avoid redundant computation.
Jitter	Variance in scheduling latency between consecutive events. Low jitter = consistent frame delivery.

Architecture

Term	Definition
Task Storage	BPF local storage attached to each task. 10ns lookup; holds CL0 hot scheduling fields.
Arena	BPF memory region for per-task telemetry data. 29ns TLB walk; dead in release builds.
BSS	Zero-initialized BPF global data. Per-CPU arrays (`cpu_bss`) provide L1-cached hot fields.
Kfunc Tunneling	Caching kfunc return values in per-CPU BSS to avoid redundant trampoline calls.
MESI Guard	Read-before-write pattern: skip store if value unchanged, preventing cache invalidation.
RODATA Gate	Compile-time constant that eliminates entire code paths (e.g., single-CCD skips stealing).
Confidence Gate	Reclassification runs every 64th stop. 63/64 stops reuse cached task_class from task_storage.
BenchLab	In-kernel microbenchmark suite measuring kfunc costs, data access patterns, and gate cascades.
Bitfield Coalescing	Packing fields written together into adjacent bits for fused clear/set ops.

Hardware

Term	Definition
CCD	Core Complex Die. Physical chiplet containing cores (9800X3D: 1 CCD, 9950X: 2 CCDs).
LLC	Last Level Cache (L3). Cores in same LLC communicate ~3-5x faster than cross-LLC.
SMT	Simultaneous Multi-Threading. Two logical CPUs per physical core.
P/E Cores	Intel hybrid architecture: Performance cores (fast) and Efficiency cores (power-saving).
V-Cache	AMD 3D V-Cache. Asymmetric LLC sizes across CCDs (e.g., 96MB vs 32MB on 9800X3D).
ETD	Empirical Topology Discovery. Measures inter-core CAS latency at startup for display.
Cache Line	64-byte block of memory. Smallest unit the CPU loads from RAM. Foundation of data layout.
MESI Protocol	Cache coherency protocol (Modified/Exclusive/Shared/Invalid). Unnecessary writes trigger invalidations.

Performance

Term	Definition
Hot Path	Code on every scheduling decision: select_cpu → enqueue → dispatch.
Cold Path	Infrequent code: task init, reclassification (every 64th stop).
Direct Dispatch	`SCX_DSQ_LOCAL_ON` — task goes directly to a CPU's local queue, bypassing the DSQ path.
1% Lows	Average framerate of the slowest 1% of frames. Key metric for stutter.
Branchless	Code avoiding `if/else` to prevent CPU pipeline stalls from branch misprediction.

Anti-Patterns

Term	Definition
False Sharing	Performance penalty when CPUs write to different data on the same 64-byte cache line.
Cache Invalidation	Forcing other cores to discard cached data via unnecessary writes. Causes bus locking.
Micro-slicing	Preempting tasks too frequently. Queued interruptions degrade throughput and increase jitter.
Volatile	Compiler hint preventing optimization. Clogs LSU, breaks ILP/MLP parallelism. Avoid in BPF.

License: GPL-2.0 Maintainer: RitzDaCat

scx_cake 1.1.1