Module bench_gpu_scale

Source

Expand description

dsfb-gpu-debug bench-gpu-scale — R.7 money-table headline benchmark

R.8 bottleneck profiler.

Two distinct entry modes share this subcommand:

Default mode (R.7) drives the panel-locked scale sweep that produces the headline reports/money_table.txt. Each row pairs one GPU dispatch path (Layer A device evidence fabric / Layer B throughput verdict summary / Layer C full audit court) against the same fixture’s CPU Layer B baseline so the speedup column is reproducible.
--detail-stage mode (R.8) skips the money-table sweep and instead runs the per-stage bottleneck profiler at three K=1 scale points (canonical 16×128, 64×512 mid-scale, 256×4096 full-scale). Each scale point gets its own reports/r8_bottleneck_<grid>_K1.txt with a table of (stage, median µs, % of wall) and the top 3 stages by absolute time. Honest scope note: K>1 batched per-stage timings need a separate _timed batched FFI (deferred); the K=1 percent breakdown is the proxy R.8 uses for the K=64 row in R.7 because the same kernels run with blockIdx.z = K at batched scale.

R.7 rows (panel-locked):

Canonical 16×128, K=32: Layer A, Layer B, Layer C CPU, Layer C GPU
Scale-large 256×4096, K ∈ {1, 16, 64, 128}: CPU Layer B, GPU Layer A, GPU Layer B, and Layer C if feasible. K=128 only runs if the BatchedGpuWorkspace allocation succeeds; otherwise the row is marked “not run: alloc refused” and the rest of the sweep continues.

Session-level fields recorded once at the top of the R.7 report:

graph_status — outcome of an opt-in build_gpu_throughput_graph_or_demote call at canonical scale. Either captured or demoted with a short reason. The graph itself does not drive the bench rows (the rows go through the pre-existing layer dispatch paths); the status is recorded so the case file’s launch-plan provenance can be audited later.
graph_plan_hash — the captured topology’s canonical hash, when capture succeeds. Reported as 64 hex chars; absent on demoted.

Output:

Console: R.7 prints a === R.7 Money Table === block per row plus a final summary table; R.8 prints a === R.8 Bottleneck Profile === block per scale point.
Files: R.7 writes reports/money_table.txt; R.8 writes reports/r8_bottleneck_<grid>_K<K>.txt per scale point.

Honest reporting: every number printed is measured. Rows that fail to run print n/a in the speedup column and a short reason in the same row. The R doctrine forbids fabricated numbers; this file enforces that by only writing rows the bench actually completed.

Functions§

parse_and_run: Run R.7 with the user-supplied CLI flags. Supported flags:

Module bench_gpu_scale

Module bench_gpu_scale Copy item path

Functions§

Module bench_gpu_scale