Expand description
dsfb-gpu-debug bench — measure pipeline wall-clock on CPU and GPU.
Times only the pipeline functions (no fixture I/O, no canonical JSON emission, no hash-chain construction outside the bare pipeline). The goal is to surface the cost of the deterministic-inference chain itself, separated from the surrounding orchestration.
Flags:
--iters N(default 100): measured iterations.--warmup N(default 10): warmup iterations excluded from stats. Excluding the warmup from stats is what keeps first-call CUDA context creation out of the published numbers — the user’s step 6 in the optimization roadmap.--backend cpu|gpu|both(defaultboth).--detail(no value): also report per-stage CUDA-event timings for the GPU path (alloc / H2D / kernel 1..5 / D2H / free / total). Implies--backend gpuif not otherwise set.
Numbers are reported as min / median / mean / max in microseconds. No formal benchmarking framework — this is a transparency tool, not a deployment performance claim. v0 launch geometry is 1 thread per entity (16 threads), which is dramatically under-utilized hardware; the bench reports what the architecture actually does today and makes that posture observable.
.expect() is used inside the GPU bench so a CUDA error aborts
loudly rather than silently producing meaningless timings.