Skip to main content

Module bench

Module bench 

Source
Expand description

dsfb-gpu-debug bench — measure pipeline wall-clock on CPU and GPU.

Times only the pipeline functions (no fixture I/O, no canonical JSON emission, no hash-chain construction outside the bare pipeline). The goal is to surface the cost of the deterministic-inference chain itself, separated from the surrounding orchestration.

Flags:

  • --iters N (default 100): measured iterations.
  • --warmup N (default 10): warmup iterations excluded from stats. Excluding the warmup from stats is what keeps first-call CUDA context creation out of the published numbers — the user’s step 6 in the optimization roadmap.
  • --backend cpu|gpu|both (default both).
  • --detail (no value): also report per-stage CUDA-event timings for the GPU path (alloc / H2D / kernel 1..5 / D2H / free / total). Implies --backend gpu if not otherwise set.

Numbers are reported as min / median / mean / max in microseconds. No formal benchmarking framework — this is a transparency tool, not a deployment performance claim. v0 launch geometry is 1 thread per entity (16 threads), which is dramatically under-utilized hardware; the bench reports what the architecture actually does today and makes that posture observable.

.expect() is used inside the GPU bench so a CUDA error aborts loudly rather than silently producing meaningless timings.

Functions§

parse_and_run