1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
//! Per-phase timing types for the WCOJ triangle dispatch.
//!
//! Compiled in only when the `wcoj-phase-timing` Cargo feature is
//! enabled. Production builds omit this module entirely so the
//! hot path has zero overhead.
//!
//! Records four GPU-stream-time phases inside count+scan+materialize
//! entries via CUDA events:
//! * `count_ms` — count kernel + the
//! `count_buf → offsets_buf` dtod copy.
//! * `scan_ms` — `multiblock_scan_u32_inplace_on_stream`.
//! * `total_ms` — `wcoj_compute_total` kernel.
//! * `materialize_ms` — materialize kernel.
//!
//! Each value is the stream-time elapsed between two
//! `cudarc::driver::CudaEvent` records, returned as `f32`
//! milliseconds (CUDA's native event-elapsed precision). The
//! sum of the four buckets is `triangle_gpu_total_ms`; any
//! host-side overhead between phases (the scalar D2H of
//! `total_rows`, output buffer allocation, the H2D of
//! `out_d_num_rows`, recorder setup/commit) lives in the
//! caller's residual bucket.