# GPU Feasibility Notes
This repo does not have a production GPU backend. The first useful test is a
standalone feasibility probe for the part of bounded counting most likely to
benefit from the GPU: sort/reduce of packed canonical short k-mers.
## Probe
Use:
```bash
scripts/cuda_kmer_sort_reduce_probe.py \
--extractor rust \
--stream \
--reads 500000 \
--outdir tmp/cuda_probe_500k_$(date +%Y%m%d) \
--force-rebuild
```
The probe:
- Extracts canonical `k=31` short k-mers from paired FASTQ/FASTQ.GZ into a
binary `u64` stream. The default `--extractor rust` path uses the crate's
real parser and k-mer generation code; `--extractor python` is retained as a
simple control.
- Compiles a tiny CUDA/CUB program with `nvcc`.
- Compares CPU `std::sort` plus adjacent reduce against GPU
host-to-device copy, CUB radix sort, CUB run-length encode, and device-to-host
copy.
- Supports `--stream`, which pipes Rust extractor stdout directly into the CUDA
helper instead of requiring an intermediate `.u64` file.
- Checks that CPU and GPU reduced `(key, count)` streams match by unique count
and checksum.
Extraction time is reported separately from GPU timing. The Rust extractor is
still a helper-process feasibility path, but it is close enough to the real
production shape to make the result useful.
## Local Results
Machine:
- GPU: NVIDIA GeForce RTX 5070 Laptop GPU
- VRAM: 8151 MiB
- Driver: 595.71.05
- CUDA: 13.2
Input:
- `tmp/human_benchmark_8threads/human_GRCh38_500k_R1.fq.gz`
- `tmp/human_benchmark_8threads/human_GRCh38_500k_R2.fq.gz`
- `k=31`
CUB helper with Rust extractor:
| 1,000 | 240,000 | 0.009 s | 17.536 ms | 0.697 ms | 0.185 ms | true |
| 50,000 | 12,000,000 | 0.269 s | 688.053 ms | 24.977 ms | 6.462 ms | true |
| 500,000 | 120,000,000 | 2.462 s | 7890.760 ms | 242.999 ms | 64.580 ms | true |
CUB streaming mode after correctness was established, with
`--skip-cpu-reference`:
| 500,000 | 120,000,000 | 2.898 s | 3.833 s | 243.430 ms | 64.571 ms | not checked |
The streaming checksum `5943999645108915372` matches the verified CUB 500k
checksum above; `--skip-cpu-reference` only skips the redundant helper-side CPU
sort/reduce.
Earlier Thrust helper with Rust extractor:
| 1,000 | 240,000 | 0.008 s | 12.677 ms | 273.973 ms | 273.394 ms | true |
| 50,000 | 12,000,000 | 0.250 s | 682.257 ms | 27.748 ms | 8.846 ms | true |
| 500,000 | 120,000,000 | 2.367 s | 7660.800 ms | 235.400 ms | 66.522 ms | true |
Earlier Python extractor control:
| 1,000 | 240,000 | 16.803 ms | 337.756 ms | 337.156 ms | true |
| 50,000 | 12,000,000 | 733.955 ms | 321.658 ms | 302.764 ms | true |
| 100,000 | 24,000,000 | 1451.000 ms | 321.488 ms | 284.518 ms | true |
| 500,000 | 120,000,000 | 7917.670 ms | 517.186 ms | 338.936 ms | true |
The 500k Rust-extractor slice shows that GPU sort/reduce is feasible at the
same scale as the current publish benchmark: 120M canonical k-mers fit in VRAM
and reduce correctly. CUB removes most of the small-run GPU fixed cost, but a
production path should still threshold on batch size because extraction,
transfer, allocation, and replay overhead dominate small inputs.
## Interpretation
The promising path is not direct GPU count-min atomics. It is:
1. CPU/Rust extracts packed canonical k-mers in large batches.
2. GPU sorts and reduces them into `(key, count)` runs.
3. Rust replays reduced counts into the existing deterministic conservative
sketch path, preserving the parity-safe update semantics.
This preserves the semantics that made deterministic sorted replay publishable,
while moving the expensive key ordering/reduction stage to the GPU.
The next production-grade experiment should avoid the helper-process pipe and
wire Rust-side k-mer batches to either:
- a CUDA helper process that consumes binary `u64` batches, or
- a feature-gated CUDA FFI path.
The helper-process route is the lower-risk first integration because it keeps
CUDA build requirements out of the default crates.io package.