oxicuda-graphalg 0.3.0

OxiCUDA: Classical graph algorithms (BFS/DFS, shortest paths, MST, max-flow, matching, SCC, centrality, community, TSP, coloring, isomorphism)
Documentation
# oxicuda-graphalg TODO

GPU-accelerated Classical Graph Algorithms, serving as a pure Rust replacement for
NetworkX / SNAP / igraph-style toolkits. Part of [OxiCUDA](https://github.com/cool-japan/oxicuda) (Vol.59).

(C) 2026 COOLJAPAN OU (Team KitaSan) -- Pure Rust, no C/Fortran, no CUDA SDK, no nvcc.

## Implementation Status

- **Actual SLoC:** 11,913 (98 files, tokei measurement)
- **Total lines (incl. comments+blanks):** 7,043
- **Tests:** 351 passing
- **Vol.59 scope:** Complete classical graph-algorithm coverage (traversal, shortest
  paths, MST, max-flow, matching, connectivity, centrality, community detection,
  graph coloring, TSP, isomorphism). Complements oxicuda-graph (GNN-oriented) by
  providing the foundational discrete-algorithm layer.

### Completed

#### Core Infrastructure
- [x] `error.rs` -- `GraphalgError` enum (`InvalidGraph`, `NegativeWeight`,
  `NegativeCycle`, `SourceOutOfRange`, `DisconnectedGraph`, `NotABipartiteGraph`,
  `NotADag`, `InvalidParameter`, `NumericalInstability`, `UnsupportedSmVersion`,
  `IndexOutOfBounds`, `EmptyInput`, `NotImplemented`, ...) + `GraphalgResult<T>`
- [x] `handle.rs` -- `SmVersion`, `LcgRng` (MMIX LCG, bit-32 boolean, Box-Muller),
  `GraphalgHandle`
- [x] `ptx_kernels.rs` -- 7 kernels x 6 SM versions: `bfs_level`, `dijkstra_relax`,
  `pagerank_step`, `fw_inner`, `triangle_count`, `csr_spmv_bool`, `community_label`
  (string-concatenation PTX, no nvcc)

#### Graph Representations
- [x] `repr/adjacency_list.rs` -- Vec-of-Vec adjacency lists
- [x] `repr/adjacency_matrix.rs` -- Dense 0/1 adjacency matrix
- [x] `repr/edge_list.rs` -- Flat list of (u, v[, w]) edges
- [x] `repr/csr_graph.rs` -- Compressed Sparse Row representation for GPU SpMV
- [x] `repr/weighted_graph.rs` -- Weighted-edge wrappers and conversions

#### Traversal
- [x] `traversal/bfs.rs` -- BFS with queue and visited set, distance and parent vectors
- [x] `traversal/dfs.rs` -- DFS recursive + iterative (explicit stack) with pre/post order
- [x] `traversal/iddfs.rs` -- Iterative Deepening DFS (memory-bounded search)
- [x] `traversal/bidirectional_bfs.rs` -- Meet-in-the-middle bidirectional BFS

#### Topological Sort
- [x] `topological/kahn.rs` -- In-degree queue-based Kahn's algorithm with cycle detection
- [x] `topological/dfs_topo.rs` -- DFS post-order reverse with cycle detection

#### Shortest Path
- [x] `shortest_path/dijkstra.rs` -- Binary-heap Dijkstra (rejects negative weights)
- [x] `shortest_path/bellman_ford.rs` -- O(VE) Bellman-Ford with negative-cycle detection
- [x] `shortest_path/spfa.rs` -- Queue-based SPFA (Shortest Path Faster Algorithm)
- [x] `shortest_path/floyd_warshall.rs` -- O(V^3) all-pairs Floyd-Warshall
- [x] `shortest_path/johnson.rs` -- Johnson reweighting (Bellman-Ford + Dijkstra-per-source)
- [x] `shortest_path/a_star.rs` -- A* with admissible heuristic, falls back to Dijkstra
- [x] `shortest_path/yen_k_shortest.rs` -- Yen's K-shortest paths via deviation method
- [x] `shortest_path/bidijkstra.rs` -- Bidirectional Dijkstra (meet-in-the-middle)
- [x] `shortest_path/transitive_closure.rs` -- Boolean Floyd-Warshall O(V^3) +
  BFS-per-source O(V(V+E)) reachability with optional reflexive diagonal
- [x] `shortest_path/transitive_reduction.rs` -- Unique DAG transitive reduction
  (covering relation) via closure + cycle check (`NotADag` on cyclic input)

#### Minimum Spanning Tree
- [x] `mst/prim.rs` -- Prim's algorithm with priority queue from a source
- [x] `mst/kruskal.rs` -- Kruskal with sorted edges and Union-Find
- [x] `mst/boruvka.rs` -- Boruvka's contraction algorithm
- [x] `mst/union_find.rs` -- Union-Find with union-by-rank and path compression

#### Maximum Flow / Min Cut
- [x] `max_flow/edmonds_karp.rs` -- BFS-augmenting Edmonds-Karp O(VE^2)
- [x] `max_flow/dinic.rs` -- Dinic level-graph + blocking-flow DFS O(V^2 E)
- [x] `max_flow/push_relabel.rs` -- Goldberg-Tarjan push-relabel
- [x] `max_flow/min_cut.rs` -- Min-cut via reachability in residual graph

#### Matching
- [x] `matching/hopcroft_karp.rs` -- Bipartite matching O(E sqrt(V))
- [x] `matching/hungarian_munkres.rs` -- Kuhn-Munkres assignment O(n^3)
- [x] `matching/blossom_v_simple.rs` -- Simplified Blossom for small general graphs

#### Connectivity
- [x] `connectivity/scc_tarjan.rs` -- Tarjan low-link strongly-connected components
- [x] `connectivity/scc_kosaraju.rs` -- Kosaraju two-pass DFS SCC
- [x] `connectivity/scc_gabow.rs` -- Gabow two-stack SCC
- [x] `connectivity/bridges_tarjan.rs` -- Bridges via low-link
- [x] `connectivity/articulation_points.rs` -- Articulation points (cut vertices)
- [x] `connectivity/biconnected.rs` -- Biconnected components via edge stack
- [x] `connectivity/k_core.rs` -- k-core decomposition via Batagelj-Zaversnik
  O(V+E) bucket peeling: core numbers, degeneracy, degeneracy ordering, k-core
  subgraph extraction (undirected; symmetric-adjacency on directed input)

#### Centrality
- [x] `centrality/degree_centrality.rs` -- Indegree / outdegree / undirected degree
- [x] `centrality/betweenness_brandes.rs` -- Brandes O(VE) BFS + dependency backprop
- [x] `centrality/closeness.rs` -- `1 / sum d(v, u)` closeness
- [x] `centrality/eigenvector.rs` -- Eigenvector centrality via power iteration
- [x] `centrality/pagerank.rs` -- Power-iteration PageRank with damping factor
- [x] `centrality/katz.rs` -- Katz centrality `(I - alpha A)^{-1} beta`

#### Community Detection
- [x] `community/louvain.rs` -- Blondel two-phase modularity optimisation
- [x] `community/label_propagation.rs` -- Iterative most-frequent-neighbour labels
- [x] `community/girvan_newman.rs` -- Edge-betweenness divisive clustering

#### Specialised Algorithms
- [x] `arborescence/chu_liu_edmonds.rs` -- Minimum spanning arborescence for digraphs
- [x] `isomorphism/vf2.rs` -- VF2 state-space search with backtracking + feasibility
- [x] `coloring/greedy_coloring.rs` -- Greedy graph coloring by vertex index
- [x] `coloring/dsatur.rs` -- DSATUR (degree of saturation)
- [x] `coloring/welsh_powell.rs` -- Welsh-Powell degree-descending
- [x] `tsp/christofides_approx.rs` -- Christofides 1.5-approximation (MST + odd-degree
  perfect matching + Eulerian circuit)
- [x] `tsp/nearest_neighbor.rs` -- Greedy nearest-neighbour TSP heuristic
- [x] `tsp/two_opt.rs` -- 2-opt local search
- [x] `eulerian/hierholzer.rs` -- O(E) Hierholzer Eulerian circuit
- [x] `hamiltonian/held_karp_dp.rs` -- Held-Karp O(n^2 2^n) exact TSP DP

#### Diagnostics & Tests
- [x] `metrics/metrics.rs` -- diameter, radius, density, global clustering coefficient
  `3 * triangles / triplets`, transitivity
- [x] `e2e_tests.rs` -- 33 cross-module integration tests (BFS line-graph distances;
  DFS tree visits all; Dijkstra 4-node; Bellman-Ford negative-cycle detection;
  Floyd-Warshall = Dijkstra-all-pairs on positive; Prim = Kruskal weight; Edmonds-Karp
  = min-cut; Tarjan SCC on `[0->1,1->2,2->0,3->1]` = `{0,1,2},{3}`; PageRank sums to 1;
  Louvain modularity >= 0; A* zero-heuristic = Dijkstra; VF2 K_3 = K_3; Hopcroft-Karp
  3x3 perfect; Hungarian = brute-force on 4x4; triangle count K_4 = 4; PTX x 6 SM;
  transitive-reduction preserves closure; k-core degeneracy; closure = BFS reach)
- [x] `benches/graphalg_ops.rs` -- Criterion: 7 PTX kernels x all SM versions plus
  Dijkstra / BFS / PageRank / Floyd-Warshall / Edmonds-Karp / Louvain algo benches

### Future Enhancements

#### P0 -- Verification Gaps
- [ ] GPU hardware verification on Linux + NVIDIA driver 525+ for all 7 PTX kernels
  across SM 75 / 80 / 86 / 89 / 90 / 100
- [ ] Reference cross-validation against NetworkX / SNAP outputs for PageRank, Louvain,
  Brandes betweenness, and Hopcroft-Karp on standard SNAP datasets

#### P1 -- Performance Tuning
- [ ] Per-SM tuned thread-block sizes for `bfs_level` and `dijkstra_relax` (currently
  fixed at portable defaults)
- [ ] Frontier-based GPU BFS (direction-optimising push/pull) for large-diameter graphs
- [ ] CSR `csr_spmv_bool` warp-per-row vs scalar dispatch heuristic based on row-length
  distribution
- [ ] PageRank topology-aware partitioning and async pull/push hybrid for sm_90 NVLink

#### P2 -- Algorithmic Extensions
- [x] Weighted general matching (`matching/weighted_general.rs`) — Edmonds/Galil O(n³) primal-dual weighted blossom (Van Rantwijk formulation): vertex+blossom potentials, alternating trees, contraction/expansion; `WeightedGeneralMatching` (+ `solve_max_cardinality`). Validated against subset-DP oracle on random graphs.
- [x] Parametric max-flow (`max_flow/parametric_maxflow.rs`) — Gallo-Grigoriadis-Tarjan / Hochbaum: affine GGT-monotone arc capacities, nested min-cut source sets, divide-and-conquer breakpoint slicing; `ParametricMaxFlow` (`solve_at`/`solve_grid`/`find_breakpoints`).
- [x] Planar graph separator (`separation/planar_separator.rs`) — Lipton-Tarjan BFS-level separator + tree-centroid specialisation with exact subset-sum balancing; `PlanarSeparator` (O(√n) on grids/trees/paths).
- [x] Harmonic centrality (`centrality/harmonic.rs`) — Marchiori-Latora 2000: harmonic mean of inverse distances `Σ 1/d(v,u)` as the harmonic centrality, well-defined for disconnected graphs unlike closeness; `HarmonicCentrality`
- [ ] Multi-GPU partitioned BFS / Dijkstra with edge-cut overlap
- [x] Delta-stepping shortest paths as an alternative to heap-based Dijkstra for GPU
- [x] Streaming dynamic graph updates (add/remove edge) for incremental SCC / PageRank (`dynamic/`): `DynamicGraph` (mutable digraph with reverse index + edge multiplicity), `IncrementalPageRank` (warm-started power iteration resuming from the prior stationary vector → same ranks as a cold solve, fewer iterations), `IncrementalScc` (merge components on cycle-closing inserts via condensation forward/backward reachability; re-split only the affected component on deletes). All validated against from-scratch Tarjan / static PageRank, incl. a randomised 40×60-step update-stream test.
- [ ] Persistent kernel path for `triangle_count` on repeated subgraph queries

## Dependencies

| Dependency | Purpose | Pure Rust? |
|------------|---------|------------|
| oxicuda-driver | CUDA Driver API wrapper (libloading runtime FFI) | Yes |
| oxicuda-memory | Device / host memory management | Yes |
| oxicuda-launch | Type-safe kernel launch | Yes |
| oxicuda-ptx | PTX code generation DSL | Yes |
| thiserror | Error derive macros | Yes |

No external graph libraries: all representations, traversals, and algorithms are
implemented natively.

## Quality Status

- Warnings: 0 (clippy clean, `#![forbid(unsafe_code)]`)
- Tests: 351 passing (unit + 33 e2e cross-module)
- `unwrap()` / `expect()` calls in production code: 0
- Refactoring policy: all files under 2000 lines
- GPU tests behind `#[cfg(feature = "gpu-tests")]`; macOS returns
  `UnsupportedPlatform` at runtime, compiles cleanly

## Performance Targets

Representative graph workloads (graphs are stored CSR on device; runtimes are
aspirational CPU-side dispatch + PTX generation pending Linux + NVIDIA validation).

| Algorithm | Graph Size | Target |
|-----------|------------|--------|
| BFS | |V| = 10^6, |E| = 10^7 (Kronecker / R-MAT) | < 50 ms on sm_80 |
| Dijkstra (binary heap) | |V| = 10^5, |E| = 10^6, positive weights | < 100 ms on sm_80 |
| PageRank (power iter) | |V| = 10^6, |E| = 10^7, 50 iterations | < 200 ms on sm_80 |
| Floyd-Warshall | |V| = 1024 (dense distance matrix) | < 500 ms on sm_80 |
| Edmonds-Karp / Dinic | |V| = 10^3, |E| = 10^4 | < 100 ms on sm_80 |
| Triangle count | |V| = 10^5, |E| = 10^6 | < 50 ms on sm_80 |
| Louvain (1 pass) | |V| = 10^5 | < 1 s on sm_80 |

## Notes

- Vol.59 covers discrete classical graph algorithms only; spectral / GNN / sparse-matrix
  operations live in oxicuda-graph and oxicuda-sparse respectively.
- All algorithms operate over `usize` vertex IDs. Weighted edges are `f64`. Multigraph
  and self-loop handling follows standard textbook conventions documented per module.
- The PTX kernels are intentionally narrow (frontier propagation, edge relaxation,
  rank step, SpMV-bool, label update); higher-level algorithm outer loops execute
  on the host while inner kernels run on device.

## Architecture-Specific Deepening

Tile / thread-block configurations for the 7 PTX kernels by SM version. Per-SM
tuning is currently uniform; targeted tuning is tracked under Future Enhancements P1.

| SM Version | Default Block (`bfs_level`) | Pipeline | Notes |
|------------|------------------------------|----------|-------|
| sm_75 (Turing) | 128 x 1 | 1 stage | baseline scalar |
| sm_80 / sm_86 (Ampere) | 256 x 1 | 2 stages | `cp.async` ready |
| sm_89 (Ada) | 256 x 1 | 2 stages | -- |
| sm_90 (Hopper) | 512 x 1 | 3 stages | TMA candidate for `pagerank_step` |
| sm_100 (Blackwell) | 512 x 1 | 3 stages | -- |

### Deepening Opportunities
- [ ] Ampere: warp-aggregated `bfs_level` frontier compaction via `ballot_sync`
- [ ] Hopper: TMA loads of CSR column indices for `csr_spmv_bool` and `pagerank_step`
- [ ] All SMs: cooperative-group push-relabel for `max_flow` on dense bipartite graphs
- [ ] Ada / Hopper: dynamic parallelism for nested BFS in bidirectional Dijkstra

## Estimation vs Actual

| Metric | Estimated (estimation.md Vol.59) | Actual |
|--------|----------------------------------|--------|
| SLoC | 70K-130K (median ~100K) | 11,913 |
| Files | ~30-50 algorithm modules | 98 |
| Tests | algorithm-grade coverage | 351 |

The gap to the median estimate reflects the estimation targeting full
NetworkX-/SNAP-grade production parity including streaming dynamic graph algorithms,
GPU-resident frontier engines, and end-to-end SNAP-dataset benchmark harnesses.
The current implementation delivers a clean classical-algorithm surface with verified
CPU correctness and PTX generation for all 7 device kernels.