vyre-driver-cuda 0.6.1

# vyre-driver-cuda  -  architecture

CUDA backend. Implements `VyreBackend` against the NVIDIA CUDA
runtime + driver APIs.

## Modules

### `backend.rs` (OFF-LIMITS  -  submodular eviction just shipped)
Backend-trait implementation. Owns the device handle, the stream
pool, and the dispatch hot path. Currently mid-edit; do not
co-edit with this turn's work.

### `binding.rs`
Buffer-binding pass that maps `BufferDecl` records onto CUDA
stream-attached buffer slots.

### `codegen.rs`
PTX/CUBIN emission. Lowers the Program's typed IR into PTX via
the CUDA driver's NVRTC API; caches the resulting `cubin` blob
keyed on the conformance certificate.

### `device.rs`
Device discovery, capability probing (compute capability, max
shared mem per block, register file size).

### `pipeline.rs`
Kernel-launch parameter computation: workgroup count, dynamic
shared memory request, argument packing.

### `stream.rs`
Per-stream handle pool; lets the dispatcher overlap H2D copies
with kernel execution.

## Public types

- **`CudaBackend`**  -  backend-trait implementation. Acquired via
  `CudaBackend::acquire()` which probes for a CUDA-capable
  device.
- **`StreamPool`**  -  internal; not exposed across the trait
  boundary.

## Integration points

- Plugs into `vyre-driver`'s registration via inventory.
- Cooperates with `vyre-runtime megakernel` when the program is a
  megakernel (PTX is emitted with the persistent-loop body the
  scaling layer asked for).