1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
//! GPU-optimized WFST data structures and algorithms.
//!
//! This module provides data structures and algorithms optimized for GPU execution,
//! based on techniques from high-performance WFST decoders.
//!
//! ## Overview
//!
//! GPU-accelerated WFST decoding achieves large speedups over single-core CPU
//! by exploiting massive parallelism. Key techniques include:
//!
//! 1. **CSR Representation**: Compressed Sparse Row format for cache-efficient access
//! 2. **Token Recombination**: uint64 packing for atomic operations without precision loss
//! 3. **Dynamic Load Balancing**: Cooperative groups with dispatcher pattern
//! 4. **K-Vector Reduction**: Reduce atomic contention with multiple vectors
//! 5. **Channels/Lanes**: Batched streaming for high throughput
//! 6. **Soft Pruning**: Mark-and-compact instead of immediate deallocation
//!
//! ## Architecture
//!
//! The module provides CPU implementations of all data structures that are compatible
//! with GPU execution patterns. Actual GPU kernels can be added via:
//!
//! - `cudarc` crate for CUDA runtime bindings
//! - `rust-cuda` for native CUDA kernel compilation
//! - `wgpu` for portable compute shaders (WebGPU/Vulkan/Metal/DX12)
//!
//! ## Memory Layout
//!
//! GPU-optimized memory layout follows these principles:
//!
//! - **Coalesced access**: Adjacent threads access adjacent memory
//! - **Minimal state**: Decoder state independent of graph size
//! - **Bounded memory**: Predictable memory usage for batched decoding
//!
//! ```text
//! FST Memory: M_fst = 12|Q| + 8|E| + 4|E_E|
//! Decoder State: M_state = 64α·n_c + 544α·n_l + 1024·n_l
//! ```
//!
//! Where:
//! - |Q| = number of states
//! - |E| = number of transitions
//! - |E_E| = number of emitting transitions
//! - α = max active tokens after pruning
//! - n_c = number of channels
//! - n_l = number of lanes
//!
//! ## Performance (reported in the literature)
//!
//! Braun et al. (2020) report up to a 240× speedup over single-core CPU decoding for the
//! GPU Viterbi exact-lattice decoder these structures model. This crate provides the
//! GPU-ready CPU-side data structures only; that figure is theirs, not an independent benchmark.
//!
//! ## References
//!
//! - Braun et al., "GPU-Accelerated Viterbi Exact Lattice Decoder for Batched Online and Offline
//! Speech Recognition" (ICASSP 2020, arXiv:1910.10032)
//! - Chen et al., "A GPU-based WFST Decoder with Exact Lattice Generation" (Interspeech 2018)
// CSR representation for memory-efficient WFST storage
pub use ;
// Token recombination with uint64 packing
pub use ;
// Dynamic load balancing
pub use ;
// K-vector atomic reduction
pub use ;
// Channels/Lanes for batched streaming
pub use ;
// Soft pruning
pub use ;