# PRISM-Q Architecture
## Goals
- **Primary**: Fastest practical quantum circuit simulation in Rust.
- Correct simulation of supported gate sets across multiple backend strategies.
- Clean backend plugin model — add new simulation strategies without touching the core.
## Non-goals
- Full OpenQASM 3.0 compliance (supports a practical subset).
- GUI or notebook integration (library-first).
- Hardware backend / QPU connectivity.
## Layered design
```text
┌─────────────────────────────────────┐
│ User / Application │
├─────────────────────────────────────┤
│ run_qasm(&str, seed) │ ← Public API (src/lib.rs)
├─────────────────────────────────────┤
│ OpenQASM 3.0 Parser │ ← src/circuit/openqasm.rs
│ &str → Circuit IR │
├─────────────────────────────────────┤
│ Circuit IR │ ← src/circuit/mod.rs
│ (backend-agnostic instruction │
│ sequence: gates, measures, │
│ barriers, conditionals) │
├─────────────────────────────────────┤
│ Fusion Pipeline │ ← src/circuit/fusion.rs
│ (cancel, fuse, reorder, batch) │
├─────────────────────────────────────┤
│ Simulation Engine │ ← src/sim/mod.rs
│ (dispatch, decompose, execute) │
├──────────┬──────────────────────────┤
│ Backends │ Compiled Samplers │ ← src/sim/compiled/, noise.rs
│ │ (shot-based simulation) │ homological.rs, unified_pauli.rs
├──────┬───┴──┬──────┬──────┬──────┬──────┬──────┤
│ SV │ TN │ MPS │Sparse│ Prod │ Stab │ Fact │
└──────┴──────┴──────┴──────┴──────┴──────┴──────┘
```
## Parser
Hand-rolled line-by-line parser targeting a practical OpenQASM 3.0 subset. Converts `&str` directly to `Circuit` IR with no intermediate AST.
**Supported**: `qubit`/`bit` declarations, standard gates (x, y, z, h, s, sdg, t, tdg, rx, ry, rz, cx, cz, swap, sx, sxdg, p, cy, ch, crx, cry, crz, csx, ccx/toffoli, ccz, cswap/fredkin, rzz, rxx, ryy, ecr, iswap, dcx, u1, u2, u3/u), gate modifiers (`ctrl @`, `inv @`, `pow(k) @`), user-defined `gate` blocks, classical `if` conditionals, multi-register broadcast, measure, barrier, expression evaluator with math functions. OpenQASM 2.0 backward compatibility (`qreg`/`creg`, `measure q -> c` syntax).
**Unsupported**: `for`/`while` loops, subroutines, classical expressions beyond `if`.
## Circuit IR
`Circuit` holds `num_qubits`, `num_classical_bits`, and `Vec<Instruction>`. Instructions are an enum:
| `Gate` | `gate`, `targets` | Gate application |
| `Measure` | `qubit`, `classical_bit` | Destructive measurement |
| `Barrier` | `qubits` | Synchronization barrier |
| `Conditional` | `condition`, `gate`, `targets` | Classical-controlled gate |
Targets use `SmallVec<[usize; 4]>` — inline storage for up to 4 qubits, no heap allocation for typical gates.
### Gate enum
`Gate` is a `Clone` enum kept at **16 bytes**. Simple variants carry parameters inline. Composite variants use `Box` to stay within the 16-byte budget for cache-friendly dispatch.
| `Id`, `X`, `Y`, `Z`, `H`, `S`, `Sdg`, `T`, `Tdg`, `SX`, `SXdg` | None | 16B |
| `Rx(f64)`, `Ry(f64)`, `Rz(f64)`, `P(f64)`, `Rzz(f64)` | Inline f64 | 16B |
| `Cx`, `Cz`, `Swap` | None | 16B |
| `Cu(Box<[[Complex64; 2]; 2]>)` | Boxed 2×2 | 16B |
| `Mcu(Box<McuData>)` | Boxed matrix + control count | 16B |
| `Fused(Box<[[Complex64; 2]; 2]>)` | Boxed pre-fused 1q matrix | 16B |
| `Fused2q(Box<[[Complex64; 4]; 4]>)` | Boxed pre-fused 2q matrix | 16B |
| `MultiFused(Box<MultiFusedData>)` | Batched 1q gates for tiled pass | 16B |
| `Multi2q(Box<Multi2qData>)` | Batched 2q gates for tiled pass | 16B |
| `BatchPhase(Box<BatchPhaseData>)` | Batched cphase with shared control | 16B |
| `BatchRzz(Box<BatchRzzData>)` | Batched ZZ rotations | 16B |
| `DiagonalBatch(Box<DiagonalBatchData>)` | Mixed diagonal 1q/2q batch | 16B |
## Fusion pipeline
Thirteen-pass gate optimization before execution, gated by qubit count thresholds. Every pass returns `Cow<Circuit>` — `Borrowed` when no optimization applies, zero overhead for circuits that don't benefit.
```text
Input Circuit
│
├─ pass0: cancel_self_inverse_pairs (always)
├─ pass0r: fuse_rzz (always) CX·Rz·CX → Rzz
├─ pass0b: fuse_batch_rzz (≥16q) N×Rzz → BatchRzz
│
│ ─── MIN_QUBITS_FOR_FUSION (10) gate ───
│
├─ pass1: fuse_single_qubit_gates (≥10q) consecutive 1q → Fused
├─ pass1r: reorder_1q_gates (≥10q) commutation-aware reorder
├─ pass1c: cancel_self_inverse_pairs (≥10q) re-cancel after reorder
├─ pass1f: fuse_single_qubit_gates (≥10q) re-fuse after reorder
│
├─ pass_2q: fuse_2q_gates (≥20q) CX + adjacent 1q → Fused2q
├─ pass2: fuse_multi_1q_gates (≥14q) 1q batch → MultiFused
├─ pass_m2q: fuse_multi_2q_gates (≥20q) 2q batch → Multi2q
├─ pass_cp: fuse_controlled_phases (≥16q) cphase batch → BatchPhase
├─ pass_db: fuse_diagonal_batch (≥16q) mixed diagonal → DiagonalBatch
└─ pass_pp: batch_post_phase_1q (≥18q) re-batch 1q after cphase
│
Output Circuit
```
**Threshold constants:**
| `MIN_QUBITS_FOR_FUSION` | 10 | Below this, clone cost exceeds simulation savings |
| `MIN_QUBITS_FOR_MULTI_FUSION` | 14 | MultiFused tiling overhead vs benefit |
| `MIN_QUBITS_FOR_DIAG_BATCH` | 16 | Diagonal batch, cphase, and Rzz batching |
| `MIN_QUBITS_FOR_POST_PHASE_BATCH` | 18 | Post-phase 1q re-batching |
| `MIN_QUBITS_FOR_2Q_FUSION` | 20 | Generic 4×4 kernel only beats specialized at DRAM-bound sizes |
| `MIN_QUBITS_FOR_MULTI_2Q_FUSION` | 20 | Same as 2q fusion |
## Backend trait
```rust
pub trait Backend {
fn name(&self) -> &'static str;
fn init(&mut self, num_qubits: usize, num_classical_bits: usize) -> Result<()>;
fn apply(&mut self, instruction: &Instruction) -> Result<()>;
fn classical_results(&self) -> &[bool];
fn probabilities(&self) -> Result<Vec<f64>>;
fn num_qubits(&self) -> usize;
// Optional overrides:
fn apply_instructions(&mut self, instructions: &[Instruction]) -> Result<()>; // batch apply
fn supports_fused_gates(&self) -> bool; // false for symbolic backends (stabilizer)
fn export_statevector(&self) -> Result<Vec<Complex64>>; // for backend transitions
}
```
Contract: `init` before `apply`. Instructions arrive in circuit order. Measurement is destructive. Deterministic given same RNG seed.
## Simulation engine
Orchestration layer in `src/sim/mod.rs`. Entry points:
| `run(circuit, seed)` | Auto-dispatch, full output |
| `run_with(kind, circuit, seed)` | Explicit backend selection |
| `run_on(backend, circuit)` | Pre-constructed backend |
| `run_qasm(qasm, seed)` | Parse + simulate |
| `run_shots(circuit, shots, seed)` | Multi-shot sampling |
| `run_shots_with(kind, circuit, shots, seed)` | Multi-shot with backend selection |
| `run_shots_with_noise(kind, circuit, noise, shots, seed)` | Noisy multi-shot |
| `run_counts(kind, circuit, shots, seed)` | Frequency histogram |
| `run_marginals(kind, circuit, seed)` | Per-qubit marginal probabilities |
### Auto-dispatch decision tree
```text
BackendKind::Auto
├─ No entangling gates → ProductState (O(n))
├─ All Clifford → Stabilizer (O(n²))
├─ Above memory limit:
│ ├─ Sparse-friendly → Sparse (O(k))
│ └─ Otherwise → MPS (bond dim 256)
├─ Partial independence → Factored (dynamic split-state)
└─ Otherwise → Statevector (exact)
```
Memory limit is dynamically computed from available system RAM (50% budget, capped at 33 qubits). Overridable via `PRISM_MAX_SV_QUBITS` environment variable. Falls back to 28 qubits (4 GB) when detection unavailable.
### Subsystem decomposition
Union-find detects independent qubit groups in O(n·α(n)). Each block runs separately (with per-block Auto dispatch), results merge lazily via `Probabilities::Factored` — a Kronecker product computed on-demand per-element in O(K), avoiding the O(2^N) dense materialization unless explicitly requested.
Block-level Rayon parallelism when all blocks are <14 qubits (avoids oversubscription with block-internal parallelism).
### Temporal Clifford decomposition
For Clifford+T circuits: Clifford prefix runs on the Stabilizer backend, state is exported to Statevector for the non-Clifford tail. Saves exponential memory for circuits with a long Clifford preamble.
## Backends
### Statevector
Full-state simulation in a flat `Vec<Complex64>` of 2^n amplitudes. The primary backend for circuits up to ~28 qubits.
Gate kernels use enum dispatch with specialized routines for CX, CZ, SWAP, Cu, MCU, Rzz, BatchRzz, BatchPhase, DiagonalBatch, and MultiFused. Single-qubit gates go through `PreparedGate1q` with FMA-vectorized SIMD. MultiFused gates use a three-tier tiled kernel (L2 16K / L3 131K / individual passes) for cache locality. MultiFused batches where all gates are diagonal dispatch to a dedicated fast path (1 complex multiply/element vs 4+2 for full 2×2).
Rayon parallelism at ≥14 qubits with `par_chunks_mut` and `MIN_PAR_ELEMS = 4096` per task. BMI2 `_pext_u64` accelerates BatchPhase, BatchRzz, and DiagonalBatch LUT indexing.
Deferred measurement normalization: `pending_norm` accumulates normalization factors without full-state scaling passes. Zero-cost for circuits without measurements.
### Stabilizer
Aaronson-Gottesman bit-packed tableau for Clifford circuits. O(n²) time and space — scales to thousands of qubits. Gate kernels use wordwise bitwise ops and `popcount` for phase computation. Supports H, S, Sdg, SX, SXdg, CX, CZ, SWAP, and measurement.
Word-group batching fuses multiple 1q gate flushes into single tableau passes. Type-grouped masks apply all gates of the same Pauli type with one wordwise op instead of per-gate dispatch. Sparse Generator Indexing (SGI) tracks per-qubit active generator lists, enabling targeted row operations instead of full-tableau scans. Lazy destabilizer materialization defers destabilizer rows until probabilities are requested.
Probability extraction uses coset-based enumeration with GF(2) Gaussian elimination — O(2^k) where k is the number of non-diagonal generators, rather than O(2^n).
**Filtered Stabilizer** (`FilteredStabilizerBackend`): Per-cluster tableaux with dynamic merging. Starts with one qubit per cluster. Cross-cluster 2q gates merge tableaux. Independent subsystems never merge, giving O(block_size²) per gate vs O(n²).
### Sparse
### MPS (Matrix Product State)
Chain of rank-3 tensors with adaptive bond dimension (default max 256). O(n·χ²) memory. Single-qubit gates absorb via FMA-vectorized SIMD over bond-dimension slices. Two-qubit gates contract adjacent sites, apply the gate, then SVD-truncate back. Non-adjacent gates route through SWAP chains.
Hybrid SVD dispatch: faer (bidiag+D&C) for matrices with m×n ≥ 256, hand-rolled Jacobi for small matrices.
### Product State
Per-qubit `[Complex64; 2]` storage. O(n) memory, O(1) per single-qubit gate. Rejects entangling gates. Selected automatically for circuits with no 2q gates.
### Tensor Network
Deferred contraction with a greedy min-size heuristic. Gates append tensors; contraction happens lazily at measurement or probability extraction. `MAX_PROB_QUBITS = 25` guards against exponential blowup.
### Factored
Dynamic split-state simulation. Starts with n independent 1-qubit states, merges via tensor product only when 2q gates bridge groups. Parallel kernels match statevector patterns for sub-states ≥14 qubits. Selected when subsystem decomposition detects partial independence.
## Compiled samplers
For multi-shot sampling without materializing the full statevector on every shot.
### Noiseless compiled sampler (`src/sim/compiled/`)
**Backward path** (`compile_measurements`): Propagates Pauli Z observables backward through the circuit. Each measurement qubit becomes a row in a GF(2) parity matrix M. Clifford gates conjugate Pauli strings in O(1). The resulting M encodes which input qubits each measurement depends on.
**Forward path** (`compile_forward`): Tracks stabilizer generator dependencies forward through the circuit. Produces the same parity matrix via dependency tracking.
**Sampling**: Random bits for independent generators, then XOR-cascade through the parity matrix. Multiple dispatch tiers:
| `FlipLut` | Small rank | 256-entry XOR lookup table |
| `SparseParity` | Sparse rows | Only flip non-zero columns |
| `XorDag` | General | Optimal XOR-reduction DAG |
| `ParityBlocks` | Blocked structure | Per-block independent sampling |
**ShotAccumulator trait**: Pluggable result collection.
| `HistogramAccumulator` | Bitstring → count map | Standard shot output |
| `MarginalsAccumulator` | Per-qubit P(1) | Marginal probabilities |
| `PauliExpectationAccumulator` | ⟨P⟩ for Pauli observables | VQE/QAOA |
| `CorrelatorAccumulator` | ⟨Z_i Z_j⟩ correlations | Entanglement analysis |
| `NullAccumulator` | Nothing | Benchmarking raw sampling speed |
### Noisy compiled sampler (`src/sim/noise.rs`)
Backward Pauli propagation through circuit + noise sensitivity analysis. Each noise location gets an X-flip and Z-flip sensitivity row. During sampling, Bernoulli coin flips determine which noise channels fire, then XOR the sensitivity rows into the sample.
`NoiseModel`: Per-instruction depolarizing noise. `NoiseOp { qubit, px, py, pz }`.
### Homological sampler (`src/sim/homological.rs`)
`ErrorChainComplex`: GF(2) chain complex over the circuit's noise locations. Computes the kernel (null space) of the boundary map to identify error cycles that are undetectable by syndrome measurements. `HomologicalSampler` uses this for sampling with topological error correction awareness.
`noisy_marginals_analytical`: Closed-form marginal computation using the parity matrix and noise rates — avoids Monte Carlo sampling entirely.
## Clifford+T simulation
Three strategies for circuits mixing Clifford and T gates:
### Stabilizer rank (`src/sim/stabilizer_rank.rs`)
Maintains a weighted sum of stabilizer states. Each T gate doubles the term count via T = α·I + β·Z decomposition. Clifford gates are O(n²) per term. Accumulates weighted amplitudes for exact probabilities.
- `run_stabilizer_rank`: Exact probabilities (t ≤ 20, n ≤ 25)
- `run_stabilizer_rank_approx`: Approximate with Monte Carlo (higher t counts)
- `run_stabilizer_rank_shots`: Shot-based sampling
- `stabilizer_overlap_sq`: Inner product between stabilizer states
### Stochastic Pauli Propagation (`src/sim/unified_pauli.rs`)
Backward-propagates measurement observables as Pauli strings. Clifford gates conjugate in O(1). T gates branch stochastically into two Pauli paths with appropriate weights. Per-path cost O(d×n/64), independent of T-gate count. Returns marginal probabilities via Monte Carlo estimation.
`run_spp(circuit, num_samples, seed) → SppResult`
### Deterministic Sparse Pauli Dynamics (`src/sim/unified_pauli.rs`)
Backward-propagates as a weighted sum of Pauli strings stored in a HashMap. T gates deterministically branch X/Y terms. Identical strings auto-merge. Optional ε-truncation for approximate mode. Exact for small T-counts, approximate with bounded error for larger ones.
`run_spd(circuit, epsilon, max_terms) → SpdResult`
## Memory layout
| Statevector | `Vec<Complex64>` (2^n) | O(2^n) | Strided pair iteration |
| Stabilizer | Bit-packed `Vec<u64>` tableau | O(n²/8) bytes | Sequential row iteration |
| Sparse | `HashMap<usize, Complex64>` | O(k), k = nonzero | Hash-based random access |
| MPS | Chain of rank-3 tensors | O(n·χ²) | Sequential site access |
| Product | `Vec<[Complex64; 2]>` | O(n) | Per-qubit independent |
| Tensor Network | Network of dense tensors | O(gates × local dim) | Contraction-order dependent |
| Factored | `Vec<Option<SubState>>` | O(2^n) worst case | Per-sub-state dispatch |
## Threading and SIMD
### Threading
Gate kernels have `_par` variants using `par_chunks_mut` for safe Rayon parallelism (behind the `parallel` feature flag):
- **<14 qubits**: Single-threaded. Thread-pool overhead exceeds computation.
- **≥14 qubits**: Rayon parallel iterators with `MIN_PAR_ELEMS = 4096` (64KB per task).
Thread pool defaults to all logical cores (HT helps at 24q+ by hiding memory latency). Overridable via `RAYON_NUM_THREADS`.
### SIMD
`Complex64` maps to 128-bit SIMD naturally. Single-qubit gate kernels use `PreparedGate1q` with runtime CPU detection and tiered dispatch:
1. **AVX2+FMA** (256-bit): 2 complex pairs per iteration. Gated by `MAX_AVX2_STATE` for full-state passes (Skylake frequency throttling), but used freely within MultiFused L2 tiles where data is cache-resident.
2. **FMA** (128-bit): Default for larger states. 3-op complex multiply (permute + mul + fmaddsub).
3. **BMI2**: `_pext_u64` for BatchPhase, BatchRzz, and DiagonalBatch LUT indexing. Single-instruction bit extraction replaces multi-iteration shift-and-or loops.
4. **Scalar fallback**: No intrinsics. All SIMD functions have a `#[cfg(not(target_arch = "x86_64"))]` fallback.
Two key SIMD structs hoist matrix broadcast at construction time, avoiding per-element dispatch:
- **`PreparedGate1q`**: Broadcasts 2×2 matrix into SIMD registers. Methods: `apply_full_sequential` (full state), `apply_tiled` (cache-resident tile, no AVX2 throttle guard), `apply_slice_pairs` (MPS bond-dimension slices), `apply_pair_ptr` (Cu/Mcu parallel).
- **`PreparedGate2q`**: Broadcasts 4×4 matrix. Methods: `apply_full` (mask-based iteration), `apply_group_ptr` (4 scattered indices).
### Determinism
Same circuit + same seed = same result, regardless of thread count. Parallel backends use deterministic work partitioning.
## Backend dispatch
All `BackendKind` variants:
| `Auto` | Decision tree (see above) | Default |
| `Statevector` | Full state-vector | Explicit |
| `Stabilizer` | Aaronson-Gottesman tableau | Explicit or auto (all Clifford) |
| `FilteredStabilizer` | Per-cluster tableaux | Explicit |
| `Sparse` | HashMap state | Explicit or auto (above memory limit, sparse-friendly) |
| `Mps { max_bond_dim }` | Matrix Product State | Explicit or auto (above memory limit) |
| `ProductState` | Per-qubit product | Explicit or auto (no entangling) |
| `TensorNetwork` | Deferred contraction | Explicit |
| `Factored` | Dynamic split-state | Explicit or auto (partial independence) |
| `StabilizerRank` | Weighted stabilizer sum | Explicit |
| `StochasticPauli { num_samples }` | SPP | Explicit |
| `DeterministicPauli { epsilon, max_terms }` | SPD | Explicit |
## Circuit builders
Pre-built circuits for benchmarking and testing (`src/circuits.rs`):
| `qft_circuit(n)` | Quantum Fourier Transform |
| `random_circuit(n, depth, seed)` | Random gates at given depth |
| `hardware_efficient_ansatz(n, layers, seed)` | HEA with Ry/Rz + CX |
| `clifford_heavy_circuit(n, depth, seed)` | Random Clifford (adjacent CX) |
| `clifford_random_pairs(n, depth, seed)` | Random Clifford (random pair CX) |
| `ghz_circuit(n)` | GHZ state (H + CX chain) |
| `qaoa_circuit(n, layers, seed)` | QAOA MaxCut |
| `single_qubit_rotation_circuit(n, depth, seed)` | 1q rotations only |
| `clifford_t_circuit(n, depth, t_fraction, seed)` | Clifford+T with tunable T ratio |
| `w_state_circuit(n)` | W state preparation |
| `quantum_volume_circuit(n, depth, seed)` | Quantum volume (random SU(4)) |
| `cz_chain_circuit(n, depth, seed)` | CZ chains |
| `phase_estimation_circuit(n)` | Quantum phase estimation |
| `independent_bell_pairs(n_pairs)` | Independent Bell pairs |
| `independent_random_blocks(blocks, size, depth, seed)` | Independent random blocks |
## Error model
All public APIs return `Result<T, PrismError>`. Error variants:
| `Parse` | Parsing | OpenQASM parse error with line number |
| `UnsupportedConstruct` | Parsing | Valid OpenQASM not supported by PRISM-Q |
| `UndefinedRegister` | Parsing | Reference to undeclared register |
| `InvalidQubit` | Validation | Qubit index exceeds register size |
| `InvalidClassicalBit` | Validation | Classical bit index exceeds register |
| `GateArity` | Validation | Wrong number of qubits for gate |
| `InvalidParameter` | Validation | Invalid gate parameter (NaN, etc.) |
| `BackendUnsupported` | Runtime | Backend can't perform requested operation |
| `IncompatibleBackend` | Runtime | Backend incompatible with circuit |
No panics on user input. `debug_assert!` for internal invariants only.
## Public API surface
Top-level re-exports from `src/lib.rs`:
**Simulation:**
`run`, `run_with`, `run_on`, `run_qasm`, `run_shots`, `run_shots_with`, `run_shots_with_noise`, `run_counts`, `run_marginals`, `bitstring`
**Compiled sampling:**
`compile_measurements`, `compile_forward`, `compile_noisy`, `run_shots_compiled`, `run_shots_noisy`, `run_shots_homological`, `noisy_marginals_analytical`
**Clifford+T:**
`run_stabilizer_rank`, `run_stabilizer_rank_approx`, `stabilizer_overlap_sq`, `run_spp`, `run_spd`, `spp_to_probabilities`, `spd_to_probabilities`
**Types:**
`Circuit`, `CircuitBuilder`, `Instruction`, `ClassicalCondition`, `Gate`, `BackendKind`, `SimulationResult`, `Probabilities`, `FactoredBlock`, `ShotsResult`, `PrismError`, `Result`
**Backends:**
`StatevectorBackend`, `StabilizerBackend`, `SparseBackend`, `MpsBackend`, `ProductStateBackend`, `TensorNetworkBackend`, `FactoredBackend`
**Accumulators:**
`ShotAccumulator`, `HistogramAccumulator`, `MarginalsAccumulator`, `PauliExpectationAccumulator`, `CorrelatorAccumulator`, `NullAccumulator`, `PackedShots`, `ShotLayout`
**Data types:**
`CompiledSampler`, `NoisyCompiledSampler`, `HomologicalSampler`, `ErrorChainComplex`, `NoiseModel`, `NoiseOp`, `StabRankResult`, `SppResult`, `SpdResult`, `SparseParity`, `ParityStats`, `PauliVec`, `MultiFusedData`, `BatchPhaseData`, `McuData`, `Multi2qData`