# PC-RL-Core
[](https://github.com/BolivarTech/PC-RL-Core/actions/workflows/ci.yml)
[](https://www.rust-lang.org)
[](LICENSE-MIT)
A **Deliberative Predictive Coding (DPC)** reinforcement learning agent that learns to play Tic-Tac-Toe from scratch, implemented entirely in Rust with zero ML framework dependencies.
The actor **deliberates before acting** by running an iterative free energy minimization loop (predictive coding inference), and a **residual echo of that deliberation** feeds back into weight updates as a structured micro-regularizer. These two mechanisms form a coupled system: deliberation generates the signal, the signal improves learning, and better learning improves future deliberation. The agent trains via REINFORCE with baseline against minimax opponents with curriculum learning.
The core library (`pc-rl-core`) is **backend-agnostic**: all linear algebra operations are abstracted behind a `LinAlg` trait, enabling future GPU backends (wgpu, CUDA) without changing the RL logic.
## Results
The agent reaches **minimax depth 9** (near-perfect play) in **40% of seeds** with a 3-layer [27,27,18] architecture and ultra-low PC error (`local_lambda=0.9999`):
At depth 9, the agent achieves **~99% draws** against a near-perfect minimax opponent -- essentially optimal play for Tic-Tac-Toe.
### Statistical Validation (N=35 seeds, 19 phases, ~3,200 runs)
| **[27,27,18]** | **0.9999** | **softsign** | **yes (proj)** | **200k** | **7.69** | **40%** |
| 1×27 | 0.99 | tanh | no | 50k | 7.94 | 37% |
| 1×27 | 0.99 | softsign | no | 50k | 7.89 | 31% |
| [27,27,18] | 0.999 | softsign | yes (proj) | 50k | 7.20 | 20% |
See the [full experiment analysis](docs/experiment_analysis.md) for details across all 19 experimental phases.
### Parameter Efficiency
The PC actor achieves near-optimal play with only **~550-1,000 parameters** -- 3-330x smaller than typical published architectures for the same task (which range from ~2,700 to ~183,000 parameters). The PC inference loop trades compute for parameters: 5 iterative inference steps extract more representational capacity per parameter than a single feedforward pass through a larger network.
## Architecture
```
Input (9) ──> [H1 27, Softsign] ──> [H2 27, Softsign] ──> [H3 18, Softsign] ──> [Output 9, Linear] ──> Softmax ──> Action
^ | ↕ skip ↕ skip+proj
| v
PC Inference Loop (top-down / bottom-up)
|
v
Latent Concat (27+27+18 = 72)
|
[Board State (9)] ++ [Latent (72)] = Critic Input (81)
|
v
[Critic Hidden 36, Softsign] ──> V(s)
```
All core structs are generic over `L: LinAlg` (default `CpuLinAlg`), enabling future GPU backends. The library is **GA-ready** with CCA-based crossover operators for evolving network populations.
**Predictive Coding Loop**: Instead of a single feedforward pass, the actor runs an iterative inference loop where higher layers generate top-down predictions of lower layer states. The prediction error (surprise) between layers drives hidden state updates. This process converges to a stable internal representation before action selection.
**Curriculum Learning**: The agent starts against a weak opponent (minimax depth 1) and advances when it achieves >95% non-loss rate over a 1000-game window. Metrics reset on each advancement to prevent cascading.
## Project Structure
```
PC-RL-Core/
├── pc-rl-core/ # Reusable RL library (v1.0.0)
│ └── src/
│ ├── linalg/
│ │ ├── mod.rs # LinAlg trait (32 methods, backend-agnostic)
│ │ └── cpu.rs # CpuLinAlg (Vec<f64> + Matrix, Jacobi SVD)
│ ├── activation.rs # Tanh, ReLU, Sigmoid, ELU, Softsign, Linear
│ ├── error.rs # PcError crate-wide error type
│ ├── matrix.rs # Dense matrix, softmax, CCA alignment, Hungarian assignment
│ ├── layer.rs # Layer<L: LinAlg> with PC top-down support
│ ├── pc_actor.rs # PcActor<L> with inference loop, residual, crossover
│ ├── mlp_critic.rs # MlpCritic<L> value function, crossover
│ ├── pc_actor_critic.rs # PcActorCritic<L> agent, ActivationCache, crossover
│ └── serializer.rs # JSON persistence (CPU concrete bridge)
├── pc_tictactoe/ # Game binary
│ ├── config.toml # Training configuration
│ └── src/
│ ├── env/ # TicTacToe + Minimax opponent
│ ├── training/ # Episodic + continuous + experiment runners
│ ├── ui/ # CLI: train, play, evaluate, experiment, seed-test, init
│ └── utils/ # Config, logger, metrics
```
## Quick Start
```bash
# Build
cargo build --release
# Train (uses pc_tictactoe/config.toml)
cargo run --release -- train -c pc_tictactoe/config.toml
# Play against the trained agent
cargo run --release -- play --model model.json
# Play as first player
cargo run --release -- play --model model.json --first
# Evaluate against minimax
cargo run --release -- evaluate --model model.json --games 100 --depth 9
```
## Configuration
All hyperparameters are configured via TOML. See [`pc_tictactoe/config.toml`](pc_tictactoe/config.toml) for the full configuration with the optimal parameters.
Key parameters:
| `output_activation` | `linear` | Unbounded logits for softmax (tanh prevents learning) |
| `alpha` | `0.03` | PC inference loop update rate |
| `lr_weights` | `0.005` | Actor learning rate |
| `hidden_layers` | `[27,27,18] softsign` | 3-layer with dimensionality reduction |
| `residual` | `true` | Skip connections with ReZero + projection |
| `rezero_init` | `0.1` | ReZero initial scaling factor |
| `gamma` | `0.99` | Discount factor |
| `entropy_coeff` | `0.0` | No entropy regularization |
| `local_lambda` | `0.9999` | Ultra-low PC error for deep networks (0.99 for 1-layer) |
## Key Findings
- **3-layer [27,27,18] with lambda=0.9999 achieves 40% D=9** -- best configuration, surpasses single-layer (20% D=9)
- **Depth-Lambda Scaling Law: `lambda = 1 - 10^(-(L+1))`** -- PC error must decrease exponentially with network depth
- **Lambda and training budget interact** -- lambda=0.9999 needs 200k episodes (6% D=9 at 50k, 40% at 200k)
- **Deliberation is the primary advantage** -- PC inference loop adds +2-3 depth levels over MLP
- **Softsign + residual + projection cooperate** -- three mechanisms enable gradient flow in deep networks
- **Output activation must be Linear** -- Tanh bounds logits to [-1,1], preventing policy learning
- **Bounded activations required for PC** -- ReLU dies, ELU explodes; tanh and softsign work
- **Backend-agnostic architecture** -- `LinAlg` trait enables CPU/GPU swap with zero logic changes
- **GA-ready crossover** -- CCA neuron alignment solves the permutation problem; Hungarian optimal assignment; supports topology mutation (dimension/layer count changes)
Validated through 19 experimental phases, ~3,200 training runs across multiple architectural configurations.
For the complete experimental methodology and statistical analysis, see [docs/experiment_analysis.md](docs/experiment_analysis.md). For the full architecture description, lessons learned, and applicability to other PC projects, see [docs/pc_actor_critic_paper.md](docs/pc_actor_critic_paper.md).
## Dependencies
The `pc-rl-core` library uses only:
- `serde` / `serde_json` -- Serialization
- `rand` -- Random number generation
- `chrono` -- Timestamps
The `pc_tictactoe` binary adds:
- `toml` -- Configuration parsing
- `clap` -- CLI argument parsing
- `ctrlc` -- Graceful shutdown
No PyTorch, TensorFlow, or any ML framework. Pure Rust from scratch.
## Testing
425 unit tests + 13 doctests covering all modules:
```bash
# Run all tests
cargo nextest run --workspace
# Run specific crate
cargo nextest run -p pc-rl-core
cargo nextest run -p pc_tictactoe
# Lint
cargo clippy --workspace --tests -- -D warnings
```
## License
Licensed under either of
- Apache License, Version 2.0 ([LICENSE-APACHE](LICENSE-APACHE) or <http://www.apache.org/licenses/LICENSE-2.0>)
- MIT License ([LICENSE-MIT](LICENSE-MIT) or <http://opensource.org/licenses/MIT>)
at your option.