PC-RL-Core
A Deliberative Predictive Coding (DPC) reinforcement learning agent that learns to play Tic-Tac-Toe from scratch, implemented entirely in Rust with zero ML framework dependencies.
The actor deliberates before acting by running an iterative free energy minimization loop (predictive coding inference), and a residual echo of that deliberation feeds back into weight updates as a structured micro-regularizer. These two mechanisms form a coupled system: deliberation generates the signal, the signal improves learning, and better learning improves future deliberation. The agent trains via REINFORCE with baseline against minimax opponents with curriculum learning.
The core library (pc-rl-core) is backend-agnostic: all linear algebra operations are abstracted behind a LinAlg trait, enabling future GPU backends (wgpu, CUDA) without changing the RL logic.
Results
The agent reaches minimax depth 9 (near-perfect play) in 40% of seeds with a 3-layer [27,27,18] architecture and ultra-low PC error (local_lambda=0.9999):
At depth 9, the agent achieves ~99% draws against a near-perfect minimax opponent -- essentially optimal play for Tic-Tac-Toe.
Statistical Validation (N=35 seeds, 19 phases, ~3,200 runs)
| Topology | Lambda | Activation | Residual | Episodes | Mean | D=9 |
|---|---|---|---|---|---|---|
| [27,27,18] | 0.9999 | softsign | yes (proj) | 200k | 7.69 | 40% |
| 1×27 | 0.99 | tanh | no | 50k | 7.94 | 37% |
| 1×27 | 0.99 | softsign | no | 50k | 7.89 | 31% |
| [27,27,18] | 0.999 | softsign | yes (proj) | 50k | 7.20 | 20% |
See the full experiment analysis for details across all 19 experimental phases.
Parameter Efficiency
The PC actor achieves near-optimal play with only ~550-1,000 parameters -- 3-330x smaller than typical published architectures for the same task (which range from ~2,700 to ~183,000 parameters). The PC inference loop trades compute for parameters: 5 iterative inference steps extract more representational capacity per parameter than a single feedforward pass through a larger network.
Architecture
Input (9) ──> [H1 27, Softsign] ──> [H2 27, Softsign] ──> [H3 18, Softsign] ──> [Output 9, Linear] ──> Softmax ──> Action
^ | ↕ skip ↕ skip+proj
| v
PC Inference Loop (top-down / bottom-up)
|
v
Latent Concat (27+27+18 = 72)
|
[Board State (9)] ++ [Latent (72)] = Critic Input (81)
|
v
[Critic Hidden 36, Softsign] ──> V(s)
All core structs are generic over L: LinAlg (default CpuLinAlg), enabling future GPU backends. The library is GA-ready with CCA-based crossover operators for evolving network populations.
Predictive Coding Loop: Instead of a single feedforward pass, the actor runs an iterative inference loop where higher layers generate top-down predictions of lower layer states. The prediction error (surprise) between layers drives hidden state updates. This process converges to a stable internal representation before action selection.
Curriculum Learning: The agent starts against a weak opponent (minimax depth 1) and advances when it achieves >95% non-loss rate over a 1000-game window. Metrics reset on each advancement to prevent cascading.
Project Structure
PC-RL-Core/
├── pc-rl-core/ # Reusable RL library (v1.0.0)
│ └── src/
│ ├── linalg/
│ │ ├── mod.rs # LinAlg trait (32 methods, backend-agnostic)
│ │ └── cpu.rs # CpuLinAlg (Vec<f64> + Matrix, Jacobi SVD)
│ ├── activation.rs # Tanh, ReLU, Sigmoid, ELU, Softsign, Linear
│ ├── error.rs # PcError crate-wide error type
│ ├── matrix.rs # Dense matrix, softmax, CCA alignment, Hungarian assignment
│ ├── layer.rs # Layer<L: LinAlg> with PC top-down support
│ ├── pc_actor.rs # PcActor<L> with inference loop, residual, crossover
│ ├── mlp_critic.rs # MlpCritic<L> value function, crossover
│ ├── pc_actor_critic.rs # PcActorCritic<L> agent, ActivationCache, crossover
│ └── serializer.rs # JSON persistence (CPU concrete bridge)
├── pc_tictactoe/ # Game binary
│ ├── config.toml # Training configuration
│ └── src/
│ ├── env/ # TicTacToe + Minimax opponent
│ ├── training/ # Episodic + continuous + experiment runners
│ ├── ui/ # CLI: train, play, evaluate, experiment, seed-test, init
│ └── utils/ # Config, logger, metrics
Quick Start
# Build
# Train (uses pc_tictactoe/config.toml)
# Play against the trained agent
# Play as first player
# Evaluate against minimax
Configuration
All hyperparameters are configured via TOML. See pc_tictactoe/config.toml for the full configuration with the optimal parameters.
Key parameters:
| Parameter | Value | Description |
|---|---|---|
output_activation |
linear |
Unbounded logits for softmax (tanh prevents learning) |
alpha |
0.03 |
PC inference loop update rate |
lr_weights |
0.005 |
Actor learning rate |
hidden_layers |
[27,27,18] softsign |
3-layer with dimensionality reduction |
residual |
true |
Skip connections with ReZero + projection |
rezero_init |
0.1 |
ReZero initial scaling factor |
gamma |
0.99 |
Discount factor |
entropy_coeff |
0.0 |
No entropy regularization |
local_lambda |
0.9999 |
Ultra-low PC error for deep networks (0.99 for 1-layer) |
Key Findings
- 3-layer [27,27,18] with lambda=0.9999 achieves 40% D=9 -- best configuration, surpasses single-layer (20% D=9)
- Depth-Lambda Scaling Law:
lambda = 1 - 10^(-(L+1))-- PC error must decrease exponentially with network depth - Lambda and training budget interact -- lambda=0.9999 needs 200k episodes (6% D=9 at 50k, 40% at 200k)
- Deliberation is the primary advantage -- PC inference loop adds +2-3 depth levels over MLP
- Softsign + residual + projection cooperate -- three mechanisms enable gradient flow in deep networks
- Output activation must be Linear -- Tanh bounds logits to [-1,1], preventing policy learning
- Bounded activations required for PC -- ReLU dies, ELU explodes; tanh and softsign work
- Backend-agnostic architecture --
LinAlgtrait enables CPU/GPU swap with zero logic changes - GA-ready crossover -- CCA neuron alignment solves the permutation problem; Hungarian optimal assignment; supports topology mutation (dimension/layer count changes)
Validated through 19 experimental phases, ~3,200 training runs across multiple architectural configurations.
For the complete experimental methodology and statistical analysis, see docs/experiment_analysis.md. For the full architecture description, lessons learned, and applicability to other PC projects, see docs/pc_actor_critic_paper.md.
Dependencies
The pc-rl-core library uses only:
serde/serde_json-- Serializationrand-- Random number generationchrono-- Timestamps
The pc_tictactoe binary adds:
toml-- Configuration parsingclap-- CLI argument parsingctrlc-- Graceful shutdown
No PyTorch, TensorFlow, or any ML framework. Pure Rust from scratch.
Testing
425 unit tests + 13 doctests covering all modules:
# Run all tests
# Run specific crate
# Lint
License
Licensed under either of
- Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
- MIT License (LICENSE-MIT or http://opensource.org/licenses/MIT)
at your option.