pc-rl-core 1.2.0

Predictive Coding Actor-Critic reinforcement learning framework — pure Rust, zero ML dependencies
Documentation

PC-RL-Core

CI Rust License

A Deliberative Predictive Coding (DPC) reinforcement learning agent that learns to play Tic-Tac-Toe from scratch, implemented entirely in Rust with zero ML framework dependencies.

The actor deliberates before acting by running an iterative free energy minimization loop (predictive coding inference), and a residual echo of that deliberation feeds back into weight updates as a structured micro-regularizer. These two mechanisms form a coupled system: deliberation generates the signal, the signal improves learning, and better learning improves future deliberation. The agent trains via REINFORCE with baseline against minimax opponents with curriculum learning.

The core library (pc-rl-core) is backend-agnostic: all linear algebra operations are abstracted behind a LinAlg trait, enabling future GPU backends (wgpu, CUDA) without changing the RL logic.

Results

The agent reaches minimax depth 9 (near-perfect play) in 40% of seeds with a 3-layer [27,27,18] architecture and ultra-low PC error (local_lambda=0.9999):

At depth 9, the agent achieves ~99% draws against a near-perfect minimax opponent -- essentially optimal play for Tic-Tac-Toe.

Statistical Validation (N=35 seeds, 19 phases, ~3,200 runs)

Topology Lambda Activation Residual Episodes Mean D=9
[27,27,18] 0.9999 softsign yes (proj) 200k 7.69 40%
1×27 0.99 tanh no 50k 7.94 37%
1×27 0.99 softsign no 50k 7.89 31%
[27,27,18] 0.999 softsign yes (proj) 50k 7.20 20%

See the full experiment analysis for details across all 19 experimental phases.

Parameter Efficiency

The PC actor achieves near-optimal play with only ~550-1,000 parameters -- 3-330x smaller than typical published architectures for the same task (which range from ~2,700 to ~183,000 parameters). The PC inference loop trades compute for parameters: 5 iterative inference steps extract more representational capacity per parameter than a single feedforward pass through a larger network.

Architecture

Input (9) ──> [H1 27, Softsign] ──> [H2 27, Softsign] ──> [H3 18, Softsign] ──> [Output 9, Linear] ──> Softmax ──> Action
                  ^    |     ↕ skip        ↕ skip+proj
                  |    v
              PC Inference Loop (top-down / bottom-up)
                  |
                  v
            Latent Concat (27+27+18 = 72)
                  |
         [Board State (9)] ++ [Latent (72)] = Critic Input (81)
                  |
                  v
         [Critic Hidden 36, Softsign] ──> V(s)

All core structs are generic over L: LinAlg (default CpuLinAlg), enabling future GPU backends. The library is GA-ready with CCA-based crossover operators for evolving network populations.

Predictive Coding Loop: Instead of a single feedforward pass, the actor runs an iterative inference loop where higher layers generate top-down predictions of lower layer states. The prediction error (surprise) between layers drives hidden state updates. This process converges to a stable internal representation before action selection.

Curriculum Learning: The agent starts against a weak opponent (minimax depth 1) and advances when it achieves >95% non-loss rate over a 1000-game window. Metrics reset on each advancement to prevent cascading.

Project Structure

PC-RL-Core/
├── pc-rl-core/                    # Reusable RL library (v1.0.0)
│   └── src/
│       ├── linalg/
│       │   ├── mod.rs          # LinAlg trait (32 methods, backend-agnostic)
│       │   └── cpu.rs          # CpuLinAlg (Vec<f64> + Matrix, Jacobi SVD)
│       ├── activation.rs       # Tanh, ReLU, Sigmoid, ELU, Softsign, Linear
│       ├── error.rs            # PcError crate-wide error type
│       ├── matrix.rs           # Dense matrix, softmax, CCA alignment, Hungarian assignment
│       ├── layer.rs            # Layer<L: LinAlg> with PC top-down support
│       ├── pc_actor.rs         # PcActor<L> with inference loop, residual, crossover
│       ├── mlp_critic.rs       # MlpCritic<L> value function, crossover
│       ├── pc_actor_critic.rs  # PcActorCritic<L> agent, ActivationCache, crossover
│       └── serializer.rs       # JSON persistence (CPU concrete bridge)
├── pc_tictactoe/               # Game binary
│   ├── config.toml             # Training configuration
│   └── src/
│       ├── env/                # TicTacToe + Minimax opponent
│       ├── training/           # Episodic + continuous + experiment runners
│       ├── ui/                 # CLI: train, play, evaluate, experiment, seed-test, init
│       └── utils/              # Config, logger, metrics

Quick Start

# Build

cargo build --release


# Train (uses pc_tictactoe/config.toml)

cargo run --release -- train -c pc_tictactoe/config.toml


# Play against the trained agent

cargo run --release -- play --model model.json


# Play as first player

cargo run --release -- play --model model.json --first


# Evaluate against minimax

cargo run --release -- evaluate --model model.json --games 100 --depth 9

Configuration

All hyperparameters are configured via TOML. See pc_tictactoe/config.toml for the full configuration with the optimal parameters.

Key parameters:

Parameter Value Description
output_activation linear Unbounded logits for softmax (tanh prevents learning)
alpha 0.03 PC inference loop update rate
lr_weights 0.005 Actor learning rate
hidden_layers [27,27,18] softsign 3-layer with dimensionality reduction
residual true Skip connections with ReZero + projection
rezero_init 0.1 ReZero initial scaling factor
gamma 0.99 Discount factor
entropy_coeff 0.0 No entropy regularization
local_lambda 0.9999 Ultra-low PC error for deep networks (0.99 for 1-layer)

Key Findings

  • 3-layer [27,27,18] with lambda=0.9999 achieves 40% D=9 -- best configuration, surpasses single-layer (20% D=9)
  • Depth-Lambda Scaling Law: lambda = 1 - 10^(-(L+1)) -- PC error must decrease exponentially with network depth
  • Lambda and training budget interact -- lambda=0.9999 needs 200k episodes (6% D=9 at 50k, 40% at 200k)
  • Deliberation is the primary advantage -- PC inference loop adds +2-3 depth levels over MLP
  • Softsign + residual + projection cooperate -- three mechanisms enable gradient flow in deep networks
  • Output activation must be Linear -- Tanh bounds logits to [-1,1], preventing policy learning
  • Bounded activations required for PC -- ReLU dies, ELU explodes; tanh and softsign work
  • Backend-agnostic architecture -- LinAlg trait enables CPU/GPU swap with zero logic changes
  • GA-ready crossover -- CCA neuron alignment solves the permutation problem; Hungarian optimal assignment; supports topology mutation (dimension/layer count changes)

Validated through 19 experimental phases, ~3,200 training runs across multiple architectural configurations.

For the complete experimental methodology and statistical analysis, see docs/experiment_analysis.md. For the full architecture description, lessons learned, and applicability to other PC projects, see docs/pc_actor_critic_paper.md.

Dependencies

The pc-rl-core library uses only:

  • serde / serde_json -- Serialization
  • rand -- Random number generation
  • chrono -- Timestamps

The pc_tictactoe binary adds:

  • toml -- Configuration parsing
  • clap -- CLI argument parsing
  • ctrlc -- Graceful shutdown

No PyTorch, TensorFlow, or any ML framework. Pure Rust from scratch.

Testing

425 unit tests + 13 doctests covering all modules:

# Run all tests

cargo nextest run --workspace


# Run specific crate

cargo nextest run -p pc-rl-core

cargo nextest run -p pc_tictactoe


# Lint

cargo clippy --workspace --tests -- -D warnings

License

Licensed under either of

at your option.