pc-rl-core 1.2.3

Predictive Coding Actor-Critic reinforcement learning framework — pure Rust, zero ML dependencies
Documentation

PC-RL-Core

CI crates.io docs.rs Rust License

A Deliberative Predictive Coding (DPC) reinforcement learning framework implemented entirely in Rust with zero ML framework dependencies.

The actor deliberates before acting by running an iterative free energy minimization loop (predictive coding inference), and a residual echo of that deliberation feeds back into weight updates as a structured micro-regularizer. These two mechanisms form a coupled system: deliberation generates the signal, the signal improves learning, and better learning improves future deliberation.

The library is backend-agnostic: all linear algebra operations are abstracted behind a LinAlg trait, enabling future GPU backends (wgpu, CUDA) without changing the RL logic.

Installation

[dependencies]
pc-rl-core = "1.2.3"

Quick Start

use pc_rl_core::{
    PcActorCriticCpu, PcActorCriticConfig, PcActorConfig, MlpCriticConfig,
    Activation, LayerDef, SelectionMode,
};

// Configure the agent
let actor_config = PcActorConfig {
    input_size: 9,
    output_size: 9,
    hidden_layers: vec![LayerDef { size: 27, activation: Activation::Softsign }],
    output_activation: Activation::Linear,
    alpha: 0.03,
    tol: 0.01,
    min_steps: 1,
    max_steps: 5,
    lr_weights: 0.005,
    synchronous: true,
    temperature: 1.0,
    local_lambda: 0.99,
    residual: false,
    rezero_init: 0.001,
};

let critic_config = MlpCriticConfig {
    input_size: 36,  // state_dim + latent_dim
    hidden_layers: vec![LayerDef { size: 36, activation: Activation::Softsign }],
    output_activation: Activation::Linear,
    lr: 0.005,
};

let config = PcActorCriticConfig {
    actor: actor_config,
    critic: critic_config,
    gamma: 0.99,
    surprise_low: 0.02,
    surprise_high: 0.15,
    adaptive_surprise: true,
    surprise_buffer_size: 400,
    entropy_coeff: 0.0,
};

let mut agent = PcActorCriticCpu::new(config, 42)?;

// Training loop: act, collect trajectory steps, learn
let (action, infer_result) = agent.act(&state, &valid_actions, SelectionMode::Training);
// ... execute action in environment, collect TrajectoryStep per timestep ...
let avg_loss = agent.learn(&trajectory);

// Evaluation (deterministic)
let (action, _) = agent.act(&state, &valid_actions, SelectionMode::Play);

Architecture

Core Components

  • PcActor<L: LinAlg> -- Policy network with predictive coding inference loop, residual skip connections, surprise scoring, and CCA crossover
  • MlpCritic<L: LinAlg> -- Standard MLP value function with MSE loss backpropagation and CCA crossover
  • PcActorCritic<L: LinAlg> -- Integrated agent combining actor and critic with surprise-based learning rate scheduling
  • Layer<L: LinAlg> -- Dense layer with forward, transpose (PC top-down), and backward passes
  • LinAlg trait -- Backend-agnostic linear algebra interface (32 methods). Default implementation: CpuLinAlg
  • GolubKahanSvd -- O(n^3) SVD via bidiagonalization, used for CCA neuron alignment

Key Mechanisms

Predictive Coding Inference: Instead of a single feedforward pass, the actor runs an iterative inference loop where higher layers generate top-down predictions of lower layer states. The prediction error (surprise) between layers drives hidden state updates until convergence.

Residual Echo (local_lambda): A small fraction of prediction errors from deliberation is blended into backpropagation gradients: delta = lambda * backprop_grad + (1-lambda) * pc_error. This couples inference and learning into a synergistic system.

Adaptive Surprise Scheduling: A circular buffer of recent surprise scores dynamically calibrates learning rate thresholds. Low surprise reduces LR (familiar states), high surprise boosts LR (novel states). Buffer-mediated damping protects learned representations during environment transitions.

CCA Crossover: GA-ready crossover operator using Canonical Correlation Analysis to align neurons functionally before blending weights, solving the permutation problem. Supports dimension mismatches, layer count differences, and residual components.

Type Aliases

type PcActorCpu = PcActor<CpuLinAlg>;
type MlpCriticCpu = MlpCritic<CpuLinAlg>;
type PcActorCriticCpu = PcActorCritic<CpuLinAlg>;
type LayerCpu = Layer<CpuLinAlg>;

Project Structure

PC-RL-Core/
├── src/
│   ├── linalg/
│   │   ├── mod.rs                  # LinAlg trait (32 methods, backend-agnostic)
│   │   ├── cpu.rs                  # CpuLinAlg (Vec<f64> + Matrix)
│   │   └── golub_kahan.rs          # Golub-Kahan SVD (O(n^3))
│   ├── activation.rs               # Tanh, ReLU, Sigmoid, ELU, Softsign, Linear
│   ├── error.rs                    # PcError crate-wide error type
│   ├── matrix.rs                   # Dense matrix, softmax, CCA alignment, Hungarian assignment
│   ├── layer.rs                    # Layer<L: LinAlg> with PC top-down support
│   ├── pc_actor.rs                 # PcActor<L> with inference loop, residual, crossover
│   ├── mlp_critic.rs               # MlpCritic<L> value function, crossover
│   ├── pc_actor_critic.rs          # PcActorCritic<L> agent, ActivationCache, crossover
│   └── serializer.rs               # JSON persistence (CPU concrete bridge)
├── docs/
│   ├── experiment_analysis.md      # 20 experimental phases, ~3,800 runs
│   └── pc_actor_critic_paper.md    # DPC architecture paper
└── Cargo.toml

Research Findings

Validated through 20 experimental phases (~3,800 training runs) on Tic-Tac-Toe (PC-TicTacToe):

  • Deliberation is the primary advantage -- PC inference loop adds +2-3 depth levels over equivalent MLP
  • Residual echo breaks performance ceilings -- 1% PC error blend (lambda=0.99) is statistically significant (p<0.034)
  • Depth-Lambda Scaling Law: lambda = 1 - 10^(-(L+1)) -- PC error must decrease exponentially with network depth
  • Lambda and training budget interact -- ultra-low PC error needs more episodes to accumulate its regularization effect
  • Adaptive surprise eliminates catastrophic forgetting -- buffer-mediated transition damping protects learned representations during curriculum transitions
  • Optimal buffer ratio: 0.3-0.4 x environment transition window -- too small resonates, too large over-damps
  • Bounded activations required for PC -- ReLU dies, ELU explodes; tanh and softsign work
  • Softsign + residual + projection cooperate -- three mechanisms enable gradient flow in deep networks
  • Parameter efficiency -- ~550 actor parameters matching networks 4-330x larger through iterative inference

For the complete experimental methodology and statistical analysis, see docs/experiment_analysis.md. For the full architecture description and lessons learned, see docs/pc_actor_critic_paper.md.

Dependencies

  • serde / serde_json -- Serialization
  • rand -- Random number generation
  • chrono -- Timestamps

No PyTorch, TensorFlow, or any ML framework. Pure Rust from scratch.

Testing

384 unit tests + 20 doctests:

cargo nextest run
cargo test --doc
cargo clippy --tests -- -D warnings

License

Licensed under either of

at your option.