# About ReinforceX
ReinforceX (ReX) is an early-stage deep reinforcement learning framework built
in Rust. It is designed as a Rust-first playground for implementing,
experimenting with, and eventually productionizing reinforcement learning
agents without making Python the core runtime.
The project currently focuses on:
- a small, readable core for value-based, policy-based, and actor-critic
algorithms;
- neural-network policies and Q-functions backed by `tch` / libtorch;
- replay and on-policy buffers that can be shared across training workers;
- sample Gymnasium environments exposed through a simple HTTP server;
- an optional C ABI for embedding agents from C, C++, C#, Unity, or other
runtimes.
Advantages of Rust for this project:
- ownership and RAII make long-running training jobs easier to reason about;
- `Send` / `Sync` boundaries make parallel training explicit;
- native binaries are a good fit for simulators, games, robotics, and embedded
integrations;
- Rust can still use libtorch through `tch`, so the project can combine systems
programming ergonomics with modern tensor operations.
ReinforceX is not yet a stable 1.0 API. Contributions are welcome, especially
around algorithms, documentation, benchmark environments, test coverage, and
safe public API design.
# Package
crates.io: https://crates.io/crates/reinforcex
```sh
cargo add reinforcex
```
The default `cpu` feature enables `torch-sys` with `download-libtorch`.
```toml
[dependencies]
reinforcex = "0.0.4"
```
For CUDA experiments, build with the `cuda` feature and make sure your local
libtorch / CUDA runtime is visible to `tch`. On Windows, `load_cuda_dlls()` also
checks `TORCH_CUDA_DLL` when the `cuda` feature is enabled.
# Algorithms
Implemented agents:
- DQN: Double-DQN style target network, n-step replay, epsilon-greedy
exploration, optional reward-based selector, shared replay buffer support.
- PPO: clipped policy objective, GAE, value clipping, entropy regularization,
discrete, multi-branch discrete, and Gaussian policies.
- SAC: continuous and discrete Soft Actor-Critic, twin critics, soft target
updates, automatic temperature updates for discrete policies, and component
checkpointing.
Core building blocks:
- Models: `FCQNetwork`, `FCSoftmaxPolicy`, `FCSoftmaxPolicyWithValue`,
`FCGaussianPolicy`, `FCGaussianPolicyWithValue`.
- Distributions: `SoftmaxDistribution`, `MultiSoftmaxDistribution`,
`GaussianDistribution`.
- Memory: `ReplayBuffer` with n-step transitions, `OnPolicyBuffer`.
- Exploration and selection: `EpsilonGreedy`, `RewardBasedSelector`.
- FFI: DQN and PPO can be created and trained through a C-compatible API.
# API
Instantiate a DQN agent.
```rust
use reinforcex::agents::{BaseAgent, DQN};
use reinforcex::explorers::EpsilonGreedy;
use reinforcex::memory::ReplayBuffer;
use reinforcex::models::FCQNetwork;
use std::sync::Arc;
use tch::{nn, nn::OptimizerConfig, Device};
let device = Device::cuda_if_available();
let vs = nn::VarStore::new(device);
let optimizer = nn::Adam::default().build(&vs, 3e-4).unwrap();
let n_input_channels = 4;
let action_size = 2;
let n_hidden_layers = 2;
let n_hidden_channels = 128;
let model = Box::new(FCQNetwork::new(
vs,
n_input_channels,
action_size,
n_hidden_layers,
n_hidden_channels,
));
let gamma = 0.97;
let n_steps = 3;
let batch_size = 16;
let update_interval = 8;
let target_update_interval = 100;
let replay_buffer_capacity = 2_000;
let explorer = EpsilonGreedy::new(0.5, 0.1, 50_000);
let transition_buffer = Arc::new(ReplayBuffer::new(replay_buffer_capacity, n_steps));
let mut agent = DQN::new(
model,
transition_buffer,
optimizer,
action_size as usize,
batch_size,
update_interval,
target_update_interval,
Box::new(explorer),
None,
gamma,
Some("models/dqn_latest.ot".to_string()),
None,
);
```
Common agent methods are provided by `BaseAgent`.
```rust
fn act(&self, obs: &Tensor) -> Tensor;
fn act_and_train(&mut self, obs: &Tensor, reward: f64) -> Tensor;
fn stop_episode_and_train(&mut self, obs: &Tensor, reward: f64);
fn get_statistics(&self) -> Vec<(String, f64)>;
fn save(&self);
fn load(&mut self);
```
Pseudo code for training:
```rust
for episode in 0..max_episode {
let mut reward = 0.0;
for step in 0..max_step {
let action = agent.act_and_train(&obs, reward);
let (next_obs, next_reward, done) = env.step(action);
obs = next_obs;
reward = next_reward;
if done {
agent.stop_episode_and_train(&obs, reward);
break;
}
}
}
```
Pseudo code for parallel learning:
```rust
use rayon::prelude::*;
use std::sync::Arc;
let buffer = Arc::new(ReplayBuffer::new(1_000, 1));
let mut agent = DQN::new(
model,
Arc::clone(&buffer),
optimizer,
action_size,
batch_size,
update_interval,
target_update_interval,
Box::new(explorer),
None,
gamma,
Some(format!("models/dqn_{agent_id}.ot")),
None,
);
for episode in 0..max_episode {
// Run the same training loop as above.
}
});
```
`build_agent_components()` is a placeholder for creating a separate model,
optimizer, and explorer per worker. Share only the replay buffer or other
explicitly thread-safe state.
# Sample experiments
The sample experiments call Gymnasium environments through FastAPI servers.
Docker Compose starts ten environment servers on ports `8001` to `8010`.
```sh
docker compose -f sample_env/docker-compose.yml up -d --build
```
Run CartPole with DQN:
```sh
cargo run -p reinforcex --features cpu -- --env cartpole --algo dqn
```
Run CartPole with PPO:
```sh
cargo run -p reinforcex --features cpu -- --env cartpole --algo ppo
```
Run CartPole with discrete SAC using four parallel environment servers:
```sh
cargo run -p reinforcex --features cpu -- --env cartpole --algo sac --parallel 4
```
Run LunarLanderContinuous with continuous SAC:
```sh
cargo run -p reinforcex --features cpu -- --env lunar --algo sac --parallel 4
```
Run Ant with PPO:
```sh
cargo run -p reinforcex --features cpu -- --env ant --algo ppo
```
Use `--save-path` and `--load-path` to persist models. Multi-agent samples can
include `{agent_id}` in the path.
```sh
cargo run -p reinforcex --features cpu -- \
--env cartpole \
--algo dqn \
--save-path "models/cartpole_dqn_{agent_id}.ot" \
--load-path "models/cartpole_dqn_{agent_id}.ot"
```
For SAC, a single save path expands into component checkpoints such as actor,
critic1, critic2, and temperature files.
Stop the sample environment servers:
```sh
docker compose -f sample_env/docker-compose.yml down
```
<img width="597" alt="CartPole training sample" src="https://github.com/user-attachments/assets/b8c0606b-ec11-4b5a-b7fc-3070ad327d72" />
# Unit test
Run all Rust unit tests from the workspace root:
```sh
cargo test --workspace
```
The core unit tests exercise agents, models, probability distributions, memory
buffers, selectors, and the FFI wrapper. The Docker-based Gymnasium server is
only required for the sample experiments above.
# FFI
ReinforceX also provides a small Foreign Function Interface (FFI) crate for
embedding agents from external runtimes such as C, C++, C#, or Unity.
Build the dynamic library:
```sh
cargo build -p reinforcex_ffi --release
```
The generated library is named `reinforcex` with the platform-specific dynamic
library extension, for example `reinforcex.dll`, `libreinforcex.so`, or
`libreinforcex.dylib`.
## Overview
- All agents are managed internally and referenced through a `u64` ID.
- The public FFI functions catch panics and return silently on invalid inputs.
- All sizes use `u64` for ABI-friendly boundaries.
- The caller owns input and output buffer allocation.
- `agent_type = 0` creates DQN; any other value creates PPO.
## Data Structures
### AgentConfig
```c
typedef struct {
uint32_t agent_type;
uint64_t obs_size;
uint64_t action_size;
double learning_rate;
double gamma;
uint64_t batch_size;
uint64_t buffer_size;
double epsilon_start;
double epsilon_end;
uint64_t epsilon_decay;
double lambda;
uint64_t update_interval;
uint64_t epoch;
uint64_t minibatch_size;
double clip_eps;
} AgentConfig;
```
| `agent_type` | `0 = DQN`, otherwise PPO |
| `obs_size` | Observation vector size |
| `action_size` | Action space size |
| `learning_rate` | Optimizer learning rate |
| `gamma` | Discount factor |
| `batch_size` | DQN batch size |
| `buffer_size` | DQN replay buffer size |
| `epsilon_start` | Initial epsilon for DQN |
| `epsilon_end` | Final epsilon for DQN |
| `epsilon_decay` | Epsilon decay steps for DQN |
| `lambda` | PPO GAE lambda |
| `update_interval` | PPO update interval |
| `epoch` | PPO training epochs |
| `minibatch_size` | PPO minibatch size |
| `clip_eps` | PPO clipping epsilon |
## Functions
### rx_agent_create
```c
uint64_t rx_agent_create(const AgentConfig* config);
```
Creates a new agent and returns its ID. Returns `0` on failure.
### rx_agent_act_and_train
```c
void rx_agent_act_and_train(
uint64_t id,
const float* obs,
uint64_t obs_len,
float reward,
float* out,
uint64_t out_len
);
```
Performs action selection and one training step. DQN writes one scalar action.
PPO writes a vector action and truncates to `out_len` if the output buffer is
smaller than the action tensor.
### rx_agent_stop_episode
```c
void rx_agent_stop_episode(
uint64_t id,
const float* obs,
uint64_t obs_len,
float reward
);
```
Signals the end of an episode and performs the final training step.
### rx_agent_destroy
```c
void rx_agent_destroy(uint64_t id);
```
Destroys the agent for the given ID. Calling it with an unknown ID is a no-op.
# Contributing
ReinforceX is a good place to contribute if you are interested in Rust,
reinforcement learning, libtorch bindings, simulator integration, or FFI.
Useful contribution areas:
- algorithm implementations and correctness tests;
- benchmark scripts and reproducible training results;
- safer public APIs around tensor shapes, device placement, and errors;
- documentation for model construction and environment integration;
- CI for Rust tests, formatting, and platform-specific FFI builds.
Before opening a pull request, please run:
```sh
cargo fmt --all -- --check
cargo test --workspace
```
# License
MIT License (https://github.com/kakky-hacker/reinforcex/blob/master/LICENSE)