rlx-rl

Flow-map generative policies with Flow Map Q-Guidance (FMQ) and Q-Guided Beam Search (QGBS) on RLX (arxiv:2605.12416).

Reference implementation (JAX): /Users/Shared/q-guided-flow-map-policies — see REFERENCE.md for a module-by-module map.

Design

Principle	Implementation
MLP actor/critic	`rlx-ir` graphs in `graph/` — not `rlx-flow`
CPU + autodiff	`Session::new(Device::Cpu)` + `legalize_broadcast` → `grad_with_loss`
No sim bindings	Implement `RlEnv`; store `Transition` in `ReplayBuffer`
Optional QGBS at eval	`EvalConfig::with_qgbs` → Algorithm 2 over `CompiledFlowMapAgent`
Offline ESD + curriculum	`flow_curriculum` + `distillation` (`mf` / `lsd` / `psd`)

Plug in your environment

use rlx_rl::{
    buffer::Transition, dataset::OfflineDataset, env::RlEnv, policy::EvalConfig,
    spec::RlSpec, FmqTrainer, QgbsConfig,
};

struct MyEnv { /* your state */ }

impl RlEnv for MyEnv {
    fn reset(&mut self) -> Vec<f32> { /* state */ }
    fn step(&mut self, action: &[f32]) -> Transition {
        // fill state, action, reward, next_state, done
        todo!()
    }
}

let spec = RlSpec { state_dim: 12, action_dim: 7, batch: 32, hidden: vec![256, 256], ..RlSpec::toy(32) };
let mut trainer = FmqTrainer::new(spec);

// Offline CFM from demonstrations
trainer.offline_pretrain(&offline_dataset, 10_000);

// Online FMQ (no simulator inside RLX)
let mut env = MyEnv::default();
trainer.online_finetune(&mut env, 50_000);

// Eval: one-step (default)
let r0 = trainer.eval_rollout(&mut env, &EvalConfig::one_step());

// Eval: optional QGBS
let eval = EvalConfig::with_qgbs(QgbsConfig::default());
let r1 = trainer.eval_rollout(&mut env, &eval);

Custom online loop without RlEnv:

let tr: Transition = /* from your stack */;
trainer.online_step_from_transition(&tr);

Toy example (feature `toy`)

cargo run -p rlx-rl --example fmq_toy --features "compile,toy"
cargo test -p rlx-rl --features "compile,toy"

Flow map + FMQ

[ X_{r,t}(a_r \mid s) = a_r + (t-r), u_{r,t}(a_r \mid s), \quad a_1 = X_{0,1}(a_0 \mid s) ]

Online FMQ: project (a_1) with (\nabla_a Q) inside a trust region, then regress (u_{0,1}) toward (a_1^* - a_0).

License

GPL-3.0-only.