rlx-rl 0.2.0

Flow-map generative policies with Flow Map Q-Guidance (FMQ) and QGBS for RLX.
Documentation

rlx-rl

Flow-map generative policies with Flow Map Q-Guidance (FMQ) and Q-Guided Beam Search (QGBS) on RLX (arxiv:2605.12416).

Reference implementation (JAX): /Users/Shared/q-guided-flow-map-policies — see REFERENCE.md for a module-by-module map.

Design

Principle Implementation
MLP actor/critic rlx-ir graphs in graph/ — not rlx-flow
CPU + autodiff Session::new(Device::Cpu) + legalize_broadcastgrad_with_loss
No sim bindings Implement RlEnv; store Transition in ReplayBuffer
Optional QGBS at eval EvalConfig::with_qgbs → Algorithm 2 over CompiledFlowMapAgent
Offline ESD + curriculum flow_curriculum + distillation (mf / lsd / psd)

Plug in your environment

use rlx_rl::{
    buffer::Transition, dataset::OfflineDataset, env::RlEnv, policy::EvalConfig,
    spec::RlSpec, FmqTrainer, QgbsConfig,
};

struct MyEnv { /* your state */ }

impl RlEnv for MyEnv {
    fn reset(&mut self) -> Vec<f32> { /* state */ }
    fn step(&mut self, action: &[f32]) -> Transition {
        // fill state, action, reward, next_state, done
        todo!()
    }
}

let spec = RlSpec { state_dim: 12, action_dim: 7, batch: 32, hidden: vec![256, 256], ..RlSpec::toy(32) };
let mut trainer = FmqTrainer::new(spec);

// Offline CFM from demonstrations
trainer.offline_pretrain(&offline_dataset, 10_000);

// Online FMQ (no simulator inside RLX)
let mut env = MyEnv::default();
trainer.online_finetune(&mut env, 50_000);

// Eval: one-step (default)
let r0 = trainer.eval_rollout(&mut env, &EvalConfig::one_step());

// Eval: optional QGBS
let eval = EvalConfig::with_qgbs(QgbsConfig::default());
let r1 = trainer.eval_rollout(&mut env, &eval);

Custom online loop without RlEnv:

let tr: Transition = /* from your stack */;
trainer.online_step_from_transition(&tr);

Toy example (feature toy)

cargo run -p rlx-rl --example fmq_toy --features "compile,toy"
cargo test -p rlx-rl --features "compile,toy"

Flow map + FMQ

[ X_{r,t}(a_r \mid s) = a_r + (t-r), u_{r,t}(a_r \mid s), \quad a_1 = X_{0,1}(a_0 \mid s) ]

Online FMQ: project (a_1) with (\nabla_a Q) inside a trust region, then regress (u_{0,1}) toward (a_1^* - a_0).