Struct Trainer

Source

pub struct Trainer { /* private fields */ }

Expand description

Manages training loop and related objects.

§Training loop

Training loop looks like following:

Given an agent implementing Agent and a recorder implementing [Recorder].
Initialize the objects used in the training loop, involving instances of Env, StepProcessor, Sampler.
- Reset a counter of the environment steps: env_steps = 0
- Reset a counter of the optimization steps: opt_steps = 0
- Reset objects for computing optimization steps per sec (OSPS):
  - A timer timer = SystemTime::now().
  - A counter opt_steps_ops = 0
Reset Env.
Do an environment step and push a transition to the replaybuffer, implementing ReplayBufferBase.
env_steps += 1
If env_steps % opt_interval == 0:
1. Do an optimization step for the agent with transition batches sampled from the replay buffer.
  - NOTE: Here, the agent can skip an optimization step because of some reason, for example, during a warmup period for the replay buffer. In this case, the following steps are skipped as well.
2. opt_steps += 1, opt_steps_ops += 1
3. If opt_steps % eval_interval == 0:
  - Do an evaluation of the agent and add the evaluation result to the as "eval_reward".
  - Reset timer and opt_steps_ops.
  - If the evaluation result is the best, agent’s model parameters are saved in directory (model_dir)/best.
4. If opt_steps % record_interval == 0, compute OSPS as opt_steps_ops / timer.elapsed()?.as_secs_f32() and add it to the recorder as "opt_steps_per_sec".
5. If opt_steps % save_interval == 0, agent’s model parameters are saved in directory (model_dir)/(opt_steps).
6. If opt_steps == max_opts, finish training loop.
Back to step 3.

§Interaction of objects

In Trainer::train() method, objects interact as shown below:

First, Agent emits an Env::Act a_t based on Env::Obs o_t received from Env. Given a_t, Env changes its state and creates the observation at the next step, o_t+1. This step of interaction between Agent and Env is referred to as an environment step.
Next, Step<E: Env> will be created with the next observation o_t+1, reward r_t, and a_t.
The Step<E: Env> object will be processed by StepProcessor and creates [ReplayBufferBase::Item], typically representing a transition (o_t, a_t, o_t+1, r_t), where o_t is kept in the StepProcessor, while other items in the given Step<E: Env>.
Finally, the transitions pushed to the ReplayBufferBase will be used to create batches, each of which implementing BatchBase. These batches will be used in optimization steps, where the agent updates its parameters using sampled experiencesp in batches.

Implementations§

Source §

impl Trainer

Source

pub fn build(config: TrainerConfig) -> Self

Constructs a trainer.

Source

pub fn train_step<E, A, R>( &mut self, agent: &mut A, buffer: &mut R, ) -> Result<(Record, bool)>
where E: Env, A: Agent<E, R>, R: ReplayBufferBase,

Performs a training step.

First, it performes an environment step once and pushes a transition into the given buffer with Sampler. Then, if the number of environment steps reaches the optimization interval opt_interval, performes an optimization step.

The second return value in the tuple is if an optimization step is done (true).

Source