Crate relearn[−][src]
Expand description
A reinforcement learning library.
Overview
This library defines a set of environments and learning agents and simulates their interaction.
This library makes heavy use of the builder pattern to construct agents,
environments, and other objects. A trait BuildFoo
defines an interface for constructing
objects implementing trait Foo
and an object Bar
can often be built from BarConfig
.
Configurations are often defined compositionally in terms of other configurations.
Using these configuration objects creates a concise, serializable representation that uniquely
identifies an agent / environment / experiment / etc.
Two important builder traits are BuildEnv
and BuildAgent
.
Glossary
Environments
A reinforcement learning Environment
is an environment structure with internal state.
The fundamental operation is to take a step from a state given
some action, resulting in a successor state, a reward value, and a flag indicating whether the
current episode is done. The Step
structure stores a
description of the observable parts of an environment step.
Episode
A sequence of environment steps, each following from the successor state of the previous step.
The initial state is set by calling Environment::reset
.
An episode ends when Environment::step
sets the episode_done
flag in its return
value.
An episode may end on a terminal state in which case all future rewards are assumed to be
zero. If instead the final state is non-terminal then there may have been non-zero future
rewards if the episode had continued.
Terminal State
An environment state that immediately ends the episode with 0 future reward. From the perspective of the MDP formalism (in which all episodes are infinitely long), a state from which all steps, no matter the action, have 0 reward and lead to another terminal state.
Return
The discounted sum of future rewards (return = sum_i { reward_i * discount_factor ** i }
).
May refer to the rewards of an entire episode or the future rewards starting from
a particular step.
Space
A space is a mathematical set with some added structure, used here for defining the set of possible actions and observations of a reinforcement learning environment.
The core interface is the Space
trait
with additional functionality provided by other traits in spaces
.
The actual elements of a space have type Space::Element
.
Action Space
A set (EnvStructure::ActionSpace
) containing all possible actions for the environment.
The action space is independent of the environment state so every action in the space is
allowed in any state.
An invalid action may be simulated by providing low reward and ending the episode.
Observation Space
A set (EnvStructure::ObservationSpace
)
containing all possible observations an environment might produce.
May contain elements that cannot be produced as observations.
To be more precise, the set of possible observations is actually
Option<ObservationSpace>
where None
represents any terminal states.
Agents
An Actor
interacts with an environment.
An Agent
is an Actor with the ability to persistently update.
An Actor may “learn” within an episode by conditioning on the observed episode history, but
only Agent::update
allows learning across episodes.
Policy
A policy maps a sequence of episode history features to parameters of an action distribution for the current state. A policy may use the past within an episode but not across episodes and not from the future.
Critic
A critic assigns a value to each step in an episode. It does so retroactively with full access to the episode future. It may also depend on the past history within an episode. It may not depend on information between episodes. The value is not necessarily the (expected) return from a given state but should be correlated with expected return such that higher values indicate better states and actions.
The critic is used for generating training targets when updating the policy. Examples include the empirical return and Generalized Advantage Estimation.
This usage is possibly non-standard. I have not found it clear whether the standard use of “critic” refers exclusively to value estimates using only the history or if retroactive value estimates can be included.
Value Function
A function approximator that maps a sequence of episode history features to estimates of the expected future return of each observation or observation-action pair. May only use the past history within an episode, not from the future or across episodes. Some critics use value functions to improve their value estimates.
Re-exports
pub use agents::Actor;
pub use agents::Agent;
pub use agents::BuildAgent;
pub use agents::Step;
pub use defs::AgentDef;
pub use defs::EnvDef;
pub use defs::MultiThreadAgentDef;
pub use envs::EnvStructure;
pub use envs::Environment;
pub use envs::StructuredEnvironment;
pub use simulation::Simulator;
Modules
Reinforcement learning agents
Command-line interface
Definition structures
Reinforcement learning environments
Logging statistics from simulation runs
Simulating agent-environment interaction
Spaces: runtime-defined types
Torch components
General-purpose utilities
Enums
Torch optimizer definition
Sequence module definition
Traits
Build an Environment
.