Crate relearn[−][src]

Expand description

A reinforcement learning library.

Overview

This library defines a set of environments and learning agents and simulates their interaction.

This library makes heavy use of the builder pattern to construct agents, environments, and other objects. A trait BuildFoo defines an interface for constructing objects implementing trait Foo and an object Bar can often be built from BarConfig. Configurations are often defined compositionally in terms of other configurations. Using these configuration objects creates a concise, serializable representation that uniquely identifies an agent / environment / experiment / etc. Two important builder traits are BuildEnv and BuildAgent.

Glossary

Environments

A reinforcement learning Environment is an environment structure with internal state. The fundamental operation is to take a step from a state given some action, resulting in a successor state, a reward value, and a flag indicating whether the current episode is done. The Step structure stores a description of the observable parts of an environment step.

Episode

A sequence of environment steps, each following from the successor state of the previous step. The initial state is set by calling Environment::reset. An episode ends when Environment::step sets the episode_done flag in its return value. An episode may end on a terminal state in which case all future rewards are assumed to be zero. If instead the final state is non-terminal then there may have been non-zero future rewards if the episode had continued.

Terminal State

An environment state that immediately ends the episode with 0 future reward. From the perspective of the MDP formalism (in which all episodes are infinitely long), a state from which all steps, no matter the action, have 0 reward and lead to another terminal state.

Return

The discounted sum of future rewards (return = sum_i { reward_i * discount_factor ** i }). May refer to the rewards of an entire episode or the future rewards starting from a particular step.

Space

A space is a mathematical set with some added structure, used here for defining the set of possible actions and observations of a reinforcement learning environment.

The core interface is the Space trait with additional functionality provided by other traits in spaces. The actual elements of a space have type Space::Element.

Action Space

A set (EnvStructure::ActionSpace) containing all possible actions for the environment. The action space is independent of the environment state so every action in the space is allowed in any state. An invalid action may be simulated by providing low reward and ending the episode.

Observation Space

A set (EnvStructure::ObservationSpace) containing all possible observations an environment might produce. May contain elements that cannot be produced as observations. To be more precise, the set of possible observations is actually Option<ObservationSpace> where None represents any terminal states.

Agents

An Actor interacts with an environment. An Agent is an Actor with the ability to persistently update. An Actor may “learn” within an episode by conditioning on the observed episode history, but only Agent::update allows learning across episodes.

Policy

A policy maps a sequence of episode history features to parameters of an action distribution for the current state. A policy may use the past within an episode but not across episodes and not from the future.

Critic

A critic assigns a value to each step in an episode. It does so retroactively with full access to the episode future. It may also depend on the past history within an episode. It may not depend on information between episodes. The value is not necessarily the (expected) return from a given state but should be correlated with expected return such that higher values indicate better states and actions.

The critic is used for generating training targets when updating the policy. Examples include the empirical return and Generalized Advantage Estimation.

This usage is possibly non-standard. I have not found it clear whether the standard use of “critic” refers exclusively to value estimates using only the history or if retroactive value estimates can be included.

Value Function

A function approximator that maps a sequence of episode history features to estimates of the expected future return of each observation or observation-action pair. May only use the past history within an episode, not from the future or across episodes. Some critics use value functions to improve their value estimates.