Reinforcement Learning: An Overview

Paper · arXiv 2412.05265 · Published December 6, 2024
Reinforcement Learning

Introduction. Reinforcement learning or RL is a class of methods for solving various kinds of sequential decision making tasks. In such tasks, we want to design an agent that interacts with an external environment. The agent maintains an internal state zt, which it passes to its policy π to choose an action at = π(zt). The environment responds by sending back an observation ot+1, which the agent uses to update its internal state using the state-update function zt+1 = SU(zt, at, ot+1). See Figure 1.1 for an illustration. Note that we often assume that the observation ot corresponds to the true environment or world state wt; in this case, we denote the internal agent state and external environment state by the same letter, namely st. We discuss this issue in more detail in Section 1.1.3.

Discussion / Conclusion. 7.8 General RL, AIXI and universal AGI where Pr(p) is the prior probability of p, and we assume the likelihood is 1 if p can generate the observations given the actions, and is 0 otherwise. One important question is: what is a reasonable prior over programs? In [Hut05], Marcus Hutter proposed to apply the idea of Solomonoff induction [Sol64] to the case of an online decision making agent. This amounts to using the prior Pr(p) = 2−l(p), where l(p) is the length of program p. This prior favors shorter programs, and the likelihood filters out programs that cannot explain the data. The resulting agent is known as AIXI, where “AI” stands for “Artificial Intelligence” and “XI” referring to the Greek letter ξ used in Solomonoff induction. The AIXI agent has been called the “most intelligent general-purpose agent possible” [HQC24], and can be viewed as the theoretical foundation of (universal) artificial general intelligence or AGI. Unfortunately, the AIXI agent is intractable to compute, for two main reasons: (1) it relies on Solomonoff induction and Kolmogorov complexity, both of which are intractable; and (2) the expectimax computation is intractable. Fortunately, various tractable approximations have been devised.