Example: quiz answers

A Distributional Perspective on Reinforcement Learning

A Distributional Perspective on Reinforcement LearningMarc G. Bellemare* 1 Will Dabney* 1R emi Munos1 AbstractIn this paper we argue for the fundamental impor-tance of thevalue distribution: the distributionof the random return received by a reinforcementlearning agent. This is in contrast to the com-mon approach to Reinforcement Learning whichmodels the expectation of this return, there is an established body of liter-ature studying the value distribution, thus far ithas always been used for a specific purpose suchas implementing risk-aware behaviour. We beginwith theoretical results in both the policy eval-uation and control settings, exposing a signifi-cant Distributional instability in the latter.

A Distributional Perspective on Reinforcement Learning sure theory may think of as the space of all possible outcomes of an experiment (Billingsley,1995).

Tags:

  Perspective, Learning, Reinforcement, Distributional, A distributional perspective on reinforcement learning

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of A Distributional Perspective on Reinforcement Learning

1 A Distributional Perspective on Reinforcement LearningMarc G. Bellemare* 1 Will Dabney* 1R emi Munos1 AbstractIn this paper we argue for the fundamental impor-tance of thevalue distribution: the distributionof the random return received by a reinforcementlearning agent. This is in contrast to the com-mon approach to Reinforcement Learning whichmodels the expectation of this return, there is an established body of liter-ature studying the value distribution, thus far ithas always been used for a specific purpose suchas implementing risk-aware behaviour. We beginwith theoretical results in both the policy eval-uation and control settings, exposing a signifi-cant Distributional instability in the latter.

2 Wethen use the Distributional Perspective to designa new algorithm which applies Bellman s equa-tion to the Learning of approximate value distri-butions. We evaluate our algorithm using thesuite of games from the Arcade Learning En-vironment. We obtain both state-of-the-art re-sults and anecdotal evidence demonstrating theimportance of the value distribution in approxi-mate Reinforcement Learning . Finally, we com-bine theoretical and empirical evidence to high-light the ways in which the value distribution im-pacts Learning in the approximate IntroductionOne of the major tenets of Reinforcement Learning statesthat, when not otherwise constrained in its behaviour, anagent should aim to maximize its expected utilityQ, orvalue(Sutton & Barto, 1998).

3 Bellman s equation succintlydescribes this value in terms of the expected reward and ex-pected outcome of the random transition(x,a) (X ,A ):Q(x,a) =ER(x,a) + EQ(X ,A ).In this paper, we aim to go beyond the notion of value andargue in favour of a Distributional Perspective on reinforce-*Equal contribution1 DeepMind, London, UK. Correspon-dence to: Marc G. of the34thInternational Conference on MachineLearning, Sydney, Australia, PMLR 70, 2017. Copyright 2017by the author(s).ment Learning . Specifically, the main object of our study isthe random returnZwhose expectation is the valueQ. Thisrandom return is also described by a recursive equation, butone of a Distributional nature:Z(x,a)D=R(x,a) + Z(X ,A ).Thedistributional Bellman equationstates that the distribu-tion ofZis characterized by the interaction of three randomvariables: the rewardR, the next state-action(X ,A ), andits random returnZ(X ,A ).

4 By analogy with the well-known case, we call this quantity thevalue the Distributional Perspective is almost as oldas Bellman s equation itself (Jaquette, 1973; Sobel, 1982;White, 1988), in Reinforcement Learning it has thus far beensubordinated to specific purposes: to model parametric un-certainty (Dearden et al., 1998), to design risk-sensitive al-gorithms (Morimura et al., 2010b;a), or for theoretical anal-ysis (Azar et al., 2012; Lattimore & Hutter, 2012). By con-trast, we believe the value distribution has a central role toplay in Reinforcement of the policy evaluation Bellman ourselves on results by R osler (1992) we show that,for a fixed policy, the Bellman operator over value distribu-tions is a contraction in a maximal form of the Wasserstein(also called Kantorovich or Mallows) metric.

5 Our partic-ular choice of metric matters: the same operator is not acontraction in total variation, Kullback-Leibler divergence,or Kolmogorov in the control will demonstrate aninstability in the Distributional version of Bellman s opti-mality equation, in contrast to the policy evaluation , although the optimality operator is a contrac-tion in expected value (matching the usual optimality re-sult), it is not a contraction in any metric over results provide evidence in favour of Learning algo-rithms that model the effects of nonstationary an algorithmic standpoint,there are many benefits to Learning an approximate distribu-tion rather than its approximate expectation. The distribu-tional Bellman operator preserves multimodality in valuedistributions, which we believe leads to more stable learn-ing.

6 Approximating the full distribution also mitigates theeffects of Learning from a nonstationary policy. As a whole, [ ] 21 Jul 2017A Distributional Perspective on Reinforcement Learningwe argue that this approach makes approximate reinforce-ment Learning significantly better will illustrate the practical benefits of the distributionalperspective in the context of the Arcade Learning Environ-ment (Bellemare et al., 2013). By modelling the value dis-tribution within a DQN agent (Mnih et al., 2015), we ob-tain considerably increased performance across the gamutof benchmark Atari 2600 games, and in fact achieve state-of-the-art performance on a number of games. Our resultsecho those of Veness et al. (2015), who obtained extremelyfast Learning by predicting Monte Carlo a supervised Learning Perspective , Learning the fullvalue distribution might seem obvious: why restrict our-selves to the mean?

7 The main distinction, of course, is thatin our setting there are no given targets. Instead, we useBellman s equation to make the Learning process tractable;we must, as Sutton & Barto (1998) put it, learn a guessfrom a guess . It is our belief that this guesswork ultimatelycarries more benefits than SettingWe consider an agent interacting with an environment inthe standard fashion: at each step, the agent selects an ac-tion based on its current state, to which the environment re-sponds with a reward and the next state. We model this in-teraction as a time-homogeneous Markov Decision Process(X,A,R,P, ). As usual,XandAare respectively thestate and action spaces,Pis the transition kernelP( |x,a), [0,1]is the discount factor, andRis the reward func-tion, which in this work we explicitly treat as a randomvariable.

8 A stationary policy maps each statex Xto aprobability distribution over the action Bellman s EquationsThereturnZ is the sum of discounted rewards along theagent s trajectory of interactions with the environment. Thevalue functionQ of a policy describes the expected re-turn from taking actiona Afrom statex X, thenacting according to :Q (x,a) :=EZ (x,a) =E[ t=0 tR(xt,at)],(1)xt P( |xt 1,at 1),at ( |xt),x0=x,a0= to Reinforcement Learning is the use of Bell-man s equation (Bellman, 1957) to describe the value func-tion:Q (x,a) =ER(x,a) + EP, Q (x ,a ).In Reinforcement Learning we are typically interested in act-ing so as to maximize the return. The most common ap- P ZR+P Z ZP (a)(b)(c)(d)T Z Figure Distributional Bellman operator with a deterministicreward function: (a) Next state distribution under policy , (b)Discounting shrinks the distribution towards 0, (c) The rewardshifts it, and (d) Projection step (Section 4).

9 Proach for doing so involves the optimality equationQ (x,a) =ER(x,a) + EPmaxa AQ (x ,a ).This equation has a unique fixed pointQ , the optimalvalue function, corresponding to the set of optimal policies ( is optimal ifEa Q (x,a) = maxaQ (x,a)).We view value functions as vectors inRX A, and the ex-pected reward function as one such vector. In this context,theBellman operatorT andoptimality operatorTareT Q(x,a) :=ER(x,a) + EP, Q(x ,a )(2)TQ(x,a) :=ER(x,a) + EPmaxa AQ(x ,a ).(3)These operators are useful as they describe the expectedbehaviour of popular Learning algorithms such as SARSAand Q- Learning . In particular they are both contractionmappings, and their repeated application to some initialQ0converges exponentially toQ orQ , respectively (Bert-sekas & Tsitsiklis, 1996).

10 3. The Distributional Bellman OperatorsIn this paper we take away the expectations inside Bell-man s equations and consider instead the full distributionof the random variableZ . From here on, we will viewZ as a mapping from state-action pairs to distributions overreturns, and call it thevalue first aim is to gain an understanding of the theoreticalbehaviour of the Distributional analogues of the Bellmanoperators, in particular in the less well-understood controlsetting. The reader strictly interested in the algorithmiccontribution may choose to skip this Distributional EquationsIt will sometimes be convenient to make use of the proba-bility space( ,F,Pr). The reader unfamiliar with mea-A Distributional Perspective on Reinforcement Learningsure theory may think of as the space of all possibleoutcomes of an experiment (Billingsley, 1995).


Related search queries