Rainbow: Combining Improvements in Deep Reinforcement …

rainbow : Combining Improvements in Deep Reinforcement LearningMatteo HesselDeepMindJoseph ModayilDeepMindHado van HasseltDeepMindTom SchaulDeepMindGeorg OstrovskiDeepMindWill DabneyDeepMindDan HorganDeepMindBilal PiotDeepMindMohammad AzarDeepMindDavid SilverDeepMindAbstractThe deep Reinforcement learning community has made sev-eral independent Improvements to the DQN algorithm. How-ever, it is unclear which of these extensions are complemen-tary and can be fruitfully combined. This paper examinessix extensions to the DQN algorithm and empirically studiestheir combination. Our experiments show that the combina-tion provides state-of-the-art performance on the Atari 2600benchmark, both in terms of data efficiency and final perfor-mance.

We also provide results from a detailed ablation studythat shows the contribution of each component to overall many recent successes in scaling Reinforcement learn-ing (RL) to complex sequential decision-making problemswere kick-started by the Deep Q-Networks algorithm (DQN;Mnih et al. 2013, 2015). Its combination of Q-learning withconvolutional neural networks and experience replay en-abled it to learn, from raw pixels, how to play many Atarigames at human-level performance. Since then, many exten-sions have been proposed that enhance its speed or DQN (DDQN; van Hasselt, Guez, and Silver2016) addresses an overestimation bias of Q-learning (vanHasselt 2010), by decoupling selection and evaluation ofthe bootstrap action.

Prioritized experience replay (Schaulet al. 2015) improves data efficiency, by replaying more of-ten transitions from which there is more to learn. The du-eling network architecture (Wang et al. 2016) helps to gen-eralize across actions by separately representing state val-ues and action advantages. Learning from multi-step boot-strap targets (Sutton 1988; Sutton and Barto 1998), as usedin A3C (Mnih et al. 2016), shifts the bias-variance trade-off and helps to propagate newly observed rewards faster toearlier visited states. Distributional Q-learning (Bellemare,Dabney, and Munos 2017) learns a categorical distributionof discounted returns, instead of estimating the mean. NoisyDQN (Fortunato et al.)

2017) uses stochastic network layersfor exploration. This list is, of course, far from of these algorithms enables substantial performanceimprovements in isolation. Since they do so by addressingCopyrightc 2018, Association for the Advancement of ArtificialIntelligence ( ). All rights of frames0%100%200%Median human-normalized scoreDQNDDQNP rioritized DDQND ueling DDQNA3 CDistributional DQNN oisy DQNR ainbowFigure 1:Median human-normalized performanceacross57 Atari games. We compare our integrated agent ( rainbow -colored) to DQN (grey) and six published baselines. Notethat we match DQN s best performance after 7M frames,surpass any baseline within 44M frames, and reach sub-stantially improved final performance.

Curves are smoothedwith a moving average over 5 different issues, and since they build on a sharedframework, they could plausibly be combined. In some casesthis has been done: Prioritized DDQN and Dueling DDQN both use double Q-learning, and Dueling DDQN was alsocombined with prioritized experience replay. In this paperwe propose to study an agent that combines all the afore-mentioned ingredients. We show how these different ideascan be integrated, and that they are indeed largely com-plementary. In fact, their combination results in new state-of-the-art results on the benchmark suite of 57 Atari 2600games from the Arcade Learning Environment (Bellemare etal. 2013), both in terms of data efficiency and of final perfor-mance.

Finally we show results from ablation studies to helpunderstand the contributions of the different [ ] 6 Oct 2017 BackgroundReinforcement learning addresses the problem of anagentlearning to act in anenvironmentin order to maximize ascalarrewardsignal. No direct supervision is provided to theagent, for instance it is never directly told the best and each discrete time stept=0,1, , the environment provides the agent with an ob-servationSt, the agent responds by selecting an actionAt,and then the environment provides the next rewardRt+1,discount t+1, and stateSt+1. This interaction is formalizedas aMarkov Decision Process, or MDP, which is a tuple S,A,T,r, , whereSis a finite set of states,Ais a finiteset of actions,T(s,a,s ) =P[St+1=s |St=s,At=a]is the (stochastic) transition function,r(s,a) =E[Rt+1|St=s,At=a]is the reward function, and [0,1]isa discount factor.

In our experiments MDPs will beepisodicwith a constant t= , except on episode termination where t= 0, but the algorithms are expressed in the general the agent side, action selection is given by a policy that defines a probability distribution over actions for eachstate. From the stateStencountered at timet, we definethe discounted returnGt= k=0 (k)tRt+k+1as the dis-counted sum of future rewards collected by the agent, wherethe discount for a rewardksteps in the future is given by theproduct of discounts before that time, (k)t= ki=1 t+ agent aims to maximize the expected discounted returnby finding a good policy may be learned directly, or it may be con-structed as a function of some other learned quantities.

Invalue-based Reinforcement learning, the agent learns an es-timate of the expected discounted return, or value, whenfollowing a policy starting from a given state,v (s) =E [Gt|St=s], or state-action pair,q (s,a) =E [Gt|St=s,At=a]. A common way of deriving a new policy from astate-action value function is to act -greedily with respect tothe action values. This corresponds to taking the action withthe highest value (thegreedyaction) with probability(1 ),and to otherwise act uniformly at random with probability .Policies of this kind are used to introduce a form ofexplo-ration: by randomly selecting actions that are sub-optimalaccording to its current estimates, the agent can discover andcorrect its estimates when appropriate.

The main limitationis that it is difficult to discover alternative courses of actionthat extend far into the future; this has motivated research onmore directed forms of Reinforcement learning and stateand/or action spaces make it intractable to learn Q valueestimates for each state and action pair independently. Indeep Reinforcement learning, we represent the various com-ponents of agents, such as policies (s,a)or valuesq(s,a),with deep ( , multi-layer) neural networks. The parametersof these networks are trained by gradient descent to mini-mize some suitable loss DQN (Mnih et al. 2015) deep networks and reinforce-ment learning were successfully combined by using a con-volutional neural net to approximate the action values for agiven stateSt(which is fed as input to the network in theform of a stack of raw pixel frames).

At each step, basedon the current state, the agent selects an action -greedilywith respect to the action values, and adds a transition(St,At,Rt+1, t+1,St+1) to a replay memory buffer (Lin1992), that holds the last million transitions. The parame-ters of the neural network are optimized by using stochasticgradient descent to minimize the loss(Rt+1+ t+1maxa q (St+1,a ) q (St,At))2,(1)wheretis a time step randomly picked from the replaymemory. The gradient of the loss is back-propagated onlyinto the parameters of theonline network(which is alsoused to select actions); the term represents the parame-ters of atarget network; a periodic copy of the online net-work which is not directly optimized.

Rainbow: Combining Improvements in Deep Reinforcement …

Tags:

Information

Transcription of Rainbow: Combining Improvements in Deep Reinforcement …

Related search queries

Rainbow: Combining Improvements in Deep Reinforcement …

Tags:

Information

Documents from same domain

Related documents

Related search queries