Dueling Network Architectures for Deep Reinforcement …

Dueling Network Architectures for Deep Reinforcement LearningZiyu van de DeepMind, London, UKAbstractIn recent years there have been many successesof using deep representations in reinforcementlearning. Still, many of these applications useconventional Architectures , such as convolutionalnetworks, LSTMs, or auto-encoders. In this pa-per, we present a new neural Network architec-ture for model-free Reinforcement learning . Ourdueling Network represents two separate estima-tors: one for the state value function and one forthe state-dependent action advantage main benefit of this factoring is to general-ize learning across actions without imposing anychange to the underlying Reinforcement learningalgorithm. Our results show that this architec-ture leads to better policy evaluation in the pres-ence of many similar-valued actions.

Moreover,the Dueling architecture enables our RL agent tooutperform the state-of-the-art on the Atari IntroductionOver the past years, deep learning has contributed to dra-matic advances in scalability and performance of machinelearning (LeCun et al., 2015). One exciting applicationis the sequential decision-making setting of reinforcementlearning (RL) and control . Notable examples include deepQ- learning (Mnih et al., 2015), deep visuomotor policies(Levine et al., 2015), attention with recurrent networks (Baet al., 2015), and model predictive control with embeddings(Watter et al., 2015). Other recent successes include mas-sively parallel frameworks (Nair et al., 2015) and expertmove prediction in the game of Go (Maddison et al.)

, 2015),Proceedings of the33rdInternational Conference on MachineLearning, New York, NY, USA, 2016. JMLR: W&CP volume48. Copyright 2016 by the author(s).which produced policies matching those of Monte Carlotree search programs, and squarely beaten a professionalplayer when combined with search (Silver et al., 2016).In spite of this, most of the approaches for RL use standardneural networks, such as convolutional networks, MLPs,LSTMs and autoencoders. The focus in these recent ad-vances has been on designing improved control and RL al-gorithms, or simply on incorporating existing neural net-work Architectures into RL methods. Here, we take anal-ternative but complementary approachof focusing primar-ily on innovating a neural Network architecture that is bettersuited for model-free RL.

This approach has the benefit thatthe new Network can be easily combined with existing andfuture algorithms for RL. That is, this paper advances a newnetwork (Figure 1), but uses already published proposed Network architecture, which we name thedu-eling architecture, explicitly separates the representation ofFigure popular single streamQ- Network (top) and the duel-ingQ- Network (bottom). The Dueling Network has two streamsto separately estimate (scalar) state-value and the advantages foreach action; the green output module implements equation (9) tocombine them. Both networks outputQ-values for each Network Architectures for Deep Reinforcement Learningstate values and (state-dependent) action advantages. Thedueling architecture consists of two streams that representthe value and advantage functions, while sharing a commonconvolutional feature learning module.

The two streamsare combined via a special aggregating layer to produce anestimate of the state-action value functionQas shown inFigure 1. This Dueling Network should be understood as asingleQnetwork with two streams that replaces the popu-lar single-streamQnetwork in existing algorithms such asDeep Q-Networks (DQN; Mnih et al., 2015). The duelingnetwork automatically produces separate estimates of thestate value function and advantage function, without anyextra , the Dueling architecture can learn which statesare (or are not) valuable, without having to learn the effectof each action for each state. This is particularly usefulin states where its actions do not affect the environment inany relevant way. To illustrate this, consider the saliencymaps shown in Figure 21.

These maps were generated bycomputing the Jacobians of the trained value and advan-tage streams with respect to the input video, following themethod proposed by Simonyan et al. (2013). (The experi-mental section describes this methodology in more detail.)The figure shows the value and advantage saliency maps fortwo different time steps. In one time step (leftmost pair ofimages), we see that the value Network stream pays atten-tion to the road and in particular to the horizon, where newcars appear. It also pays attention to the score. The advan-tage stream on the other hand does not pay much attentionto the visual input because its action choice is practicallyirrelevant when there are no cars in front. However, in thesecond time step (rightmost pair of images) the advantagestream pays attention as there is a car immediately in front,making its choice of action very the experiments, we demonstrate that the Dueling archi-tecture can more quickly identify the correct action duringpolicy evaluation as redundant or similar actions are addedto the learning also evaluate the gains brought in by the Dueling archi-tecture on the challenging Atari 2600 testbed.

Here, an RLagent with the same structure and hyper-parameters mustbe able to play 57 different games by observing image pix-els and game scores only. The results illustrate vast im-provements over the single-stream baselines of Mnih et al.(2015) and van Hasselt et al. (2015). The combination ofprioritized replay (Schaul et al., 2016) with the proposeddueling Network results in the new state-of-the-art for thispopular , attend and drive: Value and advantage saliencymaps (red-tinted overlay) on the Atari game Enduro, for a traineddueling architecture. The value stream learns to pay attention tothe road. The advantage stream learns to pay attention only whenthere are cars immediately in front, so as to avoid Related WorkThe notion of maintaining separate value and advantagefunctions goes back to Baird (1993).

In Baird s originaladvantage updating algorithm, the shared Bellman resid-ual update equation is decomposed into two updates: onefor a state value function, and one for its associated ad-vantage function. Advantage updating was shown to con-verge faster than Q- learning in simple continuous time do-mains in (Harmon et al., 1995). Its successor, the advan-tage learning algorithm, represents only a single advantagefunction (Harmon & Baird, 1996).The Dueling architecture represents both the valueV(s)and advantageA(s,a)functions with a single deep modelwhose output combines the two to produce a state-actionvalueQ(s,a). Unlike in advantage updating, the represen-tation and algorithm are decoupled by construction. Con-sequently, the Dueling architecture can be used in combina-tion with a myriad of model free RL is a long history of advantage functions in policy gra-dients, starting with (Sutton et al.)

, 2000). As a recent ex-ample of this line of work, Schulman et al. (2015) estimateadvantage values online to reduce the variance of policygradient have been several attempts at playing Atari with deepDueling Network Architectures for Deep Reinforcement Learningreinforcement learning , including Mnih et al. (2015); Guoet al. (2014); Stadie et al. (2015); Nair et al. (2015); vanHasselt et al. (2015); Bellemare et al. (2016) and Schaulet al. (2016). The results of Schaul et al. (2016) are thecurrent published BackgroundWe consider a sequential decision making setup, in whichan agent interacts with an environmentEover discrete timesteps, see Sutton & Barto (1998) for an introduction. In theAtari domain, for example, the agent perceives a videostconsisting ofMimage frames:st= (xt M+1.

,xt) Sat time stept. The agent then chooses an action from adiscrete setat A={1,..,|A|}and observes a rewardsignalrtproduced by the game agent seeks to maximize the expected discounted re-turn, where we define the discounted return asRt= =t tr . In this formulation, [0,1]is a discountfactor that trades-off the importance of immediate and fu-ture an agent behaving according to a stochastic policy ,the values of the state-action pair(s,a)and the statesaredefined as followsQ (s,a) =E[Rt|st=s,at=a, ],andV (s) =Ea (s)[Q (s,a)].(1)The preceding state-action value function (Qfunction forshort) can be computed recursively with dynamic program-ming:Q (s,a) =Es [r+ Ea (s )[Q (s ,a )]|s,a, ].We define the optimalQ (s,a) = max Q (s,a).

Dueling Network Architectures for Deep Reinforcement …

Tags:

Information

Transcription of Dueling Network Architectures for Deep Reinforcement …

Related search queries

Dueling Network Architectures for Deep Reinforcement …

Tags:

Information

Documents from same domain

Related documents

Related search queries