Example: confidence

Asynchronous Methods for Deep Reinforcement Learning

Asynchronous Methods for Deep Reinforcement LearningVolodymyr Puigdom nech P. DeepMind2 Montreal Institute for Learning Algorithms (MILA), University of MontrealAbstractWeproposeaconceptuallysi mpleandlightweight framework for deep reinforce-ment Learning that uses Asynchronous gradientdescent for optimization of deep neural networkcontrollers. We present Asynchronous variants offour standard Reinforcement Learning algorithmsand show that parallel actor-learners have astabilizing effect on training allowing all fourmethods to successfully train neural best performing method, anasynchronous variant of actor-critic, surpassesthe current state-of-the-art on the Atari domainwhile training for half the time on a singlemulti-core CPU instead of a GPU.

The process continues until the agent reaches a terminal state after which the process restarts. The return R t = P 1 k=0 kr t+k is the total accumulated return from time step twith discount factor 2(0;1]. The goal of the agent is to maximize the expected return from each state s t. The action value Qˇ(s;a) = E[R tjs t= s;a] is the ex-

Tags:

  Process

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of Asynchronous Methods for Deep Reinforcement Learning

1 Asynchronous Methods for Deep Reinforcement LearningVolodymyr Puigdom nech P. DeepMind2 Montreal Institute for Learning Algorithms (MILA), University of MontrealAbstractWeproposeaconceptuallysi mpleandlightweight framework for deep reinforce-ment Learning that uses Asynchronous gradientdescent for optimization of deep neural networkcontrollers. We present Asynchronous variants offour standard Reinforcement Learning algorithmsand show that parallel actor-learners have astabilizing effect on training allowing all fourmethods to successfully train neural best performing method, anasynchronous variant of actor-critic, surpassesthe current state-of-the-art on the Atari domainwhile training for half the time on a singlemulti-core CPU instead of a GPU.

2 Furthermore,we show that Asynchronous actor-critic succeedson a wide variety of continuous motor controlproblems as well as on a new task of navigatingrandom 3D mazes using a visual IntroductionDeep neural networks provide rich representations that canenable Reinforcement Learning (RL) algorithms to performeffectively. However, it was previously thought that thecombination of simple online RL algorithms with deepneural networks was fundamentally unstable. Instead, a va-riety of solutions have been proposed to stabilize the algo-rithm (Riedmiller, 2005; Mnih et al., 2013; 2015; Van Has-selt et al., 2015; Schulman et al., 2015a). These approachesshare a common idea: the sequence of observed data en-countered by an online RL agent is non-stationary, and on-Proceedings of the33rdInternational Conference on MachineLearning, New York, NY, USA, 2016.

3 JMLR: W&CP volume48. Copyright 2016 by the author(s).line RL updates are strongly correlated. By storing theagent s data in an experience replay memory, the data canbe batched (Riedmiller, 2005; Schulman et al., 2015a) orrandomly sampled (Mnih et al., 2013; 2015; Van Hasseltet al., 2015) from different time-steps. Aggregating overmemory in this way reduces non-stationarity and decorre-lates updates, but at the same time limits the Methods tooff-policy Reinforcement Learning RL algorithms based on experience replay haveachieved unprecedented success in challenging domainssuch as Atari 2600. However, experience replay has severaldrawbacks: it uses more memory and computation per realinteraction; and it requires off-policy Learning algorithmsthat can update from data generated by an older this paper we provide a very different paradigm for deepreinforcement Learning .

4 Instead of experience replay, weasynchronously execute multiple agents in parallel, on mul-tiple instances of the environment. This parallelism alsodecorrelates the agents data into a more stationary process ,since at any given time-step the parallel agents will be ex-periencing a variety of different states. This simple ideaenables a much larger spectrum of fundamental on-policyRL algorithms, such as Sarsa, n-step Methods , and actor-critic Methods , as well as off-policy RL algorithms suchas Q- Learning , to be applied robustly and effectively usingdeep neural parallel Reinforcement Learning paradigm also offerspractical benefits. Whereas previous approaches to deep re-inforcement Learning rely heavily on specialized hardwaresuch as GPUs (Mnih et al., 2015; Van Hasselt et al.)

5 , 2015;Schaul et al., 2015) or massively distributed architectures(Nair et al., 2015), our experiments run on a single machinewith a standard multi-core CPU. When applied to a vari-ety of Atari 2600 domains, on many games asynchronousreinforcement Learning achieves better results, in far [ ] 16 Jun 2016 Asynchronous Methods for Deep Reinforcement Learningtime than previous GPU-based algorithms, using far lessresource than massively distributed approaches. The bestof the proposed Methods , Asynchronous advantage actor-critic (A3C), also mastered a variety of continuous motorcontrol tasks as well as learned general strategies for ex-ploring 3D mazes purely from visual inputs. We believethat the success of A3C on both 2D and 3D games, discreteand continuous action spaces, as well as its ability to trainfeedforward and recurrent agents makes it the most generaland successful Reinforcement Learning agent to Related WorkThe General Reinforcement Learning Architecture (Gorila)of (Nair et al.

6 , 2015) performs Asynchronous training of re-inforcement Learning agents in a distributed setting. In Go-rila, each process contains an actor that acts in its own copyof the environment, a separate replay memory, and a learnerthat samples data from the replay memory and computesgradients of the DQN loss (Mnih et al., 2015) with respectto the policy parameters. The gradients are asynchronouslysent to a central parameter server which updates a centralcopy of the model. The updated policy parameters are sentto the actor-learners at fixed intervals. By using 100 sep-arate actor-learner processes and 30 parameter server in-stances, a total of 130 machines, Gorila was able to signif-icantly outperform DQN over 49 Atari games. On manygames Gorila reached the score achieved by DQN over 20times faster than DQN.

7 We also note that a similar way ofparallelizing DQN was proposed by (Chavez et al., 2015).In earlier work, (Li & Schuurmans, 2011) applied theMap Reduce framework to parallelizing batch reinforce-ment Learning Methods with linear function was used to speed up large matrix operationsbut not to parallelize the collection of experience or sta-bilize Learning . (Grounds & Kudenko, 2008) proposed aparallel version of the Sarsa algorithm that uses multipleseparate actor-learners to accelerate training. Each actor-learner learns separately and periodically sends updates toweights that have changed significantly to the other learn-ers using peer-to-peer communication.(Tsitsiklis, 1994) studied convergence properties of Q- Learning in the Asynchronous optimization setting.

8 Theseresults show that Q- Learning is still guaranteed to convergewhen some of the information is outdated as long as out-dated information is always eventually discarded and sev-eral other technical assumptions are satisfied. Even earlier,(Bertsekas, 1982) studied the related problem of distributeddynamic related area of work is in evolutionary meth-ods, which are often straightforward to parallelize by dis-tributing fitness evaluations over multiple machines orthreads (Tomassini, 1999). Such parallel evolutionary ap-proaches have recently been applied to some visual rein-forcement Learning tasks. In one example, (Koutn k et al.,2014) evolved convolutional neural network controllers forthe TORCS driving simulator by performing fitness evalu-ations on 8 CPU cores in Reinforcement Learning BackgroundWe consider the standard Reinforcement Learning settingwhere an agent interacts with an environmentEover anumber of discrete time steps.

9 At each time stept, theagent receives a statestand selects an actionatfrom someset of possible actionsAaccording to its policy , where is a mapping from statesstto actionsat. In return, theagent receives the next statest+1and receives a scalar re-wardrt. The process continues until the agent reaches aterminal state after which the process restarts. The returnRt= k=0 krt+kis the total accumulated return fromtime steptwith discount factor (0,1]. The goal of theagent is to maximize the expected return from each action valueQ (s,a) =E[Rt|st=s,a]is the ex-pected return for selecting actionain statesand follow-ing policy . The optimal value functionQ (s,a) =max Q (s,a)gives the maximum action value for statesand actionaachievable by any policy. Similarly, thevalue of statesunder policy is defined asV (s) =E[Rt|st=s]and is simply the expected return for follow-ing policy from value-based model-free Reinforcement Learning Methods ,the action value function is represented using a function ap-proximator, such as a neural network.)

10 LetQ(s,a; )be anapproximate action-value function with parameters . Theupdates to can be derived from a variety of reinforcementlearning algorithms. One example of such an algorithm isQ- Learning , which aims to directly approximate the optimalaction value function:Q (s,a) Q(s,a; ). In one-stepQ- Learning , the parameters of the action value functionQ(s,a; )are learned by iteratively minimizing a sequenceof loss functions, where theith loss function defined asLi( i) =E(r+ maxa Q(s ,a ; i 1) Q(s,a; i))2wheres is the state encountered after refer to the above method as one-step Q- Learning be-cause it updates the action valueQ(s,a)toward the one-step returnr+ maxa Q(s ,a ; ). One drawback of us-ing one-step Methods is that obtaining a rewardronly di-rectly affects the value of the state action pairs,athat ledto the reward.


Related search queries