Example: barber

Deterministic Policy Gradient Algorithms

Deterministic Policy Gradient AlgorithmsDavid Technologies, London, UKGuy College London, UKNicolas Heess, Thomas Degris, Daan Wierstra, Martin Technologies, London, UKAbstractIn this paper we considerdeterministicpolicygradient Algorithms for reinforcement learningwith continuous actions. The Deterministic pol-icy Gradient has a particularly appealing form: itis the expected Gradient of the action-value func-tion. This simple form means that the deter-ministic Policy Gradient can be estimated muchmore efficiently than the usual stochastic pol-icy Gradient . To ensure adequate exploration,we introduce an off- Policy actor-critic algorithmthat learns a Deterministic target Policy from anexploratory behaviour Policy . We demonstratethat Deterministic Policy Gradient Algorithms cansignificantly outperform their stochastic counter-parts in high-dimensional action IntroductionPolicy Gradient Algorithms are widely used in reinforce-ment learning problems with continuous action spaces.

(ajs) = P[ajs; ] that stochastically selects action ain state saccording to parameter vector . Policy gradient algorithms typically proceed by sampling this stochastic policy and adjusting the policy parameters in the direction of greater cumulative reward.

Tags:

  Deterministic

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Advertisement

Transcription of Deterministic Policy Gradient Algorithms

1 Deterministic Policy Gradient AlgorithmsDavid Technologies, London, UKGuy College London, UKNicolas Heess, Thomas Degris, Daan Wierstra, Martin Technologies, London, UKAbstractIn this paper we considerdeterministicpolicygradient Algorithms for reinforcement learningwith continuous actions. The Deterministic pol-icy Gradient has a particularly appealing form: itis the expected Gradient of the action-value func-tion. This simple form means that the deter-ministic Policy Gradient can be estimated muchmore efficiently than the usual stochastic pol-icy Gradient . To ensure adequate exploration,we introduce an off- Policy actor-critic algorithmthat learns a Deterministic target Policy from anexploratory behaviour Policy . We demonstratethat Deterministic Policy Gradient Algorithms cansignificantly outperform their stochastic counter-parts in high-dimensional action IntroductionPolicy Gradient Algorithms are widely used in reinforce-ment learning problems with continuous action spaces.

2 Thebasic idea is to represent the Policy by a parametric prob-ability distribution (a|s) =P[a|s; ]that stochasticallyselects actionain statesaccording to parameter vector . Policy Gradient Algorithms typically proceed by samplingthis stochastic Policy and adjusting the Policy parametersin the direction of greater cumulative this paper we instead considerdeterministicpoliciesa= (s). It is natural to wonder whether the same ap-proach can be followed as for stochastic policies: adjustingthe Policy parameters in the direction of the Policy gradi-ent. It was previously believed that the Deterministic pol-icy Gradient did not exist, or could only be obtained whenusing a model (Peters, 2010). However, we show that thedeterministic Policy Gradient does indeed exist, and further-more it has a simple model-free form that simply followsthe Gradient of the action-value function.

3 In addition, weshow that the Deterministic Policy Gradient is the limitingProceedings of the31stInternational Conference on MachineLearning, Beijing, China, 2014. JMLR: W&CP volume 32. Copy-right 2014 by the author(s).case, as Policy variance tends to zero, of the stochastic pol-icy a practical viewpoint, there is a crucial difference be-tween the stochastic and Deterministic Policy gradients. Inthe stochastic case, the Policy Gradient integrates over bothstate and action spaces, whereas in the Deterministic case itonly integrates over the state space. As a result, computingthe stochastic Policy Gradient may require more samples,especially if the action space has many order to explore the full state and action space, a stochas-tic Policy is often necessary. To ensure that our determinis-tic Policy Gradient Algorithms continue to explore satisfac-torily, we introduce an off- Policy learning algorithm.

4 Thebasic idea is to choose actions according to a stochasticbehaviour Policy (to ensure adequate exploration), but tolearn about a Deterministic target Policy (exploiting the ef-ficiency of the Deterministic Policy Gradient ). We use thedeterministic Policy Gradient to derive an off- Policy actor-critic algorithm that estimates the action-value function us-ing a differentiable function approximator, and then up-dates the Policy parameters in the direction of the approx-imate action-value Gradient . We also introduce a notion ofcompatiblefunction approximation for Deterministic policygradients, to ensure that the approximation does not biasthe Policy apply our Deterministic actor-critic Algorithms to sev-eral benchmark problems: a high-dimensional bandit; sev-eral standard benchmark reinforcement learning tasks withlow dimensional action spaces; and a high-dimensionaltask for controlling an octopus arm.

5 Our results demon-strate a significant performance advantage to using deter-ministic Policy gradients over stochastic Policy gradients,particularly in high dimensional tasks. Furthermore, ouralgorithms require no more computation than prior meth-ods: the computational cost of each update is linear in theaction dimensionality and the number of Policy , there are many applications (for example inrobotics) where a differentiable control Policy is provided,but where there is no functionality to inject noise into thecontroller. In these cases, the stochastic Policy Gradient isinapplicable, whereas our methods may still be Policy Gradient Algorithms2. PreliminariesWe study reinforcement learning and control problems inwhich an agent acts in a stochastic environment by sequen-tially choosing actions over a sequence of time steps, inorder to maximise a cumulative reward.

6 We model theproblem as aMarkov decision process(MDP) which com-prises: astate spaceS, anaction spaceA, aninitial statedistributionwith densityp1(s1), a stationarytransition dy-namics distributionwith conditional densityp(st+1|st,at)satisfying the Markov propertyp(st+1|s1,a1,..,st,at) =p(st+1|st,at), for any trajectorys1,a1,s2,a2,..,sT,aTin state-action space, and areward functionr:S A used to select actions in the MDP. In generalthe Policy is stochastic and denoted by :S P(A),whereP(A)is the set of probability measures onAand Rnis a vector ofnparameters, and (at|st)isthe conditional probability density atatassociated withthe Policy . The agent uses its Policy to interact with theMDP to give a trajectory of states, actions and rewards,h1:T=s1,a1, ,sT,aT,rToverS A R.

7 Thereturnr tis the total discounted reward from time-steptonwards,r t= k=t k tr(sk,ak)where0< < functions are defined to be the expected total dis-counted reward,V (s) =E[r 1|S1=s; ]andQ (s,a) =E[r 1|S1=s,A1=a; ].1 The agent s goal is to obtain apolicy which maximises the cumulative discounted rewardfrom the start state, denoted by the performance objectiveJ( ) =E[r 1| ].We denote the density at states after transitioning forttime steps from statesbyp(s s ,t, ). We also denotethe (improper) discounted state distribution by (s ) := S t=1 t 1p1(s)p(s s ,t, )ds. We can then writethe performance objective as an expectation,J( ) = S (s) A (s,a)r(s,a)dads=Es ,a [r(s,a)](1)whereEs [ ]denotes the (improper) expected value withrespect to discounted state distribution (s).

8 2In the re-mainder of the paper we suppose for simplicity thatA=Rmand thatSis a compact subset Stochastic Policy Gradient TheoremPolicy gradientalgorithms are perhaps the most popularclass of continuous action reinforcement learning algo-rithms. The basic idea behind these Algorithms is to adjust1To simplify notation, we frequently drop the random vari-able in the conditional density and writep(st+1|st,at) =p(st+1|St=st,At=at); furthermore we superscript valuefunctions by rather than .2 The results in this paper may be extended to an average re-ward performance objective by choosing (s)to be the stationarydistribution of an ergodic parameters of the Policy in the direction of the perfor-mance Gradient J( ). The fundamental result underly-ing these Algorithms is thepolicy Gradient theorem(Suttonet al.)

9 , 1999), J( ) = S (s) A (a|s)Q (s,a)dads=Es ,a [ log (a|s)Q (s,a)](2)The Policy Gradient is surprisingly simple. In particular,despite the fact that the state distribution (s)depends onthe Policy parameters, the Policy Gradient does not dependon the Gradient of the state theorem has important practical value, because it re-duces the computation of the performance Gradient to asimple expectation. The Policy Gradient theorem has beenused to derive a variety of Policy Gradient Algorithms (De-gris et al., 2012a), by forming a sample-based estimate ofthis expectation. One issue that these Algorithms must ad-dress is how to estimate the action-value functionQ (s,a).Perhaps the simplest approach is to use a sample returnr tto estimate the value ofQ (st,at), which leads to a variantof the REINFORCE algorithm (Williams, 1992).

10 Stochastic Actor-Critic AlgorithmsTheactor-criticis a widely used architecture based on thepolicy Gradient theorem (Sutton et al., 1999; Peters et al.,2005; Bhatnagar et al., 2007; Degris et al., 2012a). Theactor-critic consists of two eponymous components. Anac-toradjusts the parameters of the stochastic Policy (s)by stochastic Gradient ascent of Equation 2. Instead of theunknown true action-value functionQ (s,a)in Equation2, an action-value functionQw(s,a)is used, with param-eter vectorw. Acriticestimates the action-value functionQw(s,a) Q (s,a)using an appropriate Policy evalua-tion algorithm such as temporal-difference general, substituting a function approximatorQw(s,a)for the true action-value functionQ (s,a)may introducebias. However, if the function approximator iscompati-blesuch that i)Qw(s,a) = log (a|s)>wand ii) theparametersware chosen to minimise the mean-squared er-ror 2(w) =Es ,a [(Qw(s,a) Q (s,a))2], thenthere is no bias (Sutton et al.


Related search queries