Example: marketing

Trust Region Policy Optimization

Trust Region Policy OptimizationJohn of California, Berkeley, Department of Electrical Engineering and Computer SciencesAbstractIn this article, we describe a method for optimiz-ing control policies, with guaranteed monotonicimprovement. By making several approxima-tions to the theoretically-justified scheme, we de-velop a practical algorithm, called Trust RegionPolicy Optimization (TRPO). This algorithm iseffective for optimizing large nonlinear poli-cies such as neural networks. Our experimentsdemonstrate its robust performance on a wide va-riety of tasks: learning simulated robotic swim-ming, hopping, and walking gaits; and playingAtari games using images of the screen as its approximations that deviate from thetheory, TRPO tends to give monotonic improve-ment, with little tuning of algorithms for Policy Optimization can be classifiedinto three broad categories: Policy iteration methods, whichalternate between estimating the value function under thecurrent Policy and improving the Policy (Bertsekas, 2005); Policy gradient methods, which use an estimator of the gra-dient of the expected cost obtained from sample trajec-tories (Peters & Schaal, 2008a) (and which, as we laterdiscuss, have a close connection to Policy iteration); andderiva

Learning, Lille, France, 2015. JMLR: W&CP volume 37. Copy-right 2015 by the author(s). namic programming (ADP) methods, stochastic optimiza-tion methods are difficult to beat on this task (Gabillon et al., 2013). For continuous control problems, methods like CMA have been successful at learning control poli-

Tags:

  Trust, Into, Regions, Optimization, Lille, Optimiza, Trust region, Optimiza tion

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Advertisement

Transcription of Trust Region Policy Optimization

1 Trust Region Policy OptimizationJohn of California, Berkeley, Department of Electrical Engineering and Computer SciencesAbstractIn this article, we describe a method for optimiz-ing control policies, with guaranteed monotonicimprovement. By making several approxima-tions to the theoretically-justified scheme, we de-velop a practical algorithm, called Trust RegionPolicy Optimization (TRPO). This algorithm iseffective for optimizing large nonlinear poli-cies such as neural networks. Our experimentsdemonstrate its robust performance on a wide va-riety of tasks: learning simulated robotic swim-ming, hopping, and walking gaits; and playingAtari games using images of the screen as its approximations that deviate from thetheory, TRPO tends to give monotonic improve-ment, with little tuning of algorithms for Policy Optimization can be classifiedinto three broad categories: Policy iteration methods, whichalternate between estimating the value function under thecurrent Policy and improving the Policy (Bertsekas, 2005); Policy gradient methods, which use an estimator of the gra-dient of the expected cost obtained from sample trajec-tories (Peters & Schaal, 2008a) (and which, as we laterdiscuss, have a close connection to Policy iteration).

2 Andderivative-free Optimization methods, such as the cross-entropy method (CEM) and covariance matrix adaptation(CMA), which treat the cost as a black box function to beoptimized in terms of the Policy parameters (Fu et al., 2005;Szita & L orincz, 2006).General derivative-free stochastic Optimization methodssuch as CEM and CMA are preferred on many prob-lems, because they achieve good results while being sim-ple to understand and implement. For example, whileTetris is a classic benchmark problem for approximate dy-Proceedings of the31stInternational Conference on MachineLearning, lille , France, 2015. JMLR: W&CP volume 37. Copy-right 2015 by the author(s).namic programming (ADP) methods, stochastic optimiza -tion methods are difficult to beat on this task (Gabillonet al., 2013). For continuous control problems, methodslike CMA have been successful at learning control poli-cies for challenging tasks like locomotion when providedwith hand-engineered Policy classes with low-dimensionalparameterizations (Wampler & Popovi c, 2009).

3 The in-ability of ADP and gradient-based methods to consistentlybeat gradient-free random search is unsatisfying, sincegradient-based Optimization algorithms enjoy much bettersample complexity guarantees than gradient-free methods(Nemirovski, 2005). Continuous gradient-based optimiza -tion has been very successful at learning function approxi-mators for supervised learning tasks with huge numbers ofparameters, and extending their success to reinforcementlearning would allow for efficient training of complex andpowerful this article, we first prove that minimizing a certain sur-rogate loss function guarantees Policy improvement withnon-trivial step sizes. Then we make a series of approxi-mations to the theoretically-justified algorithm, yielding apractical algorithm, which we call Trust Region Policy op-timization (TRPO). We describe two variants of this algo-rithm: first, thesingle-pathmethod, which can be appliedin the model-free setting; second, thevinemethod, whichrequires the system to be restored to particular states, whichis typically only possible in simulation.

4 These algorithmsare scalable and can optimize nonlinear policies with tensof thousands of parameters, which have previously posed amajor challenge for model-free Policy search (Deisenrothet al., 2013). In our experiments, we show that the sameTRPO methods can learn complex policies for swimming,hopping, and walking, as well as playing Atari games di-rectly from raw an infinite-horizon discounted Markov decisionprocess (MDP), defined by the tuple(S,A,P,c, 0, ),whereSis a finite set of states,Ais a finite set of actions,P:S A S!Ris the transition probability distri-bution,c:S!Ris the cost function, 0:S!RisTrust Region Policy Optimizationthe distribution of the initial states0, and 2(0,1)is thediscount denote a stochastic Policy :S A![0,1], andlet ( )denote its expected discounted cost: ( )=Es0,a0,.."1Xt=0 tc(st)#,wheres0 0(s0),at (at|st),st+1 P(st+1|st,at).

5 We will use the following standard definitions of the state-action value functionQ , the value functionV , and theadvantage functionA :Q (st,at)=Est+1,at+1,.."1Xl=0 lc(st+l)#,V (st)=Eat,st+1,.."1Xl=0 lc(st+l)#,A (s, a)=Q (s, a) V (s),whereat (at|st),st+1 P(st+1|st,at)fort following useful identity expresses the expected costof another Policy in terms of the advantage over , accu-mulated over timesteps (see Kakade & Langford (2002) forthe proof, which we also reprise in Appendix A using thenotation in this paper): ( )= ( )+Es0,a0,s1,a1,.."1Xt=0 tA (st,at)#, wheres0 0(s0),at (at|st),st+1 P(st+1|st,at).(1)Let be the (unnormalized) discounted visitation fre-quencies (s)=(P(s0=s)+ P(s1=s)+ 2P(s2=s)+..),wheres0 0and the actions are chosen according to . Rearranging Equation (1) to sum over states instead oftimesteps, we obtain ( )= ( )+Xs (s)Xa (a|s)A (s, a).

6 (2)This equation implies that any Policy update ! thathas a non-positive expected advantage ateverystates, ,Pa (a|s)A (s, a) 0, is guaranteed to reduce , orleave it constant in the case that the expected advantageis zero everywhere. This implies the classic result that theupdate performed by exact Policy iteration, which uses thedeterministic Policy (s) = arg minaA (s, a), improvesthe Policy if there is at least one state-action pair with anegative advantage value and nonzero state visitation prob-ability (otherwise it has converged). However, in the ap-proximate setting, it will typically be unavoidable, due toestimation and approximation error, that there will be somestatessfor which the expected advantage is positive ( ,bad), that is,Pa (a|s)A (s, a)>0. The complex de-pendency of (s)on makes Equation (2) difficult to op-timize directly.

7 Instead, we introduce the following localapproximation to :L ( )= ( )+Xs (s)Xa (a|s)A (s, a).(3)Note thatL uses the visitation frequency rather than , ignoring changes in state visitation density due tochanges in the Policy . However, if we have a parameter-ized Policy , where (a|s)is a differentiable functionof the parameter vector , thenL matches to first order(see Kakade & Langford (2002)). That is, for any parame-ter value 0,L 0( 0)= ( 0),r L 0( ) = 0=r ( ) = 0.(4)Equation (4) implies that a sufficiently small step 0! that improvesL oldwill also improve , but does not giveus any guidance on how big of a step to take. To addressthis issue, Kakade & Langford (2002) proposed a policyupdating scheme called conservative Policy iteration, forwhich they could provide explicit lower bounds on the im-provement of .To define the conservative Policy iteration update, let olddenote the current Policy , and assume that we can solve 0= arg min 0L old( 0).

8 The new Policy newis taken tobe the following mixture Policy : new(a|s)=(1 ) old(a|s)+ 0(a|s)(5)Kakade and Langford proved the following result for thisupdate: ( new) L old( new)+2 (1 (1 ))(1 ) 2,(6)where is the maximum advantage (positive or negative)of 0relative to : = maxs|Ea 0(a|s)[A (s, a)]|(7)Since , 2[0,1], Equation (6) implies the following sim-pler bound, which we refer to in the next section: ( new) L old( new)+2 (1 )2 2.(8)This bound is only slightly weaker when 1, whichis typically the case in the conservative Policy iterationmethod of Kakade & Langford (2002). Note, however, thatso far this bound only applies to mixture policies gener-ated by Equation (5). This Policy class is unwieldy andrestrictive in practice, and it is desirable for a practical pol-icy update scheme to be applicable to all general stochasticpolicy Region Policy Optimization3 Monotonic Improvement Guarantee forGeneral Stochastic PoliciesEquation (6), which applies to conservative Policy itera-tion, implies that a Policy update that improves the right-hand side is guaranteed to improve the true expected costobjective.

9 Our principal theoretical result is that the pol-icy improvement bound in Equation (6) can be extendedto general stochastic policies, rather than just mixture po-lices, by replacing with a distance measure between and . Since mixture policies are rarely used in practice,this result is crucial for extending the improvement guaran-tee to practical problems. The particular distance measurewe use is the total variation divergence, which is definedbyDTV(pkq)=12Pi|pi qi|for discrete probabilitydistributionsp, ( , )asDmaxTV( , ) = maxsDTV( ( |s)k ( |s)).(9)Theorem =DmaxTV( old, new), and let =maxs|Ea 0(a|s)[A (s, a)]|. Then Equation(8) provide two proofs in the appendix. The first proof ex-tends Kakade and Langford s result using the fact that therandom variables from two distributions with total varia-tion divergence less than can be coupled, so that they areequal with probability1.

10 The second proof uses pertur-bation theory to prove a slightly stronger version of Equa-tion (8), with a more favorable definition of that dependson .Next, we note the following relationship between the to-tal variation divergence and the KL divergence (Pollard(2000), Ch. 3):DTV(pkq)2 DKL(pkq). LetDmaxKL( , ) = maxsDKL( ( |s)k ( |s)). The follow-ing bound then follows directly from Equation (8): ( ) L ( )+CDmaxKL( , ),whereC=2 (1 )2.(10)Algorithm 1 describes an approximate Policy iterationscheme based on the Policy improvement bound in Equa-tion (10). Note that for now, we assume exact evaluation ofthe advantage valuesA .It follows from Equation (10) that Algorithm 1 is guaran-teed to generate a sequence of monotonically improvingpolicies ( 0) ( 1) ( 2) .. To see this, letMi( )=L i( )+CDmaxKL( i, ). Then ( i+1) Mi( i+1)by Equation (10) ( i)=Mi( i),therefore, ( i+1) ( i) Mi( i+1) M( i).


Related search queries