Trust Region Policy Optimization

Trust Region Policy OptimizationJohn of California, Berkeley, Department of Electrical Engineering and Computer SciencesAbstractIn this article, we describe a method for optimiz-ing control policies, with guaranteed monotonicimprovement. By making several approxima-tions to the theoretically-justified scheme, we de-velop a practical algorithm, called Trust RegionPolicy Optimization (TRPO). This algorithm iseffective for optimizing large nonlinear poli-cies such as neural networks. Our experimentsdemonstrate its robust performance on a wide va-riety of tasks: learning simulated robotic swim-ming, hopping, and walking gaits; and playingAtari games using images of the screen as its approximations that deviate from thetheory, TRPO tends to give monotonic improve-ment, with little tuning of algorithms for Policy Optimization can be classifiedinto three broad categories: Policy iteration methods, whichalternate between estimating the value function under thecurrent Policy and improving the Policy (Bertsekas, 2005); Policy gradient methods, which use an estimator of the gra-dient of the expected cost obtained from sample trajec-tories (Peters & Schaal, 2008a) (and which, as we laterdiscuss, have a close connection to Policy iteration).

Andderivative-free Optimization methods, such as the cross-entropy method (CEM) and covariance matrix adaptation(CMA), which treat the cost as a black box function to beoptimized in terms of the Policy parameters (Fu et al., 2005;Szita & L orincz, 2006).General derivative-free stochastic Optimization methodssuch as CEM and CMA are preferred on many prob-lems, because they achieve good results while being sim-ple to understand and implement. For example, whileTetris is a classic benchmark problem for approximate dy-Proceedings of the31stInternational Conference on MachineLearning, lille , France, 2015. JMLR: W&CP volume 37. Copy-right 2015 by the author(s).namic programming (ADP) methods, stochastic optimiza -tion methods are difficult to beat on this task (Gabillonet al., 2013). For continuous control problems, methodslike CMA have been successful at learning control poli-cies for challenging tasks like locomotion when providedwith hand-engineered Policy classes with low-dimensionalparameterizations (Wampler & Popovi c, 2009).

The in-ability of ADP and gradient-based methods to consistentlybeat gradient-free random search is unsatisfying, sincegradient-based Optimization algorithms enjoy much bettersample complexity guarantees than gradient-free methods(Nemirovski, 2005). Continuous gradient-based optimiza -tion has been very successful at learning function approxi-mators for supervised learning tasks with huge numbers ofparameters, and extending their success to reinforcementlearning would allow for efficient training of complex andpowerful this article, we first prove that minimizing a certain sur-rogate loss function guarantees Policy improvement withnon-trivial step sizes. Then we make a series of approxi-mations to the theoretically-justified algorithm, yielding apractical algorithm, which we call Trust Region Policy op-timization (TRPO). We describe two variants of this algo-rithm: first, thesingle-pathmethod, which can be appliedin the model-free setting; second, thevinemethod, whichrequires the system to be restored to particular states, whichis typically only possible in simulation.

These algorithmsare scalable and can optimize nonlinear policies with tensof thousands of parameters, which have previously posed amajor challenge for model-free Policy search (Deisenrothet al., 2013). In our experiments, we show that the sameTRPO methods can learn complex policies for swimming,hopping, and walking, as well as playing Atari games di-rectly from raw an infinite-horizon discounted Markov decisionprocess (MDP), defined by the tuple(S,A,P,c, 0, ),whereSis a finite set of states,Ais a finite set of actions,P:S A S!Ris the transition probability distri-bution,c:S!Ris the cost function, 0:S!RisTrust Region Policy Optimizationthe distribution of the initial states0, and 2(0,1)is thediscount denote a stochastic Policy :S A![0,1], andlet ( )denote its expected discounted cost: ( )=Es0,a0,.."1Xt=0 tc(st)#,wheres0 0(s0),at (at|st),st+1 P(st+1|st,at).

We will use the following standard definitions of the state-action value functionQ , the value functionV , and theadvantage functionA :Q (st,at)=Est+1,at+1,.."1Xl=0 lc(st+l)#,V (st)=Eat,st+1,.."1Xl=0 lc(st+l)#,A (s, a)=Q (s, a) V (s),whereat (at|st),st+1 P(st+1|st,at)fort following useful identity expresses the expected costof another Policy in terms of the advantage over , accu-mulated over timesteps (see Kakade & Langford (2002) forthe proof, which we also reprise in Appendix A using thenotation in this paper): ( )= ( )+Es0,a0,s1,a1,.."1Xt=0 tA (st,at)#, wheres0 0(s0),at (at|st),st+1 P(st+1|st,at).(1)Let be the (unnormalized) discounted visitation fre-quencies (s)=(P(s0=s)+ P(s1=s)+ 2P(s2=s)+..),wheres0 0and the actions are chosen according to . Rearranging Equation (1) to sum over states instead oftimesteps, we obtain ( )= ( )+Xs (s)Xa (a|s)A (s, a).

(2)This equation implies that any Policy update ! thathas a non-positive expected advantage ateverystates, ,Pa (a|s)A (s, a) 0, is guaranteed to reduce , orleave it constant in the case that the expected advantageis zero everywhere. This implies the classic result that theupdate performed by exact Policy iteration, which uses thedeterministic Policy (s) = arg minaA (s, a), improvesthe Policy if there is at least one state-action pair with anegative advantage value and nonzero state visitation prob-ability (otherwise it has converged). However, in the ap-proximate setting, it will typically be unavoidable, due toestimation and approximation error, that there will be somestatessfor which the expected advantage is positive ( ,bad), that is,Pa (a|s)A (s, a)>0. The complex de-pendency of (s)on makes Equation (2) difficult to op-timize directly.

Instead, we introduce the following localapproximation to :L ( )= ( )+Xs (s)Xa (a|s)A (s, a).(3)Note thatL uses the visitation frequency rather than , ignoring changes in state visitation density due tochanges in the Policy . However, if we have a parameter-ized Policy , where (a|s)is a differentiable functionof the parameter vector , thenL matches to first order(see Kakade & Langford (2002)). That is, for any parame-ter value 0,L 0( 0)= ( 0),r L 0( ) = 0=r ( ) = 0.(4)Equation (4) implies that a sufficiently small step 0! that improvesL oldwill also improve , but does not giveus any guidance on how big of a step to take. To addressthis issue, Kakade & Langford (2002) proposed a policyupdating scheme called conservative Policy iteration, forwhich they could provide explicit lower bounds on the im-provement of .To define the conservative Policy iteration update, let olddenote the current Policy , and assume that we can solve 0= arg min 0L old( 0).

The new Policy newis taken tobe the following mixture Policy : new(a|s)=(1 ) old(a|s)+ 0(a|s)(5)Kakade and Langford proved the following result for thisupdate: ( new) L old( new)+2 (1 (1 ))(1 ) 2,(6)where is the maximum advantage (positive or negative)of 0relative to : = maxs|Ea 0(a|s)[A (s, a)]|(7)Since , 2[0,1], Equation (6) implies the following sim-pler bound, which we refer to in the next section: ( new) L old( new)+2 (1 )2 2.(8)This bound is only slightly weaker when 1, whichis typically the case in the conservative Policy iterationmethod of Kakade & Langford (2002). Note, however, thatso far this bound only applies to mixture policies gener-ated by Equation (5). This Policy class is unwieldy andrestrictive in practice, and it is desirable for a practical pol-icy update scheme to be applicable to all general stochasticpolicy Region Policy Optimization3 Monotonic Improvement Guarantee forGeneral Stochastic PoliciesEquation (6), which applies to conservative Policy itera-tion, implies that a Policy update that improves the right-hand side is guaranteed to improve the true expected costobjective.

Our principal theoretical result is that the pol-icy improvement bound in Equation (6) can be extendedto general stochastic policies, rather than just mixture po-lices, by replacing with a distance measure between and . Since mixture policies are rarely used in practice,this result is crucial for extending the improvement guaran-tee to practical problems. The particular distance measurewe use is the total variation divergence, which is definedbyDTV(pkq)=12Pi|pi qi|for discrete probabilitydistributionsp, ( , )asDmaxTV( , ) = maxsDTV( ( |s)k ( |s)).(9)Theorem =DmaxTV( old, new), and let =maxs|Ea 0(a|s)[A (s, a)]|. Then Equation(8) provide two proofs in the appendix. The first proof ex-tends Kakade and Langford s result using the fact that therandom variables from two distributions with total varia-tion divergence less than can be coupled, so that they areequal with probability1.

The second proof uses pertur-bation theory to prove a slightly stronger version of Equa-tion (8), with a more favorable definition of that dependson .Next, we note the following relationship between the to-tal variation divergence and the KL divergence (Pollard(2000), Ch. 3):DTV(pkq)2 DKL(pkq). LetDmaxKL( , ) = maxsDKL( ( |s)k ( |s)). The follow-ing bound then follows directly from Equation (8): ( ) L ( )+CDmaxKL( , ),whereC=2 (1 )2.(10)Algorithm 1 describes an approximate Policy iterationscheme based on the Policy improvement bound in Equa-tion (10). Note that for now, we assume exact evaluation ofthe advantage valuesA .It follows from Equation (10) that Algorithm 1 is guaran-teed to generate a sequence of monotonically improvingpolicies ( 0) ( 1) ( 2) .. To see this, letMi( )=L i( )+CDmaxKL( i, ). Then ( i+1) Mi( i+1)by Equation (10) ( i)=Mi( i),therefore, ( i+1) ( i) Mi( i+1) M( i).

Trust Region Policy Optimization

Tags:

Information

Advertisement

Transcription of Trust Region Policy Optimization

Related search queries

Trust Region Policy Optimization

Tags:

Information

Advertisement

Documents from same domain

Related documents

Related search queries