Generative Adversarial Imitation Learning

Generative Adversarial Imitation LearningJonathan ErmonStanford Learning a policy from example expert behavior, without interaction withthe expert or access to a reinforcement signal. One approach is to recover theexpert s cost function with inverse reinforcement Learning , then extract a policyfrom that cost function with reinforcement Learning . This approach is indirectand can be slow. We propose a new general framework for directly extracting apolicy from data as if it were obtained by reinforcement Learning following inversereinforcement Learning . We show that a certain instantiation of our frameworkdraws an analogy between Imitation Learning and Generative Adversarial networks,from which we derive a model-free Imitation Learning algorithm that obtains signif-icant performance gains over existing model-free methods in imitating complexbehaviors in large, high-dimensional IntroductionWe are interested in a specific setting of Imitation Learning the problem of Learning to perform atask from expert demonstrations in which the learner is given only samples of trajectories fromthe expert.

Is not allowed to query the expert for more data while training, and is not provided areinforcement signal of any kind. There are two main approaches suitable for this setting: behavioralcloning [18], which learns a policy as a supervised Learning problem over state-action pairs fromexpert trajectories; and inverse reinforcement Learning [23,16], which finds a cost function underwhich the expert is uniquely cloning, while appealingly simple, only tends to succeed with large amounts of data, dueto compounding error caused by covariate shift [21,22]. Inverse reinforcement Learning (IRL), onthe other hand, learns a cost function that prioritizes entire trajectories over others, so compoundingerror, a problem for methods that fit single-timestep decisions, is not an issue.

Accordingly, IRL hassucceeded in a wide range of problems, from predicting behaviors of taxi drivers [29] to planningfootsteps for quadruped robots [20].Unfortunately, many IRL algorithms are extremely expensive to run, requiring reinforcement learningin an inner loop. Scaling IRL methods to large environments has thus been the focus of much recentwork [6,13]. Fundamentally, however, IRL learns a cost function, which explains expert behaviorbut does not directly tell the learner how to act. Given that the learner s true goal often is to takeactions imitating the expert indeed, many IRL algorithms are evaluated on the quality of the optimalactions of the costs they learn why, then, must we learn a cost function, if doing so possibly incurssignificant computational expense yet fails to directly yield actions?

We desire an algorithm that tells us explicitly how to act by directly Learning a policy. To develop suchan algorithm, we begin in Section 3, where we characterize the policy given by running reinforcementlearning on a cost function learned by maximum causal entropy IRL [29,30]. Our characterizationintroduces a framework for directly Learning policies from data, bypassing any intermediate IRL , we instantiate our framework in Sections 4 and 5 with a new model-free Imitation learningalgorithm. We show that our resulting algorithm is intimately connected to Generative adversarial30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, [8], a technique from the deep Learning community that has led to recent successes inmodeling distributions of natural images: our algorithm harnesses Generative Adversarial training to fitdistributions of states and actions defining expert behavior.

We test our algorithm in Section 6, wherewe find that it outperforms competing methods by a wide margin in training policies for complex,high-dimensional physics-based control tasks over various amounts of expert BackgroundPreliminariesRwill denote the extended real numbersR { }. Section 3 will work withfinite state and action spacesSandA, but our algorithms and experiments later in the paper willrun in high-dimensional continuous environments. is the set of all stationary stochastic policiesthat take actions inAgiven states inS; successor states are drawn from the dynamics modelP(s |s,a). We work in the -discounted infinite horizon setting, and we will use an expectationwith respect a policy to denote an expectation with respect to the trajectory it generates:E [c(s,a)],E[ t=0 tc(st,at)], wheres0 p0,at ( |st), andst+1 P( |st,at)fort will use E to denote an empirical expectation with respect to trajectory samples , and we willalways use Eto refer to the expert reinforcement learningSuppose we are given an expert policy Ethat we wish to ratio-nalize with IRL.

For the remainder of this paper, we will adopt and assume the existence of solutionsof maximum causal entropy IRL [29,30], which fits a cost function from a family of functionsCwiththe optimization problemmaximizec C(min H( ) +E [c(s,a)]) E E[c(s,a)](1)whereH( ),E [ log (a|s)]is the -discounted causal entropy [3] of the policy . In practice, Ewill only be provided as a set of trajectories sampled by executing Ein the environment, so theexpected cost of Ein Eq. (1) is estimated using these samples. Maximum causal entropy IRL looksfor a cost functionc Cthat assigns low cost to the expert policy and high cost to other policies,thereby allowing the expert policy to be found via a certain reinforcement Learning procedure.

RL(c) = arg min H( ) +E [c(s,a)](2)which maps a cost function to high-entropy policies that minimize the expected cumulative Characterizing the induced optimal policyTo begin our search for an Imitation Learning algorithm that both bypasses an intermediate IRLstep and is suitable for large environments, we will study policies found by reinforcement learningon costs learned by IRL on the largest possible set of cost functionsCin Eq. (1):allfunctionsRS A={c:S A R}. Using expressive cost function classes, like Gaussian processes [14]and neural networks [6], is crucial to properly explain complex expert behavior without meticulouslyhand-crafted features.

Here, we investigate the best IRL can do with respect to expressiveness byexamining its capabilities withC=RS course, with such a largeC, IRL can easily overfit when provided a finite dataset. Therefore,we will incorporate a (closed, proper) convex cost function regularizer :RS A Rinto ourstudy. Note that convexity is a not particularly restrictive requirement: must be convex as afunction defined on all ofRS A, not as a function defined on a small parameter space; indeed, thecost regularizers of Finn et al.[6], effective for a range of robotic manipulation tasks, satisfy thisrequirement. Interestingly, will play a central role in our discussion and will not serve as a nuisancein our us define an IRL primitive procedure, which finds a cost function such that the expert performsbetter than all other policies, with the cost regularized by :IRL ( E) = arg maxc RS A (c) +(min H( ) +E [c(s,a)]) E E[c(s,a)](3)2 Let c IRL ( E).

We are interested in a policy given byRL( c) this is the policy given byrunning reinforcement Learning on the output of IRL. To characterizeRL( c), let us first define for apolicy its occupancy measure :S A Ras (s,a) = (a|s) t=0 tP(st=s| ).The occupancy measure can be interpreted as the unnormalized distribution of state-action pairsthat an agent encounters when navigating the environment with the policy , and it allows us towriteE [c(s,a)] = s,a (s,a)c(s,a)for any cost functionc. We will also need the concept of aconvex conjugate: for a functionf:RS A R, its convex conjugatef :RS A Ris given byf (x) = supy RS AxTy f(y).

Now, we are prepared to characterizeRL( c), the policy learned by RL on the cost recovered by IRL:Proposition IRL ( E) = arg min H( ) + ( E)(4)The proof of Proposition can be found in Appendix It relies on the observation that theoptimal cost function and policy form a saddle point of a certain function. IRL finds one coordinateof this saddle point, and running RL on the output of IRL reveals the other tells us that -regularized inverse reinforcement Learning , implicitly, seeks a policywhose occupancy measure is close to the expert s, as measured by . Enticingly, this suggests thatvarious settings of lead to various Imitation Learning algorithms that directly solve the optimizationproblem given by Proposition We explore such algorithms in Sections 4 and 5, where we showthat certain settings of lead to both existing algorithms and a novel special case when is a constant function is particularly illuminating, so we state and show itdirectly using concepts from convex E>0.

Generative Adversarial Imitation Learning

Tags:

Information

Advertisement

Transcription of Generative Adversarial Imitation Learning

Related search queries

Generative Adversarial Imitation Learning

Tags:

Information

Advertisement

Documents from same domain

Related documents

Related search queries