Soft Actor-Critic: Off-Policy Maximum Entropy Deep ...

Soft Actor-Critic: Off-Policy Maximum Entropy Deep ReinforcementLearning with a Stochastic ActorTuomas Haarnoja1 Aurick Zhou1 Pieter Abbeel1 Sergey Levine1 AbstractModel-free deep reinforcement learning (RL) al-gorithms have been demonstrated on a range ofchallenging decision making and control , these methods typically suffer from twomajor challenges: very high sample complexityand brittle convergence properties, which necessi-tate meticulous hyperparameter tuning. Both ofthese challenges severely limit the applicabilityof such methods to complex, real-world this paper, we propose soft actor-critic, an Off-Policy actor-critic deep RL algorithm based on themaximum Entropy reinforcement learning frame-work.

In this framework, the actor aims to maxi-mize expected reward while also maximizing en-tropy. That is, to succeed at the task while actingas randomly as possible. Prior deep RL methodsbased on this framework have been formulatedas Q-learning methods. By combining off-policyupdates with a stable stochastic actor-critic formu-lation, our method achieves state-of-the-art per-formance on a range of continuous control bench-mark tasks, outperforming prior on-policy andoff-policy methods. Furthermore, we demonstratethat, in contrast to other Off-Policy algorithms, ourapproach is very stable, achieving very similarperformance across different random IntroductionModel-free deep reinforcement learning (RL) algorithmshave been applied in a range of challenging domains, fromgames (Mnih et al.)

, 2013; Silver et al., 2016) to roboticcontrol (Schulman et al., 2015). The combination of RLand high-capacity function approximators such as neuralnetworks holds the promise of automating a wide range ofdecision making and control tasks, but widespread adoption1 Berkeley Artificial Intelligence Research, University of Cal-ifornia, Berkeley, USA. Correspondence to: Tuomas these methods in real-world domains has been hamperedby two major challenges. First, model-free deep RL meth-ods are notoriously expensive in terms of their sample com-plexity. Even relatively simple tasks can require millions ofsteps of data collection, and complex behaviors with high-dimensional observations might need substantially , these methods are often brittle with respect to theirhyperparameters: learning rates, exploration constants, andother settings must be set carefully for different problemsettings to achieve good results.

Both of these challengesseverely limit the applicability of model-free deep RL toreal-world cause for the poor sample efficiency of deep RL meth-ods is on-policy learning: some of the most commonly useddeep RL algorithms, such as TRPO (Schulman et al., 2015),PPO (Schulman et al., 2017b) or A3C (Mnih et al., 2016),require new samples to be collected for each gradient quickly becomes extravagantly expensive, as the num-ber of gradient steps and samples per step needed to learnan effective policy increases with task complexity. Off-Policy algorithms aim to reuse past experience. This is notdirectly feasible with conventional policy gradient formula-tions, but is relatively straightforward for Q-learning basedmethods (Mnih et al.)

, 2015). Unfortunately, the combina-tion of Off-Policy learning and high-dimensional, nonlinearfunction approximation with neural networks presents a ma-jor challenge for stability and convergence (Bhatnagar et al.,2009). This challenge is further exacerbated in continuousstate and action spaces, where a separate actor network isoften used to perform the maximization in Q-learning. Acommonly used algorithm in such settings, deep determinis-tic policy gradient (DDPG) (Lillicrap et al., 2015), providesfor sample-efficient learning but is notoriously challengingto use due to its extreme brittleness and hyperparametersensitivity (Duan et al.

, 2016; Henderson et al., 2017).We explore how to design an efficient and stable model-free deep RL algorithm for continuous state and actionspaces. To that end, we draw on the Maximum entropyframework, which augments the standard Maximum rewardreinforcement learning objective with an Entropy maximiza-tion term (Ziebart et al., 2008; Toussaint, 2009; Rawlik et al., [ ] 8 Aug 2018 Soft Actor-Critic2012; Fox et al., 2016; Haarnoja et al., 2017). Maximum en-tropy reinforcement learning alters the RL objective, thoughthe original objective can be recovered using a tempera-ture parameter (Haarnoja et al.

, 2017). More importantly,the Maximum Entropy formulation provides a substantialimprovement in exploration and robustness: as discussedby Ziebart (2010), Maximum Entropy policies are robustin the face of model and estimation errors, and as demon-strated by (Haarnoja et al., 2017), they improve explorationby acquiring diverse behaviors. Prior work has proposedmodel-free deep RL algorithms that perform on-policy learn-ing with Entropy maximization (O Donoghue et al., 2016),as well as Off-Policy methods based on soft Q-learning andits variants (Schulman et al., 2017a; Nachum et al., 2017a;Haarnoja et al.

, 2017). However, the on-policy variants suf-fer from poor sample complexity for the reasons discussedabove, while the Off-Policy variants require complex approx-imate inference procedures in continuous action this paper, we demonstrate that we can devise an Off-Policy Maximum Entropy actor-critic algorithm, which wecall soft actor-critic (SAC), which provides for both sample-efficient learning and stability . This algorithm extends read-ily to very complex, high-dimensional tasks, such as theHumanoid benchmark (Duan et al., 2016) with 21 actiondimensions, where Off-Policy methods such as DDPG typi-cally struggle to obtain good results (Gu et al.

, 2016). SACalso avoids the complexity and potential instability associ-ated with approximate inference in prior Off-Policy maxi-mum Entropy algorithms based on soft Q-learning (Haarnojaet al., 2017). We present a convergence proof for policyiteration in the Maximum Entropy framework, and then in-troduce a new algorithm based on an approximation to thisprocedure that can be practically implemented with deepneural networks, which we call soft actor-critic. We presentempirical results that show that soft actor-critic attains asubstantial improvement in both performance and sampleefficiency over both Off-Policy and on-policy prior also compare to twin delayed deep deterministic (TD3)policy gradient algorithm (Fujimoto et al.

, 2018), which isa concurrent work that proposes a deterministic algorithmthat substantially improves on Related WorkOur soft actor-critic algorithm incorporates three key in-gredients: an actor-critic architecture with separate policyand value function networks, an Off-Policy formulation thatenables reuse of previously collected data for efficiency, andentropy maximization to enable stability and review prior works that draw on some of these ideas inthis section. Actor-critic algorithms are typically derivedstarting from policy iteration, which alternates betweenpol-icy evaluation computing the value function for a policy andpolicy improvement using the value function to obtaina better policy (Barto et al.

Soft Actor-Critic: Off-Policy Maximum Entropy Deep ...

Tags:

Information

Transcription of Soft Actor-Critic: Off-Policy Maximum Entropy Deep ...

Related search queries

Soft Actor-Critic: Off-Policy Maximum Entropy Deep ...

Tags:

Information

Documents from same domain

Related documents

Related search queries