Soft Actor-Critic: Off-Policy Maximum Entropy Deep ...

soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor Tuomas Haarnoja 1 Aurick Zhou 1 Pieter Abbeel 1 Sergey Levine 1. Abstract networks holds the promise of automating a wide range of Model-free deep reinforcement learning (RL) al- decision making and control tasks, but widespread adoption gorithms have been demonstrated on a range of of these methods in real-world domains has been hampered challenging decision making and control tasks. by two major challenges. First, model-free deep RL meth- However, these methods typically suffer from two ods are notoriously expensive in terms of their sample com- major challenges: very high sample complexity plexity.

Even relatively simple tasks can require millions of and brittle convergence properties, which necessi- steps of data collection, and complex behaviors with high- tate meticulous hyperparameter tuning. Both of dimensional observations might need substantially more. these challenges severely limit the applicability Second, these methods are often brittle with respect to their of such methods to complex, real-world domains. hyperparameters: learning rates, exploration constants, and In this paper, we propose soft actor-critic, an off- other settings must be set carefully for different problem policy actor-critic deep RL algorithm based on the settings to achieve good results.

Both of these challenges Maximum Entropy reinforcement learning frame- severely limit the applicability of model-free deep RL to work. In this framework, the actor aims to maxi- real-world tasks. mize expected reward while also maximizing en- One cause for the poor sample efficiency of deep RL meth- tropy. That is, to succeed at the task while acting ods is on-policy learning: some of the most commonly used as randomly as possible. Prior deep RL methods deep RL algorithms, such as TRPO (Schulman et al., 2015), based on this framework have been formulated PPO (Schulman et al., 2017b) or A3C (Mnih et al., 2016), as Q-learning methods. By combining Off-Policy require new samples to be collected for each gradient step.

Updates with a stable stochastic actor-critic formu- This quickly becomes extravagantly expensive, as the num- lation, our method achieves state-of-the-art per- ber of gradient steps and samples per step needed to learn formance on a range of continuous control bench- an effective policy increases with task complexity. Off- mark tasks, outperforming prior on-policy and policy algorithms aim to reuse past experience. This is not Off-Policy methods. Furthermore, we demonstrate directly feasible with conventional policy gradient formula- that, in contrast to other Off-Policy algorithms, our tions, but is relatively straightforward for Q-learning based approach is very stable, achieving very similar methods (Mnih et al.)

, 2015). Unfortunately, the combina- performance across different random seeds. tion of Off-Policy learning and high-dimensional, nonlinear function approximation with neural networks presents a major challenge for stability and convergence (Bhatnagar et al., 1. Introduction 2009). This challenge is further exacerbated in continuous Model-free deep reinforcement learning (RL) algorithms state and action spaces, where a separate actor network is have been applied in a range of challenging domains, from often used to perform the maximization in Q-learning. A. games (Mnih et al., 2013; Silver et al., 2016) to robotic commonly used algorithm in such settings, deep determinis- control (Schulman et al.

, 2015). The combination of RL tic policy gradient (DDPG) (Lillicrap et al., 2015), provides and high-capacity function approximators such as neural for sample-efficient learning but is notoriously challenging to use due to its extreme brittleness and hyperparameter 1. Berkeley Artificial Intelligence Research, University of Cal- sensitivity (Duan et al., 2016; henderson et al., 2017 ). ifornia, Berkeley, USA. Correspondence to: Tuomas Haarnoja We explore how to design an efficient and stable model- free deep RL algorithm for continuous state and action Proceedings of the 35 th International Conference on Machine spaces. To that end, we draw on the Maximum Entropy Learning, Stockholm, Sweden, PMLR 80, 2018.

Copyright 2018. by the author(s). framework, which augments the standard Maximum reward soft Actor-Critic reinforcement learning objective with an Entropy maximiza- starting from policy iteration, which alternates between pol- tion term (Ziebart et al., 2008; Toussaint, 2009; Rawlik et al., icy evaluation computing the value function for a policy . 2012; Fox et al., 2016; Haarnoja et al., 2017 ). Maximum en- and policy improvement using the value function to obtain tropy reinforcement learning alters the RL objective, though a better policy (Barto et al., 1983; Sutton & Barto, 1998). In the original objective can be recovered using a tempera- large-scale reinforcement learning problems, it is typically ture parameter (Haarnoja et al.)

, 2017 ). More importantly, impractical to run either of these steps to convergence, and the Maximum Entropy formulation provides a substantial instead the value function and policy are optimized jointly. improvement in exploration and robustness: as discussed In this case, the policy is referred to as the actor, and the by Ziebart (2010), Maximum Entropy policies are robust value function as the critic. Many actor-critic algorithms in the face of model and estimation errors, and as demon- build on the standard, on-policy policy gradient formulation strated by (Haarnoja et al., 2017 ), they improve exploration to update the actor (Peters & Schaal, 2008), and many of by acquiring diverse behaviors.

Prior work has proposed them also consider the Entropy of the policy, but instead of model-free deep RL algorithms that perform on-policy learn- maximizing the Entropy , they use it as an regularizer (Schul- ing with Entropy maximization (O'Donoghue et al., 2016), man et al., 2017b; 2015; Mnih et al., 2016; Gruslys et al., as well as Off-Policy methods based on soft Q-learning and 2017 ). On-policy training tends to improve stability but its variants (Schulman et al., 2017a; Nachum et al., 2017a; results in poor sample complexity. Haarnoja et al., 2017 ). However, the on-policy variants suf- There have been efforts to increase the sample efficiency fer from poor sample complexity for the reasons discussed while retaining robustness by incorporating Off-Policy sam- above, while the Off-Policy variants require complex approx- ples and by using higher order variance reduction tech- imate inference procedures in continuous action spaces.

Niques (O'Donoghue et al., 2016; Gu et al., 2016). How- In this paper, we demonstrate that we can devise an off- ever, fully Off-Policy algorithms still attain better effi- policy Maximum Entropy actor-critic algorithm, which we ciency. A particularly popular Off-Policy actor-critic method, call soft actor-critic (SAC), which provides for both sample- DDPG (Lillicrap et al., 2015), which is a deep variant of the efficient learning and stability. This algorithm extends read- deterministic policy gradient (Silver et al., 2014) algorithm, ily to very complex, high-dimensional tasks, such as the uses a Q-function estimator to enable Off-Policy learning, Humanoid benchmark (Duan et al.)

Soft Actor-Critic: Off-Policy Maximum Entropy Deep ...

Tags:

Information

Transcription of Soft Actor-Critic: Off-Policy Maximum Entropy Deep ...

Related search queries

Soft Actor-Critic: Off-Policy Maximum Entropy Deep ...

Tags:

Information

Documents from same domain

Related documents

Related search queries