Example: confidence

深度强化学习综述 - ict.ac.cn

40 Vol. 40 2017 CHINESE JOURNAL OF COMPUTERS Online Publishing (61472262, 61303108, 61373094, 61502323, 61502329) SYG201422, SYG201308 1969 (CCF) 1992 . , , 1985 , , , , agent . 1983 . 1992 . 1992 . 1991 . + 1)( 215006) 2)( 210000) 3 TP18 , , , , , , , ,2017.

Abstract Deep reinforcement learning (DRL) is a new research hotspot in the artificial intelligence community. By using a general-purpose form, DRL integrates the advantages of the perception of deep learning (DL) and the decision making of reinforcement learning (RL), and gains the output control directly based on raw inputs by the

Tags:

  Learning, Deep, Reinforcement, Deep learning, Reinforcement learning, Deep reinforcement learning

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of 深度强化学习综述 - ict.ac.cn

1 40 Vol. 40 2017 CHINESE JOURNAL OF COMPUTERS Online Publishing (61472262, 61303108, 61373094, 61502323, 61502329) SYG201422, SYG201308 1969 (CCF) 1992 . , , 1985 , , , , agent . 1983 . 1992 . 1992 . 1991 . + 1)( 215006) 2)( 210000) 3 TP18 , , , , , , , ,2017 , LIU Quan, ZHAI Jian-Wei, ZHANG Zong-Zhang, ZHONG Shan, ZHOU Qian, ZHANG Peng, XU Jin, A Survey on deep reinforcement learning , 2017, ,Online Publishing A Survey on deep reinforcement learning LIU Quan ZHAI Jian-Wei ZHANG Zong-Zhang ZHONG Shan ZHOU Qian ZHANG Peng XU Jin 1)

2 (School of Computer Science and Technology, Soochow University, Suzhou, Jiangsu 215006) 2)(Collaborative Innovation Center of Novel Software Technology and Industrialization, Nanjing 210000) Abstract deep reinforcement learning (DRL) is a new research hotspot in the artificial intelligence community. By using a general-purpose form, DRL integrates the advantages of the perception of deep learning (DL) and the decision making of reinforcement learning (RL), and gains the output control directly based on raw inputs by the end-to-end learning process. DRL has made substantial breakthroughs in a variety of tasks requiring both rich perception of high-dimensional raw inputs and policy control since it was proposed.

3 In this paper, we systematically describe three main categories of DRL methods. Firstly, we summarize value-based DRL methods. The core idea behind them is to approximate the value function by using deep neural networks which have strong ability of perception. We introduce an epoch-making value-based DRL method called deep Q-Network (DQN) and its variants. These variants are divided into two categories: improvements of training algorithm and improvements of model architecture. The first category includes deep Double Q-Network (DDQN), DQN based on advantage learning technique, and DDQN with proportional prioritization. The second one includes deep Recurrent Q-Network (DRQN) and a method based on Dueling Network architecture.

4 In general, value-based DRL methods are good at dealing with large-scale problems with discrete action spaces. We then summarize policy-based DRL methods. Their powerful idea is to use deep neural networks to parameterize the policies and 2017 2 optimization methods to optimize the policies. In this part, we firstly highlight some pure policy gradient methods, then focus on a series of policy-based DRL algorithms which use the actor-critic framework , deep Deterministic Policy Gradient (DDPG), followed by an effective method named Asynchronous Advantage Actor-Critic (A3C) with the benefit of reducing the training time dramatically. Compared to value-based methods, policy-based DRL methods have a wider range of successful applications in complex problems with continuous action spaces.

5 We lastly introduce a DRL method based on search and supervision known as AlphaGo. Its core idea is to improve the efficiency of optimizing policies by introducing extra supervision and policy search techniques. Then this paper summarizes some cutting-edge research directions of DRL, including hierarchical DRL methods which can decompose an ultimate goal in RL into some sub-goals, multi-task and transfer DRL methods which can take full advantage of correlations between multiple tasks and transfer useful information to new tasks, multi-agent DRL methods which have the ability of cooperation and communication between multiple agents, DRL based on memory and reasoning which can be applied to some high-level cognitive heuristic tasks, and methods that balance between exploration and exploitation.

6 Next, we summarize some successful applications in different fields such as games, robotics, computer vision, natural language processing and parameter optimization. Finally, we end up with discussing some potential trends in DRL s future development. Keywords artificial intelligence; deep learning ; reinforcement learning ; deep reinforcement learning 1 deep learning , DL [1] [2-3] [4-5] [6-7] [8] DL [9] DL reinforcement learning RL [10] [11] [12] [13-14] [15-16] RL agent [17]

7 RL DL RL DeepMind DL RL deep reinforcement learning DRL DeepMind agent agent DRL end-to-end 1 agent DL 2 3 DRL 1 DL RL 1 DRL DRL [18-20] [21-23] [24-25] [26-27] Artificial General Intelligence AGI DRL 2 3 DRL DRL DRLDLRLDRL DRL DRL DRL DRL DRL DRLDRL 1 2 3 4 2 2 DL Artificial Neural Network ANN Multi-Layer Perceptron MLP DL DL [28] 2006 Hinton [29] DL [30-31]

8 Stacked Auto-Encoder SAE [32-33] Restricted Boltzmann Machine RBM [33-34] deep Belief Network DBN [35-36] Recurrent Neural Network RNN Convolutional Neural Network CNN Krizhevsky [2] 2012 AlexNet ImageNet top-5 4 1 2014 Visual Geometry Group VGG Simonyan [37] VGG-Net He [38] 2 Lin [39] Network in Network NIN Szegedy [40] Inception NIN GoogleNet 2014 ILSVRC 3 He [41] deep Residual Network DRN 2015 ILSVRC Szegedy [42] Inception DRN Inception Inception Residual Network IRN He [43] Identify Mapping Residual Network IMRN 4 [44] Recurrent Neural Network RNN [45] Attention Mechanism AM RL 2017 4 agent [17] Markov Decision Process MDP RL MDP fAS,,, 1 S Sst agent t 2 A agent.

9 TaA agent t 3 RAS : tttasr,~ agent ts ta 4 1,0: SASf ),(~1tttasfs agent ts ta 1 ts RL AS : agent ts ta ttasf, 1 ts tr t T TttttttrR''' 1 1,0 ),(asQ s a agent ,,,aassRasQttt 2 * ,,max,*aassRasQttt 3 3 Bellman optimality equation asasQrasQaSs,','max,'~'* 4 RL Q asasQrasQiaSsi,','max,'~'1 5 i *QQi asQAa,maxarg** 5 Q RL ),()

10 ,(asQ asQ RL [46] DRL DRL RL Riedmiller [47] Q Q Neural Fitted Q Iteration NFQ Lange [48] DL RL deep Auto-Encoder DAE DAE Abtahi [49] RL agent Lange [50] Q deep Fitted Q- learning DFQ Koutnik [51] Neural Evolution NE RL 3 Q Mnih [18-19] RL Q [52] Q deep Q-Network, DQN DRL DQN 4 3 2 5 Q 3 DQN.)


Related search queries