Reinforcement Learning: An Introduction

IReinforcement Learning: An IntroductionSecond edition, in progressRichard S. sutton and Andrew G. Bartoc 2014, 2015A Bradford BookThe MIT PressCambridge, MassachusettsLondon, EnglandiiIn memory of A. Harry KlopfContentsPreface ..viiiSeries Forward ..xiiSummary of Notation ..xiii1 The Reinforcement learning Reinforcement learning .. Examples .. Elements of Reinforcement learning .. Limitations and Scope .. An Extended Example: Tic-Tac-Toe .. Summary .. History of Reinforcement learning .. Bibliographical Remarks ..25I Tabular Solution Methods272 Multi-arm Ann-Armed Bandit Problem .. Action-Value Methods.

Incremental Implementation .. Tracking a Nonstationary Problem .. Optimistic Initial Values .. Upper-Confidence-Bound Action Selection .. Gradient Bandits .. Associative Search (Contextual Bandits) .. Summary ..473 Finite Markov Decision The Agent Environment Interface .. Goals and Rewards .. Returns .. Unified Notation for Episodic and Continuing Tasks ..61 The Markov Property .. Markov Decision Processes .. Value Functions .. Optimal Value Functions .. Optimality and Approximation .. Summary ..804 Dynamic Policy Evaluation .. Policy Improvement .. Policy Iteration.

Value Iteration .. Asynchronous Dynamic Programming .. Generalized Policy Iteration .. Efficiency of Dynamic Programming .. Summary .. 1075 Monte Carlo Monte Carlo Prediction .. Monte Carlo Estimation of Action Values .. Monte Carlo Control .. Monte Carlo Control without Exploring Starts .. Off-policy Prediction via Importance Sampling .. Incremental Implementation .. Off-Policy Monte Carlo Control .. 135 Importance Sampling on Truncated Returns .. Summary .. 1386 Temporal-Difference TD Prediction .. Advantages of TD Prediction Methods.

Optimality of TD(0) .. Sarsa: On-Policy TD Control .. Q- learning : Off-Policy TD Control .. Games, Afterstates, and Other Special Cases .. Summary .. 1617 Eligibility TD Prediction .. The Forward View of TD( ) .. The Backward View of TD( ) .. Equivalences of Forward and Backward Views .. Sarsa( ) .. Watkins s Q( ) .. Off-policy Eligibility Traces using Importance Sampling .. Implementation Issues .. 189 Variable .. Conclusions .. 1908 Planning and learning with Tabular Models and Planning .. Integrating Planning, Acting, and learning .

When the Model Is Wrong .. Prioritized Sweeping .. Full vs. Sample Backups .. Trajectory Sampling .. Heuristic Search .. Monte Carlo Tree Search .. Summary .. 220II Approximate Solution Methods2239 On-policy Approximation of Action Value Prediction with Function Approximation .. Gradient-Descent Methods .. Linear Methods .. Control with Function Approximation .. Should We Bootstrap? .. Summary .. 24910 Off-policy Approximation of Action Values25511 Policy Actor Critic Methods .. Eligibility Traces for Actor Critic Methods .. R- learning and the Average-Reward Setting.

260 III Frontiers26512 Psychology26913 Neuroscience27114 Applications and Case TD-Gammon .. Samuel s Checkers Player .. The Acrobot .. Elevator Dispatching .. Dynamic Channel Allocation .. Job-Shop Scheduling .. 29515 The Unified View .. State Estimation .. Temporal Abstraction .. Predictive Representations .. Other Frontier Dimensions .. 306 References311 Index338viiiPREFACEP refaceWe first came to focus on what is now known as Reinforcement learning in late1979. We were both at the University of Massachusetts, working on one ofthe earliest projects to revive the idea that networks of neuronlike adaptiveelements might prove to be a promising approach to artificial adaptive intel-ligence.

The project explored the heterostatic theory of adaptive systems developed by A. Harry Klopf. Harry s work was a rich source of ideas, andwe were permitted to explore them critically and compare them with the longhistory of prior work in adaptive systems. Our task became one of teasingthe ideas apart and understanding their relationships and relative continues today, but in 1979 we came to realize that perhaps the simplestof the ideas, which had long been taken for granted, had received surprisinglylittle attention from a computational perspective. This was simply the idea ofa learning system thatwantssomething, that adapts its behavior in order tomaximize a special signal from its environment.

This was the idea of a he-donistic learning system, or, as we would say now, the idea of others, we had a sense that Reinforcement learning had been thor-oughly explored in the early days of cybernetics and artificial intelligence. Oncloser inspection, though, we found that it had been explored only Reinforcement learning had clearly motivated some of the earliest com-putational studies of learning , most of these researchers had gone on to otherthings, such as pattern classification, supervised learning , and adaptive con-trol, or they had abandoned the study of learning altogether. As a result, thespecial issues involved in learning how to get something from the environmentreceived relatively little attention.

In retrospect, focusing on this idea wasthe critical step that set this branch of research in motion. Little progresscould be made in the computational study of Reinforcement learning until itwas recognized that such a fundamental idea had not yet been field has come a long way since then, evolving and maturing in sev-eral directions. Reinforcement learning has gradually become one of the mostactive research areas in machine learning , artificial intelligence, and neural net-work research. The field has developed strong mathematical foundations andimpressive applications. The computational study of Reinforcement learning isnow a large field, with hundreds of active researchers around the world in di-verse disciplines such as psychology, control theory, artificial intelligence, andneuroscience.

Particularly important have been the contributions establishingand developing the relationships to the theory of optimal control and dynamicprogramming. The overall problem of learning from interaction to achievePREFACE ixgoals is still far from being solved, but our understanding of it has improvedsignificantly. We can now place component ideas, such as temporal-differencelearning, dynamic programming, and function approximation, within a coher-ent perspective with respect to the overall goal in writing this book was to provide a clear and simple account ofthe key ideas and algorithms of Reinforcement learning . We wanted our treat-ment to be accessible to readers in all of the related disciplines, but we couldnot cover all of these perspectives in detail.

Reinforcement Learning: An Introduction

Tags:

Information

Advertisement

Transcription of Reinforcement Learning: An Introduction

Related search queries

Reinforcement Learning: An Introduction

Tags:

Information

Advertisement

Documents from same domain

Related documents

Related search queries