Example: dental hygienist

Dynamic Programming and Optimal Control 3rd Edition, …

Dynamic Programming and Optimal Control3rd Edition, Volume IIbyDimitri P. BertsekasMassachusetts institute of TechnologyChapter 6 Approximate Dynamic ProgrammingThis is an updated version of the research-oriented Chapter6 onApproximate Dynamic Programming . It will be periodically updated asnew research becomes available, and will replace the current Chapter 6 inthe book s next addition to editorial revisions, rearrangements, and new exercises,the chapter includes an account of new research, which is collected mostlyin Sections and Furthermore, a lot of new material has beenadded, such as an account of post-decision state simplifications ( ), regression-based TD methods (Section ), feature scaling ( ), policy oscillations (Section )

Dynamic Programming and Optimal Control 3rd Edition, Volume II by Dimitri P. Bertsekas Massachusetts Institute of Technology Chapter 6 Approximate Dynamic Programming This is an updated version of the research-oriented Chapter 6 on Approximate Dynamic Programming. It will be periodically updated as

Tags:

  Programming, Technology, Dynamics, Institute, Massachusetts, Optimal, Massachusetts institute of technology, Dynamic programming, Dynamic programming and optimal

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of Dynamic Programming and Optimal Control 3rd Edition, …

1 Dynamic Programming and Optimal Control3rd Edition, Volume IIbyDimitri P. BertsekasMassachusetts institute of TechnologyChapter 6 Approximate Dynamic ProgrammingThis is an updated version of the research-oriented Chapter6 onApproximate Dynamic Programming . It will be periodically updated asnew research becomes available, and will replace the current Chapter 6 inthe book s next addition to editorial revisions, rearrangements, and new exercises,the chapter includes an account of new research, which is collected mostlyin Sections and Furthermore, a lot of new material has beenadded, such as an account of post-decision state simplifications ( ), regression-based TD methods (Section ), feature scaling ( ), policy oscillations (Section )

2 , -policy iteration and explorationenhanced TD methods, aggregation methods (Section ), new Q-learningalgorithms (Section ), and Monte Carlo linear algebra (Section ).This chapter represents work in progress. It more than likely con-tains errors (hopefully not serious ones). Furthermore, its references to theliterature are incomplete. Your comments and suggestions to the authorat are welcome. The date of last revision is given 11, 20116 ApproximateDynamic General Issues of Cost Approximation .. p. Approximation Architectures.

3 P. Approximate Policy Iteration .. p. Direct and Indirect Approximation .. p. Simplifications .. p. Monte Carlo Simulation .. p. Contraction Mappings and Simulation .. p. Direct Policy Evaluation - Gradient Methods .. p. Projected Equation Methods .. p. The Projected Bellman Equation .. p. Projected Value Iteration - Other Iterative Methodsp. Simulation-Based Methods .. p. LSTD, LSPE, and TD(0) Methods .. p. Optimistic Versions .. p. Multistep Simulation-Based Methods.

4 P. Policy Iteration Issues - Exploration .. p. Policy Oscillations - Chattering .. p. -Policy Iteration .. p. A Synopsis .. p. Aggregation Methods .. p. Cost Approximation via the Aggregate Problem . p. Cost Approximation via the Enlarged Problem . p. Q-Learning .. p. Convergence Properties of Q-Learning .. p. Q-Learning and Approximate Policy Iteration .. p. Q-Learning for Optimal Stopping Problems .. p. Finite Horizon Q-Learning .. p. 455321322 Approximate Dynamic ProgrammingChap.

5 Stochastic Shortest Path Problems .. p. Average Cost Problems .. p. Approximate Policy Evaluation .. p. Approximate Policy Iteration .. p. Q-Learning for Average Cost Problems .. p. Simulation-Based Solution of Large Systems .. p. Projected Equations - Simulation-Based Versions Matrix Inversion and Regression-Type Methods . p. Iterative/LSPE-Type Methods .. p. Multistep Methods .. p. Extension of Q-Learning for Optimal Stopping . p. Bellman Equation Error-Type Methods .. p.

6 Oblique Projections .. p. Generalized Aggregation by Simulation .. p. Approximation in Policy Space .. p. The Gradient Formula .. p. Computing the Gradient by Simulation .. p. Essential Features of Critics .. p. Approximations in Policy and Value Space .. p. Notes, Sources, and Exercises .. p. 516 References .. p. 539 Sec. this chapter we consider approximation methods for challenging, compu-tationally intensive DP problems. We discussed a number of such methodsin Chapter 6 of Vol. I and Chapter 1 of the present volume, suchas forexample rollout and other one-step lookahead approaches.

7 Here our focuswill be on algorithms that are mostly patterned after two principal methodsof infinite horizon DP: policy and value iteration. These algorithms formthe core of a methodology known by various names, such asapproximatedynamic Programming , orneuro- Dynamic Programming , principal aim of the methods of this chapter is to address problemswith very large number of statesn. In such problems, ordinary linearalgebra operations such asn-dimensional inner products, are prohibitivelytime-consuming, and indeed it may be impossible to even store ann-vectorin a computer memory.

8 Our methods will involve linear algebra operationsof dimension much smaller thann, and require only that the componentsofn-vectors are just generated when needed rather than aim of the methods of this chapter is to addressmodel-freesituations, , problems where a mathematical model is unavailable orhard to construct. Instead, the system and cost structure may be sim-ulated (think, for example, of a queueing network with complicated butwell-defined service disciplines at the queues). The assumption here is thatthere is a computer program that simulates, for a given controlu, the prob-abilistic transitions from any given stateito a successor statejaccordingto the transition probabilitiespij(u), and also generates a correspondingtransition costg(i,u,j).

9 Given a simulator, it may be possible to use repeated simulation tocalculate (at least approximately) the transition probabilities of the systemand the expected stage costs by averaging, and then to apply the methodsdiscussed in earlier chapters. The methods of this chapter,however, aregeared towards an alternative possibility, which is much more attractivewhen one is faced with a large and complex system, and one contemplatesapproximations. Rather than estimate explicitly the transition probabil-ities and costs, we will aim to approximate the cost functionof a givenpolicy or even the Optimal cost-to-go function by generating one or moresimulated system trajectories and associated costs, and byusing some formof least squares fit.

10 Implicit in the rationale of methods based on cost function approxi-mation is of course the hypothesis that a more accurate cost-to-go approx-imation will yield a better one-step or multistep lookaheadpolicy. Thisis a reasonable but by no means self-evident conjecture, andmay in factnot even be true in a given problem. In another type of method,whichwe will discuss somewhat briefly, we use simulation in conjunction with agradient or other method to approximate directly an optimalpolicy witha policy of a given parametric form.


Related search queries