Example: confidence

Bellman Equations and Dynamic Programming

Part 6: Core Theory II: Bellman Equations and Dynamic ProgrammingIntroduction to Reinforcement LearningBellman Equations Recursive relationships among values that can be used to compute valuesThe tree of transition dynamicsa path, or trajectorystateactionpossible pathThe web of transition dynamicsa path, or trajectorystateactionpossible pathThe web of transition dynamicsbackup diagramstateactionpossible path4 Bellman -equation backup diagrams representing recursive relationships among valuesstate valuesaction valuespredictioncontrolmaxmaxmaxstateact ionpossible pathR. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction10 Bellman Equation for a Policy Gt=Rt+1+ Rt+2+ 2Rt+3+ 3Rt+4L=Rt+1+ Rt+2+ Rt+3+ 2Rt+4L()=Rt+1+ Gt+1 The basic idea: So: v (s)=E GtSt=s{}=E Rt+1+ v St+1()St=s{}Or, without the expectation operator.

Programming Introduction to Reinforcement Learning. Bellman Equations Recursive relationships among values that can be used to compute values. The tree of transition dynamics a path, or trajectory state action ... The basic idea: So: v

Tags:

  Programming, Basics

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Advertisement

Transcription of Bellman Equations and Dynamic Programming

1 Part 6: Core Theory II: Bellman Equations and Dynamic ProgrammingIntroduction to Reinforcement LearningBellman Equations Recursive relationships among values that can be used to compute valuesThe tree of transition dynamicsa path, or trajectorystateactionpossible pathThe web of transition dynamicsa path, or trajectorystateactionpossible pathThe web of transition dynamicsbackup diagramstateactionpossible path4 Bellman -equation backup diagrams representing recursive relationships among valuesstate valuesaction valuespredictioncontrolmaxmaxmaxstateact ionpossible pathR. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction10 Bellman Equation for a Policy Gt=Rt+1+ Rt+2+ 2Rt+3+ 3Rt+4L=Rt+1+ Rt+2+ Rt+3+ 2Rt+4L()=Rt+1+ Gt+1 The basic idea: So: v (s)=E GtSt=s{}=E Rt+1+ v St+1()St=s{}Or, without the expectation operator.

2 +..+v (s)=Xa (a|s)Xs0,rp(s0,r|s,a)hr+ v (s0)iv (s)=E Rt+1+ Rt+2+ 2Rt+3+ St=s =E [Rt+1+ v (St+1)|St=s](1)=Xa (a|s)Xs0,rp(s0,r|s,a)hr+ v (s0)i,(2)iR. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction11 More on the Bellman EquationThis is a set of Equations (in fact, linear), one for each value function for is its unique diagrams:for v for q v (s)=Xa (a|s)Xs0,rp(s0,r|s,a)hr+ v (s0)iv (s)=E Rt+1+ Rt+2+ 2Rt+3+ St=s =E [Rt+1+ v (St+1)|St=s](1)=Xa (a|s)Xs0,rp(s0,r|s,a)hr+ v (s0)i,(2)iR. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction12 Gridworld Actions: north, south, east, west; deterministic. If would take agent off the grid: no move but reward = 1 Other actions produce reward = 0, except actions that move agent out of special states A and B as function for equiprobable random policy; = S.

3 Sutton and A. G. Barto: Reinforcement Learning: An Introduction13 Bellman Optimality Equation for q*The relevant backup diagram: is the unique solution of this system of nonlinear *s,asas'ra's'r(b)(a)maxmax68 CHAPTER 3. THE REINFORCEMENT LEARNING PROBLEMq (s,driver). These are the values of each state if we first play a stroke withthe driver and afterward select either the driver or the putter, whichever isbetter. The driver enables us to hit the ball farther, but with less can reach the hole in one shot using the driver only if we are already veryclose; thus the 1contourforq (s,driver)coversonlyasmallportionofthe green. If we have two strokes, however, then we can reach the hole frommuch farther away, as shown by the 2contour. Inthiscasewedon thaveto drive all the way to within the small 1contour,butonlytoanywhereon the green; from there we can use the putter.

4 The optimal action-valuefunction gives the values after committing to a particularfirstaction, in thiscase, to the driver, but afterward using whichever actions are best. The 3contour is still farther out and includes the starting tee. From the tee, the bestsequence of actions is two drives and one putt, sinking the ball in three is the value function for a policy, it must satisfy the self-consistency condition given by the Bellman equation for state values ( ).Because it is the optimal value function, however,v s consistency conditioncan be written in a special form without reference to any specific policy. Thisis the Bellman equation forv ,ortheBellman optimality equation. Intuitively,the Bellman optimality equation expresses the fact that the value of a stateunder an optimal policy must equal the expected return for the best actionfrom that state:v (s)= maxa2A(s)q (s,a)=maxaE [Gt|St=s,At=a]=maxaE "1Xk=0 kRt+k+1 St=s,At=a#=maxaE "Rt+1+ 1Xk=0 kRt+k+2 St=s,At=a#=maxaE[Rt+1+ v (St+1)|St=s,At=a]( )=maxa2A(s)Xs0,rp(s0,r|s,a) r+ v (s0).

5 ( )The last two Equations are two forms of the Bellman optimality equation forv . The Bellman optimality equation forq isq (s,a)=EhRt+1+ maxa0q (St+1,a0) St=s,At=ai=Xs0,rp(s0,r|s,a)hr+ maxa0q (s0,a0) Programming Using Bellman Equations to compute values and optimal policies (thus a form of planning)R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction3 Iterative Methodsa sweep A sweep consists of applying a backup operation to each full policy-evaluation backup:v0!v1! !vk!vk+1! !v v (s)=Xa (a|s)Xs0,rp(s0,r|s,a)hr+ v (s0)ivk+1(s)=Xa (a|s)Xs0,rp(s0,r|s,a)hr+ vk(s0)i8s2Sv (s)=E [Gt|St=s]=E "1Xk=0 kRt+k+1 St=s#v (s)=E Rt+1+ Rt+2+ 2Rt+3+ St=s =E [Rt+1+ v (St+1)|St=s](1)=Xa (a|s)Xs0,rp(s0,r|s,a)hr+ v (s0)i,(2)v (s) = maxaq (s,a)= maxaE[Rt+1+ v (St+1)|St=s,At=a](3)= maxaXs0,rp(s0,r|s,a) r+ v (s0).

6 (4)iv0!v1! !vk!vk+1! !v v (s)=Xa (a|s)Xs0,rp(s0,r|s,a)hr+ v (s0)ivk+1(s)=Xa (a|s)Xs0,rp(s0,r|s,a)hr+ vk(s0)i8s2Sv (s)=E [Gt|St=s]=E "1Xk=0 kRt+k+1 St=s#v (s)=E Rt+1+ Rt+2+ 2Rt+3+ St=s =E [Rt+1+ v (St+1)|St=s](1)=Xa (a|s)Xs0,rp(s0,r|s,a)hr+ v (s0)i,(2)v (s) = maxaq (s,a)= maxaE[Rt+1+ v (St+1)|St=s,At=a](3)= maxaXs0,rp(s0,r|s,a) r+ v (s0) .(4)iR. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction5A Small Gridworld An undiscounted episodic task Nonterminal states: 1, 2, .., 14; One terminal state (shown twice as shaded squares) Actions that would take agent off the grid leave state unchanged Reward is 1 until the terminal state is reachedR = 1R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction6 Iterative Policy Eval for the Small Gridworld = equiprobable random action choices R = 1 An undiscounted episodic task Nonterminal states: 1, 2.

7 , 14; One terminal state (shown twice as shaded squares) Actions that would take agent off the grid leave state unchanged Reward is 1 until the terminal state is reachedR. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction4 Iterative Policy Evaluation One array version86 CHAPTER 4. Dynamic PROGRAMMINGI nput , the policy to be evaluatedInitialize an arrayV(s)=0,foralls2S+Repeat 0 For eachs2S:v V(s)V(s) Pa (a|s)Ps0,rp(s0,r|s,a) r+ V(s0) max( ,|v V(s)|)until < (a small positive number)OutputV v Figure : Iterative policy implementation point concerns the termination of the , iterative policy evaluation converges only in the limit, but in practiceit must be halted short of this. A typical stopping condition for iterative policyevaluation is to test the quantity maxs2S|vk+1(s) vk(s)|after each sweep andstop when it is su ciently small.

8 Figure gives a complete algorithm foriterative policy evaluation with this stopping the 4 = !1on all transitions1234567891011121314 RThe nonterminal states areS={1,2,..,14}. There are four actions pos-sible in each state,A={up,down,right,left}, which deterministicallycause the corresponding state transitions, except that actions that would takethe agent o the grid in fact leave the state unchanged. Thus, for instance,p(6|5,right)=1,p(10|5,right)=0, andp(7|7,right)= , episodic task. The reward is 1 on all transitions until the terminalstate is reached. The terminal state is shaded in the figure (although it isshown in two places, it is formally one state). The expected reward function isthusr(s,a,s0)= 1forallstatess,s0and actionsa. Suppose the agent followsthe equiprobable random policy (all actions equally likely).

9 The left side ofFigure shows the sequence of value functions{vk}computed by iterativepolicy evaluation. The final estimate is in factv , which in this case gives foreach state the negation of the expected number of steps from that state untilR. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction6 Iterative Policy Eval for the Small Gridworld An undiscounted episodic task Nonterminal states: 1, 2, .., 14; One terminal state (shown twice as shaded squares) Actions that would take agent off the grid leave state unchanged Reward is 1 until the terminal state is reached = equiprobable random action choicesR = 1R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction17 Gambler s Problem Gambler can repeatedly bet $ on a coin flip Heads he wins his stake, tails he loses it Initial capital {$1, $2.}

10 $99} Gambler wins if his capital becomes $100 loses if it becomes $0 Coin is unfair!Heads (gambler wins) with probability p = .4 States, Actions, Rewards?R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction18 Gambler s Problem SolutionR. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction19 Gambler s Problem SolutionR. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction22 Generalized Policy IterationGeneralized Policy Iteration (GPI): any interaction of policy evaluation and policy improvement, independent of their geometric metaphor forconvergence of GPI: 100 CHAPTER 4. Dynamic Programming vevaluationv v improvement greedy(v)v* *Figure : Generalized policy iteration: Value and policy functions interactuntil they are optimal and thus consistent with each driven toward the value function for the policy.


Related search queries