Example: barber

# Infinite-Horizon Discounted Markov Decision …

Infinite-Horizon Discounted Markov Decision processes Dan Zhang Leeds School of Business University of Colorado at Boulder Dan Zhang, Spring 2012 Infinite Horizon Discounted MDP 1. Outline The expected total Discounted reward Policy evaluation Optimality equations Value iteration Policy iteration Linear Programming Dan Zhang, Spring 2012 Infinite Horizon Discounted MDP 2. Expected Total Reward Criterion Let = (d1 , d2 , .. ) HR. Starting at a state s, using policy leads to a sequence of state-action pairs {Xt , Yt }. The sequence of rewards is given by {Rt rt (Xt , Yt ) : t = 1, 2, .. }. Let [0, 1) be the discount factor The expected total rewards from policy starting in state s is given by " N #. X. v (s) lim E s t 1 r (Xt , Yt ) . N . t=1. The limit above exists when r ( ) is bounded; , sups S,a As |r (s, a)| = M <.]

In nite-Horizon Discounted Markov Decision Processes Dan Zhang Leeds School of Business University of Colorado at Boulder Dan Zhang, Spring 2012 In nite Horizon Discounted MDP 1

### Information

Domain:

Source:

Please notify us if you found a problem with this document:

### Transcription of Infinite-Horizon Discounted Markov Decision …

1 Infinite-Horizon Discounted Markov Decision processes Dan Zhang Leeds School of Business University of Colorado at Boulder Dan Zhang, Spring 2012 Infinite Horizon Discounted MDP 1. Outline The expected total Discounted reward Policy evaluation Optimality equations Value iteration Policy iteration Linear Programming Dan Zhang, Spring 2012 Infinite Horizon Discounted MDP 2. Expected Total Reward Criterion Let = (d1 , d2 , .. ) HR. Starting at a state s, using policy leads to a sequence of state-action pairs {Xt , Yt }. The sequence of rewards is given by {Rt rt (Xt , Yt ) : t = 1, 2, .. }. Let [0, 1) be the discount factor The expected total rewards from policy starting in state s is given by " N #. X. v (s) lim E s t 1 r (Xt , Yt ) . N . t=1. The limit above exists when r ( ) is bounded; , sups S,a As |r (s, a)| = M <.]

2 Dan Zhang, Spring 2012 Infinite Horizon Discounted MDP 3. Expected Total Reward Criterion Under suitable conditions (such as the boundedness of r ( )), we have " N # " #. X X. v (s) lim E s t 1 r (Xt , Yt ) = E s t 1 r (Xt , Yt ) . N . t=1 t=1. Let . " #. X.. v (s) E s r (Xt , Yt ) . t=1. We have v (s) = lim 1 v (s) whenever v (s) exists. Dan Zhang, Spring 2012 Infinite Horizon Discounted MDP 4. Optimality Criteria A policy is discount optimal for [0, 1) if . v (s) v (s), s S, HR . The value of a Discounted MDP is defined by v (s) sup v (s), s S. HR.. Let be a discount optimal policy. Then v (s) = v (s) for all s S. Dan Zhang, Spring 2012 Infinite Horizon Discounted MDP 5. Vector Notation for Markov Decision processes Let V denote the set of bounded real valued functions on S.]

3 With componentwise partial order and norm ||v || sups S |v (s)|. The corresponding matrix norm is given by X. ||H|| sup |H(j|s)|, s S j S. where H(j|s) denotes the (s, j)-th component of H. Let e V denote the function with all components equal to 1; that is, e(s) = 1 for all s S. Dan Zhang, Spring 2012 Infinite Horizon Discounted MDP 6. Vector Notation for Markov Decision processes For d D MD , let rd (s) r (s, d(s)) and pd (j|s) p(j|s, d(s)). Similarly, for d D MR , let X X. rd (s) qd(s) (a)r (s, a), pd (j|s) qd(s) (a)p(j|s, a). a As a As Let rd denote the |S|-vector, with the s-th component rd (s). and Pd the |S| |S| matrix with (s, j)-th entry pd (j|s). We refer to rd as the reward vector and Pd as the transition probability matrix. Dan Zhang, Spring 2012 Infinite Horizon Discounted MDP 7.

4 Vector Notation for Markov Decision processes = (d1 , d2 , .. ) MR . The (s, j) component of the t-step transition probability matrix P t (j|s) satisfies P t (j|s) = [Pd1 .. Pdt 1 Pdt ](j|s) = P (Xt+1 = j|X1 = s). For v V , X. Es [v (Xt )] = P t 1 (j|s)v (j). j S. We also have . X. v = t 1 P t 1 rdt . t=1. Dan Zhang, Spring 2012 Infinite Horizon Discounted MDP 8. Assumptions Stationary rewards and transition probabilities: r (s, a) and p(j|s, a) do not vary with time Bounded rewards: |r (s, a)| M < . Discounting: [0, 1). Discrete state space: S is finite or countable Dan Zhang, Spring 2012 Infinite Horizon Discounted MDP 9. Policy Evaluation Theorem Let = (d1 , d2 , .. ) HR . Then for each s S, there exists a policy 0 = (d10 , d20 , .. ) MR , satisfying 0. P (Xt = j, Yt = a|X1 = s) = P (Xt = j, Yt = a|X1 = s), t.]

5 = Suppose HR , then for each s S, there exists a policy 0. 0 MR such that v (s) = v (s). = It suffices to consider MR . v (s) = sup v (s) = sup v (s). HR MR. Dan Zhang, Spring 2012 Infinite Horizon Discounted MDP 10. Policy Evaluation Let = (d1 , d2 , .. ) MR . Then " #. X. v (s) = Es t 1 r (Xt , Yt ) . t=1. In vector notation, we have . X. v = t 1 P t 1 rdt t=1. = rd1 + P 1 rd2 + 2 P 2 rd3 + .. = rd1 + Pd1 rd2 + 2 Pd1 Pd2 rd3 + .. = rd1 + Pd1 (rd2 + Pd2 rd3 + .. ). 0. = rd1 + Pd1 v , where 0 = (d2 , d3 , .. ). Dan Zhang, Spring 2012 Infinite Horizon Discounted MDP 11. Policy Evaluation When is stationary, = (d, d, .. ) d and 0 = .. It follows that v d satisfies . v d = rd1 + Pd v d Ld v d , where Ld : V V is a linear transformation. Theorem Suppose [0, 1). Then for any stationary policy d with d.]

6 D MR , v d is a solution in V of v = rd + Pd v .. Furthermore, v d may be written as . v d = (I Pd ) 1 rd . Dan Zhang, Spring 2012 Infinite Horizon Discounted MDP 12. Optimality Equations For any fixed n, the finite horizon optimality equation is given by . X. vn (s) = sup r (s, a) + P(j|s, a)vn+1 (j) . a As j S. Taking limits on both sides leads to . X. v (s) = sup r (s, a) + P(j|s, a)v (j) . a As j S. The equations above for all s S are the optimality equations. For v V , let Lv sup [rd + Pd v ], d D MD. Lv max [rd + Pd v ]. d D MD. Dan Zhang, Spring 2012 Infinite Horizon Discounted MDP 13. Optimality Equations Proposition For all v V and [0, 1), sup [rd + Pd v ] = sup [rd + Pd v ]. d D MD d D MR. Replacing D MD with D, the optimality equation can be written as v = Lv.]

7 In case supremum can be attained above for all v V , v = Lv . Dan Zhang, Spring 2012 Infinite Horizon Discounted MDP 14. Solutions of the Optimality Equations Theorem Suppose v V . (i) If v Lv , then v v ;. (ii) If v Lv , then v v ;. (iii) If v = Lv , then v is the only element of V with this property and v = v . Dan Zhang, Spring 2012 Infinite Horizon Discounted MDP 15. Solutions of the Optimality Equations Let U be a Banach space (complete normed linear space). Special case: space of bounded measurable real-valued functions An operator T : U U is a contraction mapping if there exists a [0, 1) such that ||Tv Tu|| ||v u|| for all u and v in U. Theorem [Banach Fixed-Point Theorem]. Suppose U is a Banach space and T : U U is a contraction mapping. Then (i) There exists a unique v in U such that Tv = v.]

8 (ii) For arbitrary v 0 U, the sequence {v n } defined by v n+1 = Tv n = T n+1 v 0 converges to v . Dan Zhang, Spring 2012 Infinite Horizon Discounted MDP 16. Solutions of the Optimality Equations Proposition Suppose [0, 1). Then L and L are contraction mappings on V . Theorem Suppose [0, 1), S is finite or countable, and r (s, a) is bounded. The following results hold. (i) There exits a v V satisfying Lv = v (Lv = v ). Furthermore, v is the only element of V with this property and equals v ;. (ii) For each d D MR , there exists a unique v V satisfying . Ld v = v . Furthermore, v = v d . Dan Zhang, Spring 2012 Infinite Horizon Discounted MDP 17. Existence of Stationary Optimal Policies A Decision rule is d is conserving if d argmax{rd + Pd v }. d D. Theorem Suppose there exists a conserving Decision rule or an optimal policy, then there exists a deterministic stationary policy which is optimal.]]

9 Dan Zhang, Spring 2012 Infinite Horizon Discounted MDP 18. Value Iteration 1 Select v 0 V , specify > 0, and set n = 0. 2 For each s S, compute v n+1 (s) by . X. v n+1 (s) = max r (s, a) + p(j|s, a)v n (j) . a As j S. 3 If (1 ). ||v n+1 v n || < , 2 . go to step 4. Otherwise, increment n by 1 and return to step 2. 4 For each s S, choose . X. d (s) argmax r (s, a) + p(j|s, a)v n+1 (j) . a As j S. and stop. Dan Zhang, Spring 2012 Infinite Horizon Discounted MDP 19. Policy Iteration 1 Set n = 0 and select an arbitrary Decision rule d0 D. 2 (Policy evaluation) Obtain v n by solving (I Pdn )v = rdn . 3 (Policy improvement) Choose dn+1 satisfy dn+1 argmax[rd + Pd v n ]. d D. Setting dn+1 = dn if possible. 4 If dn+1 = dn , stop and set d = dn . Otherwise increment n by 1 and return to step 2.

10 Dan Zhang, Spring 2012 Infinite Horizon Discounted MDP 20. Linear Programming P. Let (s) be positive scalars such that s S (s) = 1. Primal linear program is given by X. min (j)v (j). v j S. X. v (s) P(j|s, a)v (j) r (s, a), s S, a As . j S. Dual linear program is given by XX. max r (s, a)x(s, a). x s S a As X XX. x(j, a) p(j|s, a)x(s, a) = (j), j S, a Aj s S a As x(s, a) 0, s S, a As . Dan Zhang, Spring 2012 Infinite Horizon Discounted MDP 21.