### Transcription of Infinite-Horizon Discounted Markov Decision …

1 **Infinite-Horizon** **Discounted** **Markov** **Decision** **processes** Dan Zhang Leeds School of Business University of Colorado at Boulder Dan Zhang, Spring 2012 Infinite Horizon **Discounted** MDP 1. Outline The expected total **Discounted** reward Policy evaluation Optimality equations Value iteration Policy iteration Linear Programming Dan Zhang, Spring 2012 Infinite Horizon **Discounted** MDP 2. Expected Total Reward Criterion Let = (d1 , d2 , .. ) HR. Starting at a state s, using policy leads to a sequence of state-action pairs {Xt , Yt }. The sequence of rewards is given by {Rt rt (Xt , Yt ) : t = 1, 2, .. }. Let [0, 1) be the discount factor The expected total rewards from policy starting in state s is given by " N #. X. v (s) lim E s t 1 r (Xt , Yt ) . N . t=1. The limit above exists when r ( ) is bounded; , sups S,a As |r (s, a)| = M <.]

2 Dan Zhang, Spring 2012 Infinite Horizon **Discounted** MDP 3. Expected Total Reward Criterion Under suitable conditions (such as the boundedness of r ( )), we have " N # " #. X X. v (s) lim E s t 1 r (Xt , Yt ) = E s t 1 r (Xt , Yt ) . N . t=1 t=1. Let . " #. X.. v (s) E s r (Xt , Yt ) . t=1. We have v (s) = lim 1 v (s) whenever v (s) exists. Dan Zhang, Spring 2012 Infinite Horizon **Discounted** MDP 4. Optimality Criteria A policy is discount optimal for [0, 1) if . v (s) v (s), s S, HR . The value of a **Discounted** MDP is defined by v (s) sup v (s), s S. HR.. Let be a discount optimal policy. Then v (s) = v (s) for all s S. Dan Zhang, Spring 2012 Infinite Horizon **Discounted** MDP 5. Vector Notation for **Markov** **Decision** **processes** Let V denote the set of bounded real valued functions on S.]

3 With componentwise partial order and norm ||v || sups S |v (s)|. The corresponding matrix norm is given by X. ||H|| sup |H(j|s)|, s S j S. where H(j|s) denotes the (s, j)-th component of H. Let e V denote the function with all components equal to 1; that is, e(s) = 1 for all s S. Dan Zhang, Spring 2012 Infinite Horizon **Discounted** MDP 6. Vector Notation for **Markov** **Decision** **processes** For d D MD , let rd (s) r (s, d(s)) and pd (j|s) p(j|s, d(s)). Similarly, for d D MR , let X X. rd (s) qd(s) (a)r (s, a), pd (j|s) qd(s) (a)p(j|s, a). a As a As Let rd denote the |S|-vector, with the s-th component rd (s). and Pd the |S| |S| matrix with (s, j)-th entry pd (j|s). We refer to rd as the reward vector and Pd as the transition probability matrix. Dan Zhang, Spring 2012 Infinite Horizon **Discounted** MDP 7.

4 Vector Notation for **Markov** **Decision** **processes** = (d1 , d2 , .. ) MR . The (s, j) component of the t-step transition probability matrix P t (j|s) satisfies P t (j|s) = [Pd1 .. Pdt 1 Pdt ](j|s) = P (Xt+1 = j|X1 = s). For v V , X. Es [v (Xt )] = P t 1 (j|s)v (j). j S. We also have . X. v = t 1 P t 1 rdt . t=1. Dan Zhang, Spring 2012 Infinite Horizon **Discounted** MDP 8. Assumptions Stationary rewards and transition probabilities: r (s, a) and p(j|s, a) do not vary with time Bounded rewards: |r (s, a)| M < . Discounting: [0, 1). Discrete state space: S is finite or countable Dan Zhang, Spring 2012 Infinite Horizon **Discounted** MDP 9. Policy Evaluation Theorem Let = (d1 , d2 , .. ) HR . Then for each s S, there exists a policy 0 = (d10 , d20 , .. ) MR , satisfying 0. P (Xt = j, Yt = a|X1 = s) = P (Xt = j, Yt = a|X1 = s), t.]

5 = Suppose HR , then for each s S, there exists a policy 0. 0 MR such that v (s) = v (s). = It suffices to consider MR . v (s) = sup v (s) = sup v (s). HR MR. Dan Zhang, Spring 2012 Infinite Horizon **Discounted** MDP 10. Policy Evaluation Let = (d1 , d2 , .. ) MR . Then " #. X. v (s) = Es t 1 r (Xt , Yt ) . t=1. In vector notation, we have . X. v = t 1 P t 1 rdt t=1. = rd1 + P 1 rd2 + 2 P 2 rd3 + .. = rd1 + Pd1 rd2 + 2 Pd1 Pd2 rd3 + .. = rd1 + Pd1 (rd2 + Pd2 rd3 + .. ). 0. = rd1 + Pd1 v , where 0 = (d2 , d3 , .. ). Dan Zhang, Spring 2012 Infinite Horizon **Discounted** MDP 11. Policy Evaluation When is stationary, = (d, d, .. ) d and 0 = .. It follows that v d satisfies . v d = rd1 + Pd v d Ld v d , where Ld : V V is a linear transformation. Theorem Suppose [0, 1). Then for any stationary policy d with d.]

6 D MR , v d is a solution in V of v = rd + Pd v .. Furthermore, v d may be written as . v d = (I Pd ) 1 rd . Dan Zhang, Spring 2012 Infinite Horizon **Discounted** MDP 12. Optimality Equations For any fixed n, the finite horizon optimality equation is given by . X. vn (s) = sup r (s, a) + P(j|s, a)vn+1 (j) . a As j S. Taking limits on both sides leads to . X. v (s) = sup r (s, a) + P(j|s, a)v (j) . a As j S. The equations above for all s S are the optimality equations. For v V , let Lv sup [rd + Pd v ], d D MD. Lv max [rd + Pd v ]. d D MD. Dan Zhang, Spring 2012 Infinite Horizon **Discounted** MDP 13. Optimality Equations Proposition For all v V and [0, 1), sup [rd + Pd v ] = sup [rd + Pd v ]. d D MD d D MR. Replacing D MD with D, the optimality equation can be written as v = Lv.]

7 In case supremum can be attained above for all v V , v = Lv . Dan Zhang, Spring 2012 Infinite Horizon **Discounted** MDP 14. Solutions of the Optimality Equations Theorem Suppose v V . (i) If v Lv , then v v ;. (ii) If v Lv , then v v ;. (iii) If v = Lv , then v is the only element of V with this property and v = v . Dan Zhang, Spring 2012 Infinite Horizon **Discounted** MDP 15. Solutions of the Optimality Equations Let U be a Banach space (complete normed linear space). Special case: space of bounded measurable real-valued functions An operator T : U U is a contraction mapping if there exists a [0, 1) such that ||Tv Tu|| ||v u|| for all u and v in U. Theorem [Banach Fixed-Point Theorem]. Suppose U is a Banach space and T : U U is a contraction mapping. Then (i) There exists a unique v in U such that Tv = v.]

8 (ii) For arbitrary v 0 U, the sequence {v n } defined by v n+1 = Tv n = T n+1 v 0 converges to v . Dan Zhang, Spring 2012 Infinite Horizon **Discounted** MDP 16. Solutions of the Optimality Equations Proposition Suppose [0, 1). Then L and L are contraction mappings on V . Theorem Suppose [0, 1), S is finite or countable, and r (s, a) is bounded. The following results hold. (i) There exits a v V satisfying Lv = v (Lv = v ). Furthermore, v is the only element of V with this property and equals v ;. (ii) For each d D MR , there exists a unique v V satisfying . Ld v = v . Furthermore, v = v d . Dan Zhang, Spring 2012 Infinite Horizon **Discounted** MDP 17. Existence of Stationary Optimal Policies A **Decision** rule is d is conserving if d argmax{rd + Pd v }. d D. Theorem Suppose there exists a conserving **Decision** rule or an optimal policy, then there exists a deterministic stationary policy which is optimal.]]

9 Dan Zhang, Spring 2012 Infinite Horizon **Discounted** MDP 18. Value Iteration 1 Select v 0 V , specify > 0, and set n = 0. 2 For each s S, compute v n+1 (s) by . X. v n+1 (s) = max r (s, a) + p(j|s, a)v n (j) . a As j S. 3 If (1 ). ||v n+1 v n || < , 2 . go to step 4. Otherwise, increment n by 1 and return to step 2. 4 For each s S, choose . X. d (s) argmax r (s, a) + p(j|s, a)v n+1 (j) . a As j S. and stop. Dan Zhang, Spring 2012 Infinite Horizon **Discounted** MDP 19. Policy Iteration 1 Set n = 0 and select an arbitrary **Decision** rule d0 D. 2 (Policy evaluation) Obtain v n by solving (I Pdn )v = rdn . 3 (Policy improvement) Choose dn+1 satisfy dn+1 argmax[rd + Pd v n ]. d D. Setting dn+1 = dn if possible. 4 If dn+1 = dn , stop and set d = dn . Otherwise increment n by 1 and return to step 2.

10 Dan Zhang, Spring 2012 Infinite Horizon **Discounted** MDP 20. Linear Programming P. Let (s) be positive scalars such that s S (s) = 1. Primal linear program is given by X. min (j)v (j). v j S. X. v (s) P(j|s, a)v (j) r (s, a), s S, a As . j S. Dual linear program is given by XX. max r (s, a)x(s, a). x s S a As X XX. x(j, a) p(j|s, a)x(s, a) = (j), j S, a Aj s S a As x(s, a) 0, s S, a As . Dan Zhang, Spring 2012 Infinite Horizon **Discounted** MDP 21.