Example: stock market

Lecture 2: Markov Decision Processes - David Silver

Lecture 2: Markov Decision ProcessesLecture 2: Markov Decision ProcessesDavid SilverLecture 2: Markov Decision Processes1 Markov Processes2 Markov Reward Processes3 Markov Decision Processes4 Extensions to MDPsLecture 2: Markov Decision ProcessesMarkov ProcessesIntroductionIntroduction to MDPsMarkov Decision processesformally describe an environmentfor reinforcement learningWhere the environment isfully The currentstatecompletely characterises the processAlmost all RL problems can be formalised as MDPs, control primarily deals with continuous MDPsPartially observable problems can be converted into MDPsBandits are MDPs with one stateLecture 2: Markov Decision ProcessesMarkov ProcessesMarkov PropertyMarkov Property The future is independent of the past given the present DefinitionA stateStisMarkovif and only ifP[St+1|St] =P[St+1|S1.]

State Transition Matrix For a Markov state s and successor state s0, the state transition probability is de ned by P ss0= P S t+1 = s 0jS t = s State transition matrix Pde nes transition probabilities from all states s to all successor states s0, to P = from 2 6 4 P 11::: P 1n... P n1::: P nn 3 7 5 where each row of the matrix sums to 1.

Tags:

  Transition, Probabilities, Transition probabilities

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of Lecture 2: Markov Decision Processes - David Silver

1 Lecture 2: Markov Decision ProcessesLecture 2: Markov Decision ProcessesDavid SilverLecture 2: Markov Decision Processes1 Markov Processes2 Markov Reward Processes3 Markov Decision Processes4 Extensions to MDPsLecture 2: Markov Decision ProcessesMarkov ProcessesIntroductionIntroduction to MDPsMarkov Decision processesformally describe an environmentfor reinforcement learningWhere the environment isfully The currentstatecompletely characterises the processAlmost all RL problems can be formalised as MDPs, control primarily deals with continuous MDPsPartially observable problems can be converted into MDPsBandits are MDPs with one stateLecture 2: Markov Decision ProcessesMarkov ProcessesMarkov PropertyMarkov Property The future is independent of the past given the present DefinitionA stateStisMarkovif and only ifP[St+1|St] =P[St+1|S1.]

2 ,St]The state captures all relevant information from the historyOnce the state is known, the history may be thrown The state is a sufficient statistic of the futureLecture 2: Markov Decision ProcessesMarkov ProcessesMarkov PropertyState transition MatrixFor a Markov statesand successor states , thestate transitionprobabilityis defined byPss =P[St+1=s |St=s]State transition matrixPdefines transition probabilities from allstatessto all successor statess ,toP=from where each row of the matrix sums to 2: Markov Decision ProcessesMarkov ProcessesMarkov ChainsMarkov ProcessA Markov process is a memoryless random process, a sequenceof random statesS1,S2,..with the Markov Process(orMarkov Chain) is a tuple S,P Sis a (finite) set of statesPis a state transition probability matrix,Pss =P[St+1=s |St=s] Lecture 2: Markov Decision ProcessesMarkov ProcessesMarkov ChainsExample: Student Markov 3 PassClass 2: Markov Decision ProcessesMarkov ProcessesMarkov ChainsExample: Student Markov Chain 3 PassClass episodes for Student MarkovChain starting fromS1= C1S1,S2.

3 ,STC1 C2 C3 Pass SleepC1 FB FB C1 C2 SleepC1 C2 C3 Pub C2 C3 Pass SleepC1 FB FB C1 C2 C3 Pub C1 FB FBFB C1 C2 C3 Pub C2 SleepLecture 2: Markov Decision ProcessesMarkov ProcessesMarkov ChainsExample: Student Markov Chain transition 3 PassClass Lecture 2: Markov Decision ProcessesMarkov Reward ProcessesMRPM arkov Reward ProcessA Markov reward process is a Markov chain with Reward Processis a tuple S,P,R, Sis a finite set of statesPis a state transition probability matrix,Pss =P[St+1=s |St=s]Ris a reward function,Rs=E[Rt+1|St=s] is a discount factor, [0,1] Lecture 2: Markov Decision ProcessesMarkov Reward ProcessesMRPE xample: Student MRPR = + = +1R = -1R = 0 PubClass 3 PassClass 1R = -2R = -2R = 2: Markov Decision ProcessesMarkov Reward ProcessesReturnReturnDefinitionThereturn Gtis the total discounted reward from +1+ Rt+2+.

4 = k=0 kRt+k+1 Thediscount [0,1] is the present value of future rewardsThe value of receiving rewardRafterk+ 1 time-steps is values immediate reward above delayed reward. close to 0 leads to myopic evaluation close to 1 leads to far-sighted evaluationLecture 2: Markov Decision ProcessesMarkov Reward ProcessesReturnWhy discount?Most Markov reward and Decision Processes are discounted. Why?Mathematically convenient to discount rewardsAvoids infinite returns in cyclic Markov processesUncertainty about the future may not be fully representedIf the reward is financial, immediate rewards may earn moreinterest than delayed rewardsAnimal/human behaviour shows preference for immediaterewardIt is sometimes possible to useundiscountedMarkov rewardprocesses ( = 1), if all sequences 2: Markov Decision ProcessesMarkov Reward ProcessesValue FunctionValue FunctionThe value functionv(s) gives the long-term value of statesDefinitionThestate value functionv(s) of an MRP is the expected returnstarting from statesv(s) =E[Gt|St=s] Lecture 2.

5 Markov Decision ProcessesMarkov Reward ProcessesValue FunctionExample: Student MRP ReturnsSample returns for Student MRP:Starting fromS1= C1 with =12G1=R2+ R3+..+ T 2 RTC1 C2 C3 Pass Sleepv1= 2 2 12 2 14+ 10 18= FB FB C1 C2 Sleepv1= 2 1 12 1 14 2 18 2 116= C2 C3 Pub C2 C3 Pass Sleepv1= 2 2 12 2 14+ 1 18 2 FB FB C1 C2 C3 Pub 2 1 12 1 14 2 18 2 FB FB C1 C2 C3 Pub C2 SleepLecture 2: Markov Decision ProcessesMarkov Reward ProcessesValue FunctionExample: State-Value Function for Student MRP (1)10-2-2-20-1R = + = +1R = -1R = 0+1R = -2R = -2R = (s) for =0 Lecture 2: Markov Decision ProcessesMarkov Reward ProcessesValue FunctionExample: State-Value Function for Student MRP (2) = + = +1R = -1R = = -2R = -2R = (s) for = 2: Markov Decision ProcessesMarkov Reward ProcessesValue FunctionExample.

6 State-Value Function for Student MRP (3) = + = +1R = -1R = 0+ = -2R = -2R = (s) for =1 Lecture 2: Markov Decision ProcessesMarkov Reward ProcessesBellman EquationBellman Equation for MRPsThe value function can be decomposed into two parts:immediate rewardRt+1discounted value of successor state v(St+1)v(s) =E[Gt|St=s]=E[Rt+1+ Rt+2+ 2Rt+3+..|St=s]=E[Rt+1+ (Rt+2+ Rt+3+..)|St=s]=E[Rt+1+ Gt+1|St=s]=E[Rt+1+ v(St+1)|St=s] Lecture 2: Markov Decision ProcessesMarkov Reward ProcessesBellman EquationBellman Equation for MRPs (2)v(s) =E[Rt+1+ v(St+1)|St=s]v(s)7!sv(s0)7!s0rv(s) =Rs+ s SPss v(s ) Lecture 2: Markov Decision ProcessesMarkov Reward ProcessesBellman EquationExample: Bellman Equation for Student = + = +1R = -1R = = -2R = -2R = = -2 + *10 + * 2: Markov Decision ProcessesMarkov Reward ProcessesBellman EquationBellman Equation in Matrix FormThe Bellman equation can be expressed concisely using matrices,v=R+ Pvwherevis a column vector with one entry per state v(1).

7 V(n) = + v(1)..v(n) Lecture 2: Markov Decision ProcessesMarkov Reward ProcessesBellman EquationSolving the Bellman EquationThe Bellman equation is a linear equationIt can be solved directly:v=R+ Pv(I P)v=Rv= (I P) 1 RComputational complexity isO(n3) fornstatesDirect solution only possible for small MRPsThere are many iterative methods for large MRPs, programmingMonte-Carlo evaluationTemporal-Difference learningLecture 2: Markov Decision ProcessesMarkov Decision ProcessesMDPM arkov Decision ProcessA Markov Decision process (MDP) is a Markov reward process withdecisions. It is anenvironmentin which all states are Decision Processis a tuple S,A,P,R, Sis a finite set of statesAis a finite set of actionsPis a state transition probability matrix,Pass =P[St+1=s |St=s,At=a]Ris a reward function,Ras=E[Rt+1|St=s,At=a] is a discount factor [0,1].

8 Lecture 2: Markov Decision ProcessesMarkov Decision ProcessesMDPE xample: Student MDPR = +10R = +1R = -1R = 0R = -2R = = -1R = 0 Lecture 2: Markov Decision ProcessesMarkov Decision ProcessesPoliciesPolicies (1)DefinitionApolicy is a distribution over actions given states, (a|s) =P[At=a|St=s]A policy fully defines the behaviour of an agentMDP policies depend on the current state (not the history) Policies arestationary(time-independent),At ( |St), t>0 Lecture 2: Markov Decision ProcessesMarkov Decision ProcessesPoliciesPolicies (2)Given an MDPM= S,A,P,R, and a policy The state sequenceS1,S2,..is a Markov process S,P The state and reward sequenceS1,R2,S2,..is a Markovreward process S,P ,R , whereP s,s = a A (a|s)Pass R s= a A (a|s)RasLecture 2: Markov Decision ProcessesMarkov Decision ProcessesValue FunctionsValue FunctionDefinitionThestate-value functionv (s) of an MDP is the expected returnstarting from states, and then following policy v (s) =E [Gt|St=s]DefinitionTheaction-value functionq (s,a) is the expected returnstarting from states, taking actiona, and then following policy q (s,a) =E [Gt|St=s,At=a] Lecture 2: Markov Decision ProcessesMarkov Decision ProcessesValue FunctionsExample: State-Value Function for Student = +10R = +1R = -1R = 0R = -2R = = -1R = 0v (s) for (a|s)= , =1 Lecture 2.

9 Markov Decision ProcessesMarkov Decision ProcessesBellman Expectation EquationBellman Expectation EquationThe state-value function can again be decomposed into immediatereward plus discounted value of successor state,v (s) =E [Rt+1+ v (St+1)|St=s]The action-value function can similarly be decomposed,q (s,a) =E [Rt+1+ q (St+1,At+1)|St=s,At=a] Lecture 2: Markov Decision ProcessesMarkov Decision ProcessesBellman Expectation EquationBellman Expectation Equation forV v (s)7!sq (s,a)7!av (s) = a A (a|s)q (s,a) Lecture 2: Markov Decision ProcessesMarkov Decision ProcessesBellman Expectation EquationBellman Expectation Equation forQ v (s0)7!s0q (s,a)7!s,arq (s,a) =Ras+ s SPass v (s ) Lecture 2: Markov Decision ProcessesMarkov Decision ProcessesBellman Expectation EquationBellman Expectation Equation forv (2)v (s0)7!

10 S0v (s)7!srav (s) = a A (a|s)(Ras+ s SPass v (s )) Lecture 2: Markov Decision ProcessesMarkov Decision ProcessesBellman Expectation EquationBellman Expectation Equation forq (2)q (s,a)7!s,aq (s0,a0)7!a0rs0q (s,a) =Ras+ s SPass a A (a |s )q (s ,a ) Lecture 2: Markov Decision ProcessesMarkov Decision ProcessesBellman Expectation EquationExample: Bellman Expectation Equation in Student = +10R = +1R = -1R = 0R = -2R = = -1R = = * (1 + * + * + * ) + * 10 Lecture 2: Markov Decision ProcessesMarkov Decision ProcessesBellman Expectation EquationBellman Expectation Equation (Matrix Form)The Bellman expectation equation can be expressed conciselyusing the induced MRP,v =R + P v with direct solutionv = (I P ) 1R Lecture 2.


Related search queries