Example: marketing

An Introduction to Markov Decision Processes

MDP Tutorial - 1An Introduction toMarkov Decision ProcessesBob GivanRon ParrPurdue UniversityDuke UniversityMDP Tutorial - 2 OutlineMarkov Decision Processes defined (Bob) Objective functions PoliciesFinding Optimal Solutions (Ron) Dynamic programming Linear programmingRefinements to the basic model (Bob) Partial observability Factored representationsMDP Tutorial - 3 Stochastic Automata with UtilitiesAMarkov Decision Process (MDP) modelcontains: A set of possible world statesS A set of possible actionsA A real valued reward functionR(s,a) A descriptionTof each action s effects in each :the effects of an actiontaken in a state depend only on that state and not on theprior Tutorial - 4 Stochastic Automata with UtilitiesAMarkov Decision Process (MDP) modelcontains: A set of possible worl

An Introduction to Markov Decision Processes Bob Givan Ron Parr Purdue University Duke University. MDPTutorial- 2 Outline Markov Decision Processes defined (Bob) ... Effects of Action Charge-Battery on variable Need-Power? MDPTutorial- 20 Representing Blocks • Identifying “irrelevant” state variables • Decision trees • DNF formulas

Tags:

  Introduction, Power, An introduction

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of An Introduction to Markov Decision Processes

1 MDP Tutorial - 1An Introduction toMarkov Decision ProcessesBob GivanRon ParrPurdue UniversityDuke UniversityMDP Tutorial - 2 OutlineMarkov Decision Processes defined (Bob) Objective functions PoliciesFinding Optimal Solutions (Ron) Dynamic programming Linear programmingRefinements to the basic model (Bob) Partial observability Factored representationsMDP Tutorial - 3 Stochastic Automata with UtilitiesAMarkov Decision Process (MDP) modelcontains: A set of possible world statesS A set of possible actionsA A real valued reward functionR(s,a) A descriptionTof each action s effects in each :the effects of an actiontaken in a state depend only on that state and not on theprior Tutorial - 4 Stochastic Automata with UtilitiesAMarkov Decision Process (MDP) modelcontains: A set of possible world statesS A set of possible actionsA A real valued reward functionR(s,a) A descriptionTof each action s effects in each.

2 The effects of an actiontaken in a state depend only on that state and not on theprior Tutorial - 5 Representing ActionsDeterministic Actions: T : For each state and action we specify a new Actions: T:For each state and action wespecify a probability distribution over next the distributionP(s |s,a).SA S SA ProbS() Tutorial - 6 Representing ActionsDeterministic Actions: T : For each state and action we specify a new Actions: T:For each state and action wespecify a probability distribution over next the distributionP(s |s,a).

3 SA S SA ProbS() Tutorial - 7 Representing SolutionsApolicy is a mapping fromS toAMDP Tutorial - 8 Following a PolicyFollowing a policy :1. Determine the current states2. Execute action (s)3. Goto step observability: the new state resulting fromexecuting an action will be known to the systemMDP Tutorial - 9 Evaluating a PolicyHow good is a policy in a states ?For deterministic actions just total therewards but result may be stochastic actions, instead expected total rewardobtained again typically yields infinite do we compare policies of infinite value?

4 MDP Tutorial - 10 Objective FunctionsAnobjective function maps infinite sequences of rewardsto single real numbers (representing utility)Options:1. Set afinite horizon and just total the prefer earlier reward rate in the limitDiscounting is perhaps the most analytically tractable andmost widely studied approachMDP Tutorial - 11 DiscountingA rewardn steps away is discounted by for discountrate. models mortality: you may die at any moment models preference for shorter solutions a smoothed out version of limited horizon lookaheadWe usecumulative discounted reward as our objective(Max value <= ) n0 1<<M M + ++11 ------------M =MDP Tutorial - 12 Value FunctionsA value function :represents theexpected objective value obtainedfollowing policy from each state inS.

5 Value functionspartially order the policies, but at least oneoptimal policy exists, and all optimal policies have the same value function,V S V*MDP Tutorial - 13 Bellman EquationsBellman equations relate the value function to itself viathe problem the discounted objective function,In each case, there is one equation per state inSV s()Rs s(),()Ts s()s ,,() V s () s S +=V*s()=MAXaA Rsa,()Tsas ,,() V*s () s S + MDP Tutorial - 14 Finite-horizon Bellman EquationsFinite-horizon values at adjacent horizons are related bythe action dynamicsV 0,s()Rs s(),()=V n,s()Rsa,()Tsas ,,() V n1 ,s ()

6 S S +=MDP Tutorial - 15 Relation to Model CheckingSome thoughts on the relationship MDP solution focuses critically onexpected value Contrastsafety propertieswhich focus on worst case This contrast allows MDP methods to exploitsampling and approximation more aggressivelyMDP Tutorial - 16 At this point, Ron Parr spoke on solution methods forabout 1/2 an hour, and then I Tutorial - 17 Large State SpacesIn AI problems, the state space is typically astronomically large described implicitly, not enumerated decomposed into factors, or aspects of stateIssues raised: How can we represent reward and action behaviorsin such MDPs?

7 How can we find solutions in such MDPs?MDP Tutorial - 18A Factored MDP Representation State SpaceS assignments to state variables: Partitions each block a DNF formula (or BDD, etc) Reward functionR labelled state-space partition:On-Mars?, Need- power ?, Daytime?,.. 1:not On-Mars?Block 2:On-Mars?and Need- power ?Block 3:On-Mars?andnot Need- power ?Block 1:not On-Mars?.. Reward=0 Block 2:On-Mars?and Need- power ?.. Reward=4 Block 3:On-Mars?andnot Need- power ? .. Reward=5 MDP Tutorial - 19 Factored Representations of Actions Assume: actions affect state variables (Nd- power ?)

8 ^ On-Mars? |x, a) = Pr (Nd- power ? |x, a) * Pr (On-Mars? |x, a) Represent effect on each state variable as labelledpartition:1. This assumption can be (Need- power ? | Block n)Block 1:not On-Mars? .. 2:On-Mars?and Need- power ? .. 3:On-Mars?andnot Need- power ? .. of ActionCharge-Batteryon variableNeed- power ?MDP Tutorial - 20 Representing Blocks Identifying irrelevant state variables Decision trees DNF formulas Binary/Algebraic Decision DiagramsMDP Tutorial - 21 Partial ObservabilitySystem state can not always be determined a Partially Observable MDP (POMDP) Action outcomes are not fully observable Add a set ofobservationsOto the model Add anobservation distributionU(s,o) for each state Add aninitial state distributionIKey notion.

9 Belief state, a distribution over system states representing where I think I am MDP Tutorial - 22 POMDP to MDP ConversionBelief state Pr(x) can be updated to Pr(x |o) using Bayes rule: Pr(s |s,o) = Pr(o|s,s ) Pr(s |s) / Pr(o|s) =U(s ,o)T(s ,a,s) normalized Pr(s |o) = Pr(s |s,o) Pr(s)A POMDP isMarkovian and fully observable relative tothebelief state. a POMDP can be treated as a continuous state MDPMDP Tutorial - 23 Belief State ApproximationProblem:When MDP state space is astronomical, beliefstates cannot be explicitly : MDP conversion of POMDP impracticalSolution:Represent belief state approximately Typically exploiting factored state representation Typically exploiting (near) conditional independenceproperties of the belief state factors


Related search queries