Example: barber

Introduction to Reinforcement Learning

Introduction to Reinforcement LearningJ. Zico KolterCarnegie Mellon University1 Agent interaction with environmentAgentEnvironmentActionaReward rStates2Of course, an oversimplification3 Review: Markov decision processRecall a (discounted) Markov decision process = , , , , : set of states : set of actions [0,1]: transition probability distribution , : rewards function, ( )is reward for state : discount factorThe RL twist: we don t know or , or they are too big to enumerate (only have the ability to act in the MDP, observe states and actions)4 Some important quantities in MDPs(Deterministic) policy : : mapping from states to actions(Stochastic) policy.

Introduction to Reinforcement Learning J. Zico Kolter Carnegie Mellon University 1. Agent interaction with environment Agent Environment States Rewardr Actiona 2. Of course, an oversimplification 3. Review: Markov decision process Recall a (discounted) Markov decision process ℳ=",#,$,%,&

Tags:

  Introduction, Learning, Reinforcement, Reinforcement learning

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Advertisement

Transcription of Introduction to Reinforcement Learning

1 Introduction to Reinforcement LearningJ. Zico KolterCarnegie Mellon University1 Agent interaction with environmentAgentEnvironmentActionaReward rStates2Of course, an oversimplification3 Review: Markov decision processRecall a (discounted) Markov decision process = , , , , : set of states : set of actions [0,1]: transition probability distribution , : rewards function, ( )is reward for state : discount factorThe RL twist: we don t know or , or they are too big to enumerate (only have the ability to act in the MDP, observe states and actions)4 Some important quantities in MDPs(Deterministic) policy : : mapping from states to actions(Stochastic) policy.

2 0,1: distribution over actions for each stateValue of a policy , expected discounted reward if starting in some state and following policy (expressed via Bellman equation) 0 = + , 0( ) 2 4 Optimal value function (value function of optimal policy , policy with highest value) = + max6 7 , ( ) 2 95 Solving an MDPP olicy evaluation: To find the value of a policy , start with any 0( ) and repeat: : 0 + , 0( ) 2 4(alternatively, can solve above linear equation to determine 0 directly)Value iteration: To find optimal value function, start with any ( ) and repeat: : + max6 7 , ( ) 2 4 But, how do we compute these quantities when and are unknown?

3 6 Overview of RL7 Reinforcement learningModel-based methodsModel-free methodsValue-based methodsPolicy-based methodsImportant note: the term Reinforcement Learning has also been co-opted to mean essentially any kind of sequential decision-making problem involving some element of machine Learning , including many domains different from above (imitation Learning , Learning control, inverse RL, etc), but we re going to focus on the above outlineImportant note regarding domain sizeFor the purposes of this lecture (except for the last section), we re going to assume a discrete state / discrete action setting where we can enumerate all statesLast part of lecture we will talk about the case of large/continuous state and action spacesThink.

4 Grid-world, not Atari (yet)8000000-1000001 Overview of RL9 Reinforcement learningModel-based methodsModel-free methodsValue-based methodsPolicy-based methodsModel-based RLA simple approach: if we don t know the MDP, just estimate it from dataAgent acts in the world (according to some policy), and observes state, actions, reward sequence 1, 1, 1, 2, 2, 2,.., @, @, @Form the empirical estimate of the MDP via the counts , = 1 B= , B= , B+1= @ 1B=1 1 B= , B= @ 1B=1 = 1 B= B@ 1B=1 1 B= @ 1B=1 Then solve MDP =( , , , , )via , value iteration10 Model-based RLWill converge to the correct MDP (and hence correct optimal value function / optimal policy) given enough samples of each stateHow do we ensure that we get the right samples?

5 (a challenging problem for all the methods we present here, stay tuned)Advantages (informally) of model-based RL: makes efficient use of dataDisadvantages: requires we build the actual MDP models, solve for optimal policy (not much help if state space is too large)11 Performance of model-based RL32100316100031621000031623cost/trial (log scale)0102030405060708090100trialModel-B ased RLDirect RL: run 3 Direct RL: run 2 Direct RL: run 1 Direct RL: average of 50 runs32100316100031621000031623cost/trial (log scale)90100110120130140150160170180190tr ialModel-Based RL: pretrainedDirect RL: pretrainedDirect RL: not pretrained12(Atkesonand Santamar a, 96)Model-free methodsValue-based methods: based upon temporal difference Learning (TD, SARSA, Q- Learning ), learn value function 0or (or a slight generalization called the Q-function, that we will discuss shortly)Policy-based method.

6 Directly learn optimal policy (or try to approximate optimal policy, if true optimal policy is not attainable)13 Overview of RL14 Reinforcement learningModel-based methodsModel-free methodsValue-based methodsPolicy-based methodsTemporal difference (TD) methodsLet s consider the task of evaluating the value of a policy, finding 0, which we can do by repeating the iteration : 0 + , 0( ) 2 4 Now suppose we are in some state F, receive reward F, take action F= Fand end up in state F+1We can t perform the update above for all , but can we update just for F? F F+ F, F 2 4 0( )15 Temporal difference (TD) , because we still don t know F, F for all But, F+1 is a sample from the distribution F, F, so we could perform the update 0 F F+ 0 F+1 However, this is too harsh an assignment, assumes that F+1 is the only possible next state.

7 Instead smooth the update using some <1 0 F 1 0 F+ F+ 0 F+1 This is the temporal difference (TD) algorithm16 The TD algorithm, more fullyThe TD algorithm is essentially a stochastic version of policy evaluation iterationAlgorithm 0=TD , , // Estimate value function 0 Initialize 0 0, RepeatObserve state and reward Take action = ( )and observe next state Update: 0 1 0 + + 0 Return 0 Will converge to 0 0( )(for all visited frequently enough) 17TD experiments Grid world domain:Run TD on grid world for 1000 episode (each episode consists of 10 steps, sampled according to a policy, starting at a random state), initialized with = 18000000-1000001TD Progress19 Value estimation error of TD for different Issues with traditional TD algorithmsTD lets us learn a value function of a policy directly, without ever constructing the MDPBut is this really helpful?

8 Consider trying to execute the greedy policy estimated 0 =argmax6 7 , 0 2 4We need a model the Q functionQ functions (for MDPs in general) are like value functions but defined over state-action pairs 0 , = + , 0( , ) 2 4 , = + , max6 7 ( , ) 2 4 = + , 2 , Q function is the value of starting at state , taking action , and then acting according to (or optimally, for )We can easily construct analogues of value iteration or policy evaluation to construct Q functions directly, given an MDP21 SARSA and Q-learningQ function formulation leads to new TD-like methodsAs with TD, observe state , reward , take action (not necessarily = , more on this shortly), observe next state SARSA: estimate 0 , 0 , 1 0 , + + 0 , Q- Learning .

9 Estimate ( , ) , 1 , + + max6 7 , Will converge to 0, respectively given enough samples of each state-action pair22 Benefits of Q function learningThe advantage of Learning the Q function is that we can now select actions withouta model of the , in Q Learning , the optimal policy is given by =argmax6 7 ( , )so we can learn the fulloptimal policy without a model of the MDP23 Aside: on-policy vs. off-policy learningThere is an important distinction in RL methods about on-policy vs. off-policy Learning algorithmIn the on-policy setting ( the traditional TD and SARSA algorithms), you are Learning the value of the policy that you are followingIn the off-policysetting ( Q- Learning algorithm) you are Learning the value of a differentpolicy than the one you are followingVirtually all Learning guarantees apply to the on-policy case.

10 There exist some convergence proofs for off-policy case, but only for very limited cases My take: Q- Learning probably shouldn t work, amazing that it does24Q- Learning experimentRun Q- Learning on grid world domain for 20000 epsides(each episode 10 steps), initializing with , = Policy: with probability , act according to current optimal policy =argmax6 7 ( , ), act randomly with probability progress26 Optimal Q function estimation error of Q- Learning for different Overview of RL27 Reinforcement learningModel-based methodsModel-free methodsValue-based methodsPolicy-based methodsPolicy-based methods (more briefly)A final strategy for Reinforcement Learning is to avoid forming both the MDP and the value/Q function, and instead try to directly learn the (optimal) policyAn example in MDPs.


Related search queries