Example: barber

Chapter 3: The Reinforcement Learning Problem An …

CSE 190: Reinforcement Learning :An IntroductionAcknowledgment:Acknowledgmen t:A good number of these slidesA good number of these slides are cribbed from Rich Suttonare cribbed from Rich Sutton22 CSE 190: Reinforcement Learning , LectureCSE 190: Reinforcement Learning , Lecture 22 Chapter 3: The ReinforcementLearning ProblemObjectives of what I will talk about from this Chapter : describe the RL Problem we will be studying for theremainder of the course present idealized form of the RL Problem for which wehave precise theoretical results; introduce key components of the mathematics: valuefunctions and Bellman equations; describe trade-offs between applicability and 190: Reinforcement Learning , LectureCSE 190: Reinforcement Learning , Lecture 22 Chapter 3: The ReinforcementLearning ProblemObjectives of what I will talk about from this Chapter : describe the RL Problem we will be studying for theremainder of the course present idealized form of the RL Problem for which wehave precise theoretical results; introduce key components of the mathematics: valuefunctions and Bellman equations; describe trade-offs between applicability and 190: Reinforcement Learning , LectureCSE 190: Reinforcement Learning , Lecture 22 The Agent-Environment Interface Agent and environment interact at discrete time steps: t=0,1,2,K Agent observes state at step t: st!

CSE 190: Reinforcement Learning, Lecture25 Policy at step t,! t: a mapping from states to action probabilities! t(s,a)= probability that a t=a when s t=s The Agent Learns a Policy •Reinforcement learning methods specify how the agent changes its policy as a result of experience.

Tags:

  Learning, Reinforcement, Reinforcement learning

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Advertisement

Transcription of Chapter 3: The Reinforcement Learning Problem An …

1 CSE 190: Reinforcement Learning :An IntroductionAcknowledgment:Acknowledgmen t:A good number of these slidesA good number of these slides are cribbed from Rich Suttonare cribbed from Rich Sutton22 CSE 190: Reinforcement Learning , LectureCSE 190: Reinforcement Learning , Lecture 22 Chapter 3: The ReinforcementLearning ProblemObjectives of what I will talk about from this Chapter : describe the RL Problem we will be studying for theremainder of the course present idealized form of the RL Problem for which wehave precise theoretical results; introduce key components of the mathematics: valuefunctions and Bellman equations; describe trade-offs between applicability and 190: Reinforcement Learning , LectureCSE 190: Reinforcement Learning , Lecture 22 Chapter 3: The ReinforcementLearning ProblemObjectives of what I will talk about from this Chapter : describe the RL Problem we will be studying for theremainder of the course present idealized form of the RL Problem for which wehave precise theoretical results; introduce key components of the mathematics: valuefunctions and Bellman equations; describe trade-offs between applicability and 190: Reinforcement Learning , LectureCSE 190: Reinforcement Learning , Lecture 22 The Agent-Environment Interface Agent and environment interact at discrete time steps: t=0,1,2,K Agent observes state at step t: st!

2 S produces action at step t: at!A(st) gets resulting reward: rt+1!" and resulting next state: st+1t..start +1st +1t +1art +2st +2t +2art +3st +3..t +3a55 CSE 190: Reinforcement Learning , LectureCSE 190: Reinforcement Learning , Lecture 22 Policy at step t,!t: a mapping from states to action probabilities !t(s,a)= probability that at=a when st=sThe Agent Learns a Policy Reinforcement Learning methods specify how theagent changes its policy as a result of experience. Roughly, the agent s goal is to get as much reward asit can over the long 190: Reinforcement Learning , LectureCSE 190: Reinforcement Learning , Lecture 22 Getting the Degree of Abstraction Right Time steps need not refer to fixed intervals of real time. Actions can be low level ( , voltages to motors), or highlevel ( , accept a job offer), mental ( , shift in focus ofattention), etc. States can low-level sensations , or they can be abstract,symbolic, based on memory, or subjective ( , the state ofbeing surprised or lost ).

3 An RL agent is not necessarily like a whole animal or robot. Rewards are in the agent s environment because the agentcannot change it arbitrarily - otherwise, it could simplyreward itself and call it a day! The environment is not necessarily unknown to the agent,only incompletely 190: Reinforcement Learning , LectureCSE 190: Reinforcement Learning , Lecture 22 Goals and Rewards Is a scalar reward signal an adequate notion of agoal? maybe not, but it is surprisingly flexible. A goal should specify whatwhat we want to achieve, nothowhow we want to achieve it. A goal must be outside the agent s direct control thusoutside the agent. The agent must be able to measure success: explicitly; frequently during its 190: Reinforcement Learning , LectureCSE 190: Reinforcement Learning , Lecture 22 The reward hypothesis The reward hypothesis:All of what we mean by goals and purposes can be thoughtof as the maximization of the cumulative sum of areceived scalar signal (reward) A sort of null hypothesis.

4 Probably ultimately wrong, but so simple we have todisprove it before considering anything morecomplicated99 CSE 190: Reinforcement Learning , LectureCSE 190: Reinforcement Learning , Lecture 22 Returns Suppose the sequence of rewards after step t is: rt+1,rt+2,rt+3,..What do we want to maximize?In general, we want to maximize the expected return, ERt{}, for each step tasksEpisodic tasks: interaction breaks naturally intoepisodes, , plays of a game, trips through a maze. Rt=rt+1+rt+2+!+rT,where T is a final time step at which a terminal stateterminal state is reached,ending an 190: Reinforcement Learning , LectureCSE 190: Reinforcement Learning , Lecture 22 Returns for Continuing TasksContinuing tasks: interaction does not have natural episodes. Instead, we use the Discounted return: Discounted return: Rt=rt+1+!rt+2+!2rt+3+!=!krt+k+1,k=0"#whe re !,0$!$1, is the discount ensures that the expected reward 0!

5 "#1 farsighted1111 CSE 190: Reinforcement Learning , LectureCSE 190: Reinforcement Learning , Lecture 22An ExampleAvoid failure: the pole falling beyonda critical angle or the cart hitting end =+1 for each step before failure! return = number of steps before failureAs an episodic task where episode ends upon failure:As a continuing task with discounted return:reward =!1 upon failure; 0 otherwise" return = !#k, for k steps before failureIn either case, return is maximized by avoiding failure foras long as 190: Reinforcement Learning , LectureCSE 190: Reinforcement Learning , Lecture 22 Another ExampleGet to the top of the hillas quickly as possible. reward =!1 for each step where not at top of hill" return = !number of steps before reaching top of hillReturn is maximized by minimizing number of stepsto reach the top of the 190: Reinforcement Learning , LectureCSE 190: Reinforcement Learning , Lecture 22A Unified Notation In episodic tasks, we number the time steps of eachepisode starting from zero.

6 We usually do not have to distinguish betweenepisodes, so we write instead of for the stateat step t of episode j. Think of each episode as ending in an absorbing statethat always produces reward of zero: We can cover all cases by writingstst,j Rt=!krt+k+1,k=0"#where ! can be 1 only if a zero reward absorbing state is always 190: Reinforcement Learning , LectureCSE 190: Reinforcement Learning , Lecture 22 The Markov Property By the state at step t, the book means whateverinformation is available to the agent at step t about itsenvironment. The state can include immediate sensations, highlyprocessed sensations, and structures built up over timefrom sequences of sensations. Ideally, a state should summarize past sensations so as toretain all essential information, , it should have theMarkov Property: Prst+1=!s,rt+1=rst,at,rt,st"1,at"1,..,r1 ,s0,a0{}= Prst+1=!

7 S,rt+1=rst,at{}for all !s,r,and histories st,at,rt,st"1,at"1,..,r1,s0,a0. 1515 CSE 190: Reinforcement Learning , LectureCSE 190: Reinforcement Learning , Lecture 22 Markov Decision Processes If a Reinforcement Learning task has the MarkovProperty, it is basically a Markov Decision Process(MDP). If state and action sets are finite, it is a finite MDP. To define a finite MDP, you need to give: state and action sets one-step dynamics defined by transition probabilities: reward probabilities:Ps!sa=Prst+1=!sst=s,at=a{} for all s,!s"S,a"A(s).Rs!sa=Ert+1st=s,at=a,st+1= !s{} for all s,!s"S,a"A(s).1616 CSE 190: Reinforcement Learning , LectureCSE 190: Reinforcement Learning , Lecture 22 Recycling Robot An Example Finite MDP At each step, robot has to decide whether it should (1) activelysearch for a can, (2) wait for someone to bring it a can, or (3) go tohome base and recharge. Searching is better but runs down the battery; if runs out of powerwhile searching, has to be rescued (which is bad).

8 Decisions made on basis of current energy level: high, low. Reward = number of cans collected1717 CSE 190: Reinforcement Learning , LectureCSE 190: Reinforcement Learning , Lecture 22 Recycling Robot MDP S=high,low{}A(high)=search,wait{}A(low)= search,wait,recharge{} Rsearch= expected no. of cans while searchingRwait= expected no. of cans while waiting Rsearch>Rwait1818 CSE 190: Reinforcement Learning , LectureCSE 190: Reinforcement Learning , Lecture 22 The value of a state is the expected return starting fromthat state; depends on the agent s policy: The value of taking an action in a state under policy! is the expected return starting from that state, takingthat action, and thereafter following ! :Value FunctionsState-value function for policy !:V!(s)=E!Rtst=s{}=E!"krt+k+1st=sk=0#$%& '()*Action-value function for policy !:Q!(s,a)=E!Rtst=s,at=a{}=E!"krt+k+1st=s ,at=ak=0#$% & ' ( ) * 1919 CSE 190: Reinforcement Learning , LectureCSE 190: Reinforcement Learning , Lecture 22 Bellman Equation for a Policy !

9 Rt=rt+1+!rt+2+!2rt+3+!3rt+4!=rt+1+!rt+2+ !rt+3+!2rt+4!()=rt+1+!Rt+1 The basic idea: So: V!(s)=E!Rtst=s{}=E!rt+1+"Vst+1()st=s{}Or , without the expectation operator: V!(s)=!(s,a)Ps"saRs"sa+#V!("s)$%&'"s(a(2 020 CSE 190: Reinforcement Learning , LectureCSE 190: Reinforcement Learning , Lecture 22 More on the Bellman EquationV!(s)=!(s,a)Ps"saRs"sa+#V!("s)$% &'"s(a(This is a setset of equations (in fact, linear), one for each value function for ! is its unique diagrams:for V!for Q!2121 CSE 190: Reinforcement Learning , LectureCSE 190: Reinforcement Learning , Lecture 22 Gridworld Actions: north, south, east, west; deterministic. If would take agent off the grid: no move but reward = 1 Other actions produce reward = 0, except actions that move agent outof special states A and B as function for equiprobable random policy;! = 190: Reinforcement Learning , LectureCSE 190: Reinforcement Learning , Lecture 22 Golf State is ball location Reward of 1 for each strokeuntil the ball is in the hole Value of a state?))))

10 Actions: putt (use putter) driver (use driver) putt succeeds anywhere onthe green2323 CSE 190: Reinforcement Learning , LectureCSE 190: Reinforcement Learning , Lecture 22 For finite MDPs, policies can be partially ordered: There are always one or more policies that are better than orequal to all the others. These are the optimal policies. Wedenote them all ! *. Optimal policies share the same optimal state-valuefunction: Optimal policies also share the same optimal action-valuefunction:!"#! if and only if V!(s)"V#!(s) for all s$SOptimal Value FunctionsV!(s)=max"V"(s) for all s#SQ!(s,a)=max"Q"(s,a) for all s#S and a#A(s)This is the expected return for taking action a in state sand thereafter following an optimal 190: Reinforcement Learning , LectureCSE 190: Reinforcement Learning , Lecture 22 Optimal Value Function for Golf We can hit the ball farther with driver than withputter, but with less accuracy Q*(s,driver) gives the value of using driver first,then using whichever actions are best2525 CSE 190: Reinforcement Learning , LectureCSE 190: Reinforcement Learning , Lecture 22 Bellman Optimality Equation for V*V!


Related search queries