Chapter 3: The Reinforcement Learning Problem An …

CSE 190: Reinforcement Learning :An IntroductionAcknowledgment:Acknowledgmen t:A good number of these slidesA good number of these slides are cribbed from Rich Suttonare cribbed from Rich Sutton22 CSE 190: Reinforcement Learning , LectureCSE 190: Reinforcement Learning , Lecture 22 Chapter 3: The ReinforcementLearning ProblemObjectives of what I will talk about from this Chapter : describe the RL Problem we will be studying for theremainder of the course present idealized form of the RL Problem for which wehave precise theoretical results; introduce key components of the mathematics: valuefunctions and Bellman equations; describe trade-offs between applicability and 190: Reinforcement Learning , LectureCSE 190: Reinforcement Learning , Lecture 22 Chapter 3: The ReinforcementLearning ProblemObjectives of what I will talk about from this Chapter : describe the RL Problem we will be studying for theremainder of the course present idealized form of the RL Problem for which wehave precise theoretical results; introduce key components of the mathematics: valuefunctions and Bellman equations; describe trade-offs between applicability and 190: Reinforcement Learning , LectureCSE 190: Reinforcement Learning , Lecture 22 The Agent-Environment Interface Agent and environment interact at discrete time steps: t=0,1,2,K Agent observes state at step t: st!

S produces action at step t: at!A(st) gets resulting reward: rt+1!" and resulting next state: st+1t..start +1st +1t +1art +2st +2t +2art +3st +3..t +3a55 CSE 190: Reinforcement Learning , LectureCSE 190: Reinforcement Learning , Lecture 22 Policy at step t,!t: a mapping from states to action probabilities !t(s,a)= probability that at=a when st=sThe Agent Learns a Policy Reinforcement Learning methods specify how theagent changes its policy as a result of experience. Roughly, the agent s goal is to get as much reward asit can over the long 190: Reinforcement Learning , LectureCSE 190: Reinforcement Learning , Lecture 22 Getting the Degree of Abstraction Right Time steps need not refer to fixed intervals of real time. Actions can be low level ( , voltages to motors), or highlevel ( , accept a job offer), mental ( , shift in focus ofattention), etc. States can low-level sensations , or they can be abstract,symbolic, based on memory, or subjective ( , the state ofbeing surprised or lost ).

An RL agent is not necessarily like a whole animal or robot. Rewards are in the agent s environment because the agentcannot change it arbitrarily - otherwise, it could simplyreward itself and call it a day! The environment is not necessarily unknown to the agent,only incompletely 190: Reinforcement Learning , LectureCSE 190: Reinforcement Learning , Lecture 22 Goals and Rewards Is a scalar reward signal an adequate notion of agoal? maybe not, but it is surprisingly flexible. A goal should specify whatwhat we want to achieve, nothowhow we want to achieve it. A goal must be outside the agent s direct control thusoutside the agent. The agent must be able to measure success: explicitly; frequently during its 190: Reinforcement Learning , LectureCSE 190: Reinforcement Learning , Lecture 22 The reward hypothesis The reward hypothesis:All of what we mean by goals and purposes can be thoughtof as the maximization of the cumulative sum of areceived scalar signal (reward) A sort of null hypothesis.

Probably ultimately wrong, but so simple we have todisprove it before considering anything morecomplicated99 CSE 190: Reinforcement Learning , LectureCSE 190: Reinforcement Learning , Lecture 22 Returns Suppose the sequence of rewards after step t is: rt+1,rt+2,rt+3,..What do we want to maximize?In general, we want to maximize the expected return, ERt{}, for each step tasksEpisodic tasks: interaction breaks naturally intoepisodes, , plays of a game, trips through a maze. Rt=rt+1+rt+2+!+rT,where T is a final time step at which a terminal stateterminal state is reached,ending an 190: Reinforcement Learning , LectureCSE 190: Reinforcement Learning , Lecture 22 Returns for Continuing TasksContinuing tasks: interaction does not have natural episodes. Instead, we use the Discounted return: Discounted return: Rt=rt+1+!rt+2+!2rt+3+!=!krt+k+1,k=0"#whe re !,0$!$1, is the discount ensures that the expected reward 0!

"#1 farsighted1111 CSE 190: Reinforcement Learning , LectureCSE 190: Reinforcement Learning , Lecture 22An ExampleAvoid failure: the pole falling beyonda critical angle or the cart hitting end =+1 for each step before failure! return = number of steps before failureAs an episodic task where episode ends upon failure:As a continuing task with discounted return:reward =!1 upon failure; 0 otherwise" return = !#k, for k steps before failureIn either case, return is maximized by avoiding failure foras long as 190: Reinforcement Learning , LectureCSE 190: Reinforcement Learning , Lecture 22 Another ExampleGet to the top of the hillas quickly as possible. reward =!1 for each step where not at top of hill" return = !number of steps before reaching top of hillReturn is maximized by minimizing number of stepsto reach the top of the 190: Reinforcement Learning , LectureCSE 190: Reinforcement Learning , Lecture 22A Unified Notation In episodic tasks, we number the time steps of eachepisode starting from zero.

We usually do not have to distinguish betweenepisodes, so we write instead of for the stateat step t of episode j. Think of each episode as ending in an absorbing statethat always produces reward of zero: We can cover all cases by writingstst,j Rt=!krt+k+1,k=0"#where ! can be 1 only if a zero reward absorbing state is always 190: Reinforcement Learning , LectureCSE 190: Reinforcement Learning , Lecture 22 The Markov Property By the state at step t, the book means whateverinformation is available to the agent at step t about itsenvironment. The state can include immediate sensations, highlyprocessed sensations, and structures built up over timefrom sequences of sensations. Ideally, a state should summarize past sensations so as toretain all essential information, , it should have theMarkov Property: Prst+1=!s,rt+1=rst,at,rt,st"1,at"1,..,r1 ,s0,a0{}= Prst+1=!

S,rt+1=rst,at{}for all !s,r,and histories st,at,rt,st"1,at"1,..,r1,s0,a0. 1515 CSE 190: Reinforcement Learning , LectureCSE 190: Reinforcement Learning , Lecture 22 Markov Decision Processes If a Reinforcement Learning task has the MarkovProperty, it is basically a Markov Decision Process(MDP). If state and action sets are finite, it is a finite MDP. To define a finite MDP, you need to give: state and action sets one-step dynamics defined by transition probabilities: reward probabilities:Ps!sa=Prst+1=!sst=s,at=a{} for all s,!s"S,a"A(s).Rs!sa=Ert+1st=s,at=a,st+1= !s{} for all s,!s"S,a"A(s).1616 CSE 190: Reinforcement Learning , LectureCSE 190: Reinforcement Learning , Lecture 22 Recycling Robot An Example Finite MDP At each step, robot has to decide whether it should (1) activelysearch for a can, (2) wait for someone to bring it a can, or (3) go tohome base and recharge. Searching is better but runs down the battery; if runs out of powerwhile searching, has to be rescued (which is bad).

Decisions made on basis of current energy level: high, low. Reward = number of cans collected1717 CSE 190: Reinforcement Learning , LectureCSE 190: Reinforcement Learning , Lecture 22 Recycling Robot MDP S=high,low{}A(high)=search,wait{}A(low)= search,wait,recharge{} Rsearch= expected no. of cans while searchingRwait= expected no. of cans while waiting Rsearch>Rwait1818 CSE 190: Reinforcement Learning , LectureCSE 190: Reinforcement Learning , Lecture 22 The value of a state is the expected return starting fromthat state; depends on the agent s policy: The value of taking an action in a state under policy! is the expected return starting from that state, takingthat action, and thereafter following ! :Value FunctionsState-value function for policy !:V!(s)=E!Rtst=s{}=E!"krt+k+1st=sk=0#$%& '()*Action-value function for policy !:Q!(s,a)=E!Rtst=s,at=a{}=E!"krt+k+1st=s ,at=ak=0#$% & ' ( ) * 1919 CSE 190: Reinforcement Learning , LectureCSE 190: Reinforcement Learning , Lecture 22 Bellman Equation for a Policy !

Rt=rt+1+!rt+2+!2rt+3+!3rt+4!=rt+1+!rt+2+ !rt+3+!2rt+4!()=rt+1+!Rt+1 The basic idea: So: V!(s)=E!Rtst=s{}=E!rt+1+"Vst+1()st=s{}Or , without the expectation operator: V!(s)=!(s,a)Ps"saRs"sa+#V!("s)$%&'"s(a(2 020 CSE 190: Reinforcement Learning , LectureCSE 190: Reinforcement Learning , Lecture 22 More on the Bellman EquationV!(s)=!(s,a)Ps"saRs"sa+#V!("s)$% &'"s(a(This is a setset of equations (in fact, linear), one for each value function for ! is its unique diagrams:for V!for Q!2121 CSE 190: Reinforcement Learning , LectureCSE 190: Reinforcement Learning , Lecture 22 Gridworld Actions: north, south, east, west; deterministic. If would take agent off the grid: no move but reward = 1 Other actions produce reward = 0, except actions that move agent outof special states A and B as function for equiprobable random policy;! = 190: Reinforcement Learning , LectureCSE 190: Reinforcement Learning , Lecture 22 Golf State is ball location Reward of 1 for each strokeuntil the ball is in the hole Value of a state?))))

Actions: putt (use putter) driver (use driver) putt succeeds anywhere onthe green2323 CSE 190: Reinforcement Learning , LectureCSE 190: Reinforcement Learning , Lecture 22 For finite MDPs, policies can be partially ordered: There are always one or more policies that are better than orequal to all the others. These are the optimal policies. Wedenote them all ! *. Optimal policies share the same optimal state-valuefunction: Optimal policies also share the same optimal action-valuefunction:!"#! if and only if V!(s)"V#!(s) for all s$SOptimal Value FunctionsV!(s)=max"V"(s) for all s#SQ!(s,a)=max"Q"(s,a) for all s#S and a#A(s)This is the expected return for taking action a in state sand thereafter following an optimal 190: Reinforcement Learning , LectureCSE 190: Reinforcement Learning , Lecture 22 Optimal Value Function for Golf We can hit the ball farther with driver than withputter, but with less accuracy Q*(s,driver) gives the value of using driver first,then using whichever actions are best2525 CSE 190: Reinforcement Learning , LectureCSE 190: Reinforcement Learning , Lecture 22 Bellman Optimality Equation for V*V!

Chapter 3: The Reinforcement Learning Problem An …

Tags:

Information

Advertisement

Transcription of Chapter 3: The Reinforcement Learning Problem An …

Related search queries

Chapter 3: The Reinforcement Learning Problem An …

Tags:

Information

Advertisement

Documents from same domain

Related documents

Related search queries