Example: barber

Course basics CSE 190: Reinforcement Learning: An …

CSE 190: Reinforcement Learning: An Introduction22 CSE 190: Reinforcement learning , Lecture 1 CSE 190: Reinforcement learning , Lecture 1 Course basics The website for the class is linked off my homepage. Grades will be based on programming assignments, homeworks, andclass participation. Homeworks will be turned in, but not graded, as we will discuss theanswers in class in small groups. Turning it in means I can see that youare holding up your end of the conversation (this is a major part ofclass participation!) Programming assignments will be graded AnyAny email sent to me about the Course should have CSE 190 in thesubject line!33 CSE 190: Reinforcement learning , Lecture 1 CSE 190: Reinforcement learning , Lecture 1 Course goals After taking this Course you should: Understand what is unique about Reinforcement learning Understand the tradeoff between exploration and exploitation Be conversant in Markov Decision Problems (MDPs) Know the various solution methods for solving the RL problem: Dynamic programming (value iteration, policy iteration, etc.)

CSE 190: Reinforcement Learning: An Introduction CSE 190: Reinforcement Learning, Lecture 1 2 Course basics •The website for the class is linked off my homepage. •Grades will be based on programming assignments, homeworks, and class participation. •Homeworks will be turned in, but not graded, as wewill discuss the answers in class in small groups.

Tags:

  Introduction, Learning, An introduction, Reinforcement, Reinforcement learning

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of Course basics CSE 190: Reinforcement Learning: An …

1 CSE 190: Reinforcement Learning: An Introduction22 CSE 190: Reinforcement learning , Lecture 1 CSE 190: Reinforcement learning , Lecture 1 Course basics The website for the class is linked off my homepage. Grades will be based on programming assignments, homeworks, andclass participation. Homeworks will be turned in, but not graded, as we will discuss theanswers in class in small groups. Turning it in means I can see that youare holding up your end of the conversation (this is a major part ofclass participation!) Programming assignments will be graded AnyAny email sent to me about the Course should have CSE 190 in thesubject line!33 CSE 190: Reinforcement learning , Lecture 1 CSE 190: Reinforcement learning , Lecture 1 Course goals After taking this Course you should: Understand what is unique about Reinforcement learning Understand the tradeoff between exploration and exploitation Be conversant in Markov Decision Problems (MDPs) Know the various solution methods for solving the RL problem: Dynamic programming (value iteration, policy iteration, etc.)

2 Monte Carlo TD learning Know what an eligibility trace is Be aware of several well-known applications of RL Be able to read papers in the field and understand 75% of each 190: Reinforcement learning , Lecture 1 CSE 190: Reinforcement learning , Lecture 1 introduction and Motivation Four types of learning : Supervised: Agent is told exactly what to do Unsupervised: data modeling Imitation: follow the leader Reinforcement : learning by trying things in an unknown environment Imagine playing a new game whose rules you don t know; after ahundred or so moves, your opponent announces, You lose. This isreinforcement learning in a nutshell - Russell & Norvig We will start with very simple examples that are not thecomplete Reinforcement learning problem. 55 CSE 190: Reinforcement learning , Lecture 1 CSE 190: Reinforcement learning , Lecture 1 introduction and Motivation Examples of Reinforcement learning problems: Making breakfast Pole-balancing Gridworld learning to play checkers learning to play backgammon learning to get food into your mouth learning to fly a helicopter However, we will start with very simple examples that arenot the complete Reinforcement learning problem.

3 66 CSE 190: Reinforcement learning , Lecture 1 CSE 190: Reinforcement learning , Lecture 1 The Basic Idea RL is learning to map states to actions to maximize reward The results of actions are learned from experience: bytrying them out. Often, the reward is delayed. , in a game, you only learnwhether you win or lose at the end. S&B s characterization is that RL is a definition of aproblem, rather than a learning method - any method thatsolves the problem is Reinforcement learning . The executive summary is: the RL problem is that faced bya learning agent interacting with its environment to achievea goal: Requires sensation, action, and some idea of a 190: Reinforcement learning , Lecture 1 CSE 190: Reinforcement learning , Lecture 1An essential difference(from other forms of learning ) The exploration/exploitation tradeoffexploration/exploitation tradeoff: An agent learning by interacting with the environmentmust: Exploit its knowledgeExploit its knowledge to maximize reward Explore the environmentExplore the environment to ensure that its knowledge is correct The agent must try everything while favoring, over time,the most rewarding actions.

4 In a stochastic environment, actions must be tried asufficient number of times to obtain good estimates of theprobabilities. Consider: is there any corresponding issue in supervisedlearning?88 CSE 190: Reinforcement learning , Lecture 1 CSE 190: Reinforcement learning , Lecture 1 Examples (again) Making breakfast Pole-balancing Gridworld A gazelle struggles to its feet minutes after being born. 20 minuteslater, it is running at 20 MPH. A mobile robot decides whether it should enter a new room in search ofmore trash to collect or start trying to find its way back to its batteryrecharging station. It makes its decision based on how quickly andeasily it has been able to find the recharger in the past. learning to play chess learning to play backgammon learning to get food into your mouth learning to fly a helicopter What unifies these?

5 99 CSE 190: Reinforcement learning , Lecture 1 CSE 190: Reinforcement learning , Lecture 1 Examples (again) Making breakfast Pole-balancing A gazelle struggles to its feet minutes after being born. 20 minutes later, it is running at20 MPH. A mobile robot decides whether it should enter a new room in search of more trash tocollect or start trying to find its way back to its battery recharging station. It makes itsdecision based on how quickly and easily it has been able to find the recharger in thepast. learning to play chess or backgammon learning to get food into your mouth learning to fly a helicopter What unifies these? Interaction with an environment To achieve a goal With considerable uncertainty about the effects of actions - but the actionsaffect what choices there will be in the future!

6 1010 CSE 190: Reinforcement learning , Lecture 1 CSE 190: Reinforcement learning , Lecture 1 Elements of RL a policy a reward function a value function optionally, a model of the 190: Reinforcement learning , Lecture 1 CSE 190: Reinforcement learning , Lecture 1 Elements of RL a policy: A mapping from states to actions, written This mapping could be stochastic (meaning, the policy couldprovide a probability distribution over actions) , the probability of taking action a in state s a reward function a value function optionally, a model of the environment.!(s)=a!(s,a)=P(a|s)1212 CSE 190: Reinforcement learning , Lecture 1 CSE 190: Reinforcement learning , Lecture 1 Elements of RL a policy: A mapping from states to actions: Stochastic policy: a reward function: Specified in the environment, not in the agent!

7 Usually a scalar value at each state Can be zero until the end Can be negative until the end (encourages speed!) Perhaps surprisingly, can flexibly specify goals a value function optionally, a model of the environment.!(s)=a!(s,a)=P(a|s)1313 CSE 190: Reinforcement learning , Lecture 1 CSE 190: Reinforcement learning , Lecture 1 Elements of RL a policy: A mapping from states to actions: Stochastic policy: a reward function: Specified in the environment, not in the agent! Usually a scalar value at each state a value function: A mapping from states to expected total rewards from this state ifwe follow policy Written: (Later we will see how to compute/estimate this) optionally, a model of the environment.!(s)=a!(s,a)=P(a|s)V!(s)!141 4 CSE 190: Reinforcement learning , Lecture 1 CSE 190: Reinforcement learning , Lecture 1 Elements of RL a policy: A mapping from states to actions: Stochastic policy: a reward function: Specified in the environment, not in the agent!

8 Usually a scalar value at each state a value function: A mapping from states to expected total rewards from this state ifwe follow policy Written: Note the major difference between the reward and thevalue function: The value is a prediction of future rewards -rewards themselves are simply given to us by theenvironment!(s)=a!(s,a)=P(a|s)V!(s)!1 515 CSE 190: Reinforcement learning , Lecture 1 CSE 190: Reinforcement learning , Lecture 1 Elements of RL a policy: A mapping from states to actions: Stochastic policy: a reward function: Specified in the environment, not in the agent! Usually a scalar value at each state a value function: A mapping from states to expected total rewards from this state ifwe follow policy Written: A model of the environment: Something that tells us what to expect if we take an action in , the probability of getting to state s from state s if we take action a.

9 !(s)=a!(s,a)=P(a|s)V!(s)!Pss'a1616 CSE 190: Reinforcement learning , Lecture 1 CSE 190: Reinforcement learning , Lecture 1 Elements of RL A model of the environment supports planning throughsimulating the future (if I do this, then he ll do ) In general, RL agents can span the gamut from reactive 190: Reinforcement learning , Lecture 1 CSE 190: Reinforcement learning , Lecture 1 Example 1: Tic-tac-toe Everyone knows this one, right? Since one can always play to a draw, let s assume animperfect opponent. Surprisingly, standard AI game-playing methods (minimaxsearch, dynamic programming) don t work here (why not?) Reinforcement learning approach: Set up a table, V[si], i= , where n is the number of possible statesof the board, and si is a state Each entry of V[] is an estimate of the probability of a win from thatstate: the value of that state Assume we always play X s - Initialize V as: V[si = a state with three X s in a row] = ?

10 V[si = a state with three O s in a row] = ? V[si = all other states] = ?1818 CSE 190: Reinforcement learning , Lecture 1 CSE 190: Reinforcement learning , Lecture 1 Example 1: Tic-tac-toe Reinforcement learning approach: Set up a table, V[si], i= , where n is the number ofpossible states of the board, and si is a state Play many games against our imperfect opponent tolearn values of states How do we play? We need a policy. Let s use one called !-greedy For each move, ! of the time, we pick a move uniformlyat random from the possible moves. Otherwise, we pick the move that gets us to the statewith the highest value: V[si] (greedy).1919 CSE 190: Reinforcement learning , Lecture 1 CSE 190: Reinforcement learning , Lecture 1 Example 1: Tic-tac-toe !-greedy policy (details): The ! case is exploration.


Related search queries