Example: dental hygienist

Lecture 1: Introduction to Reinforcement Learning

Lecture 1: Introduction to Reinforcement LearningLecture 1: Introduction to ReinforcementLearningDavid SilverLecture 1: Introduction to Reinforcement LearningOutline1 Admin2 About Reinforcement Learning3 The Reinforcement Learning Problem4 Inside An RL Agent5 Problems within Reinforcement LearningLecture 1: Introduction to Reinforcement LearningAdminClass InformationThursdays 9:30 to 11:00amWebsite: : me: 1: Introduction to Reinforcement LearningAdminAssessmentAssessment will be 50% coursework, 50% examCourseworkAssignment A: RL problemAssignment B: Kernels problemAssessment =max(assignment1,assignment2)Examination A: 3 RL questionsB: 3 kernels questionsAnswer any 3 questionsLecture 1: Introduction to Reinforcement LearningAdminTextbooksAn Introduction to Reinforcement Learning , Sutton andBarto, 1998 MIT Press, 1998 40 poundsAvailable free online! sutton/ for Reinforcement Learning , SzepesvariMorgan and Claypool, 2010 20 poundsAvailable free online!

Lecture 1: Introduction to Reinforcement Learning The RL Problem State History and State Thehistoryis the sequence of observations, actions, rewards

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of Lecture 1: Introduction to Reinforcement Learning

1 Lecture 1: Introduction to Reinforcement LearningLecture 1: Introduction to ReinforcementLearningDavid SilverLecture 1: Introduction to Reinforcement LearningOutline1 Admin2 About Reinforcement Learning3 The Reinforcement Learning Problem4 Inside An RL Agent5 Problems within Reinforcement LearningLecture 1: Introduction to Reinforcement LearningAdminClass InformationThursdays 9:30 to 11:00amWebsite: : me: 1: Introduction to Reinforcement LearningAdminAssessmentAssessment will be 50% coursework, 50% examCourseworkAssignment A: RL problemAssignment B: Kernels problemAssessment =max(assignment1,assignment2)Examination A: 3 RL questionsB: 3 kernels questionsAnswer any 3 questionsLecture 1: Introduction to Reinforcement LearningAdminTextbooksAn Introduction to Reinforcement Learning , Sutton andBarto, 1998 MIT Press, 1998 40 poundsAvailable free online! sutton/ for Reinforcement Learning , SzepesvariMorgan and Claypool, 2010 20 poundsAvailable free online!

2 Szepesva/ 1: Introduction to Reinforcement LearningAbout RLMany Faces of Reinforcement LearningComputer ScienceEconomicsMathematicsEngineeringNe urosciencePsychologyMachine LearningClassical/OperantConditioningOpt imal ControlRewardSystemOperations ResearchBoundedRationalityReinforcement LearningLecture 1: Introduction to Reinforcement LearningAbout RLBranches of Machine LearningReinforcement LearningSupervised LearningUnsupervised LearningMachineLearningLecture 1: Introduction to Reinforcement LearningAbout RLCharacteristics of Reinforcement LearningWhat makes Reinforcement Learning different from other machinelearning paradigms?There is no supervisor, only arewardsignalFeedback is delayed, not instantaneousTime really matters (sequential, non data)Agent s actions affect the subsequent data it receivesLecture 1: Introduction to Reinforcement LearningAbout RLExamples of Reinforcement LearningFly stunt manoeuvres in a helicopterDefeat the world champion at BackgammonManage an investment portfolioControl a power stationMake a humanoid robot walkPlay many different Atari games better than humansLecture 1: Introduction to Reinforcement LearningAbout RLHelicopter ManoeuvresLecture 1: Introduction to Reinforcement LearningAbout RLBipedal RobotsLecture 1: Introduction to Reinforcement LearningAbout RLAtariLecture 1.

3 Introduction to Reinforcement LearningThe RL ProblemRewardRewardsA rewardRtis a scalar feedback signalIndicates how well agent is doing at steptThe agent s job is to maximise cumulative rewardReinforcement Learning is based on the reward hypothesisDefinition (Reward Hypothesis)Allgoals can be described by the maximisation of expectedcumulative rewardDo you agree with this statement? Lecture 1: Introduction to Reinforcement LearningThe RL ProblemRewardExamples of RewardsFly stunt manoeuvres in a helicopter+ve reward for following desired trajectory ve reward for crashingDefeat the world champion at Backgammon+/ ve reward for winning/losing a gameManage an investment portfolio+ve reward for each $ in bankControl a power station+ve reward for producing power ve reward for exceeding safety thresholdsMake a humanoid robot walk+ve reward for forward motion ve reward for falling overPlay many different Atari games better than humans+/ ve reward for increasing/decreasing scoreLecture 1: Introduction to Reinforcement LearningThe RL ProblemRewardSequential Decision MakingGoal.

4 Select actions to maximise total future rewardActions may have long term consequencesReward may be delayedIt may be better to sacrifice immediate reward to gain morelong-term rewardExamples:A financial investment (may take months to mature)Refuelling a helicopter (might prevent a crash in several hours)Blocking opponent moves (might help winning chances manymoves from now) Lecture 1: Introduction to Reinforcement LearningThe RL ProblemEnvironmentsAgent and EnvironmentobservationrewardactionAtRtOt Lecture 1: Introduction to Reinforcement LearningThe RL ProblemEnvironmentsAgent and EnvironmentobservationrewardactionAtRtOt At each steptthe agent:Executes actionAtReceives observationOtReceives scalar rewardRtThe environment:Receives actionAtEmits observationOt+1 Emits scalar rewardRt+1tincrements at env. stepLecture 1: Introduction to Reinforcement LearningThe RL ProblemStateHistory and StateThe history is the sequence of observations, actions, rewardsHt=O1,R1,A1.

5 ,At 1,Ot, all observable variables up to the sensorimotor stream of a robot or embodied agentWhat happens next depends on the history:The agent selects actionsThe environment selects observations/rewardsState is the information used to determine what happens nextFormally, state is a function of the history:St=f(Ht) Lecture 1: Introduction to Reinforcement LearningThe RL ProblemStateEnvironment StateobservationrewardactionAtRtOtSteenv ironment stateThe environment stateSetisthe environment s whatever data theenvironment uses to pick thenext observation/rewardThe environment state is notusually visible to the agentEven ifSetis visible, it maycontain irrelevantinformationLecture 1: Introduction to Reinforcement LearningThe RL ProblemStateAgent StateobservationrewardactionAtRtOtStaage nt stateThe agent stateSatis theagent s whatever informationthe agent uses to pick thenext it is the informationused by reinforcementlearning algorithmsIt can be any function ofhistory:Sat=f(Ht) Lecture 1: Introduction to Reinforcement LearningThe RL ProblemStateInformation StateAn information state ( Markov state) contains all usefulinformation from the stateStis Markov if and only ifP[St+1|St] =P[St+1|S1.]

6 ,St] The future is independent of the past given the present H1:t St Ht+1: Once the state is known, the history may be thrown The state is a sufficient statistic of the futureThe environment stateSetis MarkovThe historyHtis MarkovLecture 1: Introduction to Reinforcement LearningThe RL ProblemStateRat ExampleWhat if agent state = last 3 items in sequence?What if agent state = counts for lights, bells and levers?What if agent state = complete sequence? Lecture 1: Introduction to Reinforcement LearningThe RL ProblemStateFully Observable EnvironmentsstaterewardactionAtRtStFull observability: agent directlyobserves environment stateOt=Sat=SetAgent state = environmentstate = information stateFormally, this is a Markovdecision process (MDP)(Next Lecture and themajority of this course) Lecture 1: Introduction to Reinforcement LearningThe RL ProblemStatePartially Observable EnvironmentsPartial observability: agent indirectly observes environment:A robot with camera vision isn t told its absolute locationA trading agent only observes current pricesA poker playing agent only observes public cardsNow agent state6= environment stateFormally this is a partially observable Markov decision process(POMDP)Agent must construct its own state representationSat, history:Sat=HtBeliefs of environment state:Sat= (P[Set=s1].)

7 ,P[Set=sn])Recurrent neural network:Sat= (Sat 1Ws+OtWo) Lecture 1: Introduction to Reinforcement LearningInside An RL AgentMajor Components of an RL AgentAn RL agent may include one or more of these components:Policy: agent s behaviour functionValue function: how good is each state and/or actionModel: agent s representation of the environmentLecture 1: Introduction to Reinforcement LearningInside An RL AgentPolicyA policy is the agent s behaviourIt is a map from state to action, policy:a= (s)Stochastic policy: (a|s) =P[At=a|St=s] Lecture 1: Introduction to Reinforcement LearningInside An RL AgentValue FunctionValue function is a prediction of future rewardUsed to evaluate the goodness/badness of statesAnd therefore to select between actions, (s) =E [Rt+1+ Rt+2+ 2Rt+3+..|St=s] Lecture 1: Introduction to Reinforcement LearningInside An RL AgentExample: Value Function in AtariLecture 1: Introduction to Reinforcement LearningInside An RL AgentModelA model predicts what the environment will do nextPpredicts the next stateRpredicts the next (immediate) reward, =P[St+1=s |St=s,At=a]Ras=E[Rt+1|St=s,At=a] Lecture 1: Introduction to Reinforcement LearningInside An RL AgentMaze ExampleStartGoalRewards: -1 per time-stepActions: N, E, S, WStates: Agent s locationLecture 1: Introduction to Reinforcement LearningInside An RL AgentMaze Example: PolicyStartGoalArrows represent policy (s) for each statesLecture 1: Introduction to Reinforcement LearningInside An RL AgentMaze Example: Value Function-14-13-12-11-10-9-16-15-12-8-16- 17-6-7-18-19-5-24-20-4-3-23-22-21-22-2-1 StartGoalNumbers represent valuev (s) of each statesLecture 1.

8 Introduction to Reinforcement LearningInside An RL AgentMaze Example: Model-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1- 1 StartGoalAgent may have an internalmodel of the environmentDynamics: how actionschange the stateRewards: how much rewardfrom each stateThe model may be imperfectGrid layout represents transition modelPass Numbers represent immediate rewardRasfrom each states(same for alla) Lecture 1: Introduction to Reinforcement LearningInside An RL AgentCategorizing RL agents (1)Value BasedNo Policy (Implicit)Value FunctionPolicy BasedPolicyNo Value FunctionActor CriticPolicyValue FunctionLecture 1: Introduction to Reinforcement LearningInside An RL AgentCategorizing RL agents (2)Model FreePolicy and/or Value FunctionNo ModelModel BasedPolicy and/or Value FunctionModelLecture 1: Introduction to Reinforcement LearningInside An RL AgentRL Agent TaxonomyModelValue FunctionPolicyActorCriticValue-BasedPoli cy-BasedModel-Free Model-Based Lecture 1: Introduction to Reinforcement LearningProblems within RLLearning and PlanningTwo fundamental problems in sequential decision makingReinforcement Learning :The environment is initially unknownThe agent interacts with the environmentThe agent improves its policyPlanning:A model of the environment is knownThe agent performs computations with its model (without anyexternal interaction)The agent improves its deliberation, reasoning, introspection, pondering,thought, searchLecture 1: Introduction to Reinforcement LearningProblems within RLAtari Example.

9 Reinforcement LearningobservationrewardactionAtRtOtRul es of the game areunknownLearn directly frominteractive game-playPick actions onjoystick, see pixelsand scoresLecture 1: Introduction to Reinforcement LearningProblems within RLAtari Example: PlanningRules of the game are knownCan query emulatorperfect model inside agent s brainIf I take actionafrom states:what would the next state be?what would the score be?Plan ahead to find optimal tree searchrightleftrightrightleftleftLecture 1: Introduction to Reinforcement LearningProblems within RLExploration and Exploitation (1) Reinforcement Learning is like trial-and-error learningThe agent should discover a good policyFrom its experiences of the environmentWithout losing too much reward along the wayLecture 1: Introduction to Reinforcement LearningProblems within RLExploration and Exploitation (2)Explorationfinds more information about the environmentExploitationexploits known information to maximise rewardIt is usually important to explore as well as exploitLecture 1.

10 Introduction to Reinforcement LearningProblems within RLExamplesRestaurant SelectionExploitation Go to your favourite restaurantExploration Try a new restaurantOnline Banner AdvertisementsExploitation Show the most successful advertExploration Show a different advertOil DrillingExploitation Drill at the best known locationExploration Drill at a new locationGame PlayingExploitation Play the move you believe is bestExploration Play an experimental moveLecture 1: Introduction to Reinforcement LearningProblems within RLPrediction and ControlPrediction: evaluate the futureGiven a policyControl: optimise the futureFind the best policyLecture 1: Introduction to Reinforcement LearningProblems within RLGridworld Example: B +10+5 Actions(a)(b)What is the value function for the uniform random policy? Lecture 1: Introduction to Reinforcement LearningProblems within RLGridworld Example: Controla) gridworldb) V*c) * B +1 0+5 v What is the optimal value function over all possible policies?


Related search queries