Example: bankruptcy

Lecture 1: Introduction to Reinforcement Learning

Lecture 1: Introduction to Reinforcement LearningLecture 1: Introduction to ReinforcementLearningDavid SilverLecture 1: Introduction to Reinforcement LearningOutline1 Admin2 About Reinforcement Learning3 The Reinforcement Learning Problem4 Inside An RL Agent5 Problems within Reinforcement LearningLecture 1: Introduction to Reinforcement LearningAdminClass InformationThursdays 9:30 to 11:00amWebsite: : me: 1: Introduction to Reinforcement LearningAdminAssessmentAssessment will be 50% coursework, 50% examCourseworkAssignment A: RL problemAssignment B: Kernels problemAssessment =max(assignment1,assignment2)Examination A: 3 RL questionsB: 3 kernels questionsAnswer any 3 questionsLecture 1: Introduction to Reinforcement LearningAdminTextbooksAn Introduction to Reinforcement Learning , Sutton andBarto, 1998 MIT Press, 1998 40 poundsAvailable free online!

Lecture 1: Introduction to Reinforcement Learning The RL Problem Reward Examples of Rewards Fly stunt manoeuvres in a helicopter +ve reward for following desired trajectory ve reward for crashing Defeat the world champion at Backgammon += ve reward for winning/losing a game Manage an investment portfolio +ve reward for each $ in bank Control a ...

Tags:

  Lecture, Introduction, Lecture 1

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of Lecture 1: Introduction to Reinforcement Learning

1 Lecture 1: Introduction to Reinforcement LearningLecture 1: Introduction to ReinforcementLearningDavid SilverLecture 1: Introduction to Reinforcement LearningOutline1 Admin2 About Reinforcement Learning3 The Reinforcement Learning Problem4 Inside An RL Agent5 Problems within Reinforcement LearningLecture 1: Introduction to Reinforcement LearningAdminClass InformationThursdays 9:30 to 11:00amWebsite: : me: 1: Introduction to Reinforcement LearningAdminAssessmentAssessment will be 50% coursework, 50% examCourseworkAssignment A: RL problemAssignment B: Kernels problemAssessment =max(assignment1,assignment2)Examination A: 3 RL questionsB: 3 kernels questionsAnswer any 3 questionsLecture 1: Introduction to Reinforcement LearningAdminTextbooksAn Introduction to Reinforcement Learning , Sutton andBarto, 1998 MIT Press, 1998 40 poundsAvailable free online!

2 Sutton/ for Reinforcement Learning , SzepesvariMorgan and Claypool, 2010 20 poundsAvailable free online! szepesva/ 1: Introduction to Reinforcement LearningAbout RLMany Faces of Reinforcement LearningComputer ScienceEconomicsMathematicsEngineeringNe urosciencePsychologyMachine LearningClassical/OperantConditioningOpt imal ControlRewardSystemOperations ResearchBoundedRationalityReinforcement LearningLecture 1: Introduction to Reinforcement LearningAbout RLBranches of Machine LearningReinforcement LearningSupervised LearningUnsupervised LearningMachineLearningLecture 1: Introduction to Reinforcement LearningAbout RLCharacteristics of Reinforcement LearningWhat makes Reinforcement Learning different from other machinelearning paradigms?

3 There is no supervisor, only arewardsignalFeedback is delayed, not instantaneousTime really matters (sequential, non data)Agent s actions affect the subsequent data it receivesLecture 1: Introduction to Reinforcement LearningAbout RLExamples of Reinforcement LearningFly stunt manoeuvres in a helicopterDefeat the world champion at BackgammonManage an investment portfolioControl a power stationMake a humanoid robot walkPlay many different Atari games better than humansLecture 1: Introduction to Reinforcement LearningAbout RLHelicopter ManoeuvresLecture 1: Introduction to Reinforcement LearningAbout RLBipedal RobotsLecture 1: Introduction to Reinforcement LearningAbout RLAtariLecture 1: Introduction to Reinforcement LearningThe RL ProblemRewardRewardsA rewardRtis a scalar feedback signalIndicates how well agent is doing at steptThe agent s job is to maximise cumulative rewardReinforcement Learning is based on the reward hypothesisDefinition (Reward Hypothesis)Allgoals can be described by the maximisation of expectedcumulative rewardDo you agree with this statement?

4 Lecture 1: Introduction to Reinforcement LearningThe RL ProblemRewardExamples of RewardsFly stunt manoeuvres in a helicopter+ve reward for following desired trajectory ve reward for crashingDefeat the world champion at Backgammon+/ ve reward for winning/losing a gameManage an investment portfolio+ve reward for each $ in bankControl a power station+ve reward for producing power ve reward for exceeding safety thresholdsMake a humanoid robot walk+ve reward for forward motion ve reward for falling overPlay many different Atari games better than humans+/ ve reward for increasing/decreasing scoreLecture 1: Introduction to Reinforcement LearningThe RL ProblemRewardSequential Decision MakingGoal:select actions to maximise total future rewardActions may have long term consequencesReward may be delayedIt may be better to sacrifice immediate reward to gain morelong-term rewardExamples:A financial investment (may take months to mature)Refuelling a helicopter (might prevent a crash in several hours)Blocking opponent moves (might help winning chances manymoves from now) Lecture 1.

5 Introduction to Reinforcement LearningThe RL ProblemEnvironmentsAgent and EnvironmentobservationrewardactionAtRtOt Lecture 1: Introduction to Reinforcement LearningThe RL ProblemEnvironmentsAgent and EnvironmentobservationrewardactionAtRtOt At each steptthe agent:Executes actionAtReceives observationOtReceives scalar rewardRtThe environment:Receives actionAtEmits observationOt+1 Emits scalar rewardRt+1tincrements at env. stepLecture 1: Introduction to Reinforcement LearningThe RL ProblemStateHistory and StateThe history is the sequence of observations, actions, rewardsHt=O1,R1,A1,..,At 1,Ot, all observable variables up to the sensorimotor stream of a robot or embodied agentWhat happens next depends on the history:The agent selects actionsThe environment selects observations/rewardsState is the information used to determine what happens nextFormally, state is a function of the history:St=f(Ht) Lecture 1.

6 Introduction to Reinforcement LearningThe RL ProblemStateEnvironment StateobservationrewardactionAtRtOtSteenv ironment stateThe environment stateSetisthe environment s whatever data theenvironment uses to pick thenext observation/rewardThe environment state is notusually visible to the agentEven ifSetis visible, it maycontain irrelevantinformationLecture 1: Introduction to Reinforcement LearningThe RL ProblemStateAgent StateobservationrewardactionAtRtOtStaage nt stateThe agent stateSatis theagent s whatever informationthe agent uses to pick thenext it is the informationused by reinforcementlearning algorithmsIt can be any function ofhistory:Sat=f(Ht) Lecture 1: Introduction to Reinforcement LearningThe RL ProblemStateInformation StateAn information state ( Markov state) contains all usefulinformation from the stateStis Markov if and only ifP[St+1|St] =P[St+1|S1.]

7 ,St] The future is independent of the past given the present H1:t St Ht+1: Once the state is known, the history may be thrown The state is a sufficient statistic of the futureThe environment stateSetis MarkovThe historyHtis MarkovLecture 1: Introduction to Reinforcement LearningThe RL ProblemStateRat ExampleWhat if agent state = last 3 items in sequence?What if agent state = counts for lights, bells and levers?What if agent state = complete sequence? Lecture 1: Introduction to Reinforcement LearningThe RL ProblemStateFully Observable EnvironmentsstaterewardactionAtRtStFull observability: agent directlyobserves environment stateOt=Sat=SetAgent state = environmentstate = information stateFormally, this is a Markovdecision process (MDP)(Next Lecture and themajority of this course) Lecture 1: Introduction to Reinforcement LearningThe RL ProblemStatePartially Observable EnvironmentsPartial observability: agent indirectly observes environment.

8 A robot with camera vision isn t told its absolute locationA trading agent only observes current pricesA poker playing agent only observes public cardsNow agent state6= environment stateFormally this is a partially observable Markov decision process(POMDP)Agent must construct its own state representationSat, history:Sat=HtBeliefs of environment state:Sat= (P[Set=s1],..,P[Set=sn])Recurrent neural network:Sat= (Sat 1Ws+OtWo) Lecture 1: Introduction to Reinforcement LearningInside An RL AgentMajor Components of an RL AgentAn RL agent may include one or more of these components:Policy: agent s behaviour functionValue function: how good is each state and/or actionModel: agent s representation of the environmentLecture 1: Introduction to Reinforcement LearningInside An RL AgentPolicyA policy is the agent s behaviourIt is a map from state to action, policy:a= (s)Stochastic policy: (a|s) =P[At=a|St=s] Lecture 1.

9 Introduction to Reinforcement LearningInside An RL AgentValue FunctionValue function is a prediction of future rewardUsed to evaluate the goodness/badness of statesAnd therefore to select between actions, (s) =E [Rt+1+ Rt+2+ 2Rt+3+..|St=s] Lecture 1: Introduction to Reinforcement LearningInside An RL AgentExample: Value Function in AtariLecture 1: Introduction to Reinforcement LearningInside An RL AgentModelA model predicts what the environment will do nextPpredicts the next stateRpredicts the next (immediate) reward, =P[St+1=s |St=s,At=a]Ras=E[Rt+1|St=s,At=a] Lecture 1: Introduction to Reinforcement LearningInside An RL AgentMaze ExampleStartGoalRewards: -1 per time-stepActions: N, E, S, WStates: Agent s locationLecture 1: Introduction to Reinforcement LearningInside An RL AgentMaze Example.

10 PolicyStartGoalArrows represent policy (s) for each statesLecture 1: Introduction to Reinforcement LearningInside An RL AgentMaze Example: Value Function-14-13-12-11-10-9-16-15-12-8-16- 17-6-7-18-19-5-24-20-4-3-23-22-21-22-2-1 StartGoalNumbers represent valuev (s) of each statesLecture 1: Introduction to Reinforcement LearningInside An RL AgentMaze Example: Model-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1- 1 StartGoalAgent may have an internalmodel of the environmentDynamics: how actionschange the stateRewards: how much rewardfrom each stateThe model may be imperfectGrid layout represents transition modelPass Numbers represent immediate rewardRasfro


Related search queries