Introduction to Reinforcement Learning

Bayesian Methods in Reinforcement Learning ICML 2007 Introduction to Reinforcement LearningBayesian Methods in Reinforcement Learning ICML 2007 sequential decision making under uncertaintyMove around in the physical world ( driving, navigation)Play and win a gameRetrieve information over the webDo medical diagnosis and treatmentMaximize the throughput of a factoryOptimize the performance of a rescue team?How Can I .. ?Bayesian Methods in Reinforcement Learning ICML 2007 Reinforcement Learning RL: A class of Learning problems in which an agent interacts with an unfamiliar, dynamic and stochastic environmentGoal: Learn a policy to maximize some measure of long-term rewardInteraction: Modeled as a MDP or a POMDPE nvironmentActionStateRewardBayesian Methods in Reinforcement Learning ICML 2007 Markov decision processesAn MDP is defined as a 5-tuple : State space of the process : Action space of the process.

Probability distribution over next state : Probability distribution over rewards : Initial state distribution Policy: Mapping from states to actions or distributions over actions (x) Aor ( |x) Pr(A)XAq( |x,a)(X,A,p,q,p0)p0p( |x,a)xt+1 p( |xt,at)R(xt,at) q( |xt,at)Bayesian Methods in Reinforcement Learning ICML 2007 Example: BackgammonStates: board configurations (about ) Actions: permissible movesRewards: win +1, lose -1, else 01020 Bayesian Methods in Reinforcement Learning ICML 2007 RL applicationsBackgammon (Tesauro, 1994)Inventory Management (Van Roy, Bertsekas, Lee, & Tsitsiklis, 1996) Dynamic Channel Allocation ( Singh & Bertsekas, 1997)Elevator Scheduling (Crites & Barto, 1998)Robocup Soccer ( Stone & Veloso, 1999)Many Robots (navigation, bi-pedal walking, grasping, switching between skills.)

Helicopter Control ( Ng, 2003, Abbeel & Ng, 2006)More Applications Methods in Reinforcement Learning ICML 2007 Value FunctionState Value Function:V (x)=E [ t=0 t R(xt, (xt))|x0=x]State-Action Value Function:Q (x,a)=E [ t=0 t R(xt,at)|x0=x,a0=a]Bayesian Methods in Reinforcement Learning ICML 2007 Policy EvaluationFinding the value function of a policyBellman EquationsV (x)= a A (a|x)[ R(x,a)+ x Xp(x |x,a)V (x )]Q (x,a)= R(x,a)+ x Xp(x |x,a) a A (a |x )Q (x ,a )Bayesian Methods in Reinforcement Learning ICML 2007 Policy OptimizationFinding a policy maximizing V (x) x XNote.

If is available, then an optimal action for state is given by any Q (x,a)=Q (x,a)xa argmaxaQ (x,a)Bellman Optimality EquationsV (x)=maxa A[ R(x,a)+ x Xp(x |x,a)V (x )]Q (x,a)= R(x,a)+ x Xp(x |x,a)maxa AQ (x ,a )Bayesian Methods in Reinforcement Learning ICML 2007 Policy OptimizationValue Iteration V0(x)=0Vt+1(x)=maxa A[ R(x,a)+ x Xp(x |x,a)Vt(x )]system dynamics unknown Bayesian Methods in Reinforcement Learning ICML 2007 Reinforcement Learning (RL)RL Problem: Solve MDP when transition and/or reward models are unknownBasic Idea: use samples obtained from the agent s interaction with the environment to solve the MDPE nvironmentActionStateRewardBayesian Methods in Reinforcement Learning ICML 2007 Model-Based vs.

Model-Free RLWhat is model? state transition distribution and reward distribution Model-Based RL: model is not available, but it is explicitly learnedModel-Free RL: model is not available and is not explicitly learned Value Function / PolicyExperienceModelModelLearningModel- Based RLorPlanningModel-FreeorDirect RL ActingBayesian Methods in Reinforcement Learning ICML 2007 Reinforcement Learning solutionsValue FunctionAlgorithmsSARSAQ-learningValue IterationActor-CriticAlgorithmsPolicy SearchAlgorithmsPEGASUSG enetic AlgorithmsSutton, et al. 2000 Konda & Tsitsiklis 2000 Peters, et al. 2005 Bhatnagar, Ghavamzadeh, Sutton 2007 Policy GradientAlgorithmsSutton et al.

2000 Konda 2000 Sutton et al. 2000 Konda 2000 Sutton et al. 2000 Konda 2000 Bayesian Methods in Reinforcement Learning ICML 2007 Learning ModesOffline LearningLearning while interacting with a simulatorOnline LearningLearning while interacting with the environmentBayesian Methods in Reinforcement Learning ICML 2007 Offline LearningAgent interacts with a simulatorRewards/costs do not matterno exploration/exploitation tradeoffComputation time between actions is not criticalSimulator can produce as much as data we wishMain ChallengeHow to minimize time to converge to optimal policyBayesian Methods in Reinforcement

Learning ICML 2007 Online LearningNo simulator - Direct interaction with environmentAgent receives reward/cost for each actionMain ChallengesExploration/exploitation tradeoffShould actions be picked to maximize immediate reward or to maximize information gain to improve policyReal-time execution of actionsLimited amount of data since interaction with environment is requiredBayesian Methods in Reinforcement Learning ICML 2007 Bayesian LearningBayesian Methods in Reinforcement Learning ICML 2007 The bayesian approach - hidden process , - observableGoal: infer from measurements of Known: statistical dependence between and Place prior over : reflecting our uncertaintyObserve:Compute posterior of.

ZZZZYZYYY=yP(Z)P(Y|Z)P(Z|Y=y)=P(y|Z)P(Z) P(y|Z )P(Z )dZ ZYBayesian Methods in Reinforcement Learning ICML 2007 Bayesian LearningProsPrincipled treatment of uncertaintyConceptually simpleImmune to overfitting (prior serves as regularizer)Facilitates encoding of domain knowledge (prior) ConsMathematically and computationally posterior may not have a closed formHow do we pick the prior?Bayesian Methods in Reinforcement Learning ICML 2007 Bayesian RLSystematic method for inclusion and update of prior knowledge and domain assumptions Encode uncertainty about transition function, reward function, value function, policy, etc.

With a probability distribution (belief)Update belief based on evidence ( , state, action, reward)+Appropriately reconcile exploration with exploitation Select action based on beliefProviding full distribution, not just point estimatesMeasure of uncertainty for performance predictions ( value function, policy gradient)Bayesian Methods in Reinforcement Learning ICML 2007 Bayesian RLModel-based Bayesian RLDistribution over transition probabilityModel-free Bayesian RLDistribution over value function, policy, or policy gradientBayesian inverse RLDistribution over rewardBayesian multi-agent RLDistribution over other agents policies

Introduction to Reinforcement Learning

Tags:

Information

Advertisement

Transcription of Introduction to Reinforcement Learning

Related search queries

Introduction to Reinforcement Learning

Tags:

Information

Advertisement

Documents from same domain

Related documents

Related search queries