Part XIII Reinforcement Learning and Control

CS229 Lecture notesAndrew NgPart XIIIR einforcement Learning andControlWe now begin our study of Reinforcement Learning and adaptive supervised Learning , we saw algorithms that tried to make their outputsmimic the labelsygiven in the training set. In that setting, the labels gavean unambiguous right answer for each of the inputsx. In contrast, formany sequential decision making and Control problems, it is very difficult toprovide this type of explicit supervision to a Learning algorithm. For example,if we have just built a four-legged robot and are trying to program itto walk,then initially we have no idea what the correct actions to take are to makeit walk, and so do not know how to provide explicit supervision for a learningalgorithm to try to the Reinforcement Learning framework, we will instead provide our al-gorithms only a reward function, which indicates to the Learning agent whenit is doing well, and when it is doing poorly.

In the four-legged walking ex-ample, the reward function might give the robot positive rewards for movingforwards, and negative rewards for either moving backwards or falling will then be the Learning algorithm s job to figure out how to chooseactionsover time so as to obtain large Learning has been successful in applications as diverse asautonomous helicopter flight, robot legged locomotion, cell-phone networkrouting, marketing strategy selection, factory Control , and efficient web-pageindexing. Our study of Reinforcement Learning will begin with a definition oftheMarkov decision processes (MDP), which provides the formalism inwhich RL problems are usually Markov decision processesA Markov decision process is a tuple (S,A,{Psa}, ,R), where: Sis a set ofstates. (For example, in autonomous helicopter flight,Smight be the set of all possible positions and orientations of the heli-copter.)

Ais a set ofactions. (For example, the set of all possible directions inwhich you can push the helicopter s Control sticks.) Psaare the state transition probabilities. For each states Sandactiona A,Psais a distribution over the state space. We ll say moreabout this later, but briefly,Psagives the distribution over what stateswe will transition to if we take actionain states. [0,1) is called thediscount factor. R:S A7 Ris thereward function. (Rewards are sometimes alsowritten as a function of a stateSonly, in which case we would haveR:S7 R).The dynamics of an MDP proceeds as follows: We start in some states0,and get to choose some actiona0 Ato take in the MDP. As a result of ourchoice, the state of the MDP randomly transitions to some successor states1, drawn according tos1 Ps0a0. Then, we get to pick another a result of this action, the state transitions again, now to somes2 then picka2, and so on.]

Pictorially, we can represent this process asfollows:s0a0 s1a1 s2a2 s3a3 ..Upon visiting the sequence of statess0,s1,..with actionsa0,a1,.., ourtotal payoff is given byR(s0,a0) + R(s1,a1) + 2R(s2,a2) + .Or, when we are writing rewards as a function of the states only, this becomesR(s0) + R(s1) + 2R(s2) + .For most of our development, we will use the simpler state-rewardsR(s),though the generalization to state-action rewardsR(s,a) offers no goal in Reinforcement Learning is to choose actions over time so as tomaximize the expected value of the total payoff:E[R(s0) + R(s1) + 2R(s2) + ]Note that the reward at timesteptisdiscountedby a factor of t. Thus, tomake this expectation large, we would like to accrue positive rewardsas soonas possible (and postpone negative rewards as long as possible).

Ineconomicapplications whereR( ) is the amount of money made, also has a naturalinterpretation in terms of the interest rate (where a dollar today isworthmore than a dollar tomorrow).Apolicyis any function :S7 Amapping from the states to theactions. We say that we areexecutingsome policy if, whenever we arein states, we take actiona= (s). We also define thevalue functionfora policy according toV (s) = E[R(s0) + R(s1) + 2R(s2) + s0=s, ].V (s) is simply the expected sum of discounted rewards upon starting instates, and taking actions according to .1 Given a fixed policy , its value functionV satisfies theBellman equa-tions:V (s) =R(s) + s SPs (s)(s )V (s ).This says that the expected sum of discounted rewardsV (s) for startinginsconsists of two terms: First, theimmediate rewardR(s) that we getrightaway simply for starting in states, and second, the expected sum offuture discounted rewards.

Examining the second term in more detail, wesee that the summation term above can be rewritten Es Ps (s)[V (s )]. Thisis the expected sum of discounted rewards for starting in states , wheres is distributed accordingPs (s), which is the distribution over where we willend up after taking the first action (s) in the MDP from states. Thus, thesecond term above gives the expected sum of discounted rewardsobtainedafterthe first step in the s equations can be used to efficiently solve forV . Specifically,in a finite-state MDP (|S|< ), we can write down one such equation forV (s) for every states. This gives us a set of|S|linear equations in|S|variables (the unknownV (s) s, one for each state), which can be efficientlysolved for theV (s) notation in which we condition on isn t technically correct because isn t arandom variable, but this is quite standard in the also define theoptimal value functionaccording toV (s) = max V (s).

(1)In other words, this is the best possible expected sum of discounted rewardsthat can be attained using any policy. There is also a version of Bellman sequations for the optimal value function:V (s) =R(s) + maxa A s SPsa(s )V (s ).(2)The first term above is the immediate reward as before. The secondtermis the maximum over all actionsaof the expected future sum of discountedrewards we ll get upon after actiona. You should make sure you understandthis equation and see why it makes also define a policy :S7 Aas follows: (s) = arg maxa A s SPsa(s )V (s ).(3)Note that (s) gives the actionathat attains the maximum in the max in Equation (2).It is a fact that for every statesand every policy , we haveV (s) =V (s) V (s).The first equality says that theV , the value function for , is equal to theoptimal value functionV for every states.

Further, the inequality abovesays that s value is at least a large as the value of any other other other words, as defined in Equation (3) is the optimal that has the interesting property that it is the optimal policyforallstatess. Specifically, it is not the case that if we were starting insome statesthen there d be some optimal policy for that state, and if wewere starting in some other states then there d be some other policy that soptimal policy fors . Specifically, the same policy attains the maximumin Equation (1) forallstatess. This means that we can use the same policy no matter what the initial state of our MDP Value iteration and policy iterationWe now describe two efficient algorithms for solving finite-state MDPs. Fornow, we will consider only MDPs with finite state and action spaces (|S|< ,|A|< ).

The first algorithm,value iteration, is as follows:51. For each states, initializeV(s) := Repeat until convergence{For every state, updateV(s) :=R(s) + maxa A s Psa(s )V(s ).}This algorithm can be thought of as repeatedly trying to update theesti-mated value function using Bellman Equations (2).There are two possible ways of performing the updates in the inner loop ofthe algorithm. In the first, we can first compute the new values forV(s) forevery states, and then overwrite all the old values with the new values. Thisis called asynchronousupdate. In this case, the algorithm can be viewed asimplementing a Bellman backup operator that takes a current estimate ofthe value function, and maps it to a new estimate. (See homework problemfor details.) Alternatively, we can also , we would loop over the states (in some order), updating the values oneat a either synchronous or asynchronous updates, it can be shown thatvalue iteration will causeVto converge toV.

Having foundV , we canthen use Equation (3) to find the optimal from value iteration, there is a second standard algorithm for find-ing an optimal policy for an MDP. Thepolicy iterationalgorithm proceedsas follows:1. Initialize Repeat until convergence{(a) LetV:=V .(b) For each states, let (s) := arg maxa A s Psa(s )V(s ).}Thus, the inner-loop repeatedly computes the value function for the currentpolicy, and then updates the policy using the current value function. (Thepolicy found in step (b) is also called the policy that isgreedy with re-spect toV.) Note that step (a) can be done via solving Bellman s equationsas described earlier, which in the case of a fixed policy, is just a set of|S|linear equations in|S| at most a finite number of iterations of this algorithm,Vwill con-verge toV , and will converge to.

Part XIII Reinforcement Learning and Control

Tags:

Information

Advertisement

Transcription of Part XIII Reinforcement Learning and Control

Related search queries

Part XIII Reinforcement Learning and Control

Tags:

Information

Advertisement

Documents from same domain

Related documents

Related search queries