Example: barber

Optimal Control Theory - University of Washington

To appear inBayesian Brain, Doya, K. (ed), MIT Press (2006) Optimal Control TheoryEmanuel TodorovUniversity of California San DiegoOptimal Control Theory is a mature mathematical discipline with numerous applicationsin both science and engineering. It is emerging as the computational framework of choicefor studying the neural Control of movement, in much the same way that probabilistic infer-ence is emerging as the computational framework of choice for studying sensory informationprocessing. Despite the growing popularity of Optimal Control models, however, the elab-orate mathematical machinery behind them is rarely exposed and the big picture is hardto grasp without reading a few technical books on the subject. While this chapter cannotreplace such books, it aims to provide a self-contained mathematical introduction to opti-mal Control Theory that is su ciently broad and yet su ciently detailed when it comes tokey concepts.

zon formulations, basics of stochastic calculus. 3. Pontryagin™s maximum principle, ODE and gradient descent methods, relationship to classical mechanics. 4. Linear-quadratic-Gaussian control, Riccati equations, iterative linear approximations to nonlinear problems. 5. Optimal recursive estimation, Kalman –lter, Zakai equation. 6.

Tags:

  Control, Theory, Optimal, Estimation, Stochastic, Optimal control theory

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of Optimal Control Theory - University of Washington

1 To appear inBayesian Brain, Doya, K. (ed), MIT Press (2006) Optimal Control TheoryEmanuel TodorovUniversity of California San DiegoOptimal Control Theory is a mature mathematical discipline with numerous applicationsin both science and engineering. It is emerging as the computational framework of choicefor studying the neural Control of movement, in much the same way that probabilistic infer-ence is emerging as the computational framework of choice for studying sensory informationprocessing. Despite the growing popularity of Optimal Control models, however, the elab-orate mathematical machinery behind them is rarely exposed and the big picture is hardto grasp without reading a few technical books on the subject. While this chapter cannotreplace such books, it aims to provide a self-contained mathematical introduction to opti-mal Control Theory that is su ciently broad and yet su ciently detailed when it comes tokey concepts.

2 The text is not tailored to the eld of motor Control (apart from the lastsection, and the overall emphasis on systems with continuous state) so it will hopefullybe of interest to a wider audience. Of special interest in the context of this book is thematerial on the duality of Optimal Control and probabilistic inference; such duality suggeststhat neural information processing in sensory and motor areas may be more similar thancurrently thought. The chapter is organized in the following sections:1. Dynamic programming, Bellman equations, Optimal value functions, value and policyiteration, shortest paths, Markov decision Hamilton-Jacobi-Bellman equations, approximation methods, nite and in nite hori-zon formulations, basics of stochastic Pontryagin s maximum principle, ODE and gradient descent methods, relationship toclassical Linear-quadratic-Gaussian Control , Riccati equations, iterative linear approximationsto nonlinear Optimal recursive estimation , Kalman lter, Zakai Duality of Optimal Control and Optimal estimation (including new results).

3 7. Optimality models in motor Control , promising research Discrete Control : Bellman equationsLetx2 Xdenote the state of an agent s environment, andu2 U(x)the action (orcontrol) which the agent chooses while at statex. For now bothXandU(x)are nitesets. Letnext(x; u)2 Xdenote the state which results from applying actionuin statex, andcost(x; u) 0the cost of applying actionuin statex. As an example,xmay bethe city where we are now,uthe ight we choose to take,next(x; u)the city where that ight lands, andcost(x; u)the price of the ticket. We can now pose a simple yet practicaloptimal Control problem: nd the cheapest way to y to your destination. This problemcan be formalized as follows: nd an action sequence(u0; u1; un 1)and correspondingstate sequence(x0; x1; xn)minimizing the total costJ(x ; u ) =Xn 1k=0cost(xk; uk)wherexk+1=next(xk; uk)anduk2 U(xk). The initial statex0=xinitand destinationstatexn=xdestare given.

4 We can visualize this setting with a directed graph where thestates are nodes and the actions are arrows connecting the nodes. Ifcost(x; u) = 1for all(x; u)the problem reduces to nding the shortest path fromxinittoxdestin the Dynamic programmingOptimization problems such as the one stated above are e ciently solved viadynamicprogramming(DP). DP relies on the following obvious fact: if a given state-action sequenceis Optimal , and we were to remove the rst state and action, the remaining sequence is alsooptimal (with the second state of the original sequence now acting as initial state). Thisis theBellman optimality principle. Note the close resemblance to the Markov property ofstochastic processes (a process is Markov if its future is conditionally independent of thepast given the present state). The optimality principle can be reworded in similar language:the choice of Optimal actions in the future is independent of the past actions which led tothe present state.

5 Thus Optimal state-action sequences can be constructed by starting atthe nal state and extending backwards. Key to this procedure is theoptimal value function(or Optimal cost-to-go function)v(x) ="minimal total cost for completing the task starting from statex"This function captures the long-term cost for starting from a given state, and makes itpossible to nd Optimal actions through the following algorithm:Consider every action available at the current state,add its immediate cost to the Optimal value of the resulting next state,and choose an action for which the sum is above algorithm is "greedy" in the sense that actions are chosen based on local infor-mation, without explicit consideration of all future scenarios. And yet the resulting actionsare Optimal . This is possible because the Optimal value function contains all informationabout future scenarios that is relevant to the present choice of action.

6 Thus the optimalvalue function is an extremely useful quantity, and indeed its calculation is at the heart ofmany methods for Optimal above algorithm yields an Optimal actionu= (x)2 U(x)for every statex. Amapping from states to actions is calledcontrol lawor Control policy. Once we have a controllaw :X !U(X)we can start at any statex0, generate actionu0= (x0), transition tostatex1=next(x0; u0), generate actionu1= (x1), and keep going until we , an Optimal Control law satis es (x) = arg minu2U(x)fcost(x; u) +v(next(x; u))g(1)The minimum in (1) may be achieved for multiple actions in the setU(x), which is why may not be unique. However the Optimal value functionvis always uniquely de ned, andsatis esv(x) = minu2U(x)fcost(x; u) +v(next(x; u))g(2)Equations (1) and (2) are theBellman for somexwe already knowv(next(x; u))for allu2 U(x), then we can apply theBellman equations directly and compute (x)andv(x). Thus dynamic programming isparticularly simple in acyclic graphs where we can start fromxdestwithv xdest = 0, andperform a backward pass in which every state is visited after all its successor states havebeen visited.

7 It is straightforward to extend the algorithm to the case where we are givennon-zero nal costs for a number of destination states (or absorbing states). Value iteration and policy iterationThe situation is more complex in graphs with cycles. Here the Bellman equations arestill valid, but we cannot apply them in a single pass. This is because the presence ofcycles makes it impossible to visit each state only after all its successors have been the Bellman equations are treated as consistency conditions and used to designiterative relaxation schemes much like partial di erential equations (PDEs) are treated asconsistency conditions and solved with corresponding relaxation schemes. By "relaxationscheme" we mean guessing the solution, and iteratively improving the guess so as to makeit more compatible with the consistency two main relaxation schemes arevalue iterationandpolicy iteration. Value iterationuses only (2).

8 We start with a guessv(0)of the Optimal value function, and construct asequence of improved guesses:v(i+1)(x) = minu2U(x)ncost(x; u) +v(i)(next(x; u))o(3)This process is guaranteed to converge to the Optimal value functionvin a nite numberof iterations. The proof relies on the important idea of contraction mappings: one de nesthe approximation errore v(i) = maxx v(i)(x) v(x) , and shows that the iteration (3)causese v(i) to decrease asiincreases. In other words, the mappingv(i)!v(i+1)givenby (3) contracts the "size" ofv(i)as measured by the error norme v(i) .Policy iteration uses both (1) and (2). It starts with a guess (0)of the Optimal control3law, and constructs a sequence of improved guesses:v (i)(x) =cost x; (i)(x) +v (i) next x; (i)(x) (4) (i+1)(x) = arg minu2U(x)ncost(x; u) +v (i)(next(x; u))oThe rst line of (4) requires a separate relaxation to compute the value functionv (i)forthe Control law (i).

9 This function is de ned as the total cost for starting at statexandacting according to (i)thereafter. Policy iteration can also be proven to converge in a nite number of iterations. It is not obvious which algorithm is better, because each ofthe two nested relaxations in policy iteration converges faster than the single relaxation invalue iteration. In practice both algorithms are used depending on the problem at Markov decision processesThe problems considered thus far are deterministic, in the sense that applying actionuat statexalways yields the same next statenext(x; u). Dynamic programming easilygeneralizes to the stochastic case where we have a probability distribution over possiblenext states:p(yjx; u) ="probability thatnext(x; u) =y"In order to qualify as a probability distribution the functionpmust satisfyXy2Xp(yjx; u) = 1p(yjx; u) 0In the stochastic case the value function equation (2) becomesv(x) = minu2U(x)fcost(x; u) +E[v(next(x; u))]g(5)whereEdenotes expectation overnext(x; u), and is computed asE[v(next(x; u))] =Xy2Xp(yjx; u)v(y)Equations (1, 3, 4) generalize to the stochastic case in the same way as equation (2) Optimal Control problem with discrete states and actions and probabilistic statetransitions is called aMarkov decision process(MDP).

10 MDPs are extensively studied inreinforcement learning which is a sub- eld of machine learning focusing on Optimal controlproblems with discrete state. In contrast, Optimal Control Theory focuses on problems withcontinuous state and exploits their rich di erential Continuous Control : Hamilton-Jacobi-Bellman equationsWe now turn to Optimal Control problems where the statex2 Rnxand controlu2U(x) Rnuare real-valued vectors. To simplify notation we will use the shortcutminuinstead of4minu2U(x), although the latter is implied unless noted otherwise. Consider the stochasticdi erential equationdx=f(x;u)dt+F(x;u)dw(6)wheredwis nw-dimensional Brownian motion. This is sometimes called acontrolled Itodi usion, withf(x;u)being the drift andF(x;u)the di usion coe cient. In the absenceof noise, whenF(x;u) = 0, we can simply write_x=f(x;u). However in the stochasticcase this would be meaningless because the sample paths of Brownian motion are notdi erentiable (the termdw=dtis in nite).


Related search queries