Introduction to Stochastic Dynamic Programming

Introduction to Stochastic Dynamic Programming Sheldon Ross University of California Berkeley, California ACADEMIC PRESS A Subsidiary of H ar court Brace Jovanovich, Publishers New York London Paris San Diego San Francisco S o Paulo Sydney Tokyo Toronto COPYRIGHT 1983, BY ACADEMIC PRESS, INC. ALL RIGHTS RESERVED. NO PART OF THIS PUBLICATION MAY BE REPRODUCED OR TRANSMITTED IN ANY FORM OR BY ANY MEANS, ELECTRONIC OR MECHANICAL, INCLUDING PHOTOCOPY, RECORDING, OR ANY INFORMATION STORAGE AND RETRIEVAL SYSTEM, WITHOUT PERMISSION IN WRITING FROM THE PUBLISHER.

ACADEMIC PRESS, INC. Ill Fifth Avenue, New York, New York 10003 United Kingdom Edition published by ACADEMIC PRESS, INC. (LONDON) LTD. 24/28 Oval Road, London NW1 7DX Library of Congress Cataloging in Publication Data Ross, Sheldon M. Introduction to Stochastic Dynamic Programming . (Probability and mathematical statistics) Includes bibliographies and index. 1. Dynamic Programming . 2. Stochastic Programming . I. Title. II. Series. 1982 '03 82-18163 ISBN 0-12-598*420-0 PRINTED IN THE UNITED STATES OF AMERICA 83 84 85 86 9 8 7 6 5 4 3 2 1 To Celine Preface This text presents the basic theory and examines the scope of applications of Stochastic Dynamic Programming .

Chapter I is a study of a variety of finite-stage models, illustrating the wide range of applications of Stochastic Dynamic Programming . Later chapters study infinite-stage models: dis-counting future returns in Chapter II, minimizing nonnegative costs in Chapter III, maximizing nonnegative returns in Chapter IV, and maximiz-ing the long-run average return in Chapter V. Each of these chapters first considers whether an optimal policy need exist presenting counterexam-ples where appropriate and then presents methods for obtaining such policies when they do. In addition, general areas of application are pre-sented; for example, optimal stopping problems are considered in Chapter III and a variety of gambling models in Chapter IV.

The final two chapters are concerned with more specialized models. Chapter VI presents a variety of Stochastic scheduling models, and Chapter VII examines a type of process known as a multiproject bandit. The mathematical prerequisites for this text are relatively few. No prior knowledge of Dynamic Programming is assumed and only a moderate familiarity with probability including the use of conditional expecta-tion is necessary. I have attempted to present all proofs in as intuitive a manner as possible. An appendix dealing with Stochastic order relations, which is needed primarily for the final two chapters, is included.

Through-out the text I use the terms increasing and nondecreasing interchangeably. I Finite-Stage Models 1. Introduction A problem typical of those with which we are concerned involves a process that is observed at the beginning of a discrete time period to be in a particular state. After observation of the state, an action must be chosen ; and based only on the state at that time and the action chosen, an expected reward is earned and the probability distribution for the next state is determined. The problem of interest is to choose a policy that maximizes the expected value of the sum of the rewards earned over a given finite time span of length n.

We present a technique, known as Dynamic Programming , that enables such problems to be solved recursively in n. To be specific, suppose that the states are the integers, and let A, a finite set, be the set of all possible actions. When the state is i and the action a e A is chosen, suppose that the reward earned is R(i, a) and the next state is j with probability P0(a). Let V (i) denote the maximum expected return for an -stage problem that starts in state i. When n = 1, we clearly have ( ) = maxK(i,a). ( ) aeA Also, the one-stage optimal policy is to choose, when in state i, an action that maximizes the right side of Eq.

( ). Now consider the -stage problem that starts in state i. If action a is initially chosen, then reward R(i, a) is received and the next state will be ; with probability Pij(a). If the next state isj, we then face a problem equivalent to one that 1 2 I. FINITE-STAGE MODELS starts in 7 and has n 1 time periods to go. Hence the best we can do (in the sense of expected return) if we initially choose action a is j Because Vn(i) is the best we can do without any restriction on the initial action a, we see that Vn(i) = max[K(i, a) + ( ) _,(])]. ( ) j Equation ( ), known as the optimality equation, gives us a technique for recursively solving for Vn(i).

First we obtain { ) from Eq. ( ). Letting n = 2 in Eq. ( ), we can now solve for V2(i), and so on. In addition, the optimal policy is as follows : when there are n time periods to go and the process is in state i, then an action a that maximizes the right side of Eq. ( ) should be chosen. (That such a policy is optimal, achieving the expected return Vn(i) when the initial state is i and the problem is over n stages, is easily seen by induction on n) It often takes a great deal of computation to solve for the optimal policy explicitly. However, we can occasionally use Eq.}

( ) to solve for Vn explicitly or to obtain structural results about it (or about the optimal policy), which can result in reducing the amount of necessary computations. In this chapter we shall present and obtain some results about a variety of finite-stage sequential-decision models. 2. A Gambling Model At each play of the game a gambler can bet any nonnegative amount up to his present fortune and will either win or lose that amount with probabilities p and q = \ p, respectively. The gambler is allowed to make n bets, and his objective is to maximize the expectation of the logarithm of his final fortune.

What strategy achieves this end? 2. A GAMBLING MODEL 3 Let Vn(x) denote the maximal expected return if the gambler has a present fortune of x and is allowed n more gambles. We shall take as our action the fraction of the gambler's fortune that he bets. Thus we have the optimality equation VJx) = max \ (x + ax) + qVn_1(x - ax)], ( ) 0< <1 with the boundary condition V0(x) = log x. We leave it as an exercise for the reader to show that when p < \, Vn(x) = log x and the optimal strategy is always to bet 0.

Introduction to Stochastic Dynamic Programming

Tags:

Information

Advertisement

Transcription of Introduction to Stochastic Dynamic Programming

Related search queries

Introduction to Stochastic Dynamic Programming

Tags:

Information

Advertisement

Related documents

Related search queries