Transcription of DoubleQ-learning - NeurIPS
1 Double Q-learningHado van HasseltMulti-agent and Adaptive Computation GroupCentrum Wiskunde & InformaticaAbstractIn some stochastic environments the well-known reinforcement learning algo-rithm Q- learning performs very poorly. This poor performance is caused by largeoverestimations of action values. These overestimations result from a positivebias that is introduced because Q- learning uses the maximumaction value as anapproximation for the maximum expected action value. We introduce an alter-native way to approximate the maximum expected value for anyset of randomvariables.
2 The obtained double estimator method is shown tosometimes under-estimate rather than overestimate the maximum expected value. We apply thedouble estimator to Q- learning to construct Double Q- learning , a new off-policyreinforcement learning algorithm. We show the new algorithm converges to theoptimal policy and that it performs well in some settings in which Q- learning per-forms poorly due to its IntroductionQ- learning is a popular reinforcement learning algorithm that was proposed by Watkins [1] and canbe used to optimally solve markov Decision processes (MDPs)[2].
3 We show that Q- learning sperformance can be poor in stochastic MDPs because of large overestimations of the action val-ues. We discuss why this occurs and propose an algorithm called Double Q- learning to avoid thisoverestimation. The update of Q- learning isQt+1(st, at) =Qt(st, at) + t(st, at)(rt+ maxaQt(st+1, a) Qt(st, at)).(1)In this equation,Qt(s, a)gives the value of the actionain statesat timet. The rewardrtis drawnfrom a fixed reward distributionR:S A S R, whereE{rt|(s, a, s ) = (st, at, st+1)}=Rs next statest+1is determined by a fixed state transition distributionP:S A S [0,1],wherePs sagives the probability of ending up in states after performingains, and s Ps sa= learning rate t(s, a) [0,1]ensures that the update averages over possible randomness in therewards and transitions in order to converge in the limit to the optimal action value function.
4 Thisoptimal value function is the solution to the following set of equations [3]: s, a:Q (s, a) = s Ps sa(Rs sa+ maxaQ (s , a)).(2)The discount factor [0,1)has two interpretations. First, it can be seen as a property of the prob-lem that is to be solved, weighing immediate rewards more heavily than later rewards. Second, innon-episodic tasks, the discount factor makes sure that every action value is finite and therefore well-defined. It has been proven that Q- learning reaches the optimal value functionQ with probabilityone in the limit under some mild conditions on the learning rates and exploration policy [4 6].]
5 Q- learning has been used to find solutions on many problems [7 9] and was an inspiration to similaralgorithms, such as Delayed Q- learning [10], Phased Q- learning [11] and Fitted Q-iteration [12],to name some. These variations have mostly been proposed in order to speed up convergence rates1compared to the original algorithm. The convergence rate ofQ- learning can be exponential in thenumber of experiences [13], although this is dependent on the learning rates and with a proper choiceof learning rates convergence in polynomial time can be obtained [14].
6 The variants named abovecan also claim polynomial time important aspect of the Q- learning algorithm has been overlooked in previouswork: the use of the max operator to determine the value of thenext state can cause large over-estimations of the action values. We show that Q- learning can suffer a large performance penaltybecause of a positive bias that results from using the maximum value as approximation for the max-imum expected value. We propose an alternative double estimator method to find an estimate forthe maximum value of a set of stochastic values and we show that this sometimes underestimatesrather than overestimates the maximum expected value.
7 We use this to construct the new DoubleQ-learning paper is organized as follows. In the second section, we analyze two methods to approximatethe maximum expected value of a set of random variables. In Section 3 we present the DoubleQ-learning algorithm that extends our analysis in Section 2and avoids overestimations. The newalgorithm is proven to converge to the optimal solution in the limit. In Section 4 we show the resultson some experiments to compare these algorithms. Some general discussion is presented in Section5 and Section 6 concludes the paper with some pointers to future Estimating the Maximum Expected ValueIn this section, we analyze two methods to find an approximation for the maximum expected valueof a set of random variables.
8 The single estimator method uses the maximum of a set of estimatorsas an approximation. This approach to approximate the valueof the maximum expected value ispositively biased, as discussed in previous work in economics [15] and decision making [16]. It is abias related to the Winner s Curse in auctions [17, 18] and itcan be shown to follow from Jensen sinequality [19]. The double estimator method uses two estimates for each variable and uncouplesthe selection of an estimator and its value. We are unaware ofprevious work that discusses it. Weanalyze this method and show that it can have a negative a set ofMrandom variablesX={X1.}
9 , XM}. In many problems, one is interested inthe maximum expected value of the variables in such a set:maxiE{Xi}.(3)Without knowledge of the functional form and parameters of the underlying distributions of thevariables inX, it is impossible to determine (3) exactly. Most often, thisvalue is approximatedby constructing approximations forE{Xi}for alli. LetS= Mi=1 Sidenote a set of samples,whereSiis the subset containing samples for the variableXi. We assume that the samples inSiare independent and identically distributed (iid). Unbiased estimates for the expected values canbe obtained by computing the sample average for each variable:E{Xi}=E{ i} i(S)def=1|Si| s Sis, where iis an estimator for variableXi.
10 This approximation is unbiased since everysamples Siis an unbiased estimate for the value ofE{Xi}. The error in the approximation thusconsists solely of the variance in the estimator and decreases when we obtain more use the following notations:fidenotes the probability density function (PDF) of theithvariableXiandFi(x) = x fi(x)dxis the cumulative distribution function (CDF) of this PDF. Similarly,the PDF and CDF of theithestimator are denotedf iandF i. The maximum expected value canbe expressed in terms of the underlying PDFs asmaxiE{Xi}= maxi x fi(x) The Single EstimatorAn obvious way to approximate the value in (3) is to use the value of the maximal estimator:maxiE{Xi}= maxiE{ i} maxi i(S).