Bayesian Optimization - Washington University in St. Louis

Bayesian OptimizationSuppose we have a functionf:X Rthat we with to minimize on some domainX X. That is,we wish to ndx = arg minx Xf(x).In numerical analysis, this problem is typically called (global)optimizationand has been the subjectof decades of study. We draw a distinction between global Optimization , where we seek the absoluteoptimum inX, and local Optimization , where we seek to nd a local optimum in the neighborhoodof a given initial common approach to Optimization problems is to make some assumptions aboutf.

For example,when the objective functionfis known to be convex and the domainXis also convex, the problemis known asconvex optimizationand has been widely studied. Convex Optimization is a commontool used across machine an exact functional form forfis not available (that is,fbehaves as a black box ), what can wedo? Bayesian optimizationproceeds by maintaining a probabilistic belief aboutfand designing a so-calledacquisition functionto determine where to evaluate the function next. Bayesian Optimization isparticularly well-suited to global Optimization problems wherefis anexpensiveblack-box function;for example, evaluatingfmight require running an expensive simulation.

Bayesian optimizationhas recently become popular for training expensive machine-learning models whose behaviordepend in a complicated way on their parameters ( , convolutional neural networks). This is anexample of the AutoML not strictly required, Bayesian Optimization almost always reasons aboutfby choosingan appropriate Gaussian process prior:p(f) =GP(f; ,K).Given observationsD= (X,f),1we can condition our distribution onDas usual:p(f|D) =GP(f; f|D,Kf|D).Given this set of observations, how do we select where to observe the function next?

The meta-approach in Bayesian Optimization is to design an acquisition functiona(x). The acquisition functionis typically an inexpensive function that can be evaluated at a given point that is commensuratewith how desirable evaluatingfatxis expected to be for the minimization problem. We thenoptimize the acquisition function to select the location of the next observation. Of course, we havemerely replaced our original Optimization problem with another Optimization problem, but on amuch-cheaper functiona(x).Acquisition FunctionsMany acquisition functions can be interpreted in the framework of Bayesian decision theory asevaluating an expected loss associated with evaluatingfat a pointx.

We then select the point withthe lowest expected loss, as the below we drop thef|Dsubscripts on the mean and covarianceKfunctions will assume these observations to be noiseless here, but we could extend the methods here to the noisy case withoutdi of improvementPerhaps the rst acquisition function designed for Bayesian Optimization wasprobability of = minfis the minimal value offobserved so far. Probability of improvement evaluatesfat the point mostlikely to improve upon this value. This corresponds to the following utility function2associatedwith evaluatingfat a given pointx:u(x) ={0f(x)> f 1f(x) f.}

That is, we receive a unit reward iff(x)turns out to be less thanf , and no reward otherwise. Theprobability of improvement acquisition function is then the expected utility as a function ofx:api(x) =E[u(x)|x,D]= f N(f; (x),K(x,x))df= (f ; (x),K(x,x)).The point with the highest probability of improvement (the maximal expected utility) is is the Bayes action under this improvementThe loss function associated with probability of improvement is somewhat odd: we get a rewardfor improving upon the current minimum independent of the size of the improvement!

This cansometimes lead to odd behavior, and in practice can get stuck in local optima and alternative acquisition function that does account for the size of the improvement suppose thatf is the minimal value offobserved so far. Expected improvementevaluatesfat the point that, in expectation, improves uponf the most. This corresponds to thefollowing utility function:u(x) = max(0,f f(x)).That is, we receive a reward equal to the improvement f f(x)iff(x)turns out to be less thanf , and no reward otherwise. The expected improvement acquisition function is then the expectedutility as a function ofx:aei(x) =E[u(x)|x,D]= f (f f)N(f; (x),K(x,x))df=(f (x)) (f ; (x),K(x,x))+K(x,x)N(f ; (x),K(x,x)).

The point with the highest expected improvement (the maximal expected utility) is expected improvement has two components. The rst can be increased by reducing the meanfunction (x). The second can be increased by increasing the varianceK(x,x). These two termscan be interpreted as explicitly encoding a tradeo betweenexploitation( evaluating at pointswith low mean) andexploration( evaluating at points with high uncertainty). The exploitation exploration tradeo is a classic consideration in such problems, and the expected improvementcriterionautomaticallycapture s both as a result of the Bayesian decision theoretic a utility function is simply a negative loss searchA third alternative isentropy , we seek to minimize the uncertainty we have in thelocationof the optimal valuex = arg minx Xf(x).

Notice that our belief overfinduces a distribution overx ,p(x |D). Unfortunately, there is noclosed-form expression for this search seeks to evaluate points so as to minimize the entropy of the induced distributionp(x |D). Here the utility function is the reduction in this entropy given a new measurement atx,(x,f(x)):u(x) =H[x |D] H[x |D,x,f(x)].As in probability of improvement and expected improvement, we may build an acquisition functionby evaluating the expected utility provided by evaluatingfat a pointx. Due to the nature of thedistributionp(x |D), this is somewhat complicated, and a series of approximations must be con dence boundA nal alternative acquisition function is typically known asgp-ucb, whereucbstands foruppercon dence typically described in terms of maximizingfrather than minimizingf;however in the context of minimization, the acquisition function would take the formaucb(x; ) = (x) (x),where >0is a tradeo parameter and (x) = K(x,x)is the marginal standard deviation off(x).

3 Again, thegp-ucbacquisition function contains explicit exploitation ( (x)) and exploration ( (x))terms. Interestingly, the acquisition function cannot be interpreted as computing a natural expectedutility function. Nonetheless, strong theoretical results are known forgp-ucb, namely, that undercertain conditions, the iterative application of this acquisition function will converge to the trueglobal minimum the context of minimization, this is better described as alowercon dence bound, butucbis ingrained in the literatureas a standard

Bayesian Optimization - Washington University in St. Louis

Tags:

Information

Transcription of Bayesian Optimization - Washington University in St. Louis

Related search queries

Bayesian Optimization - Washington University in St. Louis

Tags:

Information

Documents from same domain

Related documents

Related search queries