Example: quiz answers

A Tutorial on Bayesian Optimization - arxiv.org

A Tutorial on Bayesian OptimizationPeter I. FrazierJuly 10, 2018 AbstractBayesian Optimization is an approach to optimizing objective functions that take a long time (min-utes or hours) to evaluate. It is best-suited for Optimization over continuous domains of less than 20dimensions, and tolerates stochastic noise in function evaluations. It builds a surrogate for the objectiveand quantifies the uncertainty in that surrogate using a Bayesian machine learning technique, Gaussianprocess regression, and then uses an acquisition function defined from this surrogate to decide where tosample. In this Tutorial , we describe how Bayesian Optimization works, including Gaussian process re-gression and three common acquisition functions: expected improvement, entropy search, and knowledgegradient. We then discuss more advanced techniques, including running multiple function evaluationsin parallel, multi-fidelity and multi-information source Optimization , expensive-to-evaluate constraints,random environmental conditions, multi-task Bayesian Optimization , and the inclusion of derivative infor-mation.

A Tutorial on Bayesian Optimization Peter I. Frazier July 10, 2018 Abstract Bayesian optimization is an approach to optimizing objective functions that take a long time (min-

Tags:

  Optimization

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of A Tutorial on Bayesian Optimization - arxiv.org

1 A Tutorial on Bayesian OptimizationPeter I. FrazierJuly 10, 2018 AbstractBayesian Optimization is an approach to optimizing objective functions that take a long time (min-utes or hours) to evaluate. It is best-suited for Optimization over continuous domains of less than 20dimensions, and tolerates stochastic noise in function evaluations. It builds a surrogate for the objectiveand quantifies the uncertainty in that surrogate using a Bayesian machine learning technique, Gaussianprocess regression, and then uses an acquisition function defined from this surrogate to decide where tosample. In this Tutorial , we describe how Bayesian Optimization works, including Gaussian process re-gression and three common acquisition functions: expected improvement, entropy search, and knowledgegradient. We then discuss more advanced techniques, including running multiple function evaluationsin parallel, multi-fidelity and multi-information source Optimization , expensive-to-evaluate constraints,random environmental conditions, multi-task Bayesian Optimization , and the inclusion of derivative infor-mation.

2 We conclude with a discussion of Bayesian Optimization software and future research directionsin the field. Within our Tutorial material we provide a generalization of expected improvement to noisyevaluations, beyond the noise-free setting where it is more commonly applied. This generalization isjustified by a formal decision-theoretic argument, standing in contrast to previous ad hoc IntroductionBayesian Optimization (BayesOpt) is a class of machine-learning-based Optimization methods focused onsolving the problemmaxx Af(x),(1)where the feasible set and objective function typically have the following properties: The inputxis inRdfor a value ofdthat is not too large. Typicallyd 20 in most successfulapplications of BayesOpt. The feasible setAis a simple set, in which it is easy to assess membership. TypicallyAis ahyper-rectangle{x Rd:ai xi bi}or thed-dimensional simplex{x Rd: ixi= 1}.

3 Later(Section 5) we will relax this assumption. The objective functionfis continuous. This will typically be required to modelfusing Gaussianprocess regression. fis expensive to evaluate in the sense that the number of evaluations that may be performed islimited, typically to a few hundred. This limitation typically arises because each evaluation takesa substantial amount of time (typically hours), but may also occur because each evaluation bearsa monetary cost ( , from purchasing cloud computing power, or buying laboratory materials),or an opportunity cost ( , if evaluatingfrequires asking a human subject questions who willtolerate only a limited number). flacks known special structure like concavity or linearity that would make it easy to optimize usingtechniques that leverage such structure to improve efficiency. We summarize this by sayingfis a black box. When we evaluatef, we observe onlyf(x) and no first- or second-order derivatives.

4 This preventsthe application of first- and second-order methods like gradient descent, Newton s method, or quasi-Newton methods. We refer to problems with this property as derivative-free . Through most of the article, we will assumef(x) is observed without noise. Later (Section 5) wewill allowf(x) to be obscured by stochastic noise. In almost all work on Bayesian Optimization ,noise is assumed independent across evaluations and Gaussian with constant [ ] 8 Jul 2018 Our focus is on finding aglobalrather than local summarize these problem characteristics by saying that BayesOpt is designed for black-box derivative-free global ability to optimize expensive black-box derivative-free functions makes BayesOpt extremely ver-satile. Recently it has become extremely popular for tuning hyperparameters in machine learning al-gorithms, especially deep neural networks (Snoek et al., 2012). Over a longer period, since the 1960s,BayesOpt has been used extensively for designing engineering systems (Mo ckus, 1989; Jones et al.)

5 , 1998;Forrester et al., 2008). BayesOpt has also been used to choose laboratory experiments in materials anddrug design (Negoescu et al., 2011; Frazier and Wang, 2016; Packwood, 2017), in calibration of envi-ronmental models (Shoemaker et al., 2007), and in reinforcement learning (Brochu et al., 2009; Lizotte,2008; Lizotte et al., 2007).BayesOpt originated with the work of Kushner (Kushner, 1964), Zilinskas ( Zilinskas, 1975; Mo ckuset al., 1978), and Mo ckus (Mo ckus, 1975; Mo ckus, 1989), but received substantially more attentionafter that work was popularized by Jones et al. (1998) and their work on the Efficient Global Opti-mization (EGO) algorithm. Following Jones et al. (1998), innovations developed in that same literatureinclude multi-fidelity Optimization (Huang et al., 2006; S obester et al., 2004), multi-objective optimiza-tion (Keane, 2006; Knowles, 2006; Mo ckus and Mo ckus, 1991), and a study of convergence rates (Calvin,1997; Calvin and Zilinskas, 2000; Calvin and Zilinskas, 2005; Calvin and Zilinskas, 1999).

6 The observa-tion made by Snoek et al. (2012) that BayesOpt is useful for training deep neural networks sparked asurge of interest within machine learning, with complementary innovations from that literature includingmulti-task Optimization (Swersky et al., 2013; Toscano-Palmerin and Frazier, 2018), multi-fidelity opti-mization specifically aimed at training deep neural networks (Klein et al., 2016), and parallel methods(Ginsbourger et al., 2007, 2010; Wang et al., 2016a; Wu and Frazier, 2016). Gaussian process regression,its close cousin kriging, and BayesOpt have also been studied recently in the simulation literature (Klei-jnen et al., 2008; Salemi et al., 2014; Mehdad and Kleijnen, 2018) for modeling and optimizing systemssimulated using discrete event are other techniques outside of BayesOpt that can be used to optimize expensive derivative-free black-box functions. While we do not review methods from this literature here in detail, many ofthem have a similar flavor to BayesOpt methods: they maintain a surrogate that models the objectivefunction, which they use to choose where to evaluate (Booker et al.)

7 , 1999; Regis and Shoemaker, 2007b,a,2005). This more general class of methods is often called surrogate methods. Bayesian optimizationdistinguishes itself from other surrogate methods by using surrogates developed using Bayesian statistics,and in deciding where to evaluate the objective using a Bayesian interpretation of these first introduce the typical form that Bayesian Optimization algorithms take in Section 2. Thisform involves two primary components: a method for statistical inference, typically Gaussian process(GP) regression; and an acquisition function for deciding where to sample, which is often expectedimprovement. We describe these two components in detail in Sections 3 and We then describethree alternate acquisition functions: knowledge-gradient (Section ), entropy search, and predictiveentropy search (Section ). These alternate acquisition functions are particularly useful in problemsfalling outside the strict set of assumptions above, which we call exotic Bayesian Optimization problemsand we discuss in Section 5.

8 These exotic Bayesian Optimization problems include those with parallelevaluations, constraints, multi-fidelity evaluations, multiple information sources, random environmentalconditions, multi-task objectives, and derivative observations. We then discuss Bayesian optimizationand Gaussian process regression software in Section 6 and conclude with a discussion of future researchdirections in Section tutorials and surveys on Bayesian Optimization include Shahriari et al. (2016); Brochu et al.(2009); Sasena (2002); Frazier and Wang (2016). This Tutorial differs from these others in its coverageof non-standard or exotic Bayesian Optimization problems. It also differs in its substantial emphasison acquisition functions, with less emphasis on GP regression. Finally, it includes what we believe is anovel analysis of expected improvement for noisy measurements, and argues that the acquisition functionpreviously proposed by Scott et al.

9 (2011) is the most natural way to apply the expected improvementacquisition function when measurements are Overview of BayesOptBayesOpt consists of two main components: a Bayesian statistical model for modeling the objectivefunction, and an acquisition function for deciding where to sample next. After evaluating the objectiveaccording to an initial space-filling experimental design, often consisting of points chosen uniformly at2random, they are used iteratively to allocate the remainder of a budget ofNfunction evaluations, asshown in Algorithm 1 Basic pseudo-code for Bayesian optimizationPlace a Gaussian process prior onfObservefatn0points according to an initial space-filling experimental design. Setn= NdoUpdate the posterior probability distribution onfusing all available dataLetxnbe a maximizer of the acquisition function overx, where the acquisition function is computed usingthe current posterior (xn).

10 Incrementnend whileReturn a solution: either the point evaluated with the largestf(x), or the point with the largest statistical model, which is invariably a Gaussian process, provides a Bayesian posterior probabilitydistribution that describes potential values forf(x) at a candidate pointx. Each time we observefat anew point, this posterior distribution is updated. We discuss Bayesian statistical modeling using Gaussianprocesses in detail in Section 3. The acquisition function measures the value that would be generated byevaluation of the objective function at a new pointx, based on the current posterior distribution overf. We discuss expected improvement, the most commonly used acquisition function, in Section , andthen discuss other acquisition functions in Section and iteration of BayesOpt from Algorithm 1 using GP regression and expected improvement isillustrated in Figure 1. The top panel shows noise-free observations of the objective function with bluecircles at three points.


Related search queries