Transcription of Density estimation in R - vita.had.co.nz
1 Density estimation in RHenry Deng and Hadley WickhamSeptember 2011 AbstractDensity estimation is an important statistical tool, and within R there are over 20packages that implement it: so many that it is often difficult to know which to paper presents a brief outline of the theory underlying each package, as well as anoverview of the code and comparison of speed and accuracy. We focus on univariatemethods, but include pointers to other more specialised , we foundASHandKernSmoothto be excellent: they are both fast, accu-rate, and MotivationThere are over 20 packages that perform Density estimation in R, varying in both the-oretical approach and computational performance. Users and developers who requiredensity estimation tools have different needs, and some methods of Density estimationmay be more appropriate than others.
2 This paper aims to summarise the existingapproaches to make it easier to pick the right package for the begin in Section 2 with a brief review of the underlying theory behind themain approaches for Density estimation , providing links to the relevant literature. InSection 3, we describe the R packages that implement each approach, highlighting thebasic code needed to run their Density estimation function and listing differences infeatures (dimensionality, bounds, bandwidth selection, etc).Section 4 compares the performance of each package with calculation speed, lookingat Density estimation computations from 10 to 10 million observations. The accuracy ofthe Density estimations generated is also important.
3 Section 5 compares the accuracyof the Density estimates using three distributions with varying degrees of challenge:the uniform, normal and claw distributions. Section 6 investigates the relationshipbetween calculation time and accuracy, and we conclude in Section 7 with our findingsand Theoretical approachesDensity estimation builds an estimate of some underlying probability Density functionusing an observed data sample. Density estimation can either be parametric, where the1data is from a known family, or nonparametric, which attempts to flexibly estimatean unknown distribution. We begin with a brief overview of the underlying theory,focusing on nonparametric methods because of their generality.
4 Common methodsinclude histograms, Section , kernel methods, Section , and penalized approaches,Section We attempt to give the flavor of edge method, without going into too muchdetail. For a more in-depth treatment, we recommend Scott (1992a) and Silverman(1986).We will assume that we haveniid data points,X1,X2,..,Xn, and we are interestedin an estimate, f(x), of the true Density ,f(x), at new HistogramThe histogram (Silverman, 1986) is the oldest (dating to the 1840 s (Friendly, 2005))and least sophisticated method of Density estimation . The main advantages are itsextreme simplicity and speed of computation. A histogram is piecewise constant (hencenot at all smooth) and can be extremely sensitive to the choice of bin simple enhancement to the histogram is the average shifted histogram (ASH): itis smoother than the histogram and avoids sensitivity to the choice of origin, but isstill computationally efficient.
5 The premise of this approach (Scott, 1992b) is to takemhistograms, f1, f2,.., fm, of bin widthhwith origins ofto= 0,hm,2hm,..,(m 1)hm. Asthe name suggests, the na ve ASH is simply fash(x) =1mm i=1 fi(x)There arek= m nbins across all histograms, each spanning [khm,(k+ 1)hm]with center (k+ )hm. The ASH can be made somewhat more general by using all binsto estimate the Density at each point, weighting bins closer to the data more general form of the weighted ASH is: fash(x) =1mm n k=1w(lk x) ck(x)wherewis a weighting function,lkis the center of bink, and ckis the number ofpoints in that the univariate ASH is piecewise constant, it can be computed by takinga histogram withm nbins and computing a rolling sum overmadjacent bins.
6 Thismakes the ASH extremely fast to Kernel Density estimationThe kernel Density estimation approach overcomes the discreteness of the histogramapproaches by centering a smooth kernel function at each data point then summing toget a Density estimate. The basic kernel estimator can be expressed as fkde(x) =1nn i=1K(x xih)2whereKis thekernelandhis thebandwidth(Scott, 1992b). The kernel is asymmetric, usually positive function that integrates to one. Common kernel functionsare uniform, triangle, Epanechnikov, quartic (biweight), tricube (triweight), Gaussian(normal), and cosine. The bandwidth,h, is a smoothing parameter: large bandwidthsproduce very smooth estimates, small values produce wiggly estimates.
7 It influencesestimates much more than the kernel, and so choosing a good bandwidth is critical toget a good the bandwidth is selected by minimisingL2risk, aka the mean integratedsquare error:MISE( f) = E[ ( f(x) f(x))2dx]This optimisation problem is not easy becausef(x) is unknown, and leads to manytheoretical approaches. General consensus is that plug-in selectors and cross validationselectors are the most useful over a wide range of main challenge to the kde approach is varying data Density : regions of highdata Density could have small bandwidths, but regions with sparse data need largebandwidths. Extensions to the basic kde approach overcome this problem by allowingthe bandwidth to vary (Terrell and Scott, 1992) Penalized likelihood approachesAn alternative approach is to estimate the Density as a mixture ofm basis functions(densities), compromising between quality of fit and complexity.
8 This has the advan-tage of allowing the Density to be wigglier where the underlying data supports it. Ingeneral, the penalized approach (Kauermann and Schellhase, 2009) approximates thedensity ofxas a mixture ofmdensities: fpen(x) =m i=1ci i(x)where kis a Density andciare weights picked to ensure that fpenintegrates toone. Typically the basis functions are equally weighed and differ only by a locationparameter, i, so we can simplify the definition to more closely resemble the kernelapproach: fpen(x) =1mm i=1K(x ih)Compared to the KDE, the bases/densities are no longer constrained to be centeredon the data iare often called knots, and the key problem of penalized Density estimationis finding the most appropriate number and location of knots.
9 Typically this is doneby placing a large number of knots at equally spaced locations along the domain ofthe data, and then minimizing a penalized likelihood to remove knots that contributelittle to the overall TODO Should we include this section (including the part that s commented out?The importance of lies in its role in controlling the amount of smoothness in thedensity estimation . A conventional method of selecting this penalty parameter is bythe Akaike Information Criterion (AIC), in which we minimize:AIC( ) = l( ) +df( )wheredf( ) =tr(J 1p( , )Jp( , = 0)). However, the penalized spline smooth-ing approach illustrated by Kauermann and Schellhase uses mixed models through aBayesian Density estimation packagesWe now shift from our overview of broad theoretical approaches to specific R 1 summarizes the 15 Density estimation packages that we reviewed, giving basicinformation on what they do, their theoretical foundation, and the basic R code to usethem.)
10 The remainder of this section describes each package in more detail. For eachpackage, we summarize the input, output, and special ::histallows users to generate a histogram (Venables and Ripley, 2002)of the datax. Thebreaksargument specifies the desired number of bins, or theborders of each bin (for irregular binning), or a function that estimates the numberof bins automatically. By default, the function creates a plot, but this behavior canbe suppressed withplot = F. The function returns a list giving bin boundaries, mid-points, counts and densities (standardized by bin width). (Scott, 2009) estimates ASHs using two the data vectorx, the interval for the estimation ,atob, and the desirednumber of binsnbin, and counts the number of points in each theoutput frombin1and computes the generalized ASH, with weights defined produces an equally spaced mesh of predictions, centered on each provides 2d average shifted histograms Kernel Density estimationCRAN packagesGenKern,kerdiest,KernSmooth,ks,n p,plugdensity, andsmall usethe kernel Density approach, as doesstats:: Density .