Principles of Parametric Inference - Statistics

Principles of Parametric InferenceMoulinath BanerjeeUniversity of MichiganSeptember 11, 2012 The object of statistical Inference is to glean information about an underlying populationbased on a sample collected from it. The actual population is assumed to be described bysome probability distribution. statistical Inference is concerned with learning about thedistribution or at least some characteristics of the distribution that are of scientific Parametric statistical Inference , which we will be primarily concerned with in thiscourse, the underlying distribution of the population is taken to be parametrized by aEuclidean parameter. In other words, there exists a subset ofk-dimensional Euclideanspace such that the class of distributionsPof the underlying population can be writtenas{P : }.

You can think of the s as labels for the class of distributions underconsideration. More precisely this will be our set-up: Our dataX1,X2,..Xnare from the distributionP where , the parameter space. We assumeidentifiability of the parameter, 16= 2 P 16=P 2. In general, we will also assumethatX1has a densityf(x, ) (this can either be a probability mass function or an ordinaryprobability density function). Herexis a typical value assumed by the random ,f(x, ) for a discrete random variableX1just gives us the probability thatX1assumesthe valuexwhen the underlying parameter is indeed . For a continuous random variable,f(x, ) gives us the density function of the random variableX1at the pointxwhen is theunderlying parameter. Thusf(x, )dxwheredxis a very small number is approximatelythe probability thatX1lives in the interval [x,x+dx] under parameter value.

We will be interested in estimating , or more generally, a function of , sayg( ).Let us consider a few examples that will enable us to understand these notions better.(1) LetX1,X2,..,Xnbe the outcomes ofnindependent flips of the same coin. Here, wecodeXi= 1 if thei th toss producesHand 0 otherwise. The parameter of interest is ,1the probability ofHturning up in a single toss. This can be any number between 0 and1. TheXi s are and the common distributionP is the Bernoulli( ) distributionwhich has probability mass function:f(x, ) = x(1 )1 x,x {0,1}.Check that this is indeed a valid expression for the Here the parameter space, the set of all possible values for is the closed interval [0,1].(2) LetX1,X2,..,Xndenote the failure times ofndifferent bulbs. We can think of theXi s as independent and identically distributed random variables from an exponentialdistribution with an unknown parameter which we want to estimate.

IfF(x, )denotes the distribution function ofX1under parameter value , thenF(x, ) =P is true parameter(X1 x) = 1 e common density function is given by,f(x, ) = e x .Here the parameter space for is (0, ).Note that is very naturally related to the mean of the distribution. We haveE (X1) = 1/ . The expressionE (X1) should be read as theexpected value ofX1when the true parameter is . In general, whenever I write an expression with as asubscript, interpret that asunder the scenario that the true underlying parameter is .(3) LetX1,X2,..,Xnbe the number of customers that arrive atndifferent identicalcounters in unit time. Then theXi s can be thought of as random variableswith a (common) Poisson distribution with mean . Once again which is also theparameter that completely specifices the Poisson distribution varies in the set (0, )which therefore is the parameter space.

The probability mass function is:f(x, ) =e xx!.(4) LetX1,X2,..,Xnbe observations from a Normal distribution with mean andvariance 2. The mean and the variance completely specify the normal can take the parameter = ( , 2). Thus, we have a two dimensional parameterand , the set of all possible values of is the set inR2given by ( , ) (0, ).The density functionf(x, ) is then given by:f(x, ) =1 2 exp[ (x )22 2].2 Note that each different value of gives you a different normal curve. if you fix ,the first component of and vary 2you get a family of normal (density) curves allcentered at but with varying spread. A smaller value of 2corresponds to a curvethat is more peaked about and also more tightly packed around it. If you fix 2andvary you get a family of curves that are all translates of a fixed curve (say, the onecentered at 0).

Consider now, the problem of estimatingg( ) wheregis some function of . In manycasesg( ) = itself; for example, we could be interested in estimating , the probabilityofHin Example 1 above. Generallyg( ) will describe some important aspect of thedistributionP . In Example 1,g( ) = describes the probability of the coin landing heads;in Example 3,g( ) = 1/ is the expected value of the lifetime of a bulb. Our estimate ofg( ) will be some function of our observed dataX= (X1,X2,..,Xn). We will genericallydenote an estimate ofg( ) byTn(X1,X2,..,Xn) and will writeTnfor brevity. ThusTnissome function of the observed data and is therefore a random variable s quickly look at an example. In Example 1, a natural estimate of , as we havediscussed before, isXn, the mean of theXi s. This is simply the sample proportion ofHeads inntosses of the coin.

ThusTn(X1,X2,..,Xn) =X1+X2+..+ the WLLNX nconverges in probability to and is therefore a reasonable estimator,at least in this sense. Of course, this is not the only estimator of that one can propose(but this is indeed the best estimator in more ways than one). One could also propose theproportion of heads in the firstmtosses of the coin as an estimator,mbeing the floor ofn/2. This will also converge in probability to asn , but its variance will always belarger than that general there will be several different estimators ofg( ) which may all seem reasonablefrom different perspectives the question then becomes one of finding the most optimal requires an objective measure of the performance of the estimator. IfTnestimatesg( )a criterion that naturally suggests itself is the distance ofTnfromg( ). Good estimatorsare those for which|Tn g( )|is generally small.

SinceTnis a random variable nodeterministic statement can be made about the absolute deviation; however what we canexpect of a good estimator is a high chance of remaining close tog( ). Also asn, thesample size, increases we get hold of more information and hence expect to be able todo a better job of estimatingg( ). These notions when coupled together give rise to theconsistency requirement for a sequence of estimatorsTn; asnincreases,Tnought to convergein probability tog( ) (under the probability distributionP ). In other words, for any >0,P (|Tn g( )|> ) above is clearly a large sample property; what it says is that with probability increasingto 1 (as the sample size grows),Tnestimatesg( ) to any pre-determined level of , the consistency condition alone, does not tell us anything about how well we areperforming for any particular sample size, or the rate at which the above probability isgoing to a fixed sample sizen, how do we measure the performance of an estimatorTn?

We have seen that|Tn g( )|is itself random and therefore cannot even be computed asa function of before the experiment is carried out. A way out of this difficulty is to obtainan average measure of the error, or in other words, average out|Tn g( )|over all possiblerealizations ofTn. The resulting quantity is then still a function of but no longer randomIt is called the mean absloute error and can be written compactly (using acronym) |Tn g( )|.However, it is more common to avoid absolute deviations and work with the square of thedeviation, integrated out as before over the distribution ofTn. This is called the mean-squared error ( ) and (Tn,g( )) =E (Tn g( )) course, this is meaningful, only if the above quantity is finite for all . Good estimatorsare those for which the is generally not too high, whatever be the value of.

There isa standard decomposition of the that helps us understand its components. We have, ,(Tn,g( )) =E (Tn g( ))2=E (Tn E (Tn) +E (Tn) g( ))2=E (Tn E (Tn))2+ (E (Tn) g( ))2+ 2E [(Tn E (Tn)) (E (Tn) g( ))]= Var (Tn) +b(Tn, )2,whereb(Tn,g( )) =E (Tn) g( ) is the bias ofTnas an estimator ofg( ) (the cross productterm in the above display vanishes sinceE (Tn) g( ) is a constant andE (Tn E (Tn)) = 0).It measures, on an average, by how muchTnoverestimates or underestimatesg( ). If wethink of the expectationE (Tn) as the center of the distribution ofTn, then the bias measuresby how much the center deviates from the target. The variance ofTn, of course, measureshow closelyTnis clustered around its center. Ideally one would like to minimize bothsimultaneously, but unfortunately this is rarely possible.

Principles of Parametric Inference - Statistics

Tags:

Information

Transcription of Principles of Parametric Inference - Statistics

Related search queries

Principles of Parametric Inference - Statistics

Tags:

Information

Documents from same domain

Related documents

Related search queries