Lecture Notes 9 Asymptotic Theory (Chapter 9)

Lecture Notes 9. Asymptotic Theory (Chapter 9). In these Notes we look at the large sample properties of estimators, especially the maxi- mum likelihood estimator. Some Notation: Recall that Z. E (g(X)) g(x)p(x; )dx. 1 Review of o, O, etc. 1. an = o(1) mean an 0 as n . P. 2. A random sequence An is op (1) if An 0 as n . P. 3. A random sequence An is op (bn ) if An /bn 0 as n . P. 4. nb op (1) = op (nb ), so n op (1/ n) = op (1) 0. 5. op (1) op (1) = op (1). 6. an = O(1) if |an | is bounded by a constant as n . 7. A random sequence Yn is Op (1) if for every > 0 there exists a constant M such that limn P (|Yn | > M ) < as n . 8. A random sequence Yn is Op (bn ) if Yn /bn is Op (1). 9. If Yn Y , then Yn is Op (1).. 10. If n(Yn c) Y then Yn = OP (1/ n). 11. Op (1) Op (1) = Op (1). 12. op (1) Op (1) = op (1). 2 Distances Between Probability Distributions Let P and Q be distributions with densities p and q. We will use the following distances between P and Q.

1. Total variation distance TV(P, Q) = supA |P (A) Q(A)|. R. 2. L1 distance d1 (P, Q) = |p q|.. qR. 3. Hellinger distance h(P, Q) = ( p q)2 . R. 4. Kullback-Leibler distance K(P, Q) = p log(p/q). 1. R. 5. L2 distance d2 (P, Q) = (p q)2 . Here are some properties of these distances: 1. TV(P, Q) = 12 d1 (P, Q). (prove this!). R . 2. h2 (P, Q) = 2(1 pq). p 3. TV(P, Q) h(P, Q) 2TV(P, Q). 4. h2 (P, Q) K(P, Q). p 5. TV(P, Q) h(P, Q) K(P, Q). p 6. TV(P, Q) K(P, Q)/2. 3 Consistency An estimator bn = g(X1 , .. , Xn ) is consistent for if P. bn . as n . In other words, bn = op (1). Here are two common ways to prove that bn consistent. Method 1: Show that, for all > 0, P(| bn | ) 0. Method 2. Prove convergence in quadratic mean: MSE( bn ) = Bias2 ( bn ) + Var( bn ) 0. qm p If bias 0 and var 0 then bn which implies that bn . Example 1 X1 , .. , Xn Bernoulli(p). The mle pb has bias 0 and variance p(1 p)/n 0. P P. Here pb = i Xi /n. So pb p and is consistent.

Now let = log(p/(1 p)). Then p/(1 pb)). Now b = g(b b = log(b p) where g(p) = log(p/(1 p)). By the continuous mapping P. theorem, so this is consistent. Now consider b P. Xi + 1. pb = i . n+1. 2. Then p 1. p) p = . Bias = E(b 0. n(1 + n). and p(1 p). Var = 0. n So this is consistent. Example 2 X1 , .. , Xn Uniform(0, ). Let bn = X(n) . By direct proof (we did it earlier). P. we have bn . Method of moments estimators are typically consistent. Consider one parameter. Recall b = m where m = n 1 Pn Xi . Assume that 1 exists and is continuous. So that ( ) i=1. 1 P. = (m). By the WLLN m ( ). So, by the continuous mapping Theorem, b P. bn = 1 (m) 1 ( ( )) = . 4 Consistency of the MLE. Under regularity conditions, the mle is consistent. Let us prove this in a special case. This will also reveal a connection between the mle and Hellinger distance. Suppose that the model consists of finitely many distinct densities P = {p0 , p1 , .. , pN }. The likelihood function is n Y.

L(pj ) = pj (Xi ). i=1. The mle pb is the density pj that maximizes L(pj ). Without loss of generality, assume that the true density is p0 . Theorem 3. p 6= p0 ) 0. P(b as n . Proof. Let us begin by first proving an inequality. Let j = h(p0 , pj ). Then, for j 6= 0, ! s ! n n L(pj ) 2. Y p j (X i ) 2. Y p j (X i ) 4. P > e n j /2 = P > e n j /2 = P > e n j /2. L(p0 ) i=1 0. p (X i ) i=1. p 0 (X i ). s ! s ! n n n 2j /4. Y pj (Xi ) n 2j /4. Y pj (Xi ). e E =e E. i=1. p0 (Xi ) i=1. p0 (Xi ). n n n h2 (p0 , pj ) 2j Z . n 2j /4 n 2j /4 n 2j /4. = e pj p0 =e 1 =e 1 . 2 2. 2 . j . 2 2 2 2. = en j /4 exp n log 1 en j /4 e n j /2 = e n j /2 . 2. 3. R . We used the fact that h2 (p0 , pj ) = 2 2 p0 pj and also that log(1 x) x for x > 0. Let = min{ 1 , .. , N }. Then . L(pj ) n 2j /2. P(bp 6= p0 ) P >e for some j L(p0 ). N . X L(pj ) n 2j /2. P >e j=1. L(p0 ). N. 2 2 /2. X. e n j /2 N e n 0. j=1.. We can prove a similar result using Kullback-Leibler distance as follows.

Let X1 , X2 , .. be iid F . Let 0 be the true value of and let be some other value. We will show that L( 0 )/L( ) > 1 with probability tending to 1. We assume that the model is identifiable;. this means that 1 6= 2 implies that K( 1 , 2 ) > 0 where K is the Kullback-Leibler distance. Theorem 4 Suppose the model is identifiable. Let 0 be the true value of the parameter. For any 6= 0 . L( 0 ). P >1 1. L( ). as n . Proof. We have n n 1 1X 1X. (`( 0 ) `( )) = log p(Xi ; 0 ) log p(Xi ; ). n n i=1 n i=1. p E(log p(X; 0 )) E(log p(X; )). Z Z. = (log p(x; 0 ))p(x; 0 )dx (log p(x; ))p(x; 0 )dx Z . p(x; 0 ). = log p(x; 0 )dx p(x; ). = K( 0 , ) > 0. So . L( 0 ). P >1 = P (`( 0 ) `( ) > 0). L( ).. 1. = P (`( 0 ) `( )) > 0 1.. n . This is not quite enough to show that bn 0 . 4. Example 5 Inconsistency of an mle. In all examples so far n , but the number of parameters is fixed. What if the number of parameters also goes to ? Let Y11 , Y12 N ( 1 , 2 ). Y21 , Y22 N ( 2 , 2 ).

Yn1 , Yn2 N ( n , 2 ). Some calculations show that n X 2. 2. X (Yij Y i )2.. b = . i=1 j=1. 2n It is easy to show (good test question) that p 2. b2 .. 2. 2 is consistent. Note that the modified estimator 2b The reason why consistency fails is because the dimension of the parameter space is increasing with n. Theorem 6 Under regularity conditions on the model {p(x; ) : }, the mle is consistent. The regularity conditions are technical. Basically, we need to assume that (i) the dimension of the parameter space does not change with n and (ii) p(x; ) is a smooth function of . 5 Score and Fisher Information The score function and Fisher information are the key quantities in many aspects of statis- tical inference. Suppose for now that R. The score function is log p(X1 , .. , Xn ; ) iid X log p(Xi ; ). Sn ( ) Sn ( , X1 , .. , Xn ) = `0 ( ) = = . i . The Fisher information is defined to be In ( ) = Var (Sn ( )). b . that is, the variance of the score function.

Later we will see that for the mle, Var( ). 1/In ( ). That is why is In ( ) called Information.. 5. Theorem 7 Under regularity conditions, E [Sn ( )] = 0. In other words, Z Z . log p(x1 , .. , xn ; ). p(x1 , .. , xn ; )dx1 .. dxn = 0.. That is, if the expected value is taken at the same as we evaluate Sn ( ), then the expectation is 0. This does not hold when the 's mismatch: E 0 [Sn ( 1 )] 6= 0. We'll see later that this property is very important. Proof. Z Z. log p(x1 , .. , xn ; ). E [Sn ( )] = p(x1 , .. , xn ; ) dx1 dxn . Z . p(x1 , .. , xn ; ). Z.. = p(x1 , .. , xn ; ) dx1 dxn p(x , .. , xn ; ). Z Z 1.. = p(x1 , .. , xn ; ) dx1 dxn . | {z }. 1. = 0.. Example 8 Let X1 , .. , Xn N ( , 1). Then n X. Sn ( ) = (Xi ). i=1. Clearly, E [Sn ( )] = 0.. R R. Warning: If the support of p depends on , then and . cannot be switched. Now we discuss some properties of the Fisher information. Recall that the Fisher information is defined to be In ( ) = Var(Sn ( )).

Since the mean of the score is 0, we have that In ( ) = E [Sn2 ( )]. Lemma 9 For the iid case, we have In ( ) = nI( ) where I( ) is the Fisher information for n = 1. 6. Proof. This follows since the log-likelihood and hence the score is the the sum of n, independent terms.. The next result gives a very simple formula for calculating the Fisher information. Lemma 10 Under regularity conditions, 2.. In ( ) = E `n ( ) . 2. Proof. For simplicity take n = 1. First note that Z 00. p00. Z Z Z . 0 00 p p=1 p =0 p =0 p=0 E = 0. p p Let ` = log p and S = `0 = p0 /p. Then `00 = (p00 /p) (p0 /p)2 and 0 2 0 2 00 . 2 2 2 p p p Var(S) = E(S ) (E(S)) = E(S ) = E =E E. p p p 00 0 2 ! p p = E = E(`00 ). p p . The Vector Case. Let = ( 1 , , k ). Ln ( ) and `n ( ) are defined as before. The score function Sn ( ) is now a vector of length k and the j th component is `n ( )/ j . The Fisher information In ( ) is now a k k matrix; it is the variance-covraince matrix of the score.

We have the identity: 2 . `( ). In (r, s) = E . r s Example 11 Suppose that X1 , , Xn N ( , ). Then: n . Y 1 1 2 n 1 2. Ln ( , ) = exp (xi ) exp2 (xi ). i=1. 2 2 2 . n 1. `n ( , ) = log (xi )2. 2 2 . 1. (xi ).. Sn ( , ) = n 2 + 2 12 (xi )2. n 1 n (xi ).. 2 . 0. In ( , ) = E 1 n =. 2. (xi ) 2 2. 13 (xi )2 0 n 2 2. You can check that E (S) = (0, 0)T . 7. 6 Asymptotic Normality of the MLE. In this section we prove that the mle satisfies . n( bn ) N (0, I 1 ( )). In other words, . 1. bn N , . nI( ). In fact we will show that n 1X . b = + (Xi ) + oP (n 1/2 ) (1). n i=1. where S( , Xi ). (x) =. I( ). is called the influence function. In the next section we shall see that any well-behaved estimator b can also be written as (1) for some and that Var( ) Var( ). The regularity conditions we need to prove the Asymptotic Normality of the mle are stronger than those needed for consistency. We need to assume that (i) the dimension of the parameter space does not change with n, (ii) p(x; ) is a smooth function of , (iii) we can interchange differentition and integration over x, and (iv) the range of X does not depend on.

Theorem 12.. 1. n( bn ) N 0, . I( ).. Hence, bn = + OP 1 . n Proof. By Taylor's theorem 0 = `0 ( ). b = `0 ( ) + ( b )`00 ( ) + . Hence 1 `0 ( ). n A. n( b ) . n1 `00 ( ) B. Now n 1 1X . A = `0 ( ) = n S( , Xi ) = n(S 0). n n i=1. where S( , Xi ) is the score function based on Xi . Recall that p E(S( , Xi )) = 0 and Var(S( , Xi )) =. I( ). By the central limit theorem, A N (0, I( )) = I( )Z where Z N (0, 1). By the WLLN, P. B E(`00 ) = I( ). 8. By Slutsky's theorem p . A I( )Z Z 1. =p = N 0, . B I( ) I( ) I( ). So .. 1. n( b ) N 0, .. I( ).. A small modification of the above proof yields: Theorem 13 We have n 1X . b = + (Xi ) + oP (n 1/2 ) (2). n i=1. where S( , Xi ). (x) =. I( ). Now suppose we want to estimate ( ). By the delta method we have: Theorem 14 Let be a smooth function of . Then . n( ( bn ) ( )) N (0, ( 0 ( ))2 /I( )). From all the above, we see that the approximate standard error of b is s s 1 1. se = = . nI( ) In ( ). The estimated standard error is s 1.

Lecture Notes 9 Asymptotic Theory (Chapter 9)

Information

Transcription of Lecture Notes 9 Asymptotic Theory (Chapter 9)

Related search queries

Lecture Notes 9 Asymptotic Theory (Chapter 9)

Information

Documents from same domain

Related documents

Related search queries