Example: quiz answers

Lecture 1: Entropy and mutual information

Tufts UniversityElectrical and Computer EngineeringEE194 Network information TheoryProf. Mai VuLecture 1: Entropy and mutual information1 IntroductionImagine two people Alice and Bob living in Toronto and Boston respectively. Alice (Toronto) goesjogging whenever it is not snowing heavily. Bob (Boston) doesn t ever go that Alice s actions give information about the weather in Toronto. Bob s actions giveno information . This is because Alice s actions are random and correlated with the weather inToronto, whereas Bob s actions are can we quantify the notion of information ?2 EntropyDefinitionTheentropyof a discrete random variableXwith pmfpX(x) isH(X) = xp(x) logp(x) = E[ log(p(x)) ](1)The Entropy measures the expected uncertainty inX. We also say thatH(X) is approximatelyequal to how muchinformationwe learn on average from one instance of the random that the base of the algorithm is not important since changing the base only changes thevalue of the Entropy by a multiplicative (X) = xp(x) logbp(x) = logb(a)[ xp(x) logap(x)] = logb(a)Ha(X).

Definition The mutual information between two continuous random variables X,Y with joint p.d.f f(x,y) is given by I(X;Y) = ZZ f(x,y)log f(x,y) f(x)f(y) dxdy. (26) For two variables it is possible to represent the different entropic quantities with an analogy to set theory. In Figure 4 we see the different quantities, and how the mutual ...

Tags:

  Information, Mutual, Mutual information

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of Lecture 1: Entropy and mutual information

1 Tufts UniversityElectrical and Computer EngineeringEE194 Network information TheoryProf. Mai VuLecture 1: Entropy and mutual information1 IntroductionImagine two people Alice and Bob living in Toronto and Boston respectively. Alice (Toronto) goesjogging whenever it is not snowing heavily. Bob (Boston) doesn t ever go that Alice s actions give information about the weather in Toronto. Bob s actions giveno information . This is because Alice s actions are random and correlated with the weather inToronto, whereas Bob s actions are can we quantify the notion of information ?2 EntropyDefinitionTheentropyof a discrete random variableXwith pmfpX(x) isH(X) = xp(x) logp(x) = E[ log(p(x)) ](1)The Entropy measures the expected uncertainty inX. We also say thatH(X) is approximatelyequal to how muchinformationwe learn on average from one instance of the random that the base of the algorithm is not important since changing the base only changes thevalue of the Entropy by a multiplicative (X) = xp(x) logbp(x) = logb(a)[ xp(x) logap(x)] = logb(a)Ha(X).

2 Customarily, we usethe base 2 for the calculation of ExampleSuppose you have a random variableXsuch that:X={0 with probp1 with prob 1 p,(2)then the Entropy ofXis given byH(X) = plogp (1 p) log(1 p) =H(p)(3)Note that the Entropy does not depend on the values that the random variable takes (0 and 1in this case), but only depends on the probability distributionp(x).1 Tufts UniversityElectrical and Computer EngineeringEE194 Network information TheoryProf. Mai Two variablesConsider now two random variablesX, Yjointly distributed according to the (x, y). We nowdefine the following two entropyis given byH(X, Y) = x,yp(x, y) logp(x, y).(4)The joint Entropy measures how much uncertainty there is in the two random variablesXandYtaken entropyofXgivenYisH(X|Y) = x,yp(x, y) logp(x|y) = E[ log(p(x|y)) ](5)The conditional Entropy is a measure of how much uncertaintyremains about the random variableXwhen we know the value PropertiesThe entropic quantities defined above have the following properties: Non negativity:H(X) 0, Entropy is always (X) = 0 iffXis deterministic.}

3 Chain rule:We can decompose the joint Entropy as follows:H(X1, X2, .. , Xn) =n i=1H(Xi|Xi 1),(6)where we use the notationXi 1={X1, X2, .. , Xi 1}.For two variables, the chain rule becomes:H(X, Y) =H(X|Y) +H(Y)(7)=H(Y|X) +H(X).(8)Note that in generalH(X|Y)6=H(Y|X). Monotonicity:Conditioning always reduces Entropy :H(X|Y) H(X).(9)In other words information never hurts .2 Tufts UniversityElectrical and Computer EngineeringEE194 Network information TheoryProf. Mai Vu Maximum Entropy :LetXbe set from which the random variableXtakes its values(sometimes called thealphabet), thenH(X) log|X|.(10)The above bound is achieved whenXis uniformly distributed. Non increasing under functions:LetXbe a random variable and letg(X) be somedeterministic function ofX. We have that:H(X) H(g(X)),(11)with equality iffgis :We will the two different expansions of the chain rule for two (X, g(X)) =H(X, g(X))(12)H(X) +H(g(X)|X) =0=H(g(X)) +H(X|g(X)),(13)so we haveH(X) H(g(X) =H(X|g(X)) 0.)

4 (14)with equality if and only if we can deterministically guessXgiveng(X), which is only thecase ifgis invertible. 3 Continuous random variablesSimilarly to the discrete case we can define entropic quantities for continuous random entropyof a continuous random variableXwith (x) ish(X) = f(x) logf(x)dx= E[ log(f(x)) ](15)DefinitionConsider a pair of continuous random variable (X, Y) distributed according to the (x, y). Thejoint entropyis given byh(X, Y) = f(x, y) logf(x, y)dxdy,(16)while theconditional entropyish(X|Y) = f(x, y) logf(x|y)dxdy.(17)3 Tufts UniversityElectrical and Computer EngineeringEE194 Network information TheoryProf. Mai PropertiesSome of the properties of the discrete random variables carry over to the continuous case, but somedo not. Let us go through the list again. Non negativity doesn t hold:h(X) can be :Consider the distributed on the interval [a, b].

5 The Entropy isgiven byh(X) = ba1b alog1b adx= log(b a),(18)which can be a negative quantity ifb ais less than 1. Chain ruleholds for continuous variables:h(X, Y) =h(X|Y) +h(Y)(19)=h(Y|X) +h(X).(20) Monotonicity:h(X|Y) h(X)(21)The proof follows from the non-negativity of mutual information (later). Maximum Entropy :We do not have a bound for general functionsf(x), but we dohave a formula for power-limited functions. Consider a f(x), such thatE[x2] = x2f(x)dx P,(22)thenmaxh(X) =12log(2 eP),(23)and the maximum is achieved byX N(0, P).To verify this claim one can use standard Lagrange multiplier techniques from calculus tosolve the problem maxh(f) = flogfdx, subject toE[x2] = x2fdx P. Non increasing under functions:Doesn t necessarily hold since we can t guaranteeh(X|g(X)) mutual informationDefinitionThemutual informationbetween two discreet random variablesX, Yjointly distributedaccording top(x, y) is given byI(X;Y) = x,yp(x, y) logp(x, y)p(x)p(y)(24)=H(X) H(X|Y)=H(Y) H(Y|X)=H(X) +H(Y) H(X, Y).

6 (25)4 Tufts UniversityElectrical and Computer EngineeringEE194 Network information TheoryProf. Mai VuWe can also define the analogous quantity for continuous informationbetween two continuous random variablesX, Ywith joint (x, y) is given byI(X;Y) = f(x, y) logf(x, y)f(x)f(y)dxdy.(26)For two variables it is possible to represent the different entropic quantities with an analogyto set theory. In Figure 4 we see the different quantities, andhow the mutual information is theuncertainty that is common to (X)I(X:Y)H(X|Y)H(Y|X)H(Y)Figure 1: Graphical representation of the conditional Entropy and the mutual Non-negativity of mutual informationIn this section we will show thatI(X;Y) 0,(27)and this is true for both the discrete and continuous we get to the proof, we have to introduce some preliminary concepts like Jensen s in-equality and the relative s inequalitytells us something about the expected value of a random variable afterapplying a convex function to say a function isconvexon the interval [a, b] if, x1, x2 [a, b] we have:f( x1+ (1 )x2) f(x1) + (1 )f(x2).

7 (28)Another way stating the above is to say that the function always lies below the imaginary linejoining the points (x1, f(x1)) and (x2, f(x2)). For a twice-differentiable functionf(x), convexity isequivalent to the conditionf (x) 0, x [a, b].5 Tufts UniversityElectrical and Computer EngineeringEE194 Network information TheoryProf. Mai VuLemmaJensen s inequalitystates that for any convex functionf(x), we haveE[f(x)] f(E[x]).(29)The proof can be found in [Cover & Thomas].Note that an analogue of Jensen s inequality exists forconcavefunctions where the inequalitysimply changes entropyA very natural way to measure the distance between two probability distribu-tions is therelative Entropy , also sometimes called the Kullback-Leibler entropybetween two probability distributionsp(x) andq(x) is given byD(p(x)||q(x)) = xp(x) logp(x)q(x).(30)The reason why we are interested in the relative Entropy in this section is because it is relatedto the mutual information in the following wayI(X;Y) =D(p(x, y)||p(x)p(y)).

8 (31)Thus, if we can show that the relative Entropy is a non-negative quantity, we will have shown thatthe mutual information is also of non-negativity of relative Entropy :Letp(x) andq(x) be two arbitrary probability distri-butions. We calculate the relative Entropy as follows:D(p(x)||q(x)) = xp(x) logp(x)q(x)= xp(x) logq(x)p(x)= E[logq(x)p(x)] log(E[q(x)p(x)])(by Jensen s inequality for concave function log( ))= log( xp(x)q(x)p(x))= log( xq(x))= 0. 6 Tufts UniversityElectrical and Computer EngineeringEE194 Network information TheoryProf. Mai Conditional mutual informationDefinitionLetX, Y, Zbe jointly distributed according to some (x, y, z). Theconditionalmutual informationbetweenX, YgivenZisI(X;Y|Z) = x,y,zp(x, y, z) logp(x, y|z)p(x|z)p(y|z)(32)=H(X|Z) H(X|Y Z)=H(XZ) +H(Y Z) H(XY Z) H(Z).The conditional mutual information is a measure of how much uncertainty is shared byXandY, but not Properties Chain rule:We have the following chain ruleI(X;Y1Y2.)

9 Yn) =n i=1I(X;Yi|Yi 1),(33)where we have used again the shorthand notationYi 1={Y1, Y2, .. , Yi 1}. No monotonicity:Conditioning can either increase or decrease the mutual informationbetween two variables, soI(X;Y|Z) I(X;Y),andI(X;Y|Z) I(X;Y).(34)To illustrate the last point, consider the following two examples where conditioning has differenteffects. In both cases we will make use of the following equationI(X;Y Z) =I(X;Y Z)I(X;Y) +I(X;Z|Y) =I(X;Z) +I(X;Y|Z).(35)Increasing example:If we have someX, Y, Zsuch thatI(X;Z) = 0 (which meansXandZare independent variables), then equation (35) becomes:I(X;Y) +I(X;Z|Y) =I(X;Y|Z),(36)soI(X;Y|Z) I(X;Y) =I(X;Z|Y) 0, which impliesI(X;Y|Z) I(X;Y).(37)Decreasing example:On the other hand if we have a situation in whichI(X;Z|Y) = 0,equation (35) becomes:I(X;Y) =I(X;Z) +I(X;Y|Z),(38)which in implies thatI(X;Y|Z) I(X;Y).So we see that conditioning of the mutual information can both increase or decrease it dependingon the UniversityElectrical and Computer EngineeringEE194 Network information TheoryProf.

10 Mai Vu5 Data processing inequalityFor three variablesX, Y, Zone situation which is of particular interest is when they form a Markovchain:X Y Z. This relation is implies that the probability distributionp(x, z|y) =p(x|y)p(z|y) which in turn implies thatI(X;Z|Y) = 0 like in the example situation often occurs when we have some inputXthat gets transformed by a channel togive an outputYand then we want to apply some processing to obtain a signalZas Channel Y Processing ZIn this case we have thedata processing inequality:I(X;Z) I(X;Y).(39)In other words, processing cannot increase the informationcontained in a


Related search queries