### Transcription of Entropy and Information Theory - Stanford EE

1 **Entropy** and **Information** **Theory** First Edition, Corrected March 3, 2013. ii **Entropy** and **Information** **Theory** First Edition, Corrected Robert M. Gray **Information** Systems Laboratory Electrical Engineering Department **Stanford** University Springer-Verlag New York c 1990 by Springer Verlag. Revised 2000, 2007, 2008, 2009, 2013 by Robert M. Gray iv to Tim, Lori, Julia, Peter, Gus, Amy Elizabeth, and Alice and in memory of Tino Contents Prologue ix 1 **Information** Sources 1. Introduction .. 1. Probability Spaces and Random Variables .. 1. Random Processes and Dynamical Systems .. 5. Distributions .. 6. Standard Alphabets .. 10. Expectation .. 11. Asymptotic Mean Stationarity .. 14. Ergodic Properties .. 15. 2 **Entropy** and **Information** 17. Introduction .. 17. **Entropy** and **Entropy** Rate .. 17. Basic Properties of **Entropy** .. 20. **Entropy** Rate .. 31. Conditional **Entropy** and **Information** .

2 35. **Entropy** Rate Revisited .. 42. Relative **Entropy** Densities .. 44. 3 The **Entropy** Ergodic Theorem 47. Introduction .. 47. Stationary Ergodic Sources .. 50. Stationary Nonergodic Sources .. 56. AMS Sources .. 59. The Asymptotic Equipartition Property .. 63. 4 **Information** Rates I 65. Introduction .. 65. Stationary Codes and Approximation .. 65. **Information** Rate of Finite Alphabet Processes .. 73. v vi CONTENTS. 5 Relative **Entropy** 77. Introduction .. 77. Divergence .. 77. Conditional Relative **Entropy** .. 93. Limiting **Entropy** Densities .. 105. **Information** for General Alphabets .. 107. Some Convergence Results .. 117. 6 **Information** Rates II 121. Introduction .. 121. **Information** Rates for General Alphabets .. 121. A Mean Ergodic Theorem for Densities .. 124. **Information** Rates of Stationary Processes .. 126. 7 Relative **Entropy** Rates 133.

3 Introduction .. 133. Relative **Entropy** Densities and Rates .. 133. Markov Dominating Measures .. 136. Stationary Processes .. 139. Mean Ergodic Theorems .. 142. 8 Ergodic Theorems for Densities 147. Introduction .. 147. Stationary Ergodic Sources .. 147. Stationary Nonergodic Sources .. 152. AMS Sources .. 155. Ergodic Theorems for **Information** Densities.. 158. 9 Channels and Codes 161. Introduction .. 161. Channels .. 162. Stationarity Properties of Channels .. 164. Examples of Channels .. 167. The Rohlin-Kakutani Theorem .. 187. 10 Distortion 193. Introduction .. 193. Distortion and Fidelity Criteria .. 193. Performance .. 195. The rho-bar distortion .. 197. d-bar Continuous Channels .. 199. The Distortion-Rate Function .. 203. CONTENTS vii 11 Source Coding Theorems 213. Source Coding and Channel Coding .. 213. Block Source Codes for AMS Sources.

4 213. Block Coding Stationary Sources .. 223. Block Coding AMS Ergodic Sources .. 224. Subadditive Fidelity Criteria .. 231. Asynchronous Block Codes .. 232. Sliding Block Source Codes .. 235. A geometric Interpretation of operational DRFs .. 244. 12 Coding for noisy channels 247. Noisy Channels .. 247. Feinstein's Lemma .. 248. Feinstein's Theorem .. 251. Channel Capacity .. 254. Robust Block Codes .. 258. Block Coding Theorems for Noisy Channels .. 262. Joint Source and Channel Block Codes .. 263. Synchronizing Block Channel Codes .. 266. Sliding Block Source and Channel Coding .. 270. Bibliography 280. Index 291. viii CONTENTS. Prologue This book is devoted to the **Theory** of probabilistic **Information** measures and their application to coding theorems for **Information** sources and noisy channels. The eventual goal is a general development of Shannon's mathematical **Theory** of communication, but much of the space is devoted to the tools and methods required to prove the Shannon coding theorems.

5 These tools form an area com- mon to ergodic **Theory** and **Information** **Theory** and comprise several quantitative notions of the **Information** in random variables, random processes, and dynam- ical systems. Examples are **Entropy** , mutual **Information** , conditional **Entropy** , conditional **Information** , and relative **Entropy** (discrimination, Kullback-Leibler **Information** ), along with the limiting normalized versions of these quantities such as **Entropy** rate and **Information** rate. When considering multiple random objects, in addition to **Information** we will be concerned with the distance or distortion between the random objects, that is, the accuracy of the representa- tion of one random object by another. Much of the book is concerned with the properties of these quantities, especially the long term asymptotic behavior of average **Information** and distortion, where both sample averages and probabilis- tic averages are of interest.

6 The book has been strongly influenced by M. S. Pinsker's classic **Information** and **Information** Stability of Random Variables and Processes and by the seminal work of A. N. Kolmogorov, I. M. Gelfand, A. M. Yaglom, and R. L. Dobrushin on **Information** measures for abstract alphabets and their convergence properties. Many of the results herein are extensions of their generalizations of Shannon's original results. The mathematical models of this treatment are more general than traditional treatments in that nonstationary and nonergodic **Information** processes are treated. The models are somewhat less general than those of the Soviet school of **Information** **Theory** in the sense that standard alphabets rather than completely abstract alphabets are considered. This restriction, however, permits many stronger results as well as the extension to nonergodic processes.

7 In addition, the assumption of standard spaces simplifies many proofs and such spaces include as examples virtually all examples of engineering interest. The **Information** convergence results are combined with ergodic theorems to prove general Shannon coding theorems for sources and channels. The re- sults are not the most general known and the converses are not the strongest available, but they are sufficently general to cover most systems encountered in applications and they provide an introduction to recent extensions requiring ix x PROLOGUE. significant additional mathematical machinery. Several of the generalizations have not previously been treated in book form. Examples of novel topics for an **Information** **Theory** text include asymptotic mean stationary sources, one-sided . sources as well as two-sided sources, nonergodic sources, d-continuous channels, and sliding block or stationary codes.

8 Another novel aspect is the use of recent proofs of general Shannon-McMillan-Breiman theorems which do not use mar- tingale **Theory** a coding proof of Ornstein and Weiss [118] is used to prove the almost everywhere convergence of sample **Entropy** for discrete alphabet pro- cesses and a variation on the sandwich approach of Algoet and Cover [7] is used to prove the convergence of relative **Entropy** densities for general standard al- phabet processes. Both results are proved for asymptotically mean stationary processes which need not be ergodic. This material can be considered as a sequel to my book Probability, Random Processes, and Ergodic Properties [51] wherein the prerequisite results on prob- ability, standard spaces, and ordinary ergodic properties may be found. This book is self contained with the exception of common (and a few less common).

9 Results which may be found in the first book. It is my hope that the book will interest engineers in some of the mathemat- ical aspects and general models of the **Theory** and mathematicians in some of the important engineering applications of performance bounds and code design for communication systems. **Information** **Theory** , the mathematical **Theory** of communication, has two primary goals: The first is the development of the fundamental theoretical lim- its on the achievable performance when communicating a given **Information** source over a given communications channel using coding schemes from within a prescribed class. The second goal is the development of coding schemes that provide performance that is reasonably good in comparison with the optimal performance given by the **Theory** . **Information** **Theory** was born in a surpris- ingly rich state in the classic papers of Claude E.

10 Shannon [131] [132] which contained the basic results for simple memoryless sources and channels and in- troduced more general communication systems models, including finite state sources and channels. The key tools used to prove the original results and many of those that followed were special cases of the ergodic theorem and a new vari- ation of the ergodic theorem which considered sample averages of a measure of the **Entropy** or self **Information** in a process. **Information** **Theory** can be viewed as simply a branch of applied probability **Theory** . Because of its dependence on ergodic theorems, however, it can also be viewed as a branch of ergodic **Theory** , the **Theory** of invariant transformations and transformations related to invariant transformations. In order to develop the ergodic **Theory** example of principal interest to **Information** **Theory** , suppose that one has a random process, which for the moment we consider as a sam- ple space or ensemble of possible output sequences together with a probability measure on events composed of collections of such sequences.