1 Entropy and Information Theory First Edition, Corrected March 3, 2013. ii Entropy and Information Theory First Edition, Corrected Robert M. Gray Information Systems Laboratory Electrical Engineering Department Stanford University Springer-Verlag New York c 1990 by Springer Verlag. Revised 2000, 2007, 2008, 2009, 2013 by Robert M. Gray iv to Tim, Lori, Julia, Peter, Gus, Amy Elizabeth, and Alice and in memory of Tino Contents Prologue ix 1 Information Sources 1. Introduction .. 1. Probability Spaces and Random Variables .. 1. Random Processes and Dynamical Systems .. 5. Distributions .. 6. Standard Alphabets .. 10. Expectation .. 11. Asymptotic Mean Stationarity .. 14. Ergodic Properties .. 15. 2 Entropy and Information 17. Introduction .. 17. Entropy and Entropy Rate .. 17. Basic Properties of Entropy .. 20. Entropy Rate .. 31. Conditional Entropy and Information .
2 35. Entropy Rate Revisited .. 42. Relative Entropy Densities .. 44. 3 The Entropy Ergodic Theorem 47. Introduction .. 47. Stationary Ergodic Sources .. 50. Stationary Nonergodic Sources .. 56. AMS Sources .. 59. The Asymptotic Equipartition Property .. 63. 4 Information Rates I 65. Introduction .. 65. Stationary Codes and Approximation .. 65. Information Rate of Finite Alphabet Processes .. 73. v vi CONTENTS. 5 Relative Entropy 77. Introduction .. 77. Divergence .. 77. Conditional Relative Entropy .. 93. Limiting Entropy Densities .. 105. Information for General Alphabets .. 107. Some Convergence Results .. 117. 6 Information Rates II 121. Introduction .. 121. Information Rates for General Alphabets .. 121. A Mean Ergodic Theorem for Densities .. 124. Information Rates of Stationary Processes .. 126. 7 Relative Entropy Rates 133.
3 Introduction .. 133. Relative Entropy Densities and Rates .. 133. Markov Dominating Measures .. 136. Stationary Processes .. 139. Mean Ergodic Theorems .. 142. 8 Ergodic Theorems for Densities 147. Introduction .. 147. Stationary Ergodic Sources .. 147. Stationary Nonergodic Sources .. 152. AMS Sources .. 155. Ergodic Theorems for Information Densities.. 158. 9 Channels and Codes 161. Introduction .. 161. Channels .. 162. Stationarity Properties of Channels .. 164. Examples of Channels .. 167. The Rohlin-Kakutani Theorem .. 187. 10 Distortion 193. Introduction .. 193. Distortion and Fidelity Criteria .. 193. Performance .. 195. The rho-bar distortion .. 197. d-bar Continuous Channels .. 199. The Distortion-Rate Function .. 203. CONTENTS vii 11 Source Coding Theorems 213. Source Coding and Channel Coding .. 213. Block Source Codes for AMS Sources.
4 213. Block Coding Stationary Sources .. 223. Block Coding AMS Ergodic Sources .. 224. Subadditive Fidelity Criteria .. 231. Asynchronous Block Codes .. 232. Sliding Block Source Codes .. 235. A geometric Interpretation of operational DRFs .. 244. 12 Coding for noisy channels 247. Noisy Channels .. 247. Feinstein's Lemma .. 248. Feinstein's Theorem .. 251. Channel Capacity .. 254. Robust Block Codes .. 258. Block Coding Theorems for Noisy Channels .. 262. Joint Source and Channel Block Codes .. 263. Synchronizing Block Channel Codes .. 266. Sliding Block Source and Channel Coding .. 270. Bibliography 280. Index 291. viii CONTENTS. Prologue This book is devoted to the Theory of probabilistic Information measures and their application to coding theorems for Information sources and noisy channels. The eventual goal is a general development of Shannon's mathematical Theory of communication, but much of the space is devoted to the tools and methods required to prove the Shannon coding theorems.
5 These tools form an area com- mon to ergodic Theory and Information Theory and comprise several quantitative notions of the Information in random variables, random processes, and dynam- ical systems. Examples are Entropy , mutual Information , conditional Entropy , conditional Information , and relative Entropy (discrimination, Kullback-Leibler Information ), along with the limiting normalized versions of these quantities such as Entropy rate and Information rate. When considering multiple random objects, in addition to Information we will be concerned with the distance or distortion between the random objects, that is, the accuracy of the representa- tion of one random object by another. Much of the book is concerned with the properties of these quantities, especially the long term asymptotic behavior of average Information and distortion, where both sample averages and probabilis- tic averages are of interest.
6 The book has been strongly influenced by M. S. Pinsker's classic Information and Information Stability of Random Variables and Processes and by the seminal work of A. N. Kolmogorov, I. M. Gelfand, A. M. Yaglom, and R. L. Dobrushin on Information measures for abstract alphabets and their convergence properties. Many of the results herein are extensions of their generalizations of Shannon's original results. The mathematical models of this treatment are more general than traditional treatments in that nonstationary and nonergodic Information processes are treated. The models are somewhat less general than those of the Soviet school of Information Theory in the sense that standard alphabets rather than completely abstract alphabets are considered. This restriction, however, permits many stronger results as well as the extension to nonergodic processes.
7 In addition, the assumption of standard spaces simplifies many proofs and such spaces include as examples virtually all examples of engineering interest. The Information convergence results are combined with ergodic theorems to prove general Shannon coding theorems for sources and channels. The re- sults are not the most general known and the converses are not the strongest available, but they are sufficently general to cover most systems encountered in applications and they provide an introduction to recent extensions requiring ix x PROLOGUE. significant additional mathematical machinery. Several of the generalizations have not previously been treated in book form. Examples of novel topics for an Information Theory text include asymptotic mean stationary sources, one-sided . sources as well as two-sided sources, nonergodic sources, d-continuous channels, and sliding block or stationary codes.
8 Another novel aspect is the use of recent proofs of general Shannon-McMillan-Breiman theorems which do not use mar- tingale Theory a coding proof of Ornstein and Weiss  is used to prove the almost everywhere convergence of sample Entropy for discrete alphabet pro- cesses and a variation on the sandwich approach of Algoet and Cover  is used to prove the convergence of relative Entropy densities for general standard al- phabet processes. Both results are proved for asymptotically mean stationary processes which need not be ergodic. This material can be considered as a sequel to my book Probability, Random Processes, and Ergodic Properties  wherein the prerequisite results on prob- ability, standard spaces, and ordinary ergodic properties may be found. This book is self contained with the exception of common (and a few less common).
9 Results which may be found in the first book. It is my hope that the book will interest engineers in some of the mathemat- ical aspects and general models of the Theory and mathematicians in some of the important engineering applications of performance bounds and code design for communication systems. Information Theory , the mathematical Theory of communication, has two primary goals: The first is the development of the fundamental theoretical lim- its on the achievable performance when communicating a given Information source over a given communications channel using coding schemes from within a prescribed class. The second goal is the development of coding schemes that provide performance that is reasonably good in comparison with the optimal performance given by the Theory . Information Theory was born in a surpris- ingly rich state in the classic papers of Claude E.
10 Shannon   which contained the basic results for simple memoryless sources and channels and in- troduced more general communication systems models, including finite state sources and channels. The key tools used to prove the original results and many of those that followed were special cases of the ergodic theorem and a new vari- ation of the ergodic theorem which considered sample averages of a measure of the Entropy or self Information in a process. Information Theory can be viewed as simply a branch of applied probability Theory . Because of its dependence on ergodic theorems, however, it can also be viewed as a branch of ergodic Theory , the Theory of invariant transformations and transformations related to invariant transformations. In order to develop the ergodic Theory example of principal interest to Information Theory , suppose that one has a random process, which for the moment we consider as a sam- ple space or ensemble of possible output sequences together with a probability measure on events composed of collections of such sequences.