Novelty, Information, and Surprise
Novelty, Information, and Surprise
Novelty, Information, and Surprise
Gunther Palm
Novelty,
Information
and Surprise
123
Gunther Palm
Neural Information Processing
University of Ulm
James-Franck-Ring
Ulm, Germany
ISBN 978-3-642-29074-9
ISBN 978-3-642-29075-6 (eBook)
DOI 10.1007/978-3-642-29075-6
Springer Heidelberg New York Dordrecht London
Library of Congress Control Number: 2012942731
c Springer-Verlag Berlin Heidelberg 2012
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of
the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology
now known or hereafter developed. Exempted from this legal reservation are brief excerpts in connection
with reviews or scholarly analysis or material supplied specifically for the purpose of being entered
and executed on a computer system, for exclusive use by the purchaser of the work. Duplication of
this publication or parts thereof is permitted only under the provisions of the Copyright Law of the
Publishers location, in its current version, and permission for use must always be obtained from Springer.
Permissions for use may be obtained through RightsLink at the Copyright Clearance Center. Violations
are liable to prosecution under the respective Copyright Law.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
While the advice and information in this book are believed to be true and accurate at the date of
publication, neither the authors nor the editors nor the publisher can accept any legal responsibility for
any errors or omissions that may be made. The publisher makes no warranty, express or implied, with
respect to the material contained herein.
Printed on acid-free paper
Springer is part of Springer Science+Business Media (www.springer.com)
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . xiii
References .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . xxii
Part I
3
3
5
7
10
10
11
11
14
15
18
24
30
31
32
34
35
35
36
38
42
44
45
46
vi
Contents
Part II
51
51
53
54
56
57
58
60
60
62
Information Transmission . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
5.1 Introductory Examples .. . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
5.2 Transition Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
5.3 Transmission of Information Across Simple Channels .. . . . . . . . . . . .
5.4 Technical Comments .. . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
5.5 Exercises .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
Reference .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
63
63
65
67
71
72
74
Part III
77
77
78
80
81
84
85
87
87
88
Channel Capacity .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
7.1 Information Channels . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
7.2 Memory and Anticipation.. . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
7.3 Channel Capacity.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
7.4 Technical Comments .. . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
7.5 Exercises .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
References .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
89
89
90
91
94
94
95
97
97
98
Contents
vii
105
106
109
115
117
119
120
120
123
123
125
133
138
138
139
141
141
142
146
152
154
Part V
155
156
157
161
161
166
167
168
170
170
173
173
viii
Contents
12.5 Conclusion .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
12.6 Technical Comments .. . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
12.6.1 Coincidence .. . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
12.6.2 Coincidental Patterns .. . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
12.6.3 Spatio-Temporal Patterns. . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
References .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
175
175
179
179
179
181
189
189
191
194
194
14 Entropy in Physics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
14.1 Classical Entropy .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
14.2 Modern Entropies and the Second Law . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
14.3 The Second Law in Terms of Information Gain . . . . . . . . . . . . . . . . . . . .
14.4 Technical Comments .. . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
References .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
195
195
198
201
204
204
Part VI
207
207
213
214
215
217
217
220
222
226
227
228
228
229
229
231
232
233
234
235
235
235
Contents
ix
Appendices
A
237
238
240
242
Glossary . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 243
Index . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 245
List of Figures
Fig. 1
xv
Fig. 2.1
Fig. 2.2
Fig. 2.3
Fig. 2.4
16
17
25
28
Fig. 4.1
Fig. 4.2
Fig. 4.3
52
52
54
Fig. 5.1
Fig. 5.2
Fig. 5.3
An information channel .. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
Channel example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
Three different channels . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
66
67
73
Fig. 6.1
Channel example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
78
Fig. 7.1
Fig. 7.2
Fig. 7.3
90
94
95
Fig. 8.1
98
Fig. 9.1
Fig. 9.2
112
Fig. 9.3
Fig. 9.4
Fig. 9.5
Fig. 9.6
Illustration of tightness.. . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
Illustration of cleanness. The vertically hatched region
on the left can be removed, because it is the union of
the two diagonally hatched regions . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
Illustration of the product of two repertoires.. . .. . . . . . . . . . . . . . . . . . . .
Examples of narrow covers and chains. . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
Example for a shallow repertoire (a) and for a chain (b) .. . . . . . . . . .
Illustration of classes of repertoires . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
Fig. 12.1
113
116
118
118
119
xi
xii
Fig. 12.2
Fig. 12.3
Fig. 12.4
Fig. 12.5
Fig. 16.1
Fig. 16.2
List of Figures
169
170
177
178
Introduction
Nowadays there are many practical applications of information theory in fields like
pattern recognition, machine learning, and data mining (e.g., Deco and Obradovic
1996; MacKay 2005), used in particular in the life sciences [e.g., Herzel et al.
(1994); Schmitt and Herzel (1997); Bialek et al. (2007); Taylor et al. (2007);
Tkacik and Bialek (2007); Koepsell et al. (2009)], i.e., far beyond the classical
applications in communication technology. But Claude Shannons groundbraking
original concept of information remained essentially unchanged.
The main purpose of this book is to extend classical information theory to
incorporate the subjective element of interestingness, novelty, or surprise. These
concepts can only be defined relative to a persons interests, intentions, or purposes,
and indeed classical information theory was often criticized for not being able to
incorporate these ideas. Actually, classical information theory comes quite close to
this, when introducing the information contained in a proposition or statement A
(as log2 p.A/). But in everyday life most commonly this information is not really
transferred if a person X tells the statement A to another person Y , for example
because Y may not be interested in A or because Y may rather be interested in the
fact that it is X who tells A. An interesting extension of information theory could
consider the following question: If Y is interested in B instead of A and perhaps B
largely overlaps with A, then how much information does Y obtain from being told
A? This question and other similar ones will be answered in this book; they are not
totally alien to classical information theory. This means that our new theory does
not have to go far from it. In a technical sense it can be regarded as only a slight (but
perhaps important) extension of Shannons definition of information from partitions
to covers.
This needs some explanation: Shannon tried to define his information as an
objective almost physical quantity (measured in bit). This led him to define information for random variables X . For a discrete random
Pvariable X with finitely many
possible values x1 ; x2 ; : : : ; xn he defined I.X / D i pX D xi log2 pX D xi ,
i.e., as the average of log2 pX D xi where the possible outcomes X D xi
of X are now the propositions A of interest. Again, this definition presupposes that
we are equally interested in all outcomes xi of the random variable. This may not
xiii
xiv
Introduction
Introduction
xv
A1
A2
A
A3
C2 C3
C1
AB
A4
A1
A2
A
A3
A4
A1
A2
A3
A4
AB
C1
C2
C3
Fig. 1 Examples of covers (top) and the induced hierarchical structures (bottom)
xvi
Part II:
Part III:
Part IV:
Part V:
Part VI:
Introduction
Introduction
xvii
Jeffrey 1992, 2004. This discussion can be and has been carried over to information
theory, but this is not the purpose of this book. Instead I have discovered an
additional source of subjectivity in probability and information theory that is
perhaps less relevant in probability but has a definite impact on information theory
once it is brought into the focus of attention. Classical information theory tried to
determine the amount of information contained in a message (as a real number
measured in bits). To do this it has to presuppose that the message is exchanged
between two agents (that are normally assumed to be people not animals, computers,
or brain regions), who have already agreed on a common language for the expression
of the message. This approach makes it possible to develop information theory in a
discrete framework, dealing mainly with a finite alphabet from which the messages
are composed. Thus classical information theory does not consider the process of
perception or of formation of the messages. It starts where this process has ended.
I believe and I will actually show in this book that information theory needs
only a slight modification to include this process of perception and formation of a
message. This process is captured in the definition of a description which connects
every event ! that can happen, with a proposition d.!/ about it. When we describe
actual events that have happened, we will normally be unable to give an exact
account of them, let alone of the total state of the world around us. We are restricted
both by our senses and by our language: we cannot sense everything that there is and
we cannot express everything that we sense (at least not exactly). So the description
d.x/ that we give about x will usually be true not only for x but also for other events
y which are somewhat similar to x.
Now it is quite clear that classical information theory does not deal with events
x that happen, but rather with descriptions of these events or propositions about
those events. Unfortunately, this simple fact is obscured by the usual parlance in
probability theory and statistics where a proposition is called an event and an
event is called an elementary event. This leads to such strange phrases as is the
certain event and ; is the impossible event. What is meant is that is the trivial
proposition, which is true because it says nothing about the event x, and that ; is
never true because it is a self-contradictory statement (like A and not A) about the
event x. Due to this strange use of the language (which is usually carried over from
probability theory to information theory) it may easily happen that this humanoid
or language-related aspect of the concept of information is forgotten in practical
applications of information theory. This is harmless when information theory is
used to optimize telephone lines, but it may become problematic when information
theory is applied in the natural sciences. This had already happened in the nineteenth
century when the term entropy was introduced in statistical mechanics to explain the
second law of thermodynamics and it is happening again today when information
theory is used in theoretical neuroscience or in genetics, e.g., Kuppers 1986. One
problem is that in such applications the concept of information may be taken to
be more objective than it actually is. This argument eventually has led me and
also others to some rather subtle and probably controversial criticisms of these
applications of information theory Knoblauch and Palm 2004; Palm 1985, 1996;
Bar-Hillel and Carnap 1953 which really were not the main motive for writing this
xviii
Introduction
book and which are neither part of the theory of novelty and surprise nor its most
interesting applications. In a nutshell, the situation can be described as follows.
Classical Shannon information theory requires an agreement between the sender
and the receiver of the messages whose information content has to be determined.
So information travels from a human sender to a human receiver, and the agreement
concerns the code, i.e., the propositional meaning of the symbols that constitute
the message; this implies that both sender and receiver use the same description
of events. The slightly broader information theory developed here includes the
situation where information about a purely physical observation is extracted by
human observation, so only the receiver needs to be a human. In many scientific
and everyday uses of information terminology, however, neither the sender nor the
receiver is human. And the problem is not merely that instead of the human we
have some other intelligent or intentional being, for example an alien, a monkey,
a chess-computer, or a frog. Information terminology is also used for completely
mechanical situations. For example, a cable in my car carries the information that
I have set the indicator to turn left to the corresponding lights. Or a cable transmits
visual information from a camera to the computer of my home intrusion-warning
system.
In biology and in particular in neuroscience this common use of information
terminology may interfere in strange ways with our ontological prejudices, for
example concerning the consciousness of animals, because on the one hand
information terminology is handy, but on the other hand we dont want to imply
that the receiver of the information has the corresponding properties comparable to
a human. For example, we may want to quantify the amount of visual information
the optic nerve sends to the brain of a frog (Letvin et al. 1959; Atick 1992), without
assuming that the frog (let alone its brain) has made some agreement with its
eyes about the meaning of the signals that are sent through the nerve. Similarly,
we can easily classify the maximal amount of information that can be expressed
by our genes, but we get into much deeper waters when we try to estimate how
much information is actually transmitted, and who is the sender and who is the
receiver. Does father Drosophila transmit some information (for example, about
how to behave) to his son by his genes? Or is it somehow the whole process of
evolution that produces information (Kuppers 1986, see also Taylor et al. 2007)
and who is the receiver of this information? The usual way out of this dilemma
is to avoid the question who might be the sender and who might be the receiver
altogether. In the technical examples eventually both are always humans, anyway.
In the case of information transmission by genes, neurons, or the optic nerve one can
argue that we are just interested in the physical properties of the device that limit
the amount of transmittable information, i.e., the channel capacity. In all these three
cases actually, the channel capacity has been estimated shortly after its definition by
Shannon 1948. In the case of the genome, the capacity is simply twice the number
of base pairs of the DNA-molecule. But this leads to the question how much of this
capacity is actually used and, invariably in biology, to the suspicion that it is much
less than the capacity. So it is found out that most of the genetic information is not
used or not coding, or that most of the visual information available from our
Introduction
xix
xx
Introduction
sequences containing 4 prime numbers (many more than 44, but much less than all
possible sequences), his surprise will certainly be not as large as for the sequence
.22; 23; 24; 25; 26; 27/.
But if we combine all these different reasons for finding surprise, we may
eventually find something surprising in almost every sequence. In this case, it would
seem naive to add all surprises we can get for a given sequence from various
considerations, rather one would believe that the real surprise provided by any
concrete sequence becomes less when everything is considered as surprising. This
obviously confused wording leads to the distinction between novelty and surprise
that is also made in this book.
Introduction
xxi
xxii
Introduction
in Ulm. So it was only towards the end of 2006 when I could take up the book project
again. This time I added more motivating examples to the text. And again I found a
talented student, Stefan Menz, who integrated everything into a neat LATEX-version
of the text. During the last years we also had the opportunity to pursue some of the
new ideas and applications of information theory with excellent PhD students in
an interdisciplinary school on Evolution, Information and Complexity, see Arendt
and Schleich (2009).
This book would never have been finished without the help, encouragement, and
inspiration from all the people I mentioned and also from many others whom I did
not mention. I would like to thank them all!
References
Aertsen, A. M. H. J., Gerstein, G. L., Habib, M. K., & Palm, G. (1989). Dynamics of neuronal
firing correlation: modulation of effective connectivity. Journal of Neurophysiology, 61(5),
900917.
Arendt, W., & Schleich, W. P. (Eds.) (2009). Mathematical analysis of evolution, information, and
complexity. New York: Wiley.
Atick, J. J. (1992). Could information theory provide an ecological theory of sensory processing?
Network: Computation in Neural Systems, 3, 213251.
Bar-Hillel, Y., & Carnap, R. (1953). Semantic information. In London information theory
symposium (pp. 503512). New York: Academic.
Bialek, W., de Ruyter van Steveninck, R. R., & Tishby, N. (2007). Efficient representation as a
design principle for neural coding and computation. Neural Computation, 19(9), 2387-2432.
Borst, A., & Theunissen, F. E. (1999). Information theory and neural coding. Nature Neuroscience,
2(11), 947957.
Braitenberg, V. (1977). On the texture of brain. New York: Springer.
Braitenberg, V. (2011). Information - der Geist in der Natur. Stuttgart: Schatthauer.
Deco, G., & Obradovic, D. (1996). An Information-theoretic approach to neural computing.
New York: Springer.
de Finetti, B. (1974). Theory of Probability (Vol. 1). New York: Wiley.
Eckhorn, R., Grusser, O.-J., Kroller, J., Pellnitz, K., & Popel, B. (1976). Efficiency of different neuronal codes: Information transfer calculations for three different neuronal systems. Biological
Cybernetics, 22(1), 4960.
Herzel, H., Ebeling, W., & Schmitt, A. (1994). Entropies of biosequences: The role of repeats.
Physical Review E, 50(6), 50615071.
Jaynes, E. T. (1957). Information theory and statistical mechanics. Physical Review, 106(4),
620630.
Jaynes, E. T. (1982). On the rationale of maximum entropy methods. Proceedings IEEE,
70, 939952.
Jeffreys, H. (1939). Theory of probability. New York: Oxford University Press.
Jeffrey, R. C. (1992). Probability and the art of judgment. New York: Cambridge University Press.
Jeffrey, R. C. (2004). Subjective probability: The real thing. New York: Cambridge University
Press.
Kerridge, D. F. (1961). Inaccuracy and inference. Journal of the Royal Statistical Society. Series B
(Methodological), 23(1), 184194.
Keynes, J. M. (1921). A treatise on probability. London: MacMillan.
Knoblauch, A., & Palm, G. (2004). What is Signal and What is Noise in the Brain? BioSystems,
79, 8390.
References
xxiii
Koepsell, K., Wang, X., Vaingankar, V., Wei, Y., Wang, Q., Rathbun, D. L., Usrey, W. M., Hirsch,
J. A., & Sommer, F. T. (2009). Retinal oscillations carry visual information to cortex. Frontiers
in Systems Neuroscience, 3, 118.
Kuppers, B.-O. (1986). Der Ursprung biologischer Information - Zur Naturphilosophie der
Lebensentstehung. Munchen: Piper.
Legendy, C. R. (1975). Three principles of brain function and structure. International Journal of
Neuroscience, 6, 237254.
Legendy, C. R., & Salcman, M. (1985). Bursts and recurrences of bursts in the spike trains of
spontaneously active striate cortex neurons. Journal of Neurophysiology, 53(4), 926939.
Letvin, J. Y., Maturana, H. R., McCulloch, W. S., & Pitts, W. H. (1959). What the frogs eye tells
the frogs brain. Proceedings of the IRE, 47(11), 19401951.
MacKay, D. J. C. (2005). Information theory, inference, and learning algorithms. New York:
Cambridge University Press.
MacKay, D. M., & McCulloch, W. S. (1952). The limiting information capacity of a neuronal link.
Bulletin of Mathematical Biology, 14(2), 127135.
Palm, G. (1975). Entropie und Generatoren in dynamischen Verbanden. PhD Thesis, Tubingen.
Palm, G. (1976b). Entropie und Erzeuer in dynamischen Verbanden. Z. Wahrscheinlichkeitstheorie
verw. Geb., 36, 2745.
Palm, G. (1980). On associative memory. Biological Cybernetics, 36, 167183.
Palm, G. (1981). Evidence, information and surprise. Biological Cybernetics, 42(1), 5768.
Palm, G. (1985). Information und entropie. In H. Hesse (Ed.), Natur und Wissenschaft. Tubingen:
Konkursbuch Tubingen.
Palm, G. (1996). Information and surprise in brain theory. In G. Rusch, S. J. Schmidt,
& O. Breidbach (Eds.), Innere ReprasentationenNeue Konzepte der Hirnforschung,
DELFIN Jahrbuch (stw-reihe edition) (pp. 153173). Frankfurt: Suhrkamp.
Palm, G., Aertsen, A. M. H. J., & Gerstein, G. L. (1988). On the significance of correlations among
neuronal spike trains. Biological Cybernetics, 59(1), 111.
Pfaffelhuber, E. (1972). Learning and information theory. International Journal of Neuroscience,
3, 83.
Schmitt, A. O., & Herzel, H. (1997). Estimating the entropy of DNA sequences. Journal of
Theoretical Biology, 188(3), 369377.
Shannon, C. E. (1948). A mathematical theory of communication. Bell Systems Technical Journal,
27, 379423, 623656.
Taylor, S. F., Tishby, N., & Bialek, W. (2007). Information and fitness. arXiv:0712.4382v1.
Tkacik, G., & Bialek, W. (2007). Cell biology: Networks, regulation, pathways. In R. A. Meyers
(Ed.) Encyclopedia of complexity and systems science (pp. 719741). Berlin: Springer.
arXiv:0712.4385 [q-bio.MN].
Part I
Chapter 1
This chapter lays the probabilistic groundwork for the rest of the book. We introduce
standard probability theory. We call the elements A of the -algebra propositions
instead of events, which would be more common. We reserve the word event
for the elements of the probability space .
2. The fact that we only live for a limited time and therefore are only able to produce
sequences of a limited finite length
When we refer to the set of all possible propositions, we will usually allow
for sequences of finite but unlimited length. Thus we shall usually allow at least
for a countably infinite . In probability theory one usually assumes that
contains all sets f!g (for all ! 2 ), in this case, of course #./ #.),
i.e., has rather more elements then , or a larger or equal cardinality (denoted
by #). Of course, the necessary mathematical assumptions on will always be
explicitly stated.
In real languages it may well happen that has a larger cardinality than . For
example, if an event ! involves the exact position of a glass on a table and this is
described by a pair .x; y/ of real numbers (coordinates), then has the cardinality
of the continuum, which is more than countable. On the other hand, may still be
countable or even finite. If, however, is finite, then will normally contain more
elements than , but cannot be infinite, and in fact #./ 2n , when #./ D n (the
cardinality of is at most 2n , where n is the number of elements in ).
There is another restriction on that we will always assume:
should be closed under the logical operations. This has to be explained: We
may join propositions to new propositions by means of logical operations. If A and
B are propositions, we may also say A and B, A or B, A and not B and the
like. The mathematical counterpart for this are the three set operations:
The intersection of two propositions: A \ B holds for an event !, if ! fulfills A
and ! fulfills B.
A [ B holds for ! if ! fulfills A or if it fulfills B.
AN holds for !, if ! does not fulfill A. Instead of AN the notation Ac (the
complement of A) will be used in the subsequent text.
Definition 1.1. A (nonempty) set of subsets of a set is closed under the logical
operations, if
i) For any A in , Ac is also in .
ii) For any two A, B in , A \ B and A [ B are also in .
A set of subsets of a set is called an algebra, if it is closed under the logical
operations and contains .
In the following we will always assume that P./ is an algebra; here P./
denotes the set of all subsets of . We will even require a little more as is customary
in probability theory (see Definition 1.4).
The set itself, interpreted as a proposition, says that ! 2 , in other words it
says nothing or it holds for every event !: it is a tautology. Its negation c holds
for no event !. As a set, c is called the empty set and written as c D ;.
Finally p is a mapping that assigns a positive number p.A/ to each A 2 . p.A/
is called the probability of the proposition A, i.e., the probability that an event !
fulfills A.
t
u
n
X
ai 1Ai
i D1
and determine
E.X / D
n
X
ai p.Ai /:
i D1
For more details see any book on probability or measure theory, e.g., Ash (1972); Bauer (1972);
Billingsley (1979); Halmos (1950); Jacobs (1978); Lamperti (1966).
If X.!/ D lim Xn .!/, for every ! 2 , and E.Xn / is known, then E.X /
n!1
should be determinable as E.X / D lim E.Xn /. For this idea to work we need yet
n!1
nD1
n!1
M n
X
kDM n
k
1Ak
n
Xn C
1
DW Xn0 :
n
n!1
M n
X
kDM n
k
kC1
k
p
X
:
n
n
n
A slightly different argument for the same computation of E.X / is the following:
From Xn X Xn0 it is clear that E.Xn / E.X / E.Xn0 /. And from Xn0
Xn D n1 we get E.Xn0 / E.Xn / D n1 and this shows lim E.Xn0 / D lim E.Xn /
n!1
n!1
1
S
nD1
nD1
B./ is the smallest -algebra containing all open intervals .a; b/ 0; 1.
10
References
Ash, R. B. (1972). Real analysis and probability. New York: Academic Press.
Bauer, H. (1972). Probability theory and elements of measure theory. New York: Holt, Rinehart
and Winston.
Billingsley, P. (1979). Probability and measure. New York, London, Toronto: Wiley.
Halmos, P. R. (1950). Measure theory. Princeton: Van Nostrand.
Jacobs, K. (1978). Measure and integral. New York: Academic Press.
Lamperti, J. (1966). Probability : A survey of the mathematical theory. Reading, Massachusetts:
Benjamin/Cummings.
Chapter 2
11
12
three children, since this describes his situation more accurately. This is what we
normally expect in a conversation.
For example, it may have started by Mr. Miller pointing out This is my son.
Then you may have asked Do you have more children? and he may have answered
Yes, I have two. In this case, one statement would actually mean [he has two
children and no more]. Let us turn to the other statement this is my son. We would
assume that Mr. Miller happened to be accompanied by one of his children. Indeed,
if he would be accompanied by two of them, we would expect him to mention
both of them in a usual conversation. So the other statement means [Mr. Miller was
accompanied by one of his two children and this was his son]. Now we can work
out the desired probability if we assume that it is equally probable that Mr. Miller is
accompanied by one or the other of his two children, if he is accompanied by just
one of them (which seems quite reasonable).
Mathematically, the situation can be described by three random variables X1 and
X2 2 fm; f g for child one and two, and C 2 f0; 1; 2g for his companion: C D 0
means he is not accompanied by just one child, C D 1, he is accompanied by child
one, C D 2 for child two. We assume that
p.X1 D m/ D p.X1 D f / D p.X2 D m/ D p.X2 D f / D
1
2
p.Af \ Am /
D
p.Am /
1
2
3
4
2
:
3
The other interpretation of the same statements is what is described more formally in
this chapter. Mr. Miller describes the situation differently, depending on the values
of X1 ; X2 , and C :
If X1 D m and C D 1 he says Am ,
If X2 D m and C D 2 he says Am ,
If X1 D f and C D 1 he says Af ,
If X2 D f and C D 2 he says Af ,
If C D 0 he says nothing (i.e., ) about the sex of his children.
Since he has said Am , we have to ask for the set of all conditions under which he
says Am . This is called e
Am in our theory:
e
Am D X1 D m; C D 1 [ X2 D m; C D 2:
13
pS D C; M D i
:
pM D i
pS D 1; C D 1; M D 2
pM D 2; C D 1
pS D 1; C D 1; M D 2
pS D 1; C D 1; M D 2 C pS D 3; C D 1; M D 2
1 1 1
1
3 3 2
D
D :
1 1
1 1 1
3
C
3 3 2
3 3
14
This definition is the classical basic definition of information or entropy, which goes back to
Boltzmann (1887) (see also Brush 1966).
2.3 Descriptions
15
pB .A/ D p.AjB/ WD
p.A \ B/
:
p.B/
p.A \ B/
p.A \ B/ p.B/
p.B/
D
D p.AjB/
equals p.B/
p.A/
p.B/
p.A/
p.A/
if p.AjB/ D p.A/.
Of course, all of this only makes sense if p.A/ and p.B/ are not zero.
It is also clear that the two equivalent conditions are essentially the same as
p.A \ B/ D p.A/ p.B/, because p.AjB/ D p.A/ also implies
p.A/ p.B/ D p.AjB/ p.B/ D p.A \ B/:
These considerations are summarized in the following definition:
Definition 2.2. Two propositions A and B are called independent, if
p.A \ B/ D p.A/ p.B/:
Proposition 2.1. If we define the conditional novelty of A given B as
N .AjB/ D log2 p.AjB/;
then we have
i) N .A \ B/ D N .B/ C N .AjB/
ii) N .A \ B/ D N .A/ C N .B/ if A and B are independent.
Proof. Obvious.
t
u
2.3 Descriptions
Let us come back to an observation already made in Sect. 1.1. In general, the set
of all possible events may be very large. In a typical physical model the events
! 2 would be represented as real vectors ! D .!1 ; : : : ; !n / 2 Rn and thus
would be larger than countable. On the other hand, may well be countable, and
so a description of an element ! 2 by propositions A 2 would be essentially
inexact. Moreover, different persons may use different propositions to describe the
16
World
description
World
interpretation
x
A = d(x)
same event !. For example, when we walk on the street, we may see an event !,
lets say we see a car passing by. But this is not an exact description, and if we say
that we see a blue Mercedes driven by an old man passing by very slowly, this still
is not an exact description and somebody else might rather describe the same event
as Mr. Miller is driving with his wifes car into town.
What goes on here is that
1. Given the event, somebody describes it by a statement in a certain language.
2. Then this statement is interpreted again in our model of the possible events as
a proposition A, i.e., as the set A of all events y which are also described by the
same statement (Fig. 2.1).
The point of view taken by a particular observer in describing the events ! 2
by propositions A 2 , constitutes a particular description of these events. This
process of description can be defined mathematically as a mapping.
Definition 2.3. Given .; ; p/, a mapping d W ! that assigns to each ! 2
a proposition d.!/ D A 2 such that ! 2 A, is called a description.
In addition we require2 that for every A 2 R.d /
pd D A D p.f! 2 W d.!/ D Ag/ 0:
This means that we dont want to bother with propositions that may happen
only with probability 0. This additional requirement is quite helpful for technical
reasons, but it is quite restrictive and it rules out some interesting examples (see
Example 2.3). Note that the requirement that ! 2 d.!/ means that the description
has to be true. So every event ! is described by a true proposition about it.
This requirement obviously implies that the propositions d D A are in for every A 2 R.d /.
It also implies that R.d / is finite or countable.
2.3 Descriptions
17
Example 2.1. Consider the throwing of a dice, i.e., the space of possible events
D f1; 2; 3; 4; 5; 6g. Consider the following descriptions:
1) Even vs. odd: For A D f2; 4; 6g and B D f1; 3; 5g D Ac we define the
description e by eW 1 7! B, 2 7! A, 3 7! B, 4 7! A, 5 7! B, 6 7! A.
2) Small vs. large:
d W 1 7! f1; 2; 3; 4g D A; 2 7! A; 3 7! A;
4 7! f3; 4; 5; 6g D B; 5 7! B; 6 !
7 B:
3) Pairs:
Example 2.2. Try to define some descriptions that make use of the propositions
indicated in Fig. 2.2 about locations on a square table including some logical
operations on them.3
t
u
Example 2.3. Without the requirement added to Definition 2.3 a description d may
have an uncountable range R.d /. Here are two examples for this.
Take D R and a (continuous) probability p on .R; B/ and > 0. For ! 2
we define d.!/ D .! ; ! C / and c.!/ D !; 1/ D fx 2 RW x !g.
Both c and d are interesting descriptions, but for every A in R.c/ or R.d / we
observe pc D A D 0 and pd D A D 0.
t
u
Definition 2.4. A finite or countable collection fA1 ; : : : ; An g or fAi W i 2 Ng of
(measurable) subsets of is called a (measurable) partition, if
For example, one can describe points x 2 AnC by d.x/ D A, points x in A\C by d.x/ D A\C ,
points x in B n C by d.x/ D B, and points x in B \ C by d.x/ D C .
18
i)
Ai D and
n
X
i D1
From Definition (ii) it is obvious that the probability that ! is in two sets, e.g., Ai and Aj , is 0.
So, disregarding propositions with probability 0, every ! 2 is in exactly one of the sets Ai . We
will usually disregard propositions with probability 0 and this is meant by the word essentially.
5
Due to the additional requirement in Definition 2.3 the function Nd is measurable. However, it
may happen that E.Nd / is infinite. For an example see Proposition 2.17.
N .d / D E.N d / D
19
n
X
p e
Ai log2 p.Ai /:
i D1
20
As a first step we can order the sets Ai according to their probability such that
p.A1 / p.A2 / : : : p.An /. Then we can say that A1 is more surprising
than A2 and so on. To quantify the amount of surprise we really get from Ai we
determine the probability pi that d gives us at least as much surprise as Ai does.
This is pi D pd.x/ D A1 _ d.x/ D A2 _ : : : _ d.x/ D Ai . Since d can take only
one value for every ! 2 , this is the sum of probabilities
pi D
i
X
pd D Aj D
j D1
i
X
p.e
Aj /:
j D1
D p d p.A/
D f! 2 W p.d.!// p.A/g
and the description dE by
dE.!/ WD f! 0 W p.d.! 0 // p.d.!//g:
A description d is called directed, if d D dE.
Definition 2.8. A description d is called symmetric, if for every x; y 2 , x 2
d.y/ implies y 2 d.x/.
We can now reintroduce the set-theoretical operations that are defined on
propositions, on the level of descriptions; in particular, we have the natural ordering
and the union and intersection of descriptions:
Definition 2.9. We say that a description c is finer than a description d , or that d is
coarser than c and write c d , if c.!/ d.!/ for every ! 2 .
With this definition we see that the completion dQ of any description d is always
finer than d . This is because for any ! 2 , x 2 dQ .!/ means d.x/ D d.!/ and this
implies x 2 d.x/ D d.!/. Obviously we also have dQ dE, since d.! 0 / D d.!/
implies p.d.! 0 // p.d.!//.
Usually d dE, but dE d is also possible.
21
and
Then
dE .!/ D ;
cE.1/ D f1g;
E
b.!/
D b.!/;
and
t
u
Proof. Obvious.
22
and
d.1/ D f1; 2; 3; 4; 5g;
23
\
fA 2 R.d /W ! 2 Ag:
24
and
If p.e
A/ D 0 for some e
A in the range of dQ, we set p.e
A/ log p.e
A/ D 0 since lim x log x D 0.
x!0C
25
h(p)
0.75
0.5
0.25
0.25
0.5
0.75
1
and S.d / I.d /.
ln 2
then
S.d / D
n
X
.pi pi 1 / log pi
i D1
i D1
Z1
n
X
log2 .x/ dx D
1
ln 2
Zpi
log2 .x/dx
pi 1
t
u
8
Here we rely on the approximation argument as in Sect. 1.3 for the calculation of an expectation
value. If all the finite sum approximations satisfy the same inequality ( ln12 ), then this inequality
is also satisfied by the limit.
26
Proposition 2.12. If c and d are tight descriptions then c d implies I.c/ I.d /.
t
u
From Proposition 2.5 it is obvious that both complete and directed descriptions
are tight. So we have monotonicity of information also on complete and on directed
descriptions.
Novelty and surprise are both smaller than information, but how do they
relate to each other? Usually novelty will be much larger than surprise. For
example, a complete description d with p.d.!// D n1 for every ! 2 has
N .d / D I.d / D log2 n, but S.d / D 0. The following example shows, however,
that N < S is also possible.
Example 2.8. Let .; ; p/ D E16 and9
d W 1 ! f1; : : : ; 11g
6!
14 ! f7; : : : ; 16g
2 ! f1; : : : ; 12g
::
:
7!
::
:
15 ! f8; : : : ; 16g
5 ! f1; : : : ; 15g
13 !
16 ! f9; : : : ; 16g
6!
7!
::
:
13 !
Thus, Nd .!/ < Sd .!/ for ! f6; : : : ; 13g and Nd .!/ D Sd .!/ D 0 for
! 2 f6; : : : ; 13g. Thus N .d / < S.d / in this example.
u
t
It is also quite easy to characterize the extreme cases where two of the three
quantities N , I, and S coincide. This is done in the next proposition.
Proposition 2.13. Let d be a description, then
i) N .d / D I.d / implies d D dQ essentially
ii) S.d / D I.d / implies d
essentially
iii) If d is tight then N .d / D S.d / implies d D dE essentially
Proof. (i) We have N .d / D E.Nd /, I.d / D E.NdQ /, and Nd NdQ . If for some
! 2 , Nd .!/ < NdQ .!/, the same is true for all ! 0 2 dQ .!/, and therefore
9
27
N .d / < I.d /. Thus Nd .!/ D NdQ .!/ and therefore p.d.!// D p.dQ .!//
for every ! 2 . Since dQ .!/ d.!/, this implies that dQ .!/ and d.!/ are
essentially equal.
(ii) Since S.d / D E.NdE /, I.d / D E.NdQ /, and dQ dE we can again infer that
N E D N Q and that dE.!/ D dQ .!/ for every ! 2 . For some ! we have
d
t
u
For further reference we provide a simple comparison of the three basic formulae
to calculate N , I, and S.
Proposition 2.14. For any description d we have
P
p.e
A/ log p.A/,
i) N .d / D
A2R.d /
P
ii) I.d / D
p.e
A/ log p.e
A/,
A2R.d /
iii) S.d / D
E
p.e
A/ log p.A/.
A2R.d /
Proof. Here we definitely need the additional requirement of Definition 2.3. Then
we just have to compute the expectation of a step function with at most countably
many values. Let us show (iii) for example:
S.d / D E.Sd / D
pSd D xx D
x2R.Sd /
E
pd D Alog p.A/:
t
u
A2R.d /
Then
I.d / D p log2 p .1 p/ log2 .1 p/ DW I.p/:
This function is plotted in Fig. 2.4.
t
u
In everyday life the common use of the word information is closely related
with the common use of the word uncertainty. In fact, we expect the information
provided by the description of a phenomenon to be equivalent to the amount of
uncertainty the description eliminates. The concept of uncertainty has been treated
in the contexts of thermodynamics and statistical physics (Brush 1966) and has been
expressed in terms of entropy (a term introduced by Clausius (1865) and defined by
a formula similar to Proposition 2.14.(ii) by Boltzmann 1887). It is one of our aims
to elucidate the relation between entropy and information in depth (see Chap. 14).
For the moment we limit ourselves to observing that the information of a partition
is also called its entropy by many authors.
28
Fig. 2.4 Plot of function
I .p/
I(p)
1
0.75
0.5
0.25
0.25
0.5
0.75
In a nutshell, the words information, novelty, and surprise introduced here can be
distinguished or characterized as follows:
Information you get whether you can use it or not, whether it is interesting or not,
Novelty measures how much of this is new and interesting for you,
Surprise is provided by an event that is comparatively improbable; if everything
is equally improbable, nothing is surprising.
The following proposition characterizes descriptions that provide no information.
Proposition 2.15. The information of a description d is zero if and only if all sets
in the range of its completion dQ have probability zero except for one set e
A with
p.e
A/ D 1. In accordance with our additional requirement in Definition 2.3, this is
equivalent to d
.
Proof. Clearly the condition is sufficient: if p.e
A/ D 1 for one set in the range of
e D 0 for the rest then I.d / D 0. Now, if I.d / D 0 we
dQ and p.B/
must have p.e
A/ log p.e
A/ D 0 for all e
A 2 dQ ./. Therefore p.e
A/ is 0 or 1. Since
P
e
e Q
e
t
u
e
A2dQ./ p.A/ D 1, exactly one of the sets A 2 d ./ must satisfy p.A/ D 1.
This proposition corresponds to the natural expectation that a description
provides no new information if we are certain about the outcome of the situation.
The other extreme case that provides maximal information and corresponds to
maximal uncertainty is attained by descriptions on which the probability measure is
uniformly distributed, as stated by the following proposition.
29
Proposition 2.16. For a fixed n consider all descriptions whose range has n
elements, that is all d on .; ; p/ with d./ D fA1 ; : : : ; An g. I.d / attains a
maximum of log2 n for d D dQ and p.Ai / D n1 for each i .D 1; : : : ; n/. The same is
true for N .d /.
t
u
Z1
p.x/dx D
h
1 i1
x 1 .ln x/2 dx D
D1
ln x e
1
for i 2 N
e
Thus
p.Ai / log2 p.Ai / > p.i / log2 p.i /
> .i C e/1 .ln.i C e//2 log2 .i C e/
D .i C e/1 .ln.i C e//1 .ln 2/1
> .ln 2/
1
i CeC1
Z
i Ce
1
dx:
x ln x
Therefore
I.d / > .ln 2/
1
Z1
1
.x ln x/1 dx D .ln 2/1 ln ln x 1Ce D 1:
t
u
1Ce
30
The first property, already shown in Proposition 2.3 to hold for novelty, is
monotonicity, i.e., the requirement that finer descriptions should have larger novelty,
information, and surprise. Unfortunately, it holds neither for information nor for
surprise, because c d does not imply cQ dQ , nor cE dE. From Example 2.6 we
can easily create an example where c d and N .c/ > N .d /, but I.c/ < I.d /.
However, we get monotonicity of I for tight descriptions because of Proposition 2.6.
Counterexamples against monotonicity of surprise are easy to find. For example,
e X , but
for an equally distributed discrete random variable X , clearly X
e / D 0, whereas S.X / > 0.
S.X
The other important property of classical information is its subadditivity:
I.c \ d / I.c/ C I.d /. This will be shown in the next chapter (Proposition 3.5).
It is quite easy to see that the novelty N as defined here does not have this property,
i.e., in general, N .c \ d / N .c/ C N .d /. An example for this can be obtained
by considering a description d and its complement d c . (See also Exercise 7).) Also
surprise does not have this property (see Exercise 16)). We will see in the next
section that information has this property.
In order to obtain both properties, monotonicity and subadditivity, for both,
information and novelty, the definitions given on the level of descriptions in this
chapter are not sufficient. We have to elevate these definitions to a higher level,
which is done in Part III of the book.
The other possibility is to consider only descriptions with particular properties,
for example, tight or complete descriptions. For complete descriptions, information
and novelty coincide and clearly have both properties. This is the framework of
classical information theory.
31
32
2.8 Exercises
1) For the descriptions given in Examples 2.1, 2.52.7 determine their completion,
their tightening,their novelty, their information, and their surprise.
2) Let D f0; : : : ; 999g and consider the following random variables describing
these numbers:
X.!/ WD first digit of !;
Y .!/ WD last digit of !;
Z.!/ WD number of digits of !; for every ! 2 :
What are the corresponding descriptions, what is the information content of X ,
Y , and Z, and what is the corresponding surprise (assuming equal probabilities
for all thousand numbers)?
3) Measuring the height of a table by means of an instrument with an inaccuracy of
about 1 mm can be described by two different descriptions on D 500; 1500
(these are the possible table heights in mm):
d1 .!/ WD \ ! 0:5; ! C 0:5 and
d2 .!/ WD i; i C 1 for ! 2 i; i C 1/ for i D 500; 501; : : : ; 1499:
What is the completion in these two cases and what is the average novelty,
information, and surprise (assuming a uniform distribution of table heights
on )?
2.8 Exercises
33
B
e\Y
eDX
A
.X; Y / D X
Y:
p.e
A1 /; : : : ; p.e
An1 /. Then compute a local extremum by setting the derivatives
of I.d / to 0.
9) Given a probability space .; ; p/ and the events A; B 2 . We say
A supports B, if p.BjA/ > p.B/
A weakens B, if p.BjA/ < p.B/
If A supports B, which of the relations supports and weakens hold for the
following expressions?
a)
b)
c)
d)
A and B c
B and A
Ac and B c
B c and A
x
2
34
14) Determine all complete, all directed, and all tight descriptions on D f1; 2; 3g.
15) Let D f1; : : : ; 8g. Let c.i / D f1; 2; 3; 4g for i D 1; 2; 3; 4, and c.i / D for
i D 5; 6; 7; 8, and d.i / D f2; : : : ; 6g for i D 2; : : : ; 6, d.i / D for i D 1; 7; 8.
Calculate N , S, and I for c, d , and c \ d .
16) Let D f1; : : : ; 6g, c.1/ D c.2/ D f1; 2g, c.i / D for i D 3; : : : ; 6, and
d.1/ D d.6/ D , d.i / D f2; : : : ; 5g for i D 2; : : : ; 5. Calculate N , S, and I
for c, d , and c \ d .
References
Ash, R. B. (1965). Information theory. New York, London, Sidney: Interscience.
Billingsley, P. (1978). Ergodic theory and information. Huntington, NY: Robert E. Krieger
Publishing Co.
Chapter 3
This chapter introduces some more involved versions of the concepts of novelty
and information, such as subjective and conditional novelty.
35
36
1
8
D 3;
1
8
220 / 0:1926
and
37
For a discrete random variable X and two probabilities p and q, we define the
subjective information of X as
e /;
Npq .X / WD Npq .X
the information gain as
e/
Gpq .X / WD Gpq .X
and the subjective surprise as
Spq .X / WD Npq .X /
Remark: For d D e
d , Gpq .d / is also called the information gain or the Kullback
Leibler distance between p and q (with respect to d ) (Kullback 1959, 1968;
Kullback and Leibler 1951).
Proposition 3.1. Gpq .d / 0 for any complete description d .
Proof. The assertion Gpq .d / D Ep . log.q d / C log.p d // 0 is independent
from the base of the logarithm. Since the proof is the least clumsy for the natural
logarithm, which satisfies the simple inequality ln x x 1, we use it here.
Gpq .d /
D
D
q.d.!//
Ep log
p.d.!//
X
q.D/
p.D/ log
p.D/
D2R.d /
ln x x 1
X
D2R.d /
D2R.d /
q.D/
1
p.D/
p.D/
q.D/
p.D/ D 1 1 D 0:
D2R.d /
Thus Gpq .d / 0. Here the summation extends over all propositions D in the range
R.d / of d .
t
u
Remark: The inequality ln x x 1 is very important for information theory; it can
be used to prove most of the interesting inequalities.
The following example shows that Gpq .d / can be negative for descriptions d that
are not complete.
38
p.A/
< 0:
q.A/
t
u
p.d.!// q.c.!//
log
q.d.!// p.c.!//
p.C / log
C 2R.c/
X
D2R.d /
p.d.!// q.C /
q.d.!// p.C /
X p.C /
p.D/ q.C /
log
p.D/
p.D/
q.D/ p.C /
C D
!
X p.C / p.D/ q.C /
1
p.D/
p.D/ p.C / q.D/
D
C D
!
X
X p.C /
X q.C /
D
D0
p.D/
q.D/ C D p.D/
D
C D
t
u
D0
conditional expectation,
NA .d / D N .d jA/
conditional novelty,
SA .d / D S.d jA/
IA .d / D I.d jA/
conditional information
39
3
4
and p.f2g/ D 14 .
c.1/ D f1g; c.2/ D f1; 2g; d.1/ D f1g and d.2/ D f2g:
Then we have R.c/ D fC1 ; C2 g D ff1g; f1; 2gg; R.d / D fD1 ; D2 g D ff1g; f2gg,
e1 ; C
e 2 g D ff1g; f2gg D fD
e 1; D
e 2 g D R.e
and correspondingly R.e
c / D fC
d /.
Now we can calculate
N .d jc/ D
2
2 X
X
ei \ D
e j / log2 p.Dj jCi / D : : : D
p.C
i D1 j D1
N 0 .d jc/ D E! .Sc.!/ .d // D
2
X
1
2
and
e i /ECi .sCi .d //
p.C
i D1
D
2
X
i D1
ei /
p.C
2
X
j D1
3
1
log2 3:
2 16
t
u
40
C 2c./
C 2c./
t
u
Proof. Obvious.
The first equation in this proposition implies is called the additivity of novelty.
In classical information theory, monotonicity and subadditivity are the most
important properties of information. Since for complete descriptions information
and novelty coincide with the classical concept of information, both measures have
these two properties on complete descriptions, but for general descriptions novelty
is monotonic (Proposition 2.3) and not subadditive (Exercises 2.7) and 3.5) on
page 45), whereas information is subadditive (cf. the following Proposition 3.5)
but not monotonic (cf. the following example). The last assertion is quite obvious
because c d does not imply e
c e
d (see Example 2.6 and the subsequent
discussion). The subadditivity of information is the subject of the next proposition.
Example 3.3. Consider D f1; : : : ; 32g with equal probabilities on , i.e.,
.; ; p/ D E32 . Define
a.1/ D f1g;
a.32/ D f32g; and for all other i 2
a.i / D f2; : : : ; 31g:
and define
b.i / D f1; : : : ; 31g for i D 1; : : : ; 16;
b.i / D f2; : : : ; 32g for i D 17; : : : ; 32:
Then a b but I.a/ I.b/.
Another example was given in Example 2.4.
t
u
41
Proposition 3.5.
I.c \ d / I.c/ C I.d /:
and
I.c \ d / D I.c/ C I.d /
if and only if p .c \ d /.!/ D p.e
c.!// p.e
d .!// for (almost) every ! 2 .
d / D fD1 ; : : : ; Dn g. We have e
c \e
d
Proof. Let R.e
c / D fC1 ; : : : ; Cn g and R.e
c \ d . Thus
c \e
d / N .e
c / C N .e
d /;
I.c \ d / D N .c \ d / N .e
because
N .e
c \e
d / N .e
c / N .e
d/ D
k
n X
X
i D1 j D1
n
X
i D1
k
n X
X
p.Ci \ Dj / log2
i D1 j D1
k
X
j D1
p.Ci /p.Dj /
p.Ci \ Dj /
p.Ci /p.Dj /
1
1
p.Ci \ Dj /
ln 2
p.Ci \ Dj /
i D1 j D1
0
1
k
k
n
n X
X
1 @X X
D
p.Ci /p.Dj /
p.Ci \ Dj /A
ln 2 i D1 j D1
i D1 j D1
k
n X
X
D 0:
x1
ln x
.
ln 2
ln 2
In order to obtain equality in both inequalities in this proof, we need that
p.C
i \ Dj / D
p.Ci/ p.Dj / for
every i and j (with Ci \ Dj ;) and that
c.!/ \ e
d .!/ for every ! 2 .
t
u
p .c \ d /.!/ D p e
The inequality holds because log2 x D
42
t
u
Together with Proposition 2.6 we have now shown that on tight descriptions
the information I has all the classical properties: monotonicity, subadditivity, and
additivity. The novelty N , however, lacks subadditivity.
Proposition 3.7. Let c and d be two descriptions and R.c/ finite. Then
I.d jc/ D
p.A/I.d jA/:
A2R.e
c/
t
u
Definition 3.4. For two descriptions c and d we define their mutual novelty as
M.c; d / WD N .c/ C N .d / N .c \ d /
and their transinformation1 as
T .c; d / WD I.c/ C I.d / I.c \ d /:
Proposition 3.8. Let c and d be descriptions. Then
i) T .c; d / 0,
ii) M.c; d / D N .c/ N .cjd / D N .d / N .d jc/,
iii) If c and d are tight, then T .c; d / D I.c/ I.cjd / D I.d / I.d jc/.
Proof. (i) From Proposition 3.5,
(ii) From Proposition 3.4,
(iii) From Proposition 3.6.
t
u
43
ii)
iii)
iv)
v)
vi)
e\Y
e.
and we have .X; Y / D X
I.X; Y / I.X / C I.Y /
X 4 Y implies I.X / I.Y /
I.X; Y / D I.X / C I.Y jX /
0 I.Y jX / I.Y /
X 4 Y implies I.X jZ/ I.Y jZ/ and I.ZjX / I.ZjY /
(subadditivity)
(monotonicity)
(additivity)
t
u
t
u
44
t
u
The concept of mutual information can also be defined for three and more random
variables. We define T .X; Y; Z/ WD T .X; Y / T .X; Y j Z/. This definition is
again symmetric in X; Y; and Z, because
T .X; Y; Z/ D I.X / C I.Y / C I.Z/
I.X; Y / I.Y; Z/ I.X; Z/
C I.X; Y; Z/:
T .X; Y; Z/ can have both positive and negative values in contrast to T .X; Y /
(see Exercise 3.15)). This definition can of course be extended to more than three
variables, see, for example, Bell (2003); Attneave (1959).
3.6 Exercises
45
3.6 Exercises
1) What is T .X; Y /, T .Y; Z/, T .Z; X / for the three random variables X , Y , and
Z defined in Exercise 2.2).
2) Compute N .a/, N .b/, I.a/, I.b/, M.a; b/, and T .a; b/ for Example 3.3 on
page 40.
3) Compute T .c; d / for Example 2.6 on page 21 and T .c; d /, T .c; e/, T .d; e/
for Example 2.1 on page 16.
4) Given two dice, i.e., D f1; : : : ; 6g
f1; : : : ; 6g. Let X1 D first dice, X2 D
second dice, D D doublets, i.e.,
(
D.i; j / D
1 if i D j ,
0 otherwise.
Determine T .X1 ; X2 /, T .X1 ; D/, T .X2 ; D/, T .X1 ; .X2 ; D//, T ..X1 ; X2 /; D/.
5) Find examples that show that in general N .c \ d / N .c/ C N .d /,
i.e., M.c; d / 0, is false.
6) In a container there are w white and r red balls. Two balls are drawn
consecutively. What is the uncertainty or novelty of the outcome of the first
drawing as compared to the (conditional) uncertainty of the second drawing?
7) There are 12 balls of equal size, but one of them is slightly heavier or lighter
than the others. You have a simple balance scale with two pans and are allowed
to put any number of balls on either pan in order to find out the odd ball and
the direction of its weight difference to the others. How many weighings (each
with three possible outcomes) do you need? How can information theory help
in answering that question?
8) Show Proposition 3.9 on page 43.
9) Show the following.
Proposition 3.12. i) I.X jY; Z/ C I.Y jZ/ D I.X; Y jZ/,
eY
e implies I.X jZ/ I.Y jZ/ and I.ZjX / I.ZjY /,
ii) X
iii) I.X jY; Z/ minfI.X jY /; I.X jZ/g.
10) Prove or refute each of the following statements:
a) T .X; Z/ max.T .X; Y /; T .Y; Z//
b) T .X; Z/ T .X; Y / C T .Y; Z/
c) T ..X; Z/; Y / T .X I Y / C T .Y I Z/.
11) There are three different 40-Cent coins, each of these coins consisting of two
20-Cent coins glued together (head-to-head, head-to-tail, or tail-to-tail). One
of these three coins is drawn randomly and placed on a table. The random
variables X and Y denote the top and bottom side of the drawn coin and
can take the values h (head) or t (tails). Describe this experiment by
means of a probability space. Compute pX D Y , pX D t, pX D Y jY D t,
pX D tjY D h and T .X; Y /.
46
d.3/ D f1; 3; 5g
d.6/ D f6g:
2 7! 3
3 7! .3; 5/
4 7! .2; 4/
5 7! 2
6 7! .1; 5/
References
Amari, S., & Nagaoka, H. (2000). Methods of information geometry. AMS and Oxford University
Press.
Attneave, F. (1959). Applications of information theory to psychology. New York: Holt, Rinehart
and Winston.
References
47
Part II
Chapter 4
51
52
Is it odd?
yes
Second question:
no
Is it 1?
Third question:
Is it 2?
yes
no
yes
no
Is it 3?
Is it 4?
yes
no
yes
no
Second question:
Third question:
no
Is it card 1?
Is card 4 or 5 an ace?
yes
no
yes
no
Is it card 2?
Is it card 4?
Is it card 6?
yes
no
yes
no
yes
no
7,8
A similar example is the following: You are shown a deck of 8 cards containing
2 aces. The deck is shuffled and the cards are put on the table (face down). Your task
is to find one ace by asking yesno-questions. How many questions do you need?
This task turns out to be much harder. In fact, the theory provided in this chapter
does not solve it, nor does classical information theory. We will return to it in
Chap. 10. A reasonable strategy for this task could be the following (see Fig. 4.2):
This strategy needs three questions, except when card 1 is an ace; in this case it
needs two. The probability for this is 14 . So on average this strategy needs 14 2 C
3
3 D 2 34 questions.
4
Is there a better strategy? The answer will be given in Chap. 10. There are good
reasons why the best strategy should need between 2 and 3 questions, but it may be
surprising that the result is closer to 3. After all, with 3 questions one can find 1 ace
53
among 8 cards. Adding another ace to the deck doubles the probability of hitting an
ace (from 18 to 14 ), so should we not be able to do it in 2 questions? This proves to be
impossible after a few trials. Another argument for an information content of 2 goes
as follows. Let us assume one of the aces is black and one red. In order to determine
the color of the ace we have found, we need one additional bit of information. To
find out one ace plus its color (e.g., the red ace) we need 3 questions, i.e., 3 bits
of information. So the information content of the localization of one ace should be
3 1 D 2 bits (if information is additive). Well, it turns out that in this problem
information is not additive.
k
X
Li p.f! 2 W d.!/ D Ai g/ D
i D1
k
X
p e
Ai Li ;
i D1
where e
Ai D d D Ai as defined in Definition 2.6.
More generally, assume that we want to guess the values of a random variable
X W ! A, where A D fa1 ; : : : ; an g. Let pX D ai D pi . We may summarize
this situation in the scheme
a1 : : : an
:
p1 : : : pn
Again we denote by Li the number of questions needed (in a certain guessing
strategy) to determine ai , then the average number of questions needed is
E.L/ D
n
X
pi Li :
i D1
A fixed guessing strategy starts with a fixed first question. Then, depending on
the first answer (yes or no), there are two alternative fixed second questions. Again
in every possible case after the second answer, there is a fixed third question, and so
on. A useful way of picturing such a guessing strategy and its outcomes is by means
54
0
0
000
1
01
1000
1010
1110
001
of a tree (Fig. 4.3). It has a certain number of levels; at each level (l D 1; : : : ; k), the
number b.l/ of branches corresponds to the number of different cases after the lth
question. It is clear that b.l/ 2l for yesno questions, and b.l/ will indeed usually
be smaller, because whenever one value ai of X is determined (at level Li ) we stop
asking questions and therefore cut off the subsequent branches at higher levels. The
number of possible branches at levels l > Li that are thus cut off are 2lLi . The
n
highest level k in the tree is, of course, k D max Li .
At the level k,
n
P
i D1
kLi
i D1
n
P
i D1
n
P
2Li 1. This
i D1
1
[
i D1
Bi ;
55
It should be noted that the codes c that occur in this correspondence to guessing
strategies have a particulary nice property: they are irreducible (or prefix-free).
Definition 4.2. A code cW A ! B is called irreducible, if there are no two a; a0
in A such that c.a/ is a beginning (or prefix) of c.a0 /. A codeword b is called a
beginning of b 0 , if l.b/ l.b 0 / and bi D bi0 for i D 1; : : : ; l.b/.
Example 4.1. For A D fa; b; c; d; e; f; gg and B D f0; 1g, consider the two codes
c and c 0 defined by
c.a/ D 0;
c.b/ D 10;
c.c/ D 1100;
c.e/ D 1110;
c.f / D 11110;
c.g/ D 11111
c 0 .a/ D 0;
c 0 .b/ D 10;
c 0 .c/ D 110;
c 0 .e/ D 1110;
c 0 .f / D 0101;
c 0 .g/ D 1111:
c.d / D 1101;
and
c 0 .d / D 1101;
t
u
a1 a2 : : : an
p1 p2 : : : pn
we may ask for an optimal irreducible 0-1-code for p, i.e., an irreducible code
cW f1; : : : ; ng ! f0; 1g with minimal average length L.c/. We define L.p/ as the
average length L.c/ of this code.
56
Proof. .i / .i i /: is obvious.
.i i / .i i i /: Let us repeat the proof of this fact in the language of coding: Let
n
c be an irreducible code for A with L1 ; : : : ; Ln . Let k D max Li .
i D1
kLi
i D1
Therefore
n
X
i D1
# f0; 1gk D 2k :
n
P
n
[
!
Mk .c.ai //
i D1
2Li 1.
i D1
.i i i / .i i /: Assume that L1 L2 : : : Ln .
We select an arbitrary codeword c.a1 / of length L1 for a1 . Then
we select an arbitrary codeword c.a2 / of length L2 , which does
not have c.a1 / as a beginning. This means that the sets ML2 .c.a1 //
and ML2 .c.a2 // defined above, have to be disjoint. We repeat this
procedure until we select c.an /. It will work as long as in every
step j there is a suitable codeword left, i.e., a word of length Lj
that has none of the words c.a1 /; : : : ; c.aj 1 / as a beginning. But
jP
1
jP
1
2Lj Li , i.e., 1 >
2Li ,
this is the case as long as 2Lj >
i D1
i D1
t
u
57
t
u
a1 : : : an2
an1
p1 : : : pn2 pn1 C pn
where an and an1 have merged into an1 . The reduced scheme may now be
reordered so that the probabilities in the lower row decrease from left to right. We
continue this procedure with the same recipe until n D 1. The code obtained by
means of this procedure is called the Huffman code (cf. Huffman 1952).
Lemma 4.1. If c is an optimal code for a scheme with probabilities .p1 ; : : : ; pn /
that has the corresponding codeword lengths L1 ; : : : ; Ln and pi <pj , then Li Lj .
Proof. Otherwise one would get a better code c 0 by choosing c 0 .ai / D c.aj /,
c 0 .aj / D c.ai / and c 0 .ak / D c.ak / 8 k i; j :
L.c/ L.c 0 / D
n
X
rD1
0
pr Lr @
1
pr Lr C pi Lj C pj Li A
ri;j
D pi Li C pj Lj pi Lj pj Li
D .pj pi /.Lj Li / > 0;
if Lj > Li :
t
u
58
Theorem 4.2. The Huffman code h is an optimal irreducible code for the scheme
a1 : : : an
p1 : : : pn
;
()
if .p1 ; : : : ; pn / was ordered such that pn1 and pn are the two smallest elements of
the vector.
Proof. By induction on n: Let c be an optimal code for p, then by Lemmas 4.1
and 4.2 we can assume that the codewords for an1 and an are among the longest
codewords and (possibly by exchanging longest codewords) that they are of the form
c.an1 / D w0 and c.an / D w1.
Thus .c.a1 /; : : : ; c.an2 /; w/ D c 0 is a code for .p1 ; : : : ; pn2 ; pn1 C pn / and
L.c/ D L.c 0 / C pn1 C pn . Clearly c 0 has to be optimal, because otherwise it could
be shortened leading also to a shorter code for p. This proves ().
By construction, the Huffman codes h also fulfill (), i.e.,
L h.p1 ; : : : ; pn / D L h.p1 ; : : : ; pn2 ; pn1 C pn / C pn1 C pn ;
t
u
n
X
i D1
pi log2 pi
n
X
i D1
n
P
i D1
pi D 1 and
pi log2 qi :
n
P
i D1
qi 1
59
Proof. This statement has essentially been proved in Proposition 3.1. As in Proposition 3.1, we use the properties of the natural logarithm in the proof. Our assertion
follows from the fact that
n
X
pi ln pi C
i D1
n
X
pi ln qi D
i D1
n
X
i D1
qi
pi
n
X
n
X
pi ln
qi
i D1
ln x x 1
n
X
i D1
pi .
qi
1/
pi
pi 0:
t
u
i D1
If we now take qi D 2Li in Proposition 4.1, then Theorem 4.1 holds for the
lengths Li and we see that
I.X / D
n
X
pi log2 pi
i D1
n
X
pi log2 qi D
i D1
n
X
pi Li D E.L/:
i D1
Li
i D1
n
X
2. log2 pi / D 1;
i D1
n
X
pi Li D
i D1
n
X
pi .d log2 pi e/
i D1
n
X
pi . log2 pi C 1/ D I.X / C 1:
i D1
Thus the optimal guessing strategy (or code) has an average number of questions
E.L/, which is close to I.X /, and we have shown the following
Proposition 4.2. If L is the number of questions needed in an optimal guessing
strategy for X , then
I.X / E.L/
n
X
i D1
t
u
60
4.8 Exercises
1) Try to find an optimal guessing strategy for Example 4.2 and draw the
corresponding tree.
2) Given a deck of 16 cards, among them 4 aces. Let Ai D [the i th card is the first
ace,
from the top of the deck]. Find an optimal code for the scheme
A1 :::Acounted
16
with
the corresponding probabilities.
p1 :::p16
3) Given a dice, i.e., D f1; : : : ; 6g with equal probabilities, find an optimal code
for it.
4) Consider the information I as a function on probability vectors
n
P
p D .p1 ; : : : ; pn / with pi 0 and
pi D 1 defined by
i D1
I.p/ D
n
X
pi log2 pi :
i D1
4.8 Exercises
61
1
1
n; : : : ; n
D log2 n
6:51 %
1:89 %
3:06 %
5:08 %
17:40 %
1:66 %
3:01 %
H
I
J
K
L
M
N
4:76 %
7:55 %
0:27 %
1:21 %
3:44 %
2:53 %
9:78 %
O
P
Q
R
S
T
U
2:51 %
0:79 %
0:02 %
7:00 %
7:27 %
6:15 %
4:35 %
V
W
X
Y
Z
0:67 %
1:89 %
0:03 %
0:04 %
1:13 %
8) Can it be that there are two optimal codes with different codewordlengths for
the same probability vector p?
9) A game with 5 (or 6) uniformly distributed outcomes is played repeatedly. The
result shall be transmitted binary with a maximum of 2.5 bit available for each
result. For which n 2 N exists a proper n-tuple code?
10) 6 cards (3 aces, 3 kings) are placed side by side on a table in random order. One
wants to find the position of an ace. Determine an optimal guessing strategy for
this purpose.
11) Answer the following questions:
1) Is there a binary code with six codewords of length 1,2,2,2,2, and 2?
2) Is there a binary code with six codewords of length 1,3,3,3,3, and 3?
3) Is there a prefix-free binary code with six codewords of length 1,3,3,3,3,
and 3?
4) Is there a prefix-free binary code with six codewords of length 2,3,3,3,3,
and 3?
1
Modified from Beutelspacher, A. (1993). Kryptologie. Friedr. Vieweg & Sohn Verlagsgesellschaft
mbH, Braunschweig/Wiesbaden.
62
References
Bose, R. C., & Ray-Chaudhuri, D. K. (1960). On a class of error correcting binary group codes.
Information and Control, 3, 6879.
Fano, R. M. (1961). Transmission of information: A statistical theory of communications.
New York: Wiley.
Hamming, R. V. (1950). Error detecting and error correcting codes. Bell Systems Technical Journal,
29, 147160.
Huffman, D. A. (1952). A method for the construction of minimum redundancy codes. Proceedings
of the IRE, 40, 10981101.
Shannon, C. E. (1948). A mathematical theory of communication. Bell Systems Technical Journal,
27, 379423, 623656.
Chapter 5
Information Transmission
This chapter introduces the concept of a transition probability and the problem of
guessing the input of an information channel from observing its output. It gives a
first idea on the classical results of Shannon, without introducing the technicalities
of stationary stochastic processes and the proof of Shannys Theorem. This material
is provided in the next three chapters. Since it is not necessary for the understanding
of Parts IV, V, and VI, one can move directly to Part IV after this chapter.
pX D 6; E1 D 1; E2 D 1
pE1 D 1; E2 D 1
63
64
5 Information Transmission
1
0:9 0:8
6
D5
5
1
0:9 0:8 C 0:1 0:2
6
6
0:72
0:72
D5
D5
>4
0:72 C 5 0:02
0:82
E.W jE1 D 0; E2 D 0/ D 5
pX D 6; E1 D 0; E2 D 0
pE1 D 0; E2 D 0
0:1 0:2
0:1 0:2 C 5 0:9 0:8
0:02
< 0:01
D5
3:62
D5
E.W jE1 D 0; E2 D 1/ D 5
pX D 6; E1 D 0; E2 D 1
pE1 D 0; E2 D 1
0:1 0:8
0:1 0:8 C 5 0:9 0:2
0:40
1
0:08
D
<
D5
0:98
0:98
2
D5
E.W jE1 D 1; E2 D 0/ D 5
pX D 6; E1 D 1; E2 D 0
pE1 D 1; E2 D 0
0:9 0:2
0:9 0:2 C 5 0:1 0:8
0:45
0:18
D
> 1:5
D5
0:58
0:29
D5
The most interesting cases are those of conflicting expert opinion. Our result is
that one should rely on the better of the two experts, which seems obvious.
This case can also be used to show the difference between maximum likelihood and maximum Bayesian probability. Here we ask the following question:
Given the two conflicting expert opinions, is it more likely that X D 6 or that
X 6? The correct (and Bayesian) interpretation of this question compares
pX D 6jE1 ; E2 with pX 6jE1 ; E2 . A more sloppy interpretation of this
question might compare pE1 ; E2 jX D 6 with pE1 ; E2 jX 6. In fact, the two
interpretations may lead to different results. Here it happens for E1 D 1; E2 D 0.
To see this we can compute the so-called likelihood-ratio
65
0:9 0:2
0:18
pE1 D 1; E2 D 0jX D 6
D
D
>1
pE1 D 1; E2 D 0jX 6
0:1 0:8
0:08
and compare it with the Bayesian probability ratio
pX D 6jE1 D 1; E2 D 0
pX D 6; E1 D 1; E2 D 0
1 0:18
0:18
D
D
D
< 1:
pX 6jE1 D 1; E2 D 0
pX 6; E1 D 1; E2 D 0
5 0:08
0:4
This result means that in the case of conflicting values of our experts E1 and E2 , the
most probably correct guess on X is that X 6 rather than X D 6. However, we
may still bet successfully on X D 6.
This definition is correct for finite sets A. In general, there are some more technical requirements
concerning measurability (see Bauer 1972 for example) which we do not mention here.
66
5 Information Transmission
Definition 5.2. For two random variables X and Y with values in A and B
respectively, and a transition probability P W A
B, we write P W X
Y and
say that X is transmitted into Y by P , or P links Y to X , if for every a 2 A and
every (measurable) M B
pY 2 M jX D a D pa .M /:
This definition essentially means that the two random variables X and Y are
linked through the transition probability P .
The knowledge of the channel mechanism, i.e., the probabilities pa for each
a 2 A makes it possible to determine the joint probabilities
pab D prX D a; Y D b D prX D a pa .b/ D q.a/pa .b/;
given a probability distribution q on A.
From these joint probabilities one can also obtain the corresponding probability
distribution on B by
X
prY D b D
prX D a; Y D b:
a2A
67
0
68
5 Information Transmission
Unfortunately, not all examples work out as nicely and so one eventually needs a
rather complicated theory that again makes use of sequences of inputs and outputs
in a similar way as we just remarked after Proposition 4.2. With these techniques
one can eventually prove that (up to an arbitrarily small error probability) one can
indeed get as many bits of information safely (or with very high fidelity) through a
channel as is computed by maximizing the transinformation.2 This quantity is called
the channel capacity.
Definition 5.3. Let P W A
Y g:
Thus the maximum is taken over all input random variables X with values in
A and the output Y is linked to X by P . For A D a1 ; : : : ; an we could as well
maximize over all probability vectors p D .p1 ; : : : ; pn / where pi D p.ai / D
pX D ai . The maximization can then be carried out by the ordinary methods of
real analysis (Exercises 9 and 10).
The meaning of this definition of capacity is quite easy to understand in view of
the remarks at the end of Chap. 4. The transinformation T .X; Y / D I.X / I.X jY /
measures the amount of uncertainty about X that is removed by knowing Y , or the
amount of information about X provided by Y , or the average reduction in the
number of yesno questions needed to guess the value of X , that one gets from
knowing the value of Y . In simple words: T .X; Y / measures, how much the channel
output Y says about the channel input X . And the channel capacity is the maximum
that the channel output can say about the channel input for an appropriately chosen
input.
Definition 5.4. Let P W A
B and QW B
C be two transition probabilities
we define their composition R D Q P as the transition probability RW A
C
given by
X
ra .c/ WD
pa .b/ qb .c/:
b2B
69
pX D a pY D bjX D a
pX D a; Y D b
D
;
pY D b
pY D b
pX D a jY D b > pX D ajY D b
pX D a pa .b/ > pX D a pa .b/:
Thus one has to maximize simply the forward transition probability pa .b/,
weighted by the so-called a-priori probability pX D a. And in many practical
cases the a priori probabilities are all equal and can also be left out.
Definition 5.5. Given a transition probability P W A
B (A and B finite sets), an a
priori probability p for A and an output b 2 B, then a D G.b/ is called a Bayesian
guess for b, if p.a / pa .b/ p.a/ pa .b/ for every a 2 A.
The mapping GW B
A can also be regarded as a channel and one can now
consider the combined channel Q D G P W A
A which is defined by the
transition probabilities qa .A0 / WD pa .G 1 .A0 // for any a 2 A and any A0 A.
A good measure for the fidelity of the whole procedure of transmitting a message
in A through a channel and guessing it from the channels output is the error
probability, which is defined as e D pG.Y / X , where again P W X
Y . It
can be calculated as follows:
XX
eD
pX D a pa .b/ 1G.b/a :
a2A b2B
If we observe that
a2A
eD
X
X
pY D b pX D G.b/ pG.b/ .b/ D 1
pX D G.b/ pG.b/ .b/:
b2B
b2B
Clearly this error probability is not the same as the average error e introduced
in Chap. 5.2 because it depends on the probabilities of the input variable X . The
error probability of the combined channel from A back to A is also related to
the transinformation of this channel, because, if the error probability is low, the
transinformation has to be high. Clearly this also means that the transinformation of
the original channel P was high. This is made explicit in the following proposition.
Since T .X; Y / D I.X / I.X jY /, high transinformation means that I.X jY / has
to be close to zero. So we consider the relation between I.X jY / and e.
70
5 Information Transmission
b2B
pY D b eb .
(ii) We first consider the case that Y D b and consider I.X jY D b/ D Ib .X /
which is just the information of X for the probability pY Db and denoted as
e /, i.e., NY Db .X
e / D Ib .X /.
NY Db .X
P
From Proposition 3.7 we know that I.X jY / D
pY D bI.X jY D b/.
b2B
Now
Ib .X / D Ib .X; Z/ D Ib .Z/ C Ib .X jZ/
D eb log2 eb .1 eb / log2 .1 eb /
e / C .1 eb / NY Db;X DG.b/ .X/:
e
C eb NY Db;X G.b/ .X
The first equality holds because knowing the values of both X and Y , we also
know Z.
The last term on the right is zero because we know X D G.b/. It is quite easy
to estimate the second-last term since for a G.b/, we have
pX D ajY D b pX D G.b/jY D b D 1 eb
and so
pX D ajX G.b/; Y D b
1 eb
eb
.
e / log2 . 1eb /. We use this estimate only
This implies that NY Db;X G.b/ .X
eb
for eb > 12 . For eb 12 we obtain
Ib .X / eb log2 eb .1 eb / log2 .1 eb / 2eb
and also for eb >
1
2
we get
1 e
b
eb
Thus I.X jY / D
71
pY D bIb .X / 2e.
(i) We have
I.X jY / D I..X; Z/jY / I.X; Z/
D I.Z/ C I.X jZ/
e log2 e .1 e/ log2 .1 e/ C e log2 n:
t
u
We end this chapter with a few remarks on the practical computation of the
channel capacity. A case that is quite common in applications is a so-called
uniformly disturbing channel. We call a transition probability P W A
B uniformly
disturbing, if all conditional probabilities pa (for every a 2 A) on B have the same
information I.Pa /, i.e., if I.Pa / D c for all a 2 A. Such a channel produces
the same amount of uncertainty on its output for every input a. For a uniformly
disturbing channel P , it is easy to compute c.P / because P W X
Y implies
T .X; Y / D I.Y / I.Y jX / D I.Y / c. So
c.P / D maxfI.Y /W P W X
Y; X W ! Ag c:
t
u
In many cases I.Y / can be maximized quite easily and often the maximal
possible value of I.Y /, namely I.Y / D log2 #.B/ can be obtained. This is the
case for example for so-called symmetric channels, where one has to choose X
equally distributed to make also Y equally distributed yielding I.Y / D log2 #.B/.
Another case occurs when the matrix P is invertible, so that one can compute the
distribution for X that makes Y equally distributed. In these cases one simply gets
c.P / D log2 #.B/ c.
The next three chapters need a little more mathematics on stochastic processes
than the rest of the book; they may be skipped because they are not necessary for
the understanding of the subsequent chapters.
72
5 Information Transmission
5.5 Exercises
1) A lonesome islander has tried to repair the telegraph from Exercise 3.14) so
that he can play dice with his neighbors again. Since he has never learned to
fix electronic devices, the attempted repair failed. Now the telegraph shows the
following transmission behaviour:
1 7! 6
2 7! .2; 4; 6/
3 7! .3; 5/
4 7! .2; 4/
5 7! .2; 6/
6 7! .1; 3; 5; 6/
1p p
:
p p1
Here p stands for the error probability of the channel. Calculate the channel
capacity. What is the result of T .X; Y / for p D 0:17?
Hint: Use T .X; Y / D I.Y / I.Y jX /.
3) The channel from Exercise 2) with error probability p D 0:17 shall now be
used to transmit symbols. What error probability per bit results from optimal
guessing by the use of gW f0; 1g3 ! f0; 1g? Determine the transinformation
T .X; g.Y //, whereas X and Y denote the sent and received bit, respectively.
4) Let P be the symmetric binary channel from Exercise 2). Calculate the
transition probability of channel Q that results from applying the channel P
twice.
5) Consider three discrete random variables X; Y; and Z. X and Y are called
independent given Z, if for every a; b; and c
pX D a; Y D bjZ D c D pX D ajZ D c pY D bjZ D c:
Show that T .X; Y / T .X; Z/ if X and Y are independent given Z.
5.5 Exercises
73
a
0
b
0
2
p
1
p
7) 8 shells are placed on a street. Underneath two of the shells there lies
1 e, respectively. Someone points at one of the shells and says that there is
a euro lying underneath it.
This situation can be modeled with the probability space D f!
f1; 2; : : : ; 8gW j!j D 2g with D P./ and the equipartion (uniform
distribution) p.!/. How big is p.!/ for ! 2 ? We define the random
variable Xk WD 1f!Wk2!g for k D 1; 2; : : : ; 8. Calculate I.Xk /, I.X1 ; X3 / and
I.X1 ; X2 ; X3 /.
Let the description dk .!/ D f! 0 W k 2 ! 0 g if k 2 !, and dk .!/ D if k !.
How big are N .Xk / and I.dk /?
8
T
Next, we define d.!/ WD
dk .!/. How big is N .d /?
kD1
74
5 Information Transmission
842
B3 8 4
B
1 B
B1 3 8
B
P D
16 B 0 1 3
B
@0 0 1
011
11
10
31
83
48
24
1
0
0C
C
C
0C
C:
1C
C
3A
8
X D 6
if x D 6;
X
if x 6:
11) In a game show one can bet 10 e on wether a blindly cast dice shows a six or
not. If the guess is correct one wins 60 e, otherwise one loses the 10 e.
You happen to know a staff member of the show, who reveals insider information to you. Before the show is broadcast, you come to know if the dice is
showing a six or not. Via the channel given in Exercise 10) your informant sends
1 if the outcome is not a six, otherwise he sends a 6. Actually, he had used
the channel before to transmit the cast number, but has decided that it is better
to just send a 1 or a 6. After receiving the information you can decide whether
to take part in the game or not. How does this information need to be evaluated
for maximizing profit? How much is the average profit per show?
Reference
Bauer, H. (1972). Probability theory and elements of measure theory. New York: Holt, Rinehart
and Winston.
Fano, RM. (1961). Transmission of Information: A Statistical Theory of Communication. Wiley,
New York
Part III
Chapter 6
This chapter briefly introduces the necessary concepts from the theory of stochastic
processes (see for example Lamperti 1977; Doob 1953) that are needed for a proper
definition of information rate and channel capacity, following Shannon.
The purpose of Part III is a compact development of the main results of classical
information theory, including Shannons theorem. I believe that the use of the
e as defined in Chap. 2 simplifies
concept of a description and in particular X
the notation and perhaps also the understanding a bit. These results need the
terminology of stochastic processes. For this reason they are usually regarded as
technically demanding and not included in introductory textbooks such as Topse
(1974). Part III can be skipped by experts who already know the classical results on
channel capacity and by beginners who want to understand basic information theory
and the new concepts of novelty and surprise introduced in this book.
77
78
Fig. 6.1 Channel example
0.8
0.2
0.8
0.2
0.8
0.2
0.8
0.2
0.8
0.2
0.8
0.2
0.8
0.2
0.8
0.2
0.8
0.2
0.8
The transinformation of the channel C is 2:6 bit1 . Sending the digits twice would
4
result in an error probability of 160
at most, because only two transmission errors
would lead to a wrong guess at the channel output. This yields a transinformation of
3:10245 bit.2 This means a bit-rate of 3:10245=2 D 1:55123 bits per channel use.
The method of coding two digits in three symbols which are transmitted reliably
yields a bit-rate of 23 log 10 D 2:2146. The last method, coding 9 digits in 13
9
symbols yields a bit-rate of 13
log 10 D 2:2998.
All these bit-rates stay below the capacity of the channel C. In this section
we work towards the demonstration that for sufficiently long sequence-codes it is
possible to achieve bit-rates close to the capacity with small error probabilities. To
this end we have to consider sequences of random inputs and outputs at the channel
under consideration. These sequences are called stochastic processes.
8
2
c D log2 10 C 10
log2 8 C 10
log2 2 log2 10 D 2:6 bit.
68 16
1
16
1
2
bit-rate D log2 10 C 100
log2 17
C 17
log2 17
D log2 10 C
17
log2 10 0:322757 0:68 D 3:10245.
1
68 64
100 17
log2 17 D
79
n
Y
pXi 2 Ai D
i D1
n
Y
n
Y
pX1 2 Ai
i D1
pXt Ci 2 Ai
i D1
D pXt C1 2 A1 ; : : : ; Xt Cn 2 An :
t
u
80
following: For a Markov process the future is independent of the past given the
presence.
Proposition 6.2. Let X be a stationary Markov process on a finite set A. Let
PWA
A be given by pab D pX2 D bjX1 D a. Then
pX1 D a1 ; X2 D a2 ; : : : ; Xn D an D pX1 D a1
n1
Y
Pai ai C1 :
i D1
n1
Y
pXi C1
i D1
D ai C1 jX1 D a1 ; : : : ; Xi D ai
D pX1 D a1
n1
Y
pXi C1 D ai C1 jXi D ai
i D1
D pX1 D a1
n1
Y
pX2 D ai C1 jXi D ai
i D1
qi D pX1 D ai DpX2 D ai D
pX2 D ai jX1 D aj qj D
Pj i qj :u
t
n!1
where Yn D
1
n
n
P
i D1
Xi .
This obviously means that for large n, the average Yn of the first n random
variables Xi will be with very high probability very close to a constant valueits
expectation. Thus the scatter in Yn will become negligible.
81
Proposition 6.3. Every i.i.d. process with finite E.X12 / satisfies the w.l.l.n.
Proof. For the proof we need a well known estimate of pjYn E.Yn /j , called
the Chebyshev-inequality:
For any random variable X and any > 0 we have
1jX j jX j
and thus
pjX j E.jX j/:
Now we consider the function
X
n
2
1
Xi E.Xi /
X WD .Yn E.Yn // D
n i D1
2
n
n
1 XX
E..Xi E.Xi //.Xj E.Xj ///:
n2 i D1 j D1
For i j these expectations are E .Xi E.Xi // E.Xj E.Xj // D 0, and for
i D j they are
E..Xi E.Xi //2 / D E.Xi2 / .E.Xi //2 D Var.Xi / D Var.X1 /:
Thus
E.X / D
1
1
E.X12 / .E.X1 //2 D Var.X1 /
n
n
and therefore
p.Yn E.Yn //2 2 D pjYn E.Yn /j
Var.X1 /
!0
2 n
for n ! 1:u
t
82
n
X
I.Xi /:
i D1
n
X
I.Xi /:
t
u
i D1
I.X / D lim
exists.
Proof. Recall that
(6.1)
n
X
I.Xi jX1 ; : : : ; Xi 1 /
for each n 2 N.
(6.2)
i D2
It is also not difficult to check (see Prop. 3.12.(iii) or 3.9.(iv) and consider the
stationarity of X ) that for each i 2 N
I.Xi jX1 ; : : : ; Xi 1 / I.Xi 1 jX1 ; : : : ; Xi 2 /:
(6.3)
Thus in (6.2) every term in the sum on the right hand side is less than or equal to
its predecessor. Furthermore, we have
I.X1 / D I.X2 / I.X2 jX1 /
by Proposition 3.9.(v).
Therefore
n I.Xn jX1 ; : : : ; Xn1 / I.X1 ; : : : ; Xn /
(6.4)
83
and
(6.1)
I.X1 ; : : : ; Xn / D I.X1 ; : : : ; Xn1 / C I.Xn jX1 ; : : : ; Xn1 /
(6.3)
I.X1 ; : : : ; Xn1 / C I.Xn1 jX1 ; : : : ; Xn2 /
(6.4)
I.X1 ; : : : ; Xn1 / C
D
1
I.X1 ; : : : ; Xn1 /
n1
n
I.X1 ; : : : ; Xn1 /;
n1
and so
1
1
I.X1 ; : : : ; Xn /
I.X1 ; : : : ; Xn1 /:
n
n1
This means that the sequence n1 I.X1 ; : : : ; Xn / is decreasing. Since it is always
positive, the limit exists.
t
u
Definition 6.6. The limit in Proposition 6.4 is called the information rate I.X / of
the process X D .Xn /n2N .
It is important to observe that the information rate I.X / of a process X coincides
with the average information needed for determining the value taken by one of the
random variables Xn , when knowing the previous ones, as the following proposition
shows.
Proposition 6.5. Given a stationary process X D .Xn /n2N , then
lim I.Xn jX1 ; : : : ; Xn1 / D I.X /:
n!1
bn D
t
u
Proposition 6.6. Let X be a stationary process in A. Let cn denote an optimal 0-1code for .X1 ; : : : ; Xn /, then
lim
n!1
1
L.cn / D I.X /:
n
Proof.
1
1
1
1
I.X1 ; : : : ; Xn / L.cn / I.X1 ; : : : ; Xn / C
n
n
n
n
by Proposition 4.2.
t
u
84
and
I.X1 / D I.X /:
t
u
Proof. Obvious.
The last two propositions show that the information content of a random variable
can be precisely identified with the average number of yesno questions needed in
repeated guessing of this variable for one determination of its value.
A similar interpretation as for I.X / can now also be given for
I.X jY / D
pY D b I.X jY D b/:
n!1
1
T ..X1 ; : : : ; Xn /; .Y1 ; : : : ; Yn //:
n
85
1
I.X1 ; : : : ; Xn /
n!1 n
I.X / D I exists. Now we are interested not only in the averages but also
in the individual values of the functions In .x/, where x D .x1 ; : : : ; xn / is a
particular value of the random vector .X1 ; : : : ; Xn /. We want to investigate, for
which individual vectors x they come close to the average I for large n. It will
turn out that this can happen for most x, i.e., with high probability.
Definition 6.8. A process X D .Xn /n2N with values in a finite set A is said to
have the asymptotic equipartition property (a.e.p.), if the sequence .InC1 In /n2N
satisfies the w.l.l.n.
This definition actually means that for processes with a.e.p. n1 In comes close to
I with high probability.
Proposition 6.8. An i.i.d. process X D .Xn /n2N with values in a finite set A
satisfies the a.e.p.
Proof. Because the Xi are independent, we have
n
X
e
e
ei :
N X
In D N X 1 \ : : : \ X n D
i D1
of the
more detail. It means that p n1 In I < ! 1 for any > 0. Here I D I.X / is
the information rate of the process X D .Xn /n2N . If we define
86
1
D ! 2 W In .!/ I <
n
Hn;
and
log2 p.a/
< I C ;
n
If we sum these inequalities for all a 2 An; , we obtain the following estimates
for #An; :
#An; > p.An; / 2n.I / > .1 / 2n.I / ;
#An; < p.An; / 2n.I C/ 2n.I C/ :
Thus we have proved the following.
Proposition 6.9. Let X D .Xn /n2N be a process with values in A that has the
a.e.p. and the information rate I . Then there is for every > 0, an n 2 N and a set
An; An of so-called high-probability sequences satisfying
i) p.An; / > 1 ,
ii) 2n.I / > p.a/ > 2n.I C/ for every a 2 An; ,
iii) .1 / 2n.I / < #An; < 2n.I C/ .
6.8 Exercises
87
The classes of processes that satisfy the w.l.l.n. or the a.e.p. have been more
thoroughly investigated in the mathematical literature, in particular, in a branch of
ergodic theory which analyzes dynamical systems from a probabilistic point of
view (see also Gray 1990). It turns out that they are in fact rather large compared to
the very special case of i.i.d. processes, since they contain for example the class
of all ergodic processes (see for example Billingsley 1978 or Friedman 1970;
Walters 1982). It is the requirement of independence of the variables Xi that is so
strong, and it has actually been weakened in several interesting ways (Billingsley
1978; Gray 1990).
6.8 Exercises
1) Let A D f1; : : : ; ng and .Xi /i 2N be independent identically distributed on A
such that pXi D k D n1 for every i 2 N and k 2 A. For any B A obviously
p.B/ D n1 #.B/. The relative frequency with which the Xi hit the set B is defined
m
P
1B Xi . Can one say that the random variables Ym converge in
as Ym D m1
i D1
88
3) Is it always true that the combined process .Xi ; Yi /i 2N is i.i.d. if the processes
.Xi /i 2N and .Yi /i 2N are i.i.d.?
4) Let .Xi /i 2N be independent identically distributed random variables on the finite
set A with pX1 D a 0 for every a 2 A. What is the probability that every
word from A occurs infinitely often in a sequence .Xi /i 2N ? Here A denotes
the set of all finite words on A, given by
A D
Ai
i 2N
References
Ash, R. B. (1965). Information theory. New York, London, Sidney: Interscience.
Attneave, F. (1959). Applications of information theory to psychology. New York: Holt, Rinehart
and Winston.
Billingsley, P. (1978). Ergodic theory and information. Huntington, NY: Robert E. Krieger
Publishing Co.
Cover, T. M. & Thomas, J. A. (1991). Elements of information theory. New York: Wiley.
Doob, J. L. (1953). Stochastic processes. New York: Wiley.
Friedman, N. A. (1970). Introduction to ergodic theory. New York: Van Nostrand Reinhold
Company.
Gray, R. M. (1990). Entropy and information theory. New York: Springer.
Khinchin, A. (1957). Mathematical foundations of information theory. New York: Dover Publications, Inc.
Lamperti, J. (1966). Probability : A survey of the mathematical theory. Reading, Massachusetts:
Benjamin/Cummings.
Lamperti, J. (1977). Stochastic processes - a survey of the mathematical theory. Applied Mathematical Sciences 23, New York: Springer.
Shannon, C. E. (1948). A mathematical theory of communication. Bell Systems Technical Journal,
27, 379423, 623656.
Topse, F. (1974). Informationstheorie: eine Einfuhrung. Stuttgart: Teubner Verlag.
Walters, P. (1982). An introduction to ergodic theory. New York, Berlin, Heidelberg: Springer.
Chapter 7
Channel Capacity
In this chapter we extend the definitions of Chap. 5 to real information channels that
handle sequences of symbols instead of single symbols. This extension is necessary
to use the idea of taking a limit of very long sequences to define information rate
(Definition 5.5) now to define transinformation rate and channel capacity. This leads
to the proof of Shannons famous theorem in the next chapter.
In this very general definition, one needs the canonical -algebras on AN and B N that are
generated by the so-called cylinder sets (cf. Bauer 1972 and see also Definition 7.2).
89
90
Fig. 7.1 Data transmission
on a channel
7 Channel Capacity
t=1
a1
t=2
b1
a2
b3
a4
t=4
t=3
a3
b2
b4
91
In the following we shall concentrate on the simplest type of channel: the channel
without memory and anticipation, also called the simple channel.
Let .pa /a2A be a transition probability from A to B, where A and B are both
finite sets.
We can then construct an information channel with input alphabet A and
output alphabet B that simply works independently on each term ai in an
input sequence .a1 ; a2 ; : : :/ and produces the corresponding term bi of the output
sequence .b1 ; b2 ; : : :/.
For a 2 AN we thus define
Ca b1 2 B1 ; : : : ; bn 2 Bn WD pa1 .B1 / pa2 .B2 / : : : pan .Bn /:
(7.1)
Y, then
i) If X is stationary, so is Y,
ii) If X is independent, so is Y.
t
u
B N is defined by2
Yg:
Of course, a general definition should not make requirements on the processes involved. Yet
Shannons theorem relies on strong properties and it seems adequate to restrict the definition to
stationary processes.
92
7 Channel Capacity
B and a probability p0
Y g;
a2A
where the maximum3 extends over all probability vectors q D .qa /a2A and I denotes
the information of a probability vector as defined in Exercise 4.4.
Proof. Let A D fa1 ; : : : ; am g, B D fb1 ; : : : ; bn g. For two stationary processes X
and Y with CW X
Y we have
T .X ; Y/
D
D
D
./
(7.1)
1
I .Y1 ; : : : ; Yn / j .X1 ; : : : ; Xn /
n
1 e
e n / j .X
e1 \ : : : \ X
e n/
N .Y 1 \ : : : \ Y
I.Y/ lim
n!1 n
1
e1 \ : : : \ Y
en / j X
en
E log2 p .Y
I.Y/ C lim
n!1 n
n
Y
1
I.Y/ C lim E log2
PXi Yi
n!1 n
i D1
I.Y/ lim
n!1
1X
E log2 PXi Yi
n!1 n
i D1
n
I.Y/ C lim
The maximum is attained because the set of all probability vectors q is compact.
93
I.Y/ C E log2 PX1 Y1
XX
pX1 D aPa .b/ log2 Pa .b/ :
I.Y/ C
a2A b2B
depends only on X1
n
Y
pxi .yi /:
i D1
XX
a2A b2B
D I.Y1 /
pX1 D aI.pa /
(7.2)
a2A
Y1 .
t
u
pa b1 D m1 ; : : : ; bn D mn
m1 ;:::;mn
qm c1 2 C1 ; : : : ; cn 2 Cn
where m 2 B N starts with .m1 ; : : : ; mn /.
94
7 Channel Capacity
1p
00
0.5
01
2
2
10
0.5
1p
11
7.5 Exercises
1) What is the memory- and anticipationspan of Examples 7.17.5?
2) How could one modify Example 7.4 with P W A
A
B in order to define a
channel with memory span 4?
3) When two channels with finite memory and anticipation are connected, how
does the memory- and anticipationspan of the composed channel depend on
those of the two channels?
4) Show that for a simple channel (see Definition 7.4 on page 91) Cp defined by
a transition probability pW A
B, the capacity c.Cp / is attained for an i.i.d.
input process on A.
5) Show that the channel capacity c.p/ of a channel pW AN B N , where
Ipa .b1 ; : : : ; bn / is independent of a 2 AN for every n 2 N, can be obtained
by maximizing fI.Y/W X process on A; pW X
Yg.
6) In Example 7.1 take B D f0; 1g, A0 A, and consider the mapping f that has
(
f .a/ D
for a 2 A0 ;
otherwise :
References
Fig. 7.3 Two simple
channels
95
0
2
0.5
1
0.25
0.5
0.5
0.25
9) For the simple channels C given in Fig. 7.3, what is their capacity and what is
the capacity of C C?
10) Show the following:
Proposition 7.3. Let C W AN
B N be a channel. Its capacity
c.C/ satisfies c min.#.A/; #.B//.
References
Ash, R. B. (1965). Information theory. New York, London, Sidney: Interscience.
Bauer, H. (1972). Probability theory and elements of measure theory. New York: Holt,
Rinehart and Winston.
Feinstein, A. (1958). Foundations of information theory. New York: McGraw-Hill Book
Company, Inc.
Chapter 8
The goal of our rather technical excursion into the field of stationary processes
was to formulate and prove Shannons theorem. This is done in this last chapter
of Part III.
Shannons Theorem is one of the most important results for the foundation of
information theory (Shannon and Weaver 1949). It says that the channel capacity
c determines exactly what can effectively be transmitted across the channel. If you
want to transmit less than c bits of information per time unit across the channel you
can manage to do it in such a way that you can recover the original information from
the channel output with high fidelity (i.e., with low error probabilities). However, if
you want to transmit more than c bits per time unit across the channel, this cannot
be done with high fidelity. This theorem again underlines the fact that information
is incompressible (like water) and that a given channel can only transmit a given
amount of it in a given time.
97
98
Coding
C
U
A
X
CAB
B
Y
i
h 1
e1 \ : : : \ X
e n I.X / > <
i) p N X
n
h 1
i
e1 \ : : : \ Y
en I.Y/ > <
ii) p N Y
n
i
h 1
e1 \ : : : \ X
en \ Y
e1 \ : : : \ Y
en I.X ; Y/ > <
iii) p N X
n
h 1
i
e1 \ : : : \ U
e n I.U/ > <
iv) p N U
n
From Proposition 6.8 we can also estimate the number of the corresponding highprobability sequences, i.e., N D #.An; /, #.Bn; /, M D #.Cn; /, and also the number
P of high-probability pairs.
Now the idea is to consider only the high-probability elements in C n , An , B n ,
and An
B n , and to map each high-probability element c D .c1 ; : : : ; cn / 2 Cn;
onto a different randomly chosen a D .a1 ; : : : ; an / that is the first element in a highprobability pair .a; b/. This procedure will work if there are more such as than
there are high-probability cs, and if the probability of finding the first element a
from the second element b in a high-probability pair .a; b/ is sufficiently high. In
this case, we can guess first a and then c from the channel output b. In order to
carry out the proof, we now have to estimate the number of these as appearing in
high-probability pairs .a; b/.
1. Given a high-probability a, we estimate the number Na of high-probability
pairs (a,b) containing a as follows: We use the abbreviations X D .X1 ; : : : ; Xn /,
Y D .Y1 ; : : : ; Yn /, and U D .U1 ; : : : ; Un /, and consider only high-probability
elements ! 2 . Then
99
e jX
e/ D
p.Y
p.X; Y /
2n.I.X ;Y/I.X /C2/:
e/
p.X
Thus
1
b2B n
P
P
:
2n.I.X ;Y/I.X /C2/
Na
Because of the estimates for M and P from Proposition 6.7, this is true if
2n.I.U /C/ .1 /2n.I.X /3/ :
Since I.X / T .X ; Y/ D c > r D I.U/ this is certainly true for sufficiently
small .
2. Given a high-probability b 2 B n , we estimate the number Nb of high-probability
pairs .a; b/ in An
B n containing b similarly to (1):
e jY
e/ D p.X; Y / 2n.I.X ;Y/I.Y/C2/ :
p.X
e/
p.Y
Thus
1
a2An
This number we use to estimate the probability that there is at most one m.c/
occurring as first component among the Nb pairs, for each of the high-probability
bs at the channel output. More exactly, for a fixed high-probability c we take
a D m.c/ as channel input and obtain b as channel output. Now we ask for
the probability pf that there is another c 0 such that .m.c 0 /; b/ is also a highprobability pair. For fixed b let nb be the number of codewords m.c 0 / such that
.m.c 0 /; b/ is a high-probability pair. Now we can estimate
pf pnb 1 < E.nb /
DM
Nb
N
100
t
u
References
101
8.4 Exercises
1) Let P C be the compound channel obtained from connecting two channels
P and C.
a) Show that c.P C/ minfc.P/; c.C/g.
b) Give an example where c.P/ 0 and c.C/ 0, but c.P C/ D 0. This can
be done with deterministic channels P and C.
c) Show that for any > 0 one can construct a deterministic adapter channel
R such that c.P R C/ minfc.P/; c.C/g .
d) What would be a good adapter channel for P D C, for the two channels of
Exercise 7.9?
References
Feinstein, A. (1954). A new basic theorem of information theory. IRE Transactions on Information
Theory, 4, 222.
Feinstein, A. (1959). On the coding theorem and its converse for finite-memory channels.
Information and Control, 2, 2544.
Gray, R. M. (1990). Entropy and information theory. New York: Springer.
Kieffer, J. (1981). Block coding for weakly continuous channels. IEEE Transactions on Information Theory, 27(6), 721727.
McMillan, B. (1953). The basic theorems of information theory. Annals of Mathematical Statistics,
24, 196219.
Pfaffelhuber, E. (1971). Channels with asymptotically decreasing memory and anticipation. IEEE
Transactions on Information Theory, 17(4), 379385.
Shannon, C. E. (1948). A mathematical theory of communication. Bell Systems Technical Journal,
27, 379423, 623656.
Shannon, C. E. & Weaver, W. (1949). The mathematical theory of communication. Champaign:
University of Illinois Press.
Wolfowitz, J. (1964). Coding theorems of information theory (2nd ed.). Springer-Verlag New York,
Inc.: Secaucus.
Part IV
Chapter 9
This chapter introduces the notion of a cover or repertoire and its proper descriptions. Based on the new idea of relating covers and descriptions, some interesting
properties of covers are defined.
Definition 9.1. For a probability space .; ; p/, a cover is a subset of n f;g
such that [ D .1
In general, a cover may be a finite or an infinite set of propositions. For a basic
understanding of the concepts, it will be much easier to consider finite covers.
Part II on coding and information transmission illustrated that information is
an objective quantity, which quantifies the amount of information needed to
determine the value of a random variable. On the other hand, novelty is slightly
more subjective in the sense that it takes explicitly into account a particular
interpretation or description d of events x, and so it is a measure of the novelty
provided by viewing the world (the events) through d . Of course, interest in the
value of a particular random variable X , also implies a particular description of
e , but this description is of a special clear-cut type: we have said
events, namely X
it is complete. After having investigated the clear-cut, complete descriptions that
are needed for optimal guessing and coding in Parts II and III, we now want to
come back to the more personal, biased, one-sided descriptions of events. The
opposite extreme to complete descriptions are descriptions that express a very
particular interest pointing only to one direction. They are the directed descriptions,
corresponding to the narrow covers defined in Sect. 9.4.
The world view of a person could be characterized as the collection of
all propositions that (s)he will eventually use (externally or internally) to describe
events in the world. We may also understand as the collection of all propositions
a person is potentially interested in. Such a collection will also be called a
repertoire, meaning the repertoire of all elementary observations (propositions),
through which a person views or describes the occurring events x 2 . One could
105
106
assume that such a repertoire should be closed under the logical operations NOT,
AND, and OR, i.e., complement, intersection, and union. This would make it an
algebra. We do not take this point of view here, in particular with respect to negation.
Indeed, non-table would not appear as such a natural elementary description of a
thing as table.
Definition 9.2. For a cover and a description d , we define the novelty provided
by d for (novelty of d for ) by N .d / WD E.Nd / where
Nd .!/ D supfN .A/W d.!/ A 2 g:
Proposition 9.1. For any finite cover and any description d , we have
i) N .d / N .d /
ii) N .d / D N .d / if and only if R.d / .2
Proof. Obvious.
t
u
107
D fCi D aceW i D 1; : : : ; 8g, where Ci 2 face; no aceg describes the i -th card.
Here is not a partition, because there are two aces among the eight cards.
In the example of Chap. 2, the Monty Hall problem, there is the location of the
sportscar S 2 f1; 2; 3g and the candidates guess C 2 f1; 2; 3g. The quizmaster is
allowed to make a true statement from the repertoire D fS i W i D 1; 2; 3g.
Here, however, there is another restriction given by the situation: He cannot chose
to say S C , because he should not open the door the candidate has chosen.
Further examples for repertoires can be considered in the context of the game
of lotto: 6 numbers are drawn from L D f1; : : : ; 49g without replacement. If you
have guessed them right, you get a lot of money. Usually the six numbers are not
displayed in the order they are drawn, but ordered by their size. Thus we have 6
random variables X1 < X2 < X3 < X4 < X5 < X6 with values in L.
If you regularly take part in the game, you want to know the six numbers. This
interest is described by
D fX1 D a1 ; X2 D a2 ; : : : ; X6 D a6 W a1 < a2 < a3 < a4 < a5 < a6 2 Lg;
which is a partition with large information content, (6 pairs of digits have an
information content of about 40 bit, since they are less than 49 it will be
about 6 bit less). Normally, you are unable to remember these 6 numbers, if
you are just told them. You may, however, be able to remember them, if they
are interesting configurations for example .1; 2; 3; 4; 5; 6/. It seems natural to
say that this configuration is more surprising than any of the usual configurations, like .4; 5; 18; 27; 34; 40/. Some people may be interested in particular
numbers like 3; 13; 33, or perhaps prime numbers. Here I only want to consider
a few particular configurations: it is surprising, if 3 or more numbers come in
a row. This interest in configurations can be expressed by the repertoire D
f3 in a row; 4 in a row; 5 in a row; 6 in a row; g. Another interesting feature
may be the compactness of the sequence, i.e., X6 X1 . So we may be interested
in 2 D fX6 X1 D aW a 2 Lg. Since small values of X6 X1 are more surprising
than large values, this interest may be even better expressed by 2 D fX6 X1
aW a 2 Lg. In addition we may be surprised by large values of X1 or by small values
of X6 as described by
1 D fX1 D aW a 2 Lg;
Let us play a bit with these 7 repertoires. First we should note that X1 ,X6 , and
X6 X1 do not really take all values in L. For example, X6 cannot be less than
6 and X1 cannot be more than 44. So X6 D 3 is the empty set, so is X6 4.
Now it does not really matter whether we put the empty set into a repertoire or not.
The empty set cannot be used to describe anything. In our definitions we usually
assume that ; ; normally, we even assume that p.A/ 0 for all propositions in
a repertoire.
108
109
and 2 2 1 6 :
110
In line with the observations made in Sect. 9.1, it is obvious that the novelty
of the same event ! can be larger if described in En with larger n (n k
implies NEn .!/ NEk .!/). So the real surprise of ! should be evaluated as some
combination of NEn .!/ and n, the idea being that an event ! is more surprising if it
can be described with fewer words (or letters) and has smaller probability.
Also in this line of thinking one could try to take the shortest English expression
(i.e., the smallest n such that En contains the proposition) as a measure for the
information content of a proposition. A more formalized version of this idea indeed
leads to an alternative, algorithmic approach to information theory by Chaitin (1975,
1977, 1987), Kolmogorov (1965, 1968), or to the so-called description complexity.
t
u
Example 9.3. Another example close to the simple extreme fg is the cover
f; Ag, where 0 < p.A/ < 1, which describes a simple bet on the proposition
t
A about events ! 2 , I bet that A. A different example is the cover fA; Ac g. u
Further examples are provided by the range R.d / of a description d , or by the
cover depicted in Fig. 2.2 in Sect. 2.3.
In this section we want to study the relationship between covers or repertoires
(see Definition 9.4) and descriptions.
Definition 9.3. Given a cover , a description d is called consistent with or in terms
of or a choice from , if d.x/ 2 for every x 2 . A choice from a cover is called
proper, if for every x 2 there is no element A of such that x 2 A d.x/.3
We denote by D.) the set of all proper choices4 from .
The word choice stems from the idea that a particular description d is always
a choice in the sense that for a given x 2 , d.x/ has to be chosen from those
propositions A 2 that contain x. Of course, it may also be possible that there is
no choice (or rather only one choice) for a repertoire .
Proposition 9.2. A cover admits only one description d , if and only if A\B D ;
for any A B 2 .
Proof. Obvious.
t
u
Such a cover is called disjoint or a partition (cf. Definition 2.6). Note that the
requirement that
Pp.A/ 0 for every A 2 implies that a disjoint cover has to be
countable and
p.A/ D 1.
A2
3
In line with our general strategy to disregard sets of probability 0, we can interpret A d.x/ as
p.A n d.x// D 0 and p.d.x/ n A/ > 0.
4
In the definition of D./, we understand a description d simply as a mapping d W ! with
! 2 d.!/, i.e., without the additional requirement of Definition 2.3.
111
For finite covers there are always proper choices. Unfortunately, this may no
longer be true for infinite covers as the following example shows.
Example 9.4. D f.a; b/W a < b 2 R D g.
t
u
Let me explain the idea behind the concept of a proper choice or description:
If you ask someone for a description of Mr. Miller, and in the course of this
description he says Mr. Miller has two children, then you usually assume not
only that Mr. Miller has two children, but also that Mr. Miller has no more than
two children. On the other hand, if Mr. Miller had three children, it would still be
correct to say that he has two children, since having three children implies having
two children.
In the repertoire of the person describing Mr. Miller, there are certainly propositions about Mr. Miller of the type Mr. Miller has n children. Among these
Mr. Miller has two children and Mr. Miller has three children are both correct
choices, if Mr. Miller has three children. But we assume that our informer will
choose the stricter of these two statements in this case. His proper choice of
statement should be such that there is no stricter statement available (in his
repertoire) about Mr. Miller that is true.
In the following we will be mostly concerned with proper choices. We might ask
whether there is always a proper choice for a repertoire .
We define .x/ WD fA 2 W x 2 Ag. An element M 2 .x/ is called minimal, if
there is no A 2 .x/ with A M . Thus a proper choice d picks for every x 2
a minimal element d.x/ 2 .x/. If .x/ is a finite set for every x 2 , as will be
usually the case, then it is clear that .x/ has minimal elements and that a proper
choice exists.
Definition 9.4. A cover is called finitary, if
i) p.A/ 0 for every A 2 and
ii) For almost every ! 2 and every A 2 containing ! there is a minimal
proposition B 2 with ! 2 B A. Minimality of B means that ! 2 C B
and C 2 implies C D B.
A finitary cover is also called a repertoire.
Example 9.5. Take D R and
1 D f.a ; a C /; a 2 Rg for > 0;
2 D f.a; 1/; a 2 Rg;
3 D fa; 1/; a 2 Rg;
4 D ff!gW ! 2 Rg:
Which of these covers are finitary?
t
u
Obviously finite covers with property (i) are finitary. This requirement is used to
ensure the existence of proper choices also from infinite covers.
112
tight
not tight
Proposition 9.3. Let be a finitary cover. For every ! 2 and every A 2 with
! 2 A there is a proper choice4 d 2 D./ such that d.!/ A.
Proof. We obtain a proper choice d 2 D./ by choosing for every ! 0 2 a
minimal B 0 2 with ! 0 2 B 0 . Now take ! 2 and A 3 !. Then there is a
minimal B 2 with ! 2 B A. We define d.!/ WD B.
t
u
For further reference we define two particular repertoires, a very fine one, called
(for large) and a very coarse one, called (for small).
Example 9.6. Let be finite or countable and p.!/ 0 for every ! 2 . On
.; ; p/ we define two covers or repertoires:
D ff!gW ! 2 g
and
D f n f!gW ! 2 g :
t
u
Definition 9.5. A repertoire is called tight if there is exactly one proper choice
from it. In this case, this choice is called d .
Example 9.7. The pictures in Fig. 9.1 illustrate tightness.
The repertoire in the right picture is not tight, because for the lower left corner,
both the horizontally hatched region or the diagonally hatched region can be (part
of) a proper choice. Thus there is more than one proper choice for this repertoire. In
the left picture the horizontally hatched region cannot be chosen because it contains
each of the diagonally hatched regions and will therefore not be a proper choice in
those regions.
t
u
Example 9.8. A bet on the truth of a single proposition A can also be described by
a repertoire, namely by D fA; g (cf. Example 9.3). This repertoire is tight and
the only choice d is
(
A for x 2 A and
d .x/ D
for x A.
For a random variable
X we defined the description X in Definition 2.16. The
range D R X of this description is a tight repertoire and d D X again.
e for
Any partition is obviously tight, and so is the range of the description X
any discrete random variable X on .
t
u
113
clean
114
c [
D./ D D./ implies c D c
D.c / D D./ D D.[ /
c[ D [ D [[ and [c D c D cc
t
u
D./ D D./
c D c
[ D [
c [
We need this assumption only for the first equation in (iii) and (iv).
115
116
ab
117
Proposition 9.11. For two tight covers and the product is tight and d D
d \ d .
Proof. Obviously for every A 2 with x 2 A, we have d .x/ \ d .x/ A. For
this reason there is only one proper choice for .
t
u
It will be seen later, in Part VI, that for tight repertoires and , their product
is essentially the smallest tight repertoire containing and .
The flattening of an arbitrary cover may not exist, because f may not be a cover. An example
for this is D R.X / for a random variable X with R.X/ D R. In this case, has no
maximal elements. If the flattening exists, it is clearly flat. Usually we consider finite covers which
guarantees that the flattening exists.
118
Fig. 9.4 The hierarchy for D fA; B; C; D; E; F; G; H g where D f1; 2; 3; 4; 5; 6g, and A D
f1; 2; 3; 4; 6g, B D f2; 4; 5; 6g, C D f1; 2; 3; 4g, D D f1; 6g, E D f4; 5; 6g, F D f3g, G D f4g,
H D f5g
Fig. 9.5 Example for a shallow repertoire (a) and for a chain (b)
Shallow or flat covers are obviously clean repertoires. Clearly a cover is shallow
if and only if every choice for it is proper. So for shallow covers the distinction
between choices and proper choices is unnecessary. fg is the only cover that is
shallow and narrow.
The flattening is a very bold operation on covers or repertoires, it can easily make
them uninteresting, i.e., very coarse. Indeed, every cover is equivalent to (Def. 9.9)
[ fg and . [ fg/f D fg.
Proposition 9.13. A cover is a partition8 if and only if it is shallow and tight.
Proof. Clearly every partition is shallow and tight. Let be shallow and tight. Take
A B 2 and assume that there is an x 2 A \ B. Then d .x/ is contained in A
119
clean
template
tight
partition
chain
shallow
all repertoires
{}
and in B, and thus cannot be shallow. This shows that is disjoint and therefore
a partition.
t
u
The picture in Fig. 9.6 illustrates the different classes of repertoires and how they
are related with each other.
120
9.6 Exercises
1) For the examples of repertoires in Examples 9.3, 9.7, and 9.9, what is c ,
which of them are tight, and what is the cardinality (number of elements) of
D./ in each case?
2) For which repertoires with n elements does D./ have the largest (resp. smallest) cardinality? Give an example for each case. What is the result if jj D n?
3) Prove the following:
Proposition 9.14. For a repertoire : [ is an algebra if and only if c is a
partition.
4) Let be a repertoire and jj D n. What is the maximal number of elements in
[ and in \ ?
5) When jj D n and is a clean repertoire, or a shallow repertoire, respectively,
what is the maximal number of elements in [ and in \ ?
6) Let be a repertoire and jj D n. Consider the unique proper choice d for \ .
What is the maximal number of elements in R.d /? How is R.e
d / related to the
algebra .) generated by ?
7) Given a description d , is there a difference between R.d\ / and R.d /\ ?
f
8) Given a description d , is there a difference between .e
d /\ and .d
\ /?
9) Give an illustrating example for each of the set theoretical differences and
intersections of sets in Fig. 9.6.
10) Show that c is clean for any repertoire.
11) Show that for finite (and p.!/ 0 8 ! 2 ) all covers are repertoires.
12) Give an example that this is not the case when is countable.
13) For finite with jj D n,
a) which are the largest and which are the smallest tight repertoires and what
are their cardinalities?
b) the same for templates,
c) the same for flat covers.
14) Show that (Example 9.6) is flat, but not tight (for jj > 2).
References
Chaitin, G. J. (1975). Randomness and mathematical proof. Scientific American, 232(5), 4752.
Chaitin, G. J. (1977). Algorithmic information theory. IBM Journal of Research and Development,
21, 350359.
Chaitin, G. J. (1987). Algorithmic information theory, volume I of Cambridge tracts in theoretical
computer science. Cambridge: Cambridge University Press.
Kolmogorov, A. N. (1965). Three approaches to the quantitative definition of information.
Problems in Information Transmission, 1, 47.
Kolmogorov, A. N. (1968). Logical basis for information theory and probability theory. IEEE
Transactions on Information Theory, 14, 662664.
References
121
Palm, G. (1981). Evidence, information and surprise. Biological Cybernetics, 42(1), 5768.
Palm, G. (1985). Information und entropie. In H. Hesse (Ed.), Natur und Wissenschaft. Tubingen:
Konkursbuch Tubingen.
Palm, G. (1996). Information and surprise in brain theory. In G. Rusch, S. J. Schmidt, &
O. Breidbach (Eds.), Innere ReprasentationenNeue Konzepte der Hirnforschung, DELFIN
Jahrbuch (stw-reihe edition) (pp. 153173). Frankfurt: Suhrkamp.
Palm, G. (2007). Information theory for the brain. In V. Braitenberg, & F. Radermacher (Eds.),
Interdisciplinary approaches to a new understanding of cognition and consciousness: vol.
20 (pp. 215244). Wissensverarbeitung und Gesellschaft: die Publikationsreihe des FAW/n
Ulm, Ulm.
Chapter 10
This chapter finally contains the definition of novelty, information and surprise for
arbitrary covers and in particular for repertoires and some methods for their practical
calculation. We give the broadest possible definitions of these terms for arbitrary
covers, because we use it occasionally in Part VI. Practically it would be sufficient
to define everything just for repertoires. It turns out that the theories of novelty and
of information on repertoires are both proper extensions of classical information
theory (where complementary theorems hold), which coincide with each other and
with classical information theory, when the repertoires are partitions.
123
124
have reduced our problem to the familiar problem of determining an optimal code
for the random variable X1 . It is easy to calculate the probabilities that X1 D i and
to work out I.X1 / 2:61 (cf. Exercise 10.7)).
Next I want to give some examples illustrating the definition of surprise S./.
Someone, Mr. Miller, wants to prove the efficiency of a certain medication by a
statistical test. To this end he has acquired 2,000 data points: each comparing the
result with and without medication on the same patient. A textbook on statistical
testing suggests a certain test T and mentions that this test should be carried out
with at least 10, better 20 data points. Of course, it is also recommended to use as
many data points as possible.
Now Mr. Miller has the idea to divide his 2,000 data points into 100 batches of
20 and perform 100 tests with the idea to report the best result.
He finds out that 6 of his tests lead to significant results and two are even highly
significant. Instead of simply mentioning one of those highly significant results,
he now starts to wonder, that two significant results should somehow be more
significant than one significant result. A friend tells him that this is obviously the
case because the 100 tests were performed on different persons and therefore can
be assumed to be independent. Thus the significance probabilities can simply be
multiplied. But what about the nonsignificant results? Well, says his friend, even
if they have rather high-significance probabilities pi , these are certainly smaller
than 1 and so they can safely be multiplied also, they will just make the resulting
significance probability smaller.
What do you think about this argumentation?
The 100 statistical tests can be described by 100 random variables Ti , and we
can indeed assume them to be independent. The significance probability pi of Ti
is p.Ti /, which is also described by the cover i D fTi xW x > 0g. The first
100
S
idea, to report the best result, corresponds to the repertoire D
i , whereas the
i D1
125
For arbitrary, possibly uncountable repertoires the max may be a sup (i.e., not attained). It may
also be that the expectation does not exist or is infinite for some d 2 D./; in this case, the max is
defined as 1. For finite repertoires we will see that the max exists and is finite.
2
For arbitrary, possible uncountable covers the min may be a inf (i.e., not attained). It may also be
that the expectation does not exist or is infinite for all d 2 D./; in this case, the min is defined
as 1. For finite repertoires we will see that the min exists and is finite.
3
If there is no d 2 D./ with N .d / D N ./, this definition is not reasonable. In this case,
it should be replaced by b
I ./ D
lim minfI .d /W d 2 D./; N .d / an g and b
S ./ D
an !N ./
lim
an !N ./
maxfS .d /W d 2 D./; N .d / an g.
126
These definitions are formulated in the most general way for arbitrary repertoires.
Usually they will be applied to finite repertoires. In more general cases some
technical details have to be observed as indicated in the footnotes. Clearly N ./
is called the novelty of , both I./ and b
I./ may be called the information of ,
and both S./ and b
S./ may be called the surprise of .
Proposition 10.1. For any repertoire the following inequalities hold:
1
S./ b
S./ N ./ b
I./ I./:
ln 2
Proof. (i) For any d 2 D./, S.d /
(ii)
(iii)
(iv)
(v)
1
ln 2
1
ln 2 .
S./ b
S./ is obvious.
b
I./ I./ is obvious.
N ./ b
I./, because I.d / N .d / for any d .
b
S./ N ./. Assume S.d / D b
S./ which implies N .d / D N ./.
Let d.!/ D A then, p.A/ D minfp.B/W ! 2 B 2 g.
Now AE D p.d / p.A/ D f! 0 W p.d.! 0 // p.A/g A and we are done.
E
t
u
Indeed ! 0 2 A ) p.d.! 0 // p.A/ ) ! 0 2 A.
127
1
6
4
2
log2 C log2 3 D log2 3 ;
6
4
3
3
1
1
1
1
2
I./ D log2 3 C log2 6 C D log2 3 C ;
3
6
2
2
3
b
I./ D 1;
I./ D
b
I./ D log2 3;
1 b
D S./;
2
1
3
1
S./ D log2 3 C log2 D b
S./;
3
3
2
S./ D
t
u
1
6
5
log2 C log2 6;
6
5
6
1
2
b
I. / D log2 3 C log2 6;
3
3
I. / D
S. / D
1
log2 3;
3
b
S. / D 0:
t
u
b
S./ D S.N /:
128
Proposition2:10
N .N / D S.N /.
t
u
consider D fa; a C 1W a 0g and D fa; 1/W a 0g. D./ is very large,
D./ is very small: it contains just one description bW x 7! x; 1/.
Here N ./ D N .d / for d W x 7! x; x C 1, yielding N ./ D log2 .e/
log2 .1 e 1 / after some computation (see the next example), S./ D b
S./ D
S.d / D S./ D ln12 . On the other hand, I./ D I.c/ for cW x 7! bxc; bxc C 1,
which is again computed in the next example, but b
I./ D I.d / D 1. N ./ D
N .b/ for bW x 7! x; 1/, S./ D b
S./ D S.b/ D N .b/, and I./ D b
I./ D
I.b/ D 1.
Next we calculate
Z1
Z1
Z1 Z1
1
1
N .b/ D e x . log e x /dx D
x e x dx D
1yx e x dydx
ln 2
ln 2
0
1
ln 2
Z1
e y dy D
1
1:44269;
ln 2
Z1
N .d / D
x
. log.e
x
e
x1
1
//dx D
ln 2
Z1
e x .x ln.1 e 1 //dx
0
1
1 ln.1 e /
2:10442;
ln 2
1
X
N .c/ D
.e j e j 1 /. log.e j e j 1 //
D
j D0
D .1 e 1 /
1
1 X j
e .j ln.1 e 1 //
ln 2 j D0
e 1
1 e 1
ln.1 e 1 /
ln 2
.1 e 1 /2
1 e 1
1
e
1
1
D
ln.1
e
/
1:50134.
ln 2
1 e 1
D
t
u
Example 10.4. Let us now consider Example 9.5 and evaluate N , I and the other
quantities for the exponential distribution. For 1 and x 2 RC we obtain
129
Dd >0
x
C d:
D
ln 2
So N .1 / D
R1 x
. ln 2 C d /e x dx D
0
1
ln 2
C d.
1
X
e 2i .1 e 2 /.
i D0
D .1 e 2 /
X
1
i D0
2i
C d/
ln 2
1
2i 2i X 2i
e
C
de
ln 2
i D0
e 2
2
C d:
ln 2 1 e 2
For ! 0, I becomes ln12 Cd which goes to infinity like log2 . For increasing
the information I decreases monotonically towards 0.
2 is not a repertoire.
For the exponential distribution 3 is the same as from Example 10.3.
For 3 and x 2 RC we obtain
N3 .x/ D maxfN .A/W x 2 A 2 3 g D log.e x / D
So N .3 / D
R1
0
x x
ln 2 e
dx D
1
ln 2 .
x
:
ln 2
130
4 is not a repertoire, but it would be quite obvious that N4 .x/ D 1 for any
x 2 R and we obtain N .4 / D b
I.4 / D I.4 / D 1.
t
u
Now we proceed to prove some of the statements made in the remarks following
Proposition 10.1.
Definition 10.2. Let be a cover. A proposition A 2 is called small in , if for
almost every x 2 A and every B 2 with x 2 B we have p.B/ p.A/.
Proposition 10.3. For a repertoire the following statements are equivalent:
i) N ./ D b
I./ < 1.
ii) contains a small partition, i.e., a partition of small propositions.
Proof.
0
(ii) ) (i) : Let 0 be a partition that is small
P in . For 2 A 2 define d.x/ D A.
Then N ./ D N .d / D
p.A/ log2 p.a/ D I.d / D b
I./.
A2 0
A2
A3
131
N ./ D 0:7 log2
10
6
b
S./ D 0:7 log2
10
7
D 0:36020;
The definition of coincides with the more general Definition 9.9 (i.e., [ D
[ ). Also the -relation as defined here can be used for arbitrary covers.
Proposition 10.5.
i) The relation is an equivalence relation on repertoires.
ii) For , we have N ./ D N ./, I./ D I./, b
I./ D b
I./, S./ D
b
b
S./ and S./ D S./.
iii) We have if and only if and .
iv) For we have N ./ N ./.
Proof. (i), (ii) are obvious.
(iii) ): By Proposition 9.4 D./ D D./ implies c D c which implies
[ D c[ D c[ D [ :5
(: [ implies [ [[ D [ and vice versa. So [ D [ : Again
by Proposition 9.4 D./ D D.[ / D D.[ / D D./.
(iv) Take c 2 D./. For ! 2 ; c.!/ 2 [ , so there is a minimal B 2 with
! 2 B c.!/. Now we define d.!/ D B and obtain d 2 D./ with d c
and therefore N .d / N .c/.
t
u
In particular, c and [ have the same information, novelty, and surprise as
a repertoire . The equivalence classes of have already been determined in
Proposition 9.3.
The next proposition shows an interesting property of N .
Proposition 10.6. Let and be two repertoires.
i) N[ .!/ D max.N .!/; N .!//
5
8! 2
132
ii) N ./; N ./ N . [ / N ./ C N ./
t
u
and I. [ / D I./ D 1;
whereas
I.. [ / . [ // D I. / D I. / C I./ > I./ C I./
and even
I.. [ / [ . [ // D I./ > I./ C I./
and thus I is not subadditive (i.e., in general I. [ / and also I. / can be
greater than I./ C I./).
u
t
Of course, we should be able to obtain monotonicity and subadditivity also for I
for reasonably large subclasses of repertoires. This is indeed the case for tight and
for flat repertoires as we will see in Chap. 17.
In order to focus and simplify this discussion of information properties on various
sets of covers, we formally introduce symbols for the most important sets. They will
be the subject of Part VI.
133
Next we ask the question for which repertoires we can achieve maximal and
minimal values of S, b
S, I, b
I, and N . To this end we consider a finite or countable
space .; ; p/ with p.!/ 0 for every ! 2 . It is quite obvious that the minimal
value for all these quantities S D b
S DN Db
I D I D 0 is achieved for D fg.
On the other hand, the maximal value for N D b
I D I is achieved for the partition
introduced in Example 9.6. For jj D n it is at most log n (for p.!/ D n1 for
every ! 2 ) and for infinite it can be infinite (Proposition 2.17). The maximal
value of S and b
S is rather small, which is the subject of Exercises 10.12). The final
problem we have to tackle in this chapter is how to compute I and b
I in concrete
cases. (Computing N , b
S, and S is in fact quite simple as we saw above.) The next
section is devoted to this problem.
for k i; j and
pi0 WD q
and
pj0
WD pi C pj q:
q pi
2 .0; 1/. Then
pj pi
r pi C .1 r/pj D q C pj C pi D pj0
and
r pj C .1 r/pi D q D pi0 :
134
Thus
h.pi0 / C h.pj0 / > r h.pj / C .1 r/h.pi / C .1 r/h.pj / C r h.pi /
D h.pj / C h.pi /:
t
u
This lemma provides the essential justification for the following idea: For any
d 2 D./ the propositions in R.e
d / that are used for e
d will be contained in elements
of . The idea now is that we can restrict our search to partitions built from
propositions B that are formed by means of complements and intersections from .
The following definition introduces descriptions defined by orderings within
repertoires which are built using set differences.
Definition 10.5. Let be a repertoire with n elements.6 A one-to-one mapping
aW f1; : : : ; ng ! is called an ordering of . Given an ordering a of , the
description da for by the ordering a is defined as
da .!/ D a.1/ for every ! 2 a.1/;
da .!/ D a.2/ for every remaining ! 2 a.2/; i.e., for ! 2 a.2/ n a.1/;
da .!/ D a.3/ for every remaining ! 2 a.3/; i.e., for ! 2 a.3/ n .a.1/ [ a.2//;
da .!/ D a.n/ for ! 2 a.n/ n .a.1/ [ : : : [ a.n 1//:
Any description d such that d D da for an ordering a of , is called a description
by ordering of . Note that d .!/ D a.k/ for k D minfi W ! 2 a.i /g.
Proposition 10.7. For any finite repertoire7 the minimum
minfI.d /W d description in g
is obtained at a description da by ordering of .
d / is a partition,
Proof. Let d be any description for , R.e
d / D fD1 ; : : : ; Dn g. R.e
n
P
h.p.Di //, and we may in addition assume that p.D1 / : : : p.Dn /.
I.d / D
i D1
Define A1 WD d.D1 /.
Now consider the description d 1 defined by
(
d 1 .!/ D
A1
for ! 2 A1 ;
d.!/
otherwise:
6
7
135
I.d 1 / D h.p.D1 / C
n
X
p.A1 \ Di // C
i D2
n
X
i D2
d .!/ D
A2
1
d .!/
for ! 2 A2 n A1 ;
otherwise:
Again Lemma 10.1 shows that I.d 2 / I.d 1 /. So we proceed until we obtain d n
with I.d n / : : : I.d 1 / I.d /; d n is a description by ordering of .
t
u
Note that in Proposition 10.7 we have not yet considered proper descriptions. The
condition of properness further constrains the order in which we subtract subsets.
This is the subject of the next two propositions.
S
Definition 10.6. For a repertoire we define A4 WD A n fB 2 W B Ag for
any A 2 and 4 WD fA4 W A 2 g n f;g called the difference repertoire of .
4 is a cover with the interesting property that for ! 2 and A 2 , we have
! 2 A4 if and only if A is minimal in ! .
The idea in the definition of 4 is the same that led to the definition of the
completion e
d of a description d . When we know that a person could also take the
description B A instead of A which would be more exact, we can assume that !
is not in B, when he describes ! just by A. So actually we can infer that ! 2 A4
when it is described by A. However, this kind of completion is only partial, because
in general 44 4 , i.e., 4 is not yet flat. The following is an example for this
(cf. Exercise 1).
Example 10.7. Let D f1; : : : ; 6g and
D ff1g; f1; 2g; f1; 3g; f3; 4g; f2; 4; 6g; f2; 3; 5; 6gg.
Then 4 D ff1g; f2g; f3g; f3; 4g; f2; 4; 6g; f2; 3; 5; 6gg;
44 D ff1g; f2g; f3g; f4g; f4; 6g; f5; 6gg;
444 D ff1g; f2g; f3g; f4g; f6g; f5; 6gg;
4444 D ff1g; f2g; f3g; f4g; f5g; f6gg:
t
u
136
t
u
b
I./ D minfI.d /W d description in 4 and N .d / D N ./g:
8
This proposition actually holds for arbitrary repertoires in the same way as Proposition 10.7, if
there is a description d 2 D./ which satisfies the additional condition in Definition 2.3.
137
0
4
0
then d.y/ D Ak D d.!/ ) k D minfi W ! 2 A4
i g ) d .y/ D Ak D d .!/.
Thus e
d D de0 and I.d / D I.d 0 /.
t
u
Example 10.8. We take the throwing of two dice X1 and X2 as our basic experiment,
i.e., .; ; p/ D D 2 . We consider a simple example of a repertoire . Take
A1 D f.5; 5/; .6; 6/g;
A2 D X1 D X2 ;
A4 D X1 D 1 [ X2 D 1;
A5 D ;
and
A3 D X1 5;
D fA1 ; A2 ; A3 ; A4 ; A5 g:
fA1 ; A3 ; A2 ; A4 ; A5 g leads to the partition fA01 ; A03 ; A02 ; A04 n .A02 [ A03 /; A05 g D 1
fA2 ; A3 ; A4 ; A2 ; A5 g leads to the partition
For each of these partitions we can calculate the information. The best of these six
possibilities is 4 with I.4 / 1; 7196.
The following is a slightly more complicated example:
Take
A1 D X1 ; X2 2 f2; 4; 6g; A2 D f.1; 1/; .6; 6/g; A3 D X1 C X2 2 f3; 5; 7; 9; 11g;
A4 D X1 C X2 D 4; A5 D X1 D X2 ; A6 D X1 D 5 [ X2 D 5; and
A7 D X1 C X2 D 5:
t
u
In many cases one obtains the minimum information description in by the
following recipe:
1. Consider the largest element in 0 WD 4 , i.e.,
A1 WD arg maxfp.A/W A 2 0 g:
2. Define 1 WD fA n A1 W A 2 0 g and repeat, i.e.,
A2 WD arg maxfp.A/W A 2 1 g and 2 WD fA n A2 W A 2 1 g:
138
The following example shows that this procedure does not always work.
Example 10.9. Let .; ; p/ D E34 and D fA1 ; : : : ; A6 g with
A1 D f1; : : : ; 18g;
A2 D f19; : : : ; 34g;
A3 D f1; : : : ; 8; 31; 32; 33; 34g;
A4 D f9; : : : ; 16; 27; 28; 29; 30g;
A5 D f17; 23; 24; 25; 26g;
A6 D f18; 19; 20; 21; 22g:
Beginning with A1 we obtain a description d1 with d1 .!/ D A1 for ! 2 A1 and
so on. This is the description we obtain with the rule of thumb. We get I1 D
I.d1 / D log 34 18
log 18 16
log 4 D 1:93868. However, beginning with A2
34
34
we obtain a description d2 with d2 .!/ D A2 for ! 2 A2 and so on, leading to
I2 D I.d2 / D log 34 18
log 18 16
log 8 D 1:46809.
t
u
34
34
10.5 Exercises
1) Calculate I, b
I, N , b
S, and S for the repertoires in Examples 9.3, 9.7, 9.9,
10.7, 10.8, and 10.9.
2) For the repertoires in Examples 9.7, 9.9, 10.7, 10.8, and 10.9, which are not
tight, maximize N .d / I.d / and E.N d=N e
d / on D./.
3) What is the surprise of a chain D fA1 ; : : : ; An g, what is the information?
What is the maximal surprise of a chain of n elements, what is the maximal
information?
4) What is the surprise of a continuous chain, i.e., a repertoire X for a continuous
random variable X ? Compare this result with Exercise 3).
References
139
References
Palm, G. (1975). Entropie und Generatoren in dynamischen Verbanden. PhD Thesis, Tubingen.
Palm, G. (1976a). A common generalization of topological and measure-theoretic entropy.
Asterisque, 40, 159165.
Palm, G. (1976b). Entropie und Erzeuer in dynamischen Verbanden. Z. Wahrscheinlichkeitstheorie
verw. Geb., 36, 2745.
Palm, G. (1981). Evidence, information and surprise. Biological Cybernetics, 42(1), 5768.
Chapter 11
141
142
repetition of experiments), which has a probability close to 0 for the old theory
and close to 1 for the new theory. Another common version of this procedure is a
statistical test: In this case, the proponent of the new theory bets not just on one
proposition A, but on a chain of propositions A1 A2 A3 : : : An , each of
which is much more likely for the new theory than for the old one. If ! 2 Ak n Ak1
happens, he reports pold .Ak / as the significance of his result.1
Now we want to assume that both, the old probability q and the new
probability p are known, and want to construct a set A for which the novelty Npq ./
is maximal (with D fA; g). Then we consider the same problem for chains
D fA1 ; : : : ; An ; g. This is done in Sect. 11.3.
Positivity: I.j/ 0,
Identity: I.j/ D 0,
Two-way monotonicity: ) I.j / I.j / and I. j/ I. j/,
Additive symmetry: I./ C I.j/ D I./ C I.j/.
Note that requirement 4 is necessary for additivity2 :
I./ C I.j/ D I. _ / D I./ C I.j/:
143
min I.ajb/
b2D./ a2D./
and
N .j/ WD max
min N .ajb/:
a2D./ b2D./
When we consider the sets R and F of repertoires and flat covers, respectively,
we observe that identity is easily fulfilled, monotonicity is half-fulfilled. It holds
in the first argument, but not in the second, if we use the proper ordering, namely
4 (this will be defined and analyzed in Chap. 16) for I and for N . Additive
symmetry is not fulfilled. This is also shown in the next example.
Example 11.1. In general we dont get I.j/ D 0 with the minmax construction:
Consider .; ; p/ D D 2 , X1 D first coordinate, X2 D second coordinate, and
e1 [ X
e 2 . For any .i; j / 2 there are exactly two choices from , namely
DX
X1 D i and X2 D j .
For any b 2 d./ and any ! 2 , we take ab .!/ b.!/. Thus e
ab .!/ \ e
b.!/
ab .!/ \ b.!/ D f!g. Then
ab je
b/ D N .e
ab \e
b/N .e
b/ D log 36N.e
b/ log 36log 12 D log 3
I.ab jb/ D N .e
and
N .ab jb/ D N .ab \ b/ N .b/ D log 36 log 6 D log 6:
Thus
min max I.ajb/ min I.ab jb/ log 3 > 0
b2D./ a2D./
b2D./
and similarly for N . Thus the minmax definition of I or N would not satisfy
identity.
To construct a counterexample against additive symmetry, we have to remember:
I./ C I.j/ D min I.b/ C max min I.ajb/;
b
144
e 1 and D X
e1 [ X
e 2 . Then from Definition 11.1 we find
Now let D X
I./ C I.j/ D log 6 C log 6;
N ./ C N .j/ D log 6;
t
u
Still the two definitions of conditional information and novelty may be quite
interesting when they are interpreted in a game-theoretical way:
Player A chooses a proper choice from and player B a proper choice from ,
A with the goal to minimize information and to maximize novelty and B the opposite goal. In this book I do not want to explore this game-theoretical interpretation
further.
When we are considering the set T of templates, there is only one proper choice
from and from . So we can forget about the minima and maxima and expect a
reasonable theory (in particular, looking back at Sect. 3.3).
Proposition 11.1. On T we have
i) I.j/ D 0 and N .j/ D 0,
ii) implies I.j / I.j /; I. j/ I. j/; N .j / N .j / and
iii) I./ C I.j/ D I. / and N ./ C N .j/ D N . /:
for any tight covers , , and . In particular, the same holds for partitions , ,
and .
Proof. Let a be the only proper choice from , and b the same for , and c for .
(i) Obviously I.aja/ D N .aja/ D 0.
a and
(ii) N .a/ C N .bja/ D N .a \ b/ by Proposition 3.4 and the same holds for e
e
b and therefore for I (Proposition 3.6).
(iii) implies a b and therefore p a.!/jc.!/ p b.!/jc.!/ for every
! 2 , so N .ajc/ N .bjc/. Similarly, the assertions on I have been shown
for tight descriptions (Propositions 3.5 and 2.6).
t
u
With this proposition we have extended all essential properties of classical
information to the information I on T, and almost all properties (except monotonicity in the second argument) to the novelty N on T. Indeed, we cannot get this
monotonicity for N , and equivalently we cannot get subadditivity, as is shown in
the following example.
Example 11.2. N is not subadditive on T. Let .; ; p/ D E16 , and
D fi; : : : ; 16gW i 2 ;
D f1; : : : ; i gW i 2 ;
D fi; : : : ; j gW i j 2 fi gW i 2 :
So N ./ D N ./
N ./ C N ./.
1
ln 2
145
This discussion shows that it is not at all straightforward how to define mutual
information for arbitrary covers, as long as we do not have additive symmetry,
because it may happen that the expressions I./ C I./ I. _ /, I./ I.j/,
and I./ I.j/ all give different results. For templates we can safely define
mutual information, because of Proposition 11.1.
We now define the mutual information or transinformation T .; / and the
mutual novelty M.; / for templates.
Definition 11.2. For any two templates and we define
T .; / WD I./CI./I. /
and
and I D Id ; I D Id ; I D Id
the corresponding random variables (defined in Definitions 2.5 and 2.13). Then
!
p e
d
i) T .; / D E I C I I D E log
0 and
p e
d p e
d
p d
.
ii) M.; / D E N C N N D E log
p d p d
Proof. Everything is obvious from the definitions, the positivity in (i) was shown in
Proposition 3.9.(ii).
t
u
We observe that it is possible that M.; / < 0 (see Example 11.2).
Proposition 11.3. Let ; be templates. Then
i)
ii)
iii)
iv)
v)
vi)
vii)
viii)
146
147
d /:
IGpq ./ WD Gpq .e
It is obvious that this definition of novelty gain coincides with Definition 11.3.
Proposition 11.4. Let p and q be probabilities on . If is a partition, then
Gpq ./ D IGpq ./:
Proof. There is only one choice d 2 D./ and d D e
d by Proposition 2.2.
t
u
This proposition shows that IG could as well be defined as the novelty gain of
partitions. By analogy we can now define the surprise loss SL as the novelty gain
of chains. This is also the idea explained in the introductory example.
We use the name surprise loss because the semantics of information and
surprise appears to be opposite: when you gain information you loose surprise. With
this idea it is possible to define G, SL, and IG even for arbitrary infinite covers.
Definition 11.5. For two probabilities p and q on and for an arbitrary cover ,
we define
SLpq ./ WD sup Gpq .d /W [ ; finite chain and
IGpq ./ WD sup Gpq .d /W ./; finite partition :
Definition 11.5 defines information gain and surprise loss as the solutions of
two optimization problems: namely to find a partition with maximal information
gain and to find a chain with maximal surprise loss. The second problem is the one
we posed in the introductory example. Proposition 11.7 gives the solution to these
problems.
148
Proposition 11.5. Definition 11.5 for IGpq ./ is consistent with Definition 11.4 for
templates.
Proof. Let be a template and d the corresponding description. Let ./
be a finite partition and d the corresponding description. We consider one ! 2 .
d .!/ D B n A, where both A and B are unions of elements of \ .
Since is a template, for any ! 0 2 A we have d A, so d .! 0 / d .!/
because ! A. On the other hand, d .!/ B and thus e
d .!/ B n A D d .!/.
Thus IG.d / IG.d / by Proposition 3.2.
t
u
Proposition 11.6. For any cover , we have
SLpp ./ D 0 D IGpp ./:
t
u
Proof. Obvious.
p.d .x//
:
q.d .x//
This shows (i) in the finite case. For the infinite case, we need an approximation
argument.
In (i) and (ii) we have to show that the supremum in Definition 11.6 is actually
attained at the given formula. In (iii) because of Proposition 11.8 we only have to
show the first inequality. We first observe that for any proposition B 2 , we can
Eq .1B f /
p.B/
interpret
D
DW fNB as the average of f on B with respect to
q.B/
q.B/
149
the probability q. More formally, we can define the conditional probability qB (as
p.B/
in Chap. 3 and observe that
D EqB .f /.
q.B/
(i) By definition or construction of the expectation E, it is sufficient to take any
finite partition D fA1 ; : : : ; An g and show that Gpq ./ Ep .log2 f / since
Ep .log2 f / can be approximated by sufficiently large partitions. Now
Gpq ./ D
n
X
p.Ai / log2
i D1
p.Ai /
:
q.Ai /
We use the trick of Proposition 11.8 again and take the natural logarithms to
show
n
X
p.Ai /
Ep .ln f /:
p.Ai / ln
q.Ai /
i D1
Indeed
n
X
X
p.Ai /
q.Ai /EAi .f / ln EAi .f /
D
q.Ai /
i D1
n
p.Ai / ln
i D1
and
Ep .ln f / D Eq .f ln f / D
n
X
q.Ai /EAi .f ln f /:
i D1
n
X
i D1
X
p.Bi /
D
p.Ai / log2 EBi .f /:
q.Bi /
i D1
n
p.Ai / log2
150
In order to maximize this, the sets Ai should be chosen and ordered in such
a way that EBi .f / becomes as large as possible. Since Bi C1 includes Bi
and Bn D (implying EBn .f / D 1) the best choice is to have the largest
values of f on A1 and generally f should be larger on Ai than on Ai C1 . This
implies EAi .f / EAi C1 .f / and also EBi .f / EBi C1 .f /. Incidentally, it
also implies that EBi .f / EAi .f / since Bi contains on average larger values
of f than in Ai . This already shows (iii).
Furthermore, if we can split a set Ai into two sets A0i and A00i with EA00i .f /
EA00i .f /, the surprise gain will increase, because it can only become larger on
A0i and remains the same on A00i .
(iii) We show that for every x 2 in fact
F .x/ D p.f .x//=q.f .x// D Eq .f jf f .x// f .x/:
Indeed,
Eq .f jf f .x// D Eq .f 1f f .x/ /=qf f .x/
D Ep .1f f .x/ /=qf f .x/
D p.f .x//=q.f .x//:
t
u
t
u
Example 11.4. Take D 0; 1 with the usual Lebesgue measure q. Let p be given
by a density f with respect to q. Define
WD fx; x C W 0 x 1 g [
0; 2 ; 1 2 ; 1
t
u
151
Definition 11.6. For any two probability distributions p and q on .; /, we define
i) IG.p; q/ WD IGpq ./ and
ii) SL.p; q/ WD SLpq ./
These definitions use the concepts of information gain and surprise loss to define
measures for the distance between two probability distributions. Information gain
IG is the same as the KullbackLeibler distance, whereas SL is a new distance
measure.
Proposition 11.9. Let .; / be a measurable space and p; q be two probabilities
on .; /, and let p have a density f with respect to q. Then
i) IG.p; q/ D IGpq ./ D Ep .log2 f /,
ii) SL.p; q/ D Gpq .f / D Ep .log2 p f log2 q f / and
iii) SL.p; q/ IG.p; q/ 0:
Proof. This follows directly from Proposition 11.7. Observe that
log2 p f log2 q f D log2 F
for the function F defined in Proposition 11.7.
t
u
3x 2 log2 .3x 2 / dx
0
Z1
SL.p; q/ D
3x 2 log2 .pf x/ log2 .qf x/ dx; so
Z1
ln 2 IG D
Z1
2
3x 2 ln 3 dx
3x 2 ln x dx C
0
1
2
D 6 2 C ln 3 D ln 3 0:432
3
3
and IG 0:623I
Z1
ln 2 SL D
0
3x 2 ln.1 x 3 / ln.1 x/ dx
152
yD1x 3
Z1
Z1
3.1 y/2 ln y dy
ln y dy
0
D 1 C 3 6
1
1
1
1
C 3 D C D 0:833
4
9
2
3
and SL 1:202:
For comparison we can also compute IG.q; p/ and SL.q; p/:
Z1
ln 2 IG D
ln.3x 2 / dx D 2 ln 3 0:9014
0
Z1
ln 2 SL D
ln.qf x/ ln.pf x/ dx
Z1
D
0
ln x ln x
Z1
dx D 2
ln x dx D 2
0
It is plausible that these two values are larger than the other two computed before,
because the density of q with respect to p has much larger values than the density
f .x/ D 3x 2 of p with respect to q.
Similar effects can also be seen by maximizing the surprise loss SLpq ./ for a
simple bet D fA; g (see also Exercise 4).
t
u
The transinformation T .X; Y / between two random variables X and Y , which
was introduced in Chap. 4, can obviously also be regarded as the information gain
between the common distribution of X and Y (i.e., the distribution of .X; Y /) and
the product of the distributions of X and of Y . Thus Proposition 3.9.(ii) which
shows the positivity of the transinformation, can be regarded as a special case of
Proposition 3.1 or 11.8. We can use this observation to extend the definition of
transinformation from discrete to continuous random variables. This will be carried
out in the next section.
153
Shannon 1948), in spite of the fact that by any reasonable definition information
will be infinite for such variables because events can occur with arbitrarily small
probabilities.
The idea is to use only conditional information as defined in Definition 11.1
and
random variable X by the cover (-algebra) .X / D
to describe a continuous
e as in Chap. 3. With
fX aW a 2 Rg instead of simply using the description X
this idea we can reproduce all basic theorems, i.e., Proposition 3.9.
Unfortunately we cannot directly define I.X / by I..X //, because usually
.X / is not a repertoire. So we have to use the method of definition that we
used for information gain (Definition 11.5) and work with finite partitions or finite
subalgebras of .X /.
Definition 11.7. Let X; Y be arbitrary random variables. We define
I.X / WD supfI./ W partition; .X /g;
I.X jY / WD sup .X / inf .Y / I.j/;
where both sup and inf are extended over all partitions.
Clearly for discrete random variables X and Y these definitions coincide with
Definition 3.4.
There is a connection to another common definition of conditional information
that is worth mentioning.
For A 2 , p.A/ 0 and a random variable X we can define the random
variable5 p.AjX /. Based on this, for a partition we define the random variable
P
I.jX / WD A2 p.AjX / log2 p.AjX / and
I.jX / WD E.I.jX //.
Proposition 11.10. For two random variables X and Y we have I.Y jX / D
supfI.jX / W finite partition .Y /g.
Now we can show a useful continuous analog to Proposition 3.9.
Proposition 11.11. Let U , X , Y , and Z be arbitrary random variables.
i) I..X; Y /jU / I.X jU / C I.Y jU /;
ii) X 4 Y implies I.X jU / I.Y jU / and I.U jX / I.U jY /;
iii) I..X; Y /jU / D I.X jU / C I.Y j.X; U //:
Proof. These statements follow directly from their discrete analogs in Proposition 3.9.
t
u
We can use the idea of Definition 11.5 also for an extension of our previous
Definition 11.2 of transinformation.
5
p.AjX/ is a random variable that depends on the value of X, i.e., p.AjX/ D f .X/, where
f .x/ D p.AjX D x/ which can be properly defined for almost every x 2 R.X/.
154
pX D x; Y D y
pX D x pY D y
and
e \Y
e/
p.X
T .X; Y / D I.X / C I.Y / I.X; Y / D E log2
D E.log2 f /:
e / p.Y
e/
p.X
(ii) In the general case we simply have to observe that our definitions fit together:
T .X; Y / is defined in Definitions 11.9 and 11.8, the information gain is defined
in Definitions 11.6 and 11.5. Both definitions reduce the calculation to finite
partitions of R2 , where we have shown the equality in (i). In order to see that
the two resulting suprema are the same, we need an approximation argument
and Proposition 3.2.
t
u
155
(see Pearlmutter and Hinton 1987; Linsker 1989b; Atick 1992; Deco and Obradovic
1996; Hinton and Ghahramani 1997; Dayan and Abbott 2001; Kamimura 2002;
Erdogmus et al. 2003; Ozertem et al. 2006; Coulter et al. 2009; Brown 2009).
In general, expressions of information gain or conditional information are used
quite often for optimization in pattern recognition (Amari 1967; Battiti 1994; Amari
et al. 1996; Amari and Nagaoka 2000; Principe et al. 2000; Torkkola and Campbell
2000; Hyvarinen 2002; Mongillo and Den`eve 2008).
Actually, one can distinguish three different but related lines of argumentation
that converge on the use of information theory for a better understanding of learning
algorithms:
1. The statistical or Bayesian approach (e.g., MacKay 2005): Here the idea is
essentially that what is learned is a common distribution p.x; y/ of the two
variables (or sets of variables) X and Y , often called data and labels, whose
relationship has to be learned. Here it is common to use the KL distance to
measure the distance between the currently learned distribution and the true
distribution p.
2. The statistical physics approach is actually very similar, but arises from the
tradition in statistical physics to use the principle of maximal ignorance
(Jaynes 1957, 1982), which then leads to approaches that maximize entropy
(i.e., information) or transinformation.
3. Approaches that try to understand and mimic biological learning processes. Here
the idea often is that biological learning has the goal to optimize the neural
representation of the learning situation, i.e., of the values of the variables X
and Y , now interpreted as stimulus or stimulus situation and response of
the animal. Very often this leads again to maximization of the transinformation
between the neural representation Z and the variable X or Y or both (e.g., Barlow
1989; Atick 1992; Linsker 1989b, 1992, 1997; Zemel and Hinton 1995).
For these purposes we do not need to consider repertoires. However, this chapter
also introduces a new novelty-based measure that could be used for learning or
optimization: the surprise loss which could be used in a similar way as information
gain. For these applications the most important results are Propositions 11.7 and
11.9. In this book we do not try to elaborate these possibilities further.
156
A partial question may be how much information the data X provide about the labels
Y , i.e., the transinformation or the information gain between the joint distribution
of X and Y and the product of their individual distributions. The learning process,
i.e., the process of approximating the joint distribution by a sequence of distributions
that are calculated from the data, can often be well described again in terms of
the information gain between these distributions (Amari 1982, 1985; Amari and
Nagaoka 2000; MacKay 2005). Practically all recent applications of information
theory to such data-driven learning problems in the fields of pattern recognition,
machine learning, or data mining are of this type. They are based on the use of
information gain or KullbackLeibler distance to measure the distance between
probabilities.
In the life sciences there are two particular fields of application where arguments
based on information terminology have a particular additional appeal: the neurosciences (which are the subject of the next section) and molecular and cell biology
where the obvious link is the information contained in the DNA (Herzel et al. 1994;
Schmitt and Herzel 1997; Grosse et al. 2000; Weiss et al. 2000; Slonim et al. 2005;
Taylor et al. 2007; Tkacik and Bialek 2007; Mac Donaill 2009).
11.7 Exercises
1) Given two probabilities p and q on a finite find a function X on for which
Ep .Nq X / is maximal! Do the same for Gpq .X / What is the difference?
2) Compare the surprise values obtained in Exercise 1) with SL.q; p/, SL.p; q/,
IG.q; p/, and IG.p; q/! Are all relations between these numbers possible?
3) Give examples for two continuous probabilities p q on 0; 1 such that
a) IG.p; q/ D IG.q; p/,
b) SL.p; q/ D SL.q; p/.
4) For D f0; 1g, p D .0:7; 0:3/, and q D 13 ; 23 determine the chain with maximal subjective surprise. Compare this to Exercise 1)
5) Given n independent random variables X1 ; : : : ; Xn what is the novelty and what
the surprise of X1 \ X2 \ : : : \ Xn ?
The solution is of interest for the evaluation of a number of statistical tests
for the same hypothesis undertaken in independent settings (for example,
by different research groups). For this reason we will give the result here:
The novelty for a particular event x is of course the sum of the novelties
N Xi Xi .x/, and if this sum is s, then the surprise is
2s
n1
X
.s ln 2/k
kD0
References
157
References
Amari, S. (1967). A theory of adaptive pattern classifiers. IEEE Transactions on Electronic
Computers, 16(3), 299307.
Amari, S. (1982). Differential geometry of curved exponential familiescurvature and information
loss. Annals of Statistics, 10, 357385.
Amari, S. (1985). Differential-geometrical methods in statistics. New York: Springer.
Amari, S., & Nagaoka, H. (2000). Methods of information geometry. USA: AMS and Oxford
University Press.
Amari, S., Cichocki, A., & Yang, H. H. (1996). A new learning algorithm for blind signal
separation. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in Neural
Information Processing Systems (Vol. 8) (pp. 757763). Cambridge: MIT Press.
Atick, J. J. (1992). Could information theory provide an ecological theory of sensory processing?
Network: Computation in Neural Systems, 3, 213251.
Barlow, H. B. (1989). Unsupervised learning. Neural Computation, 1, 295311.
Battiti, R. (1994). Using mutual information for selecting features in supervised neural net learning.
Neural Networks, 5, 537550.
Bauer, H. (1972). Probability theory and elements of measure theory. New York: Holt, Rinehart
and Winston.
Brown, G. (2009). A new perspective for information theoretic feature selection. In Proceedings
of the 12th international conference on artificial intelligence and statistics (AI-STATS 2009).
Chow, S. L. (1996). Statistical significance: Rationale, validity and utility. London: Sage Publications.
Coulter, W. K., Hillar, C. J., & Sommer, F. T. (2009). Adaptive compressed sensinga new class
of self-organizing coding models for neuroscience. arXiv:0906.1202v1.
Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. London: Wiley.
Dayan, P., & Abbott, L. F. (2001). Theoretical neuroscience: Computational and mathematical
modeling of neural systems. MA: MIT Press.
Deco, G., & Obradovic, D. (1996). An Information-theoretic approach to neural computing.
New York: Springer.
Erdogmus, D., Principe, J. C., & II, K. E. H. (2003). On-line entropy manipulation: Stochastic
information gradient. IEEE Signal Processing Letters, 10(8), 242245.
Grosse, I., Herzel, H., Buldyrev, S., & Stanley, H. (2000). Species independence of mutual
information in coding and noncoding DNA. Physical Review E, 61(5), 56245629.
Herzel, H., Ebeling, W., & Schmitt, A. (1994). Entropies of biosequences: The role of repeats.
Physical Review E, 50(6), 50615071.
Hinton, G., & Ghahramani, Z. (1997). Generative models for discovering sparse distributed
representations. Philosophical Transactions of the Royal Society B: Biological Sciences,
352(1358), 11771190.
Hyvarinen, A. (2002). An alternative approach to infomax and independent component analysis.
Neurocomputing, 4446, 10891097.
Jaynes, E. T. (1957). Information theory and statistical mechanics. Physical Review, 106(4),
620630.
Jaynes, E. T. (1982). On the rationale of maximum entropy methods. Proceedings IEEE, 70,
939952.
158
Kamimura, R. (2002). Information theoretic neural computation. New York: World Scientific.
Kolmogorov, A. N. (1956) On the Shannon theory of information transmission in the case of
continuoussignals. IRE Transactions on Information Theory, IT-2, 102108.
Kullback, S., & Leibler, R. A. (1951). On information and sufficiency. The Annals of Mathematical
Statistics, 22(1), 7986.
Linsker, R. (1989b). How to generate ordered maps by maximizing the mutual information between
input and output signals. Neural Computation, 1(3), 402411.
Linsker, R. (1992). Local synaptic learning rules suffice to maximize mutual information in a linear
network. Neural Computation, 4, 691702.
Linsker, R. (1997). A local learning rule that enables information maximization for arbitrary input
distributions. Neural Computation, 9, 16611665.
MacKay, D. J. C. (2005). Information theory, inference, and learning algorithms. UK: Cambridge
University Press.
Mac Donaill, D. (2009). Molecular informatics: Hydrogen-bonding, error-coding, and genetic
replication. In 43rd Annual Conference on Information Sciences and Systems (CISS 2009).
MD: Baltimore.
Mongillo, G., & Den`eve, S. (2008). On-line learning with hidden Markov models. Neural
Computation, 20, 17061716.
Ozertem, U., Erdogmus, D., & Jenssen, R. (2006). Spectral feature projections that maximize
shannon mutual information with class labels. Pattern Recognition, 39(7), 12411252.
Pearlmutter, B. A., & Hinton, G. E. (1987). G-maximization: An unsupervised learning procedure
for discovering regularities. In J. S. Denker (Ed.), AIP conference proceedings 151 on neural
networks for computing (pp. 333338). Woodbury: American Institute of Physics Inc.
Principe, J. C., Fischer III, J., & Xu, D. (2000). Information theoretic learning. In S. Haykin (Ed.),
Unsupervised adaptive filtering (pp. 265319). New York: Wiley.
Schmitt, A. O., & Herzel, H. (1997). Estimating the entropy of DNA sequences. Journal of
Theoretical Biology, 188(3), 369377.
Slonim, N., Atwal, G., Tkacik, G., & Bialek, W. (2005). Estimating mutual information and multiinformation in large networks. arXiv:cs/0502017v1.
Taylor, S. F., Tishby, N., & Bialek, W. (2007). Information and fitness. arXiv:0712.4382v1.
Tkacik, G., & Bialek, W. (2007). Cell biology: Networks, regulation, pathways. In R. A. Meyers
(Ed.) Encyclopedia of complexity and systems science (pp. 719741). Berlin: Springer.
arXiv:0712.4385 [qbio.MN]
Torkkola, K., & Campbell, W. M. (2000). Mutual information in learning feature transformations.
In ICML 00: Proceedings of the Seventeenth International Conference on Machine Learning
(pp. 10151022). San Francisco: Morgan Kaufmann.
Weiss, O., Jimenez-Montano, M., & Herzel, H. (2000). Information content protein sequences.
Journal of Theoretical Biology, 206, 379386.
Zemel, R. S., & Hinton, G. E. (1995). Learning population codes by minimizing description length.
Neural Computation, 7, 549564.
Part V
Chapter 12
161
162
its representation. Thus the issue of representation has to be studied with about
the same intensity as the problems of computation. In a way computation can be
regarded simply as a transformation between different representations of the same
information.
Here we do not address problems of computation but focus on the issue of
representation and on the use of information theory in this context. The first
question to be discussed is a methodological or even philosophical one: How is it
possible to use rather technical concepts from information theory to discuss issues
of representation in neuroscience and brain research?
In the 1950s and 1960s the use of information measurements in brain research
and experimental psychology became quite popular (e.g., MacKay and McCulloch
1952; Barnard 1955; Quastler 1956a,b; Attneave 1959; Barlow 1961; Wenzel 1961;
Miller 1962; Yovits et al. 1962; Gerstein and Mandelbrot 1964; Cherry 1966;
Perkel and Bullock 1967; Pfaffelhuber 1972; Abeles and Lass 1975; Massaro 1975;
Eckhorn et al. 1976; Uttley 1979; Johannesma 1981; Srinivasan et al. 1982); for
example, the information transmission rates of a single neuron, the optic nerve,
or of conscious reactions to stimuli were determined and discussed, also during
the following years. Similarly the information storage capacity of short-term, longterm and some other memories was discussed. In my early work, I used information
theory to investigate the optimal storage capacity of neural associative memories
(Palm 1980, 1987b; Palm and Sommer 1992) based on Hebbian synaptic plasticity
(Hebb 1949). This led me to the prediction that spiking activity in associative neural
populations should be sparse. An important theme in the early discussions was the
question of the neural code (see Perkel and Bullock 1967), i.e., whether the
neurons use a kind of Morse code, where patterns of exact time-intervals between
spikes are crucial, whether the single spikes of a neuron can be interpreted simply
as binary yes or no signals, or whether it is the vigor or frequency of the single
neurons spiking that signals some degree of intensity (either for the size of some
measured variable or for the certainty of a proposition).
This discussion has been revived and deepened in the last years from the 1990s
to the present (e.g., Optican and Richmond 1987; Linsker 1988; Barlow et al. 1989;
Bialek et al. 1991; Optican et al. 1991; van Essen et al. 1991; Atick 1992; Kjaer
et al. 1994; Shadlen and Newsome 1994; Tononi et al. 1994; Softky 1995; Dan et al.
1996; Gerstner et al. 1997; Golomb et al. 1997; Rieke et al. 1997; Rolls et al. 1997;
Tsodyks and Markram 1997; Brunel and Nadal 1998; Borst and Theunissen 1999;
Eckhorn 1999; Brenner et al. 2000; Panzeri and Schultz 2001; Nakahara and Amari
2002; Adelman et al. 2003; Seri`es et al. 2004; Dean et al. 2005; Butts and Goldman
2006; Butts et al. 2007; Gutnisky and Dragoi 2008; Koepsell and Sommer 2008;
Coulter et al. 2009; Wang et al. 2010). This is not surprising since the available
experimental evidence about neural activity, spike trains of single and multiple
neurons, and neural systems has increased dramatically over the last 50 years.
However, the discussion of the neural code still remained in the context of
classical experimental paradigms and traditional Shannon information (one notable
exception was Charles (Legendy 1975; Legendy and Salcman 1985; Legendy
2009), who used an early version of the idea of novelty introduced in this book).
163
One question which is still being discussed concerns spike rate vs. single spike
timing: is it just the rate of spikes in a neurons output spike train that contains the
information it conveys (Abbott 1994; Dan et al. 1996; Deadwyler and Hampson
1997; Gerstner et al. 1997; Golomb et al. 1997; Kang and Sompolinsky 2001;
Kjaer et al. 1994; Nirenberg and Latham 2003; Optican et al. 1991; Optican
and Richmond 1987; Panzeri and Schultz 2001; Panzeri et al. 1999; Seri`es et al.
2004; Shadlen and Newsome 1998; Treves and Panzeri 1995) or is there additional
information in the precise timing of individual spikes? Since there is no clock in
the brain, the second alternative requires a temporal reference. This could be in the
preceding spikes of the neuron itself. This idea leads to the observation of suspicious
interspike interval patterns (Abeles et al. 1993; Abeles and Gerstein 1988; Baker and
Lemon 2000; Dayhoff and Gerstein 1983a,b; Grun et al. 2002a,b, 1999; Tetko and
Villa 1992; Martignon et al. 1994, 2000, 1995). Or the reference could be spikes
of other neurons. This leads to the idea of spike patterns across populations of
neurons, which would be very hard to observe experimentally, or at least to the very
common idea of coincidence or synchronicity of spikes in two or more neurons
which could be measured by correlation (Tsodyks et al. 2000; Engel et al. 2001;
Grun et al. 1994b; Konig et al. 1995; Palm et al. 1988). Today there is a lot of
experimental evidence for the importance of both spike rates and synchronicity.
The idea of a population code as opposed to single neuron codes, be it in terms
of spike frequencies, ordering of spike latencies (Perrinet et al. 2003; Thorpe
et al. 2004; Guyonneau et al. 2004; Loiselle et al. 2005) or spike coincidences
in the population, has also become quite popular. An important aspect here is the
sparseness of co-occurring spikes in a typical population of excitatory neurons
(see Palm 1980, 1982, 1987a; Palm and Sommer 1992; Hyvarinen and Karhunen
2001; Furber et al. 2007). The population idea has been considered to derive rules
for synaptic plasticity and learning. The guiding principle in these theories is that
the neural interconnectivity should be modified by synaptic plasticity (the synapses
form the connections between neurons) in such a way that it creates population
activities that maximize the information content about the stimuli (Barlow 1989;
Bell and Sejnowski 1995; Haft and van Hemmen 1998; Linsker 1989b,a, 1992,
1997; Yang and Amari 1997) that are believed to be represented in that neural
population or cortex area (for example, visual information in the visual cortex).
More recently, details of the precise timing of pre- and postsynaptic spikes and
the resulting influence on synaptic efficiency have been the focus of experimental,
information theoretical, and neural modeling studies (Dan and Poo 2006; Bi and
Poo 1998; Bialek et al. 1991; Kempter et al. 1999; Guyonneau et al. 2004; Hosaka
et al. 2008; Izhikevich 2007; Markram et al. 1997; Masquelier et al. 2009; Morrison
et al. 2008; Pfister and Gerstner 2006; van Rossum et al. 2000).
Another issue is the variability of neural responses. The same neuron (in
the visual cortex for example) will respond to repetitions of the same (visual)
stimulus not exactly in the same way. There may be large variations in both the
rate and the fine timing of the spikes. This leads to the question of what is the
signal (i.e., information about the experimental stimulus) and what is the noise
164
(this may also be information about something else) in the neural spike train
(Abeles et al. 1995; Arieli et al. 1996; Bair and Koch 1996; Christodoulou and
Bugmann 2001; Butts and Goldman 2006; Knoblauch and Palm 2004; Mainen and
Sejnowski 1995; Shadlen and Newsome 1998; Softky and Koch 1992, 1993; Stevens
and Zador 1998).
The ensuing discussions can get technically quite involved and detailed, but still
they often try to avoid questions about the purpose of the activity of an individual
neuron or a cortical area for the information processing of the behaving animal. It
is very likely that the purpose of the primary visual cortex (V1), for example, is not
just to present visual information. This information is presented quite efficiently
in the 106 fibers of the optic nerve. In the visual cortex, there are at least two
magnitudes more neuronsto represent just the same information? It can be that
certain important features that are implicit in the visual information in the optic
nerve are made explicit in the representation in V1. These can become important
presumably in terms of the behavioral responses and goals of the animal and may be
used by other brain areas to produce reasonable behavior. It may even be (and there
is experimental evidence for this) that general signals concerning the state of the
whole animal and in particular its attentiveness and even motivational or emotional
aspects also contribute to the neural responses in V1. These would normally be
regarded as noise with respect to the representation of the experimental visual
stimulus.
From this kind of argument it becomes evident that the search for a neural
code, even in a rather peripheral area like V1, ultimately requires an integrative
information processing theory of the whole brain. In fact, a number of brain theories,
often for partial functionalities, have already been published (Palm 1982; Shaw and
Palm 1988; Edelman and Tononi 2000; Grossberg 1999; Hawkins and Blakeslee
2004; Hecht-Nielsen 2007) and can serve as a background for these ideas on coding
and information theory (see also Tononi et al. 1992, 1994).
The whole idea of a neural code may even be (and has been) criticized from
a philosophical point of view (e.g., Bar-Hillel and Carnap 1953, see also Palm
1985): The use of information theory may seem rather inconspicuous with respect to
the so-called mindbrain (or mindbody) problem, in particular, since information
theorists never hesitate to admit that information in the technical sense does not
concern the meaning of the messages. Thus information can be introduced almost
innocently in a scientific theory of the brain that ends with a theory of consciousness
or at least points towards such a theory. I do not wish to say that such a theory
is totally impossible (for example, I sympathize with the theory put forward by
Edelman (Edelman and Tononi 2000), but I think one should be aware of the points
in the argument where aspects of intentionality are brought in. The problem with
this use of information terminology in brain research is essentially that the concept
of information or entropy may appear to be an objective one (since entropy is, after
all, a concept of physics; see Chap. 14), but contains in fact a strong subjective
and teleological element, namely the choice of the repertoire, the description or
the partition through which the world is viewed. This aspect was rather hidden in
classical information theory. In this new broader approach it becomes more evident.
165
In this use of information theory in brain research, where the receivers and
senders of information are not people but perhaps parts of a brain and we can
only indirectly try to infer their purposes, let alone the code that is used to
transmit information between them, classical information theory may be a little
too restrictive. One would like to be able to deal more directly with instances
of information extraction, formation of languages or language-like descriptions
of physical events, the formulation of more goal-directed purposeful variants of
information, and the like. A need for this is expressed more or less explicitly in
many attempts to use information theoretical ideas in the theoretical understanding
of brains (e.g., Uttley 1979; Palm 1982; Optican and Richmond 1987; Barlow 1989;
Barlow and Foldiak 1989; Barlow et al. 1989; Abeles 1991; Coulter et al. 2009).
For example, the definition of the information contained in a stimulus (Butts 2003)
should somehow reflect the importance of this stimulus for the animal, or at least
the importance of the neurons responding to this stimulus.
I believe the concepts of novelty, surprise, description, and repertoire, as
developed in this book, could help us to find a better and more appropriate use of
information theory in brain research. It may be possible to use these new concepts
in order to get rid of some (implicit) assumptions about the neural structures that
are to be analyzed in an information theoretical way. My intention is not to criticize
the ubiquitous recent applications of information theory in neuroscience and brain
research [a good review was given in Borst and Theunissen (1999), see also the
recent books Gerstner and Kistler (2002); Kamimura (2002); Rieke et al. (1997)].
Rather I want to show, where the particular new concepts introduced here, which
bear a certain ambivalence between information- and significance-related concepts,
can be successfully employed to achieve a better understanding or a theoretical
underpinning of argumentations that have been put forward in these fields and that,
in fact, have partially inspired me to introduce these new concepts. My exposition
will mainly consist of various applications of the concepts of novelty and surprise
introduced in Chap. 10.
The concept of information is clearly very useful in the neurosciences. The
activation of a neuron in the brain is considered as a signal for other neurons; it
represents some message or partial description about the state of the outside world
or of the animal itself. The neuron derives its activation from its synaptic inputs
which come from other neurons or (sometimes) directly from sense organs. If we
take together the knowledge about the activations of all neurons in the brain into a
so-called activity state or activity vector, this should contain a rather comprehensive
description of the state of the environment and of the animal itself, in other words: of
the current situation. Such a representation is necessary as far as it concerns aspects
of the situation that are vital for the animal, because it is the basis on which the
animal has to choose, plan, and perform the proper action in the given situation.
Clearly every single neuron in the brain can only be monitoring very few
specific aspects of the whole situation, and these aspects can well be described
mathematically by a repertoire or by a description.
166
167
depolarizing and thus positive and inhibitory IPSPs, which are hyperpolarizing and
thus negative. This simple model leads to the equation
XX
hi .t Tni /;
(12.1)
D.t/ D
n
168
access to the spike train(s), if he records from several neurons simultaneously they
can be regarded as his afferents. In addition he also has information about the
outside world, which he can see through his own eyes, not through the eyes of the
animal. Often he records some particular features of this state of the world which
he believes to be interesting for the animal and perhaps even for the particular
neuron(s) he is recording from. Often he controls these features as a so-called
stimulus. Of course, it is well possible that the animal or even the recorded neuron
is interested in or reacts to other aspects of the experimental situation which are
not captured by the stimulus as defined by the experimenter. In addition to the
neural responses the experimenter may also record behavioral responses of the
animal. In behavioral experiments with trained animals, the animal is rewarded
for certain responses to the different experimental stimuli. In these cases one can
use information to measure how well the different stimulus response configurations
are differentiated by the animal or by the neuron or neural population recorded
from. Usually the experimenter will look out for interesting neural events in order
to correlate them with observations on the stimulus or the behavioral response. In
order to define an interesting neural event, he normally has to rely on the same
features that are available to the single neuron, i.e., on bursts and coincidences.
In multiunit recordings one may also try to combine these features into a onedimensional repertoire using plausible models for the functions hi and working with
(12.1). Thus the experimenter may create a burst repertoire, a pause repertoire, a
coincidence repertoire, and even a depolarization repertoire in order to evaluate
his data. In contrast to the neuron, however, the experimenter will not only use the
novelty created by these repertoires, but also the statistically more relevant surprise.
In the following we will briefly describe these repertoires.
169
bit
9
6
3
bit
60
40
20
bit
Fig. 12.2 Burst novelty as a function of time for 2 individual spike-trains and a simulated
Poisson-spike-train. a) Unstimulated neuron in a fly, b) Stimulated neuron in cat LGN, c) Geiger
counter
was really surprisingly surprising (cf. Chap. 10). Some experimental and theoretical
analysis of this model was given by Palm 1981). For these calculations one needs
a probability distribution p on the set x of possible spike trains. This distribution
should reflect some of the statistical properties of the observed spike trains, but only
to a certain degree. It should also be possible to regard it as the naive distribution
against which the surprise of the actually observed spike train can be measured. For
this reason I have considered the Poisson distribution on for the calculation of
burst novelty and surprise shown in Figs. 12.2 and 12.3.
A slightly more general model is the so-called renewal process (see also Grun
et al. 1994b; Grun and Rotter 2010). Here we observe the so-called interspike
interval distribution, i.e., the probability distribution of the time-intervals between
successive spikes D D TnC1 Tn and we assume that this random variable D is
170
surprise
12
10
8
6
4
2
0
500
1000
novelty
1500
2000
independent of n and, in addition, that subsequent interspike-intervals are independent. In this case, the variables Tn Tnk in n have the same distribution for all
n. And this distribution is the distribution of the sum of k identical independent
versions of D. These distributions are quite well understood in probability theory
(e.g., Doob 1953). A special case is the Poisson distribution which arises from the
assumption that D is exponentially distributed, i.e., pD t D 1 e t . Even
for this simple distribution, I have not been able to calculate the relation between
burst novelty and burst surprise analytically. So I have used computer simulation to
obtain Fig. 12.3.
171
D f.Tni /W n 2 N; i D 1; : : : ; mg:
The coincidence repertoire is defined as
t D fCtk W k 2 N; > 0g;
(12.2)
where
Ctk
"
XX
n
#
1t Tni t k
k
L Y
X
i D1 j D1
j
xi
L
X
j
D n
xi D nj for j D 1; : : : ; k
#
(12.3)
i D1
and
Pk .n/ D
nk
X
i Dn
pk .i /:
(12.4)
Q
j
j
Here xi 2 f0; 1g and therefore kj 1 xi D 1 exactly if there is a k-coincidence,
so the first sum in (12.4) counts the k-coincidences, where the other sums count the
172
number of spikes in the j -th site during the observation. This probability has first
been calculated for k D 2 in Palm et al. (1988) and was incorporated as an analysis
tool in early versions of the Joint Peri-Stimulus-Time-Histogram (JPSTH) program
(Aertsen et al. 1989), later for higher values of k by Grun et al. (2002a; 2002b) and
extended to more general recording conditions (Grun et al. 1999; Gutig et al. 2002).
The calculation is comparatively easy if we proceed by induction on k. The fact
that we dont know the parameters pj doesnt matter because all binary sequences
j
j
.x1 ; : : : ; xL / containing nj ones are equally probable and therefore we can use
combinatorial counting arguments of these nLj sequences.
Obviously for k D 1 we have
(
p1 .n/ D
1 if n D n1 ;
0 otherwise.
Assume that we know pk1 .n/ for all n (obviously pk1 .n/ D 0 for n > n1 ).
To calculate pk .n/ we observe that for getting exactly n k-coincidences we have
to have at least n coincidences in the first k 1 spike sequences and at least n spikes
in the k-th sequence and if we have these .k 1/-coincidences in i n places,
then exactly n of the nk spikes of the k-th sequence have to be in these i places, the
remaining ones have to be in the remaining L i places.
Thus
i Li
n1
X
n n n
pk .n/ D
(for n nk );
pk1 .i / Lk
(12.5)
i Dn
nk
p2 .i / D
n i
L 2
n2
and
P2 .n/ D
n2
X
i Dn
n1 Ln1
i
n i
L 2
n2
The formulae become a bit simpler if we assume that the sites have been
reordered in such a way that n1 n2 n3 : : : nk and furthermore that spikes are
relatively sparse such that n1 C nk L.
In this case,
i1 Li1 i2 Li2
n1
X
i2 n2 i2
n n3 n
p3 .n/ D
L
L
i2 Dn
n2
n3
pk .ik / D
ij Lij
ij C1 nj C1 ij C1
L
i2 Dn
ik1 Dn j D1
nj C1
Pk .n/ D
ij Lij
ij C1 nj C1 ij C1
:
L
i2 Dn
ik Dn j D1
nj C1
and
i1
X
i1
X
ik2 k1
X
Y
ik1 k1
X
Y
173
(12.6)
(12.7)
174
175
12.5 Conclusion
We can use concepts from information theory in trying to understand the functioning
of brains on many different levels, ranging from the neuron to the entire organism or
person. On each level one can distinguish the novelty or the surprise obtained from
the incoming signals by the unit under consideration from the information provided
by these signals and transmitted to this unit. In every case this information and the
corresponding transmission channel capacity is larger than the surprise obtained.
The discrepancy between information and surprise is most obvious for the whole
organism, when we consider how little surprise we typically get out of how much
input information.
As for the relation between novelty and surprise in the brain, this issue is more
relevant for the statistical evaluation of neurophysiological observations. It has been
the subject of controversial discussions in the literature, in particular concerning the
significance of spatio-temporal spike-patterns (citations are collected on p. 176),
without the use of the terminology introduced in this book. Now this topic can be
formulated as a neat mathematical problem.
Problem: For a reasonable probability distribution for neural spike trains (like the
Poisson distribution which can often serve as a good zero-hypothesis), and for all the
repertoires defined in this chapter, one should try to calculate the average surprise
S./ and the surprise statistics, i.e., prob S t for all t 2 R.
This problem is actually quite easy for the poisson distribution and most of the
repertoires (the neuronal, the coincidence and the population repertoire), it is harder
(we do not know the analytical answer yet) for the burst repertoire. There are a few
results dealing with some instances of this problem that can be found in the literature
of the last 30 years. Most of these results are collected in the technical comments
below.
Once again, one can roughly identify the three quantities novelty, information,
and surprise introduced in this book with three viewpoints on brain activity:
novelty with the subjective view, seeing the world through the eyes of individual
neurons or neural populations (Letvin et al. 1959; Legendy 1975), information (or
transinformation) with the functional view of measuring the neurons contribution to
the animals experimental performance (Barlow 1961; Borst and Theunissen 1999;
Nemenman et al. 2008), and surprise with the physiological statistical view that tries
to find significant patterns of neural activation (Dayhoff and Gerstein 1983a; Abeles
and Gerstein 1988; Palm et al. 1988; Aertsen et al. 1989; Grun et al. 1994b, 2002a;
Martignon et al. 1995).
176
1. Signal vs. noise in the variability of neural responses: Abbott (1994); Brunel
and Nadal (1998); Butts (2003); Butts and Goldman (2006); Christodoulou
and Bugmann (2001); Golomb et al. (1997); Kang and Sompolinsky (2001);
Kjaer et al. (1994); Knoblauch and Palm (2004); Mainen and Sejnowski (1995);
Shadlen and Newsome (1994, 1998); Softky and Koch (1992, 1993); Softky
(1995); Stevens and Zador (1998); Nakahara and Amari (2002); Nakahara et al.
(2006); Hansel and Sompolinsky (1996).
2. Rate coding vs. fine timing of spikes: Abbott (1994); Abeles et al. (1995);
Aertsen et al. (1989); Aertsen and Johannesma (1981); Bach and Kruger (1986);
Bair and Koch (1996); Barlow (1961); Bethge et al. (2002); Bialek et al. (1991);
Brown et al. (2004); Cessac et al. (2008); Deadwyler and Hampson (1997); Dean
et al. (2005); Den`eve (2008); Eckhorn et al. (1976); Engel et al. (2001); Gerstein
and Aertsen (1985); Gerstner and Kistler (2002); Gerstner et al. (1997); Golomb
et al. (1997); Gutig et al. (2002); Kjaer et al. (1994); Konig et al. (1995); Kostal
et al. (2007); Krone et al. (1986); Kruger and Bach (1981); Legendy and Salcman
(1985); Mainen and Sejnowski (1995); Markram et al. (1997); Morrison et al.
(2008); Nakahara and Amari (2002); Nirenberg and Latham (2003); Palm et al.
(1988); Panzeri and Schultz (2001); Panzeri et al. (1999); Perkel and Bullock
(1967); Pfister and Gerstner (2006); Rieke et al. (1997); Schneideman et al.
(2003); Seri`es et al. (2004); Shadlen and Newsome (1994); Softky (1995); Softky
and Koch (1993); Tsodyks and Markram (1997); Vaadia et al. (1995).
3. Population code: Aertsen and Johannesma (1981); Amari and Nakahara, 2006);
Barlow (1989); Bethge et al. (2002); Bialek et al. (2007); Brenner et al. (2000);
Brunel and Nadal (1998); Butts and Goldman (2006); Coulter et al. (2009); Dan
et al. (1996); Deadwyler and Hampson (1997); Dean et al. (2005); Furber et al.
(2007); Gutnisky and Dragoi (2008); Kang and Sompolinsky (2001); Krone et al.
(1986); Legendy (2009); Linsker (1989b, 1992); Nirenberg and Latham (2005);
Osborne et al. (2008); Prut et al. (1998); Rolls et al. (1997); Schneideman et al.
(2003); Zemel and Hinton (1995).
4. Significance of spike patterns: Abeles (1991); Abeles et al. (1995, 1993); Abeles
and Gerstein (1988); Baker and Lemon (2000); Brown et al. (2004); Cessac
et al. (2008); Dan and Poo (2006); Dayhoff and Gerstein (1983a,b); Gerstein and
Aertsen (1985); Grun et al. (1994b, 2002a,b, 1999); Gutig et al. (2002); Hosaka
et al. (2008); Martignon et al. (2000, 1995); Masquelier et al. (2009); Nakahara
and Amari (2002); Palm et al. (1988); Pfister and Gerstner (2006); Tetko and
Villa (1992).
5. Neural coding in the visual system and natural scene statistics: Adelman et al.
(2003); Atick (1992); Atick and Redlich (1990, 1992); Barlow (1989); Bethge
et al. (2002); Butts et al. (2007); Dan et al. (1996); Dong and Atick (1995);
Field and Chichilnisky (2007); Gutnisky and Dragoi (2008); Haft and van
Hemmen (1998); Hoyer and Hyvarinen (2002); Hyvarinen and Karhunen (2001);
Hyvarinen and Hoyer (2001); Hyvarinen et al. (2009); Koepsell and Sommer
(2008); Koepsell et al. (2009); Krone et al. (1986); Legendy (2009); Linsker
(1989b,a); McClurkin et al. (1991); Optican et al. (1991); Optican and Richmond
(1987); Rolls et al. (1997); Seri`es et al. (2004); Wang et al. (2010).
number of bursts
177
100
55
23
24
19
0
S=0
10
20
30
40
50
Poisson surprise
80
190
Perhaps the first paper that tried to understand the organization of the brain from
the point of view of each single neuron in terms of the information that a single
neuron could obtain from its afferents and transmit to its efferents, was written by
Legendy (1975). Putting ourselves in the position of one single neuron, we can ask
ourselves What could it be interested in? and How could it transfer this into
surprising signals on its own axon?
The second question is very familiar to the single neuron physiologists who
record the spike trains from single neurons and try to make sense of it. What they
listen to are bursts, i.e., temporary increases in spiking activity. The surprising
event being that the neuron fires comparatively many spikes in comparatively short
time intervals. Typically bursts consist of anything between 5 to 50 spikes (for a
more elaborated statistics, see Legendy and Salcman 1985). The analysis of bursts
in terms of surprise or novelty has been initiated by Palm (1981) and Legendy and
Salcman (1985). Some results of the latter paper are shown in Figs. 12.4 and 12.5.
The first question has also become a technical question for those neurophysiologists that started to do multiunit recordings. In this case, the physiologists afferents
are the neurons he records from. Three answers have been given:
(a) Coincidence: A surprisingly large number of afferents (to the neuron or to the
physiologists electrode) fire within the same (short) time window.
(b) Afferent patterns: A certain pattern (i.e., subset of the afferents) fires within
the same short time window. This was probably most often the case when
a physiologist was surprised by a large coincidence (Kruger and Bach 1981;
Legendy and Salcman 1985; Bach and Kruger 1986; Abeles et al. 1993).
178
100
number of bursts
50
0
0
100
20
10
average spike rate during burst
50
0
0
50
100
Fig. 12.5 Spike-burst statistics. a) Histogram of the spike rate during high surprise bursts (thick
bars: N > 10, thin bars: N > 20), b) Histogram of the number of spikes in high surprise bursts.
Preparation as in Fig. 12.4 from Legendy and Salcman (1985)
179
12.6.1 Coincidence
The most straightforward possibility for a single neurons surprise is in fact (a).
A typical neuron needs a certain number of coincident input spikes in order to
become sufficiently active. Additional bursting of some of these inputs may help so
that perhaps the better formulation of the surprising event has both the ingredients
of bursting and of coincidence: All afferents taken together fire surprisingly many
spikes within a surprisingly short time interval. The mathematical analysis of this
kind of novelty or surprise is identical with the analysis of burst novelty or surprise
as described above.
180
i 2K
i
i
D bj
W bj 2 f0; 1g :
j 2N
i 2K
A pattern of length L is a sequence cji
j D1;:::;L
DW c. If we are interested
in all patterns of
Lg,
our cover is simply D fCc W c pattern of length
n length
L,
o
where Cc D ! D bji 2 W bji D cji 8j D 1; : : : ; L 8i D 1; : : : ; k . In this
case, every pattern would be very surprising, but exactly for this reason, this large
calculated novelty is not really surprising.
Let us model this situation more completely: We assume as a naive probability
assignment that all bji are independent and have the same probability q of being a
1. Thus, if Nc is the number of 1s in a pattern c, we simply have p.c/ D q Nc .1
q/LkNc .
If q D 1=2, then all patterns are equally improbable: p.c/ D 2Lk , and the
surprise for each pattern is equally high: S.c/ D Lk. If q is small, which is typical
for spike trains, then r WD .1 q/Lk is not incredibly small between 0 and 1 and
p.c/ D
q
1q
Nc
r;
1q
C log r:
q
So the surprise increases with Nc and our measurement reduces essentially to the
coincidence surprise of Sect. 12.6.2.
Another interpretation of the pattern surprise, in particular, in the case where the
1s are much more surprising than the 0s, would be to consider only 1s,
i.e., the occurrences of spikes in a pattern. This leads more or less to the
same
n
statistics: In this case, we describe oa pattern c by the proposition Cc D
bji 2 W bji D 1 for all i; j where cji D 1 and form the cover WD fDc W c pattern
of length Lg.
If again Nc denotes the number of 1s in a pattern c, then p.Cc / D q Nc . Therefore
N .Cc / D Nc . log q/ for all patterns c.
This is quite obvious because in every case the novelty of a proposition in
the cover depends only on the total number of spikes implied by it. Thus the
calculation of the corresponding surprise reduces essentially to the calculation of
burst surprise (see Sect. 12.3.1).
After these observations one may argue that the real surprise in the case of spatiotemporal patterns does not lie in the fact that each of the patterns is very surprising
by itself, but in the repetitive occurrence of one (or a few) of these patterns. This
case is best treated as the surprise of repetition (of an improbable event). This kind of
problem has already been analyzed to some extent by Dayhoff and Gerstein (1983b),
Abeles and Gerstein (1988). We will return to it in the next chapter.
References
181
References
Abbott, L. F. (1994). Decoding neuronal firing and modeling neural networks. Quarterly Reviews
of Biophysics, 27, 291331.
Abeles, M. (1991). Corticonics: Neural circuits of the cerebral cortex. Cambridge: Cambridge
University Press.
Abeles, M., & Gerstein, G. L. (1988). Detecting spatiotemporal firing patterns among simultaneously recorded single neurons. Journal of Neurophysiology, 60(3), 909924.
Abeles, M., & Lass, Y. (1975). Transmission of information by the axon: II. The channel capacity.
Biological Cybernetics, 19(3), 121125.
Abeles, M., Bergman, H., Margalit, E., & Vaadia, E. (1993). Spatiotemporal firing patterns in the
frontal cortex of behaving monkeys. Journal of Neurophysiology, 70(4), 16291638.
Abeles, M., Bergman, H., Gat, I., Meilijson, I., Seidemann, E., Tishby, N., & Vaadia, E. (1995).
Cortical activity flips among quasi stationary states. Proceedings of the National Academy of
Sciences of the United States of America, 92, 86168620.
Adelman, T. L., Bialek, W., & Olberg, R. M. (2003). The information content of receptive fields.
Neuron, 40(13), 823833.
Aertsen, A. M. H. J., & Johannesma, P. I. M. (1981). The spectro-temporal receptive field.
A functional characteristic of auditory neurons. Biological Cybernetics, 42(2), 133143.
Aertsen, A. M. H. J., Gerstein, G. L., Habib, M. K., & Palm, G. (1989). Dynamics of neuronal
firing correlation: Modulation of effective connectivity. Journal of Neurophysiology, 61(5),
900917.
Amari, S.-i., & Nakahara, H. (2005). Difficulty of singularity in population coding. Neural
Computation, 17, 839858.
Amari, S., & Nakahara, H. (2006). Correlation and independence in the neural code. Neural
Computation, 18(6), 12591267.
Arieli, A., Sterkin, A., Grinvald, A., & Aertsen, A. M. H. J. (1996). Dynamics of ongoing
activity: Explanation of the large variability in evoked cortical responses. Science, 273(5283),
18681871.
Atick, J. J. (1992). Could information theory provide an ecological theory of sensory processing?
Network: Computation in Neural Systems, 3, 213251.
Atick, J. J., & Redlich, A. N. (1990). Towards a theory of early visual processing. Neural
Computation, 2(3), 308320.
Atick, J. J., & Redlich, A. N. (1992). What does the retina know about natural scenes? Cambridge:
MIT Press.
Attneave, F. (1959). Applications of information theory to psychology. New York: Holt, Rinehart
and Winston.
Bach, M., & Kruger, J. (1986). Correlated neuronal variability in monkey visual cortex revealed
by a multi-microelectrode. Experimental Brain Research, 61(3), 451456.
Bair, W., & Koch, C. (1996). Temporal precision of spike trains in extrastriate cortex of the
behaving macaque monkey. Neural Computation, 8(6), 11851202.
Baker, S. N., & Lemon, R. N. (2000). Precise spatiotemporal repeating patterns in monkey
primary and supplementary motor areas occur at chance levels. Journal of Neurophysiology, 84,
17701780.
Bar-Hillel, Y., & Carnap, R. (1953). Semantic information. In London information theory
symposium (pp. 503512). New York: Academic.
Barlow, H. B. (1961). Possible principles underlying the transformation of sensory messages.
Cambridge: MIT Press.
Barlow, H. B. (1989). Unsupervised learning. Neural Computation, 1, 295311.
Barlow, H. B., & Foldiak, P. (1989). Adaptation and decorrelation in the cortex. In C. Miall,
R. M. Durbin, & G. J. Mitcheson (Eds.), The computing neuron (pp. 5472). USA:
Addison-Wesley.
182
Barlow, H. B., Kaushal, T. P., & Mitchison, G. J. (1989). Finding minimum entropy codes. Neural
Computation, 1(3), 412423.
Barnard, G. A. (1955). Statistical calculation of word entropies for four Western languages. IEEE
Transactions on Information Theory, 1(1), 4953.
Bateson, G. (1972). Steps to an ecology of mind. London: Intertext Books.
Bell, A. J., & Sejnowski, T. J. (1995). An information-maximisation approach to blind separation
and blind deconvolution. Neural Computation, 7, 11291159.
Bethge, M., Rotermund, D., & Pawelzik, K. (2002). Optimal short-term population coding: When
Fisher information fails. Neural Computation, 14, 23172351.
Bi, G.-Q., & Poo, M.-M. (1998). Synaptic modifications in cultured hippocampal neurons:
Dependence on spike timing, synaptic strength, and postsynaptic cell type. The Journal of
Neuroscience, 18, 1046410472.
Bialek, W., de Ruyter van Steveninck, R. R., & Tishby, N. (2007). Efficient representation as a
design principle for neural coding and computation. Neural Computation, 19(9), 2387-2432.
Bialek, W., Reike, F., de Ruyter van Steveninck, R. R., & Warland, D. (1991). Reading a neural
code. Science, 252, 18541857.
Bliss, T. V. P., & Collingridge, G. L. (1993). A synaptic model of memory: Long-term potentiation
in the hippocampus. Nature, 361, 3139.
Borst, A., & Theunissen, F. E. (1999). Information theory and neural coding. Nature Neuroscience,
2(11), 947957.
Brenner, N., Strong, S., Koberle, R., Bialek, W., & de Ruyter van Steveninck, R. (2000). Synergy
in a neural code. Neural Computation, 12(7), 15311552.
Brown, E. N., Kass, R. E., & Mitra, P. P. (2004). Multiple neural spike train data analysis: Stateof-the-art and future challenges. Nature Neuroscience, 7, 456461. doi: 10.1038/nn1228.
Brunel, N., & Nadal, J.-P. (1998). Mutual information, Fisher information, and population coding.
Neural Computation, 10(7), 17311757.
Butts, D. A. (2003). How much information is associated with a particular stimulus? Network:
Computation in Neural Systems, 14(2), 177187.
Butts, D. A., & Goldman, M. (2006). Tuning curves, neuronal variability and sensory coding.
PLOS Biology, 4, 639646.
Butts, D. A., Weng, C., Jin, J., Yeh, C.-I., Lesica, N. A., Alonso, J.-M., & Stanley, G. B. (2007).
Temporal precision in the neural code and the timescales of natural vision. Nature, 449(7158),
9295.
Cessac, B., Rostro-Gonzalez, H., Vasquez, J.-C., & Vieville, T. (2008). To which extend is
the neural code a metric? In Proceedings of the conference NeuroComp 2008. Informal
publication.
Cherry, C. (1966). On human communication. Cambridge: MIT Press.
Christodoulou, C., & Bugmann, G. (2001). Coefficient of variation (CV) vs mean inter-spikeinterval (ISI) curves: What do they tell us about the brain? Neurocomputing, 3840, 11411149.
Coulter, W. K., Hillar, C. J., & Sommer, F. T. (2009). Adaptive compressed sensinga new class
of self-organizing coding models for neuroscience.
Dan, Y., & Poo, M.-M. (2006). Spike timing-dependent plasticity: From synapse to perception.
Physiology Review, 86, 10331048.
Dan, Y., Atick, J. J., & Reid, R. C. (1996). Efficient coding of natural scenes in the lateral geniculate
nucleus: Experimental test of a computational theory. Journal of Neuroscience, 16(10),
33513362.
Dayhoff, J. E., & Gerstein, G. L. (1983a). Favored patterns in spike trains. I. Detection. Journal of
Neurophysiology, 49(6), 13341348.
Dayhoff, J. E., & Gerstein, G. L. (1983b). Favored patterns in spike trains. II. Application. Journal
of Neurophysiology, 49(6), 13491363.
Deadwyler, S. A., & Hampson, R. E. (1997). The significance of neural ensemble codes during
behavior and cognition. Annual Review of Neuroscience, 20, 217244.
Dean, I., Harper, N. S., & D. McAlpine (2005). Neural population coding of sound level adapts to
stimulus statistics. Nature Neuroscience, 8(12), 16841689.
References
183
Den`eve, S. (2008). Bayesian spiking neurons I: Inference. Neural Computation, 20, 91117.
Dong, D. W., & Atick, J. J. (1995). Statistics of natural time-varying images. Network, 6(3), 345
358.
Doob, J. L. (1953). Stochastic Processes. New York: Wiley.
Eckhorn, R. (1999). Neural mechanisms of scene segmentation: Recordings from the visual cortex
suggest basic circuits for linking field models. IEEE Transactions on Neural Networks, 10(3),
464479.
Eckhorn, R., Grusser, O.-J., Kroller, J., Pellnitz, K., & Popel, B. (1976). Efficiency of different neuronal codes: Information transfer calculations for three different neuronal systems. Biological
Cybernetics, 22(1), 4960.
Edelman, G. M., & Tononi, G. (2000). A universe of consciousness: How matter becomes
imagination. New York: Basic Books.
Engel, A., Fries, P., & Singer, W. (2001). Dynamic predictions: Oscillations and synchrony in
top-down processing. Nature Reviews Neuroscience, 2(10), 704716.
Field, G. D., & Chichilnisky, E. J. (2007). Information processing in the primate retina: Circuitry
and coding. Annual Review of Neuroscience, 30, 130.
Furber, S. B., Brown, G., Bose, J., Cumpstey, J. M., Marshall, P., & Shapiro, J. L. (2007). Sparse
distributed memory using rank-order neural codes. IEEE Transactions on Neural Networks, 18,
648659.
Gerstein, G. L., & Aertsen, A. M. (1985). Representation of cooperative firing activity among
simultaneously recorded neurons. Journal of Neurophysiology, 54(6), 15131528.
Gerstein, G. L., & Mandelbrot, B. (1964). Random walk models for the spike activity of a single
neuron. Biophysical Journal, 4(1), 4168.
Gerstner, W., & Kistler, W. M. (2002). Spiking Neuron Models. New York: Cambridge University
Press.
Gerstner, W., Kreiter, A. K., Markram, H., & Herz, A. V. M. (1997). Neural codes: Firing rates
and beyond. Proceedings of the National Academy of Sciences of the United States of America,
94(24), 1274012741.
Golomb, D., Hertz, J., Panzeri, S., Treves, A., & Richmond, B. (1997). How well can we estimate
the information carried in neuronal responses from limited samples? Neural Computation, 9(3),
649665.
Grossberg, S. (1999). How does the cerebral cortex work? Learning, attention and grouping by the
laminar circuits of visual cortex. Spatial Vision, 12, 163186.
Grun, S., Aertsen, A. M. H. J., Abeles, M., Gerstein, G., & Palm, G. (1994a). Behaviorrelated neuron group activity in the cortex. In Proceedings 17th Annual Meeting European
Neuroscience Association. Oxford. Oxford University Press.
Grun, S., Aertsen, A. M. H. J., Abeles, M., Gerstein, G., & Palm, G. (1994b). On the significance of
coincident firing in neuron group activity. In N. Elsner, & H. Breer (Eds.), Sensory transduction
(p. 558). Thieme: Stuttgart.
Grun, S., Diesmann, M., & Aertsen, A. (2002a). Unitary events in multiple single-neuron spiking
activity: I. Detection and significance. Neural Computation, 14(1), 4380.
Grun, S., Diesmann, M., & Aertsen, A. (2002b). Unitary events in multiple single-neuron spiking
activity: II. Nonstationary data. Neural Computation, 14(1), 81119.
Grun, S., Diesmann, M., Grammont, F., Riehle, A., & Aertsen, A. (1999). Detecting unitary events
without discretization of time. Journal of Neuroscience, 94(1), 121154.
Grun, S., & Rotter, S. (Eds.) (2010). Analysis of spike trains. New York: Springer.
Gutig, R., Aertsen, A., & Rotter, S. (2002). Statistical significance of coincident spikes: Countbased versus rate-based statistics. Neural Computation, 14(1), 121153.
Gutnisky, D. A., & Dragoi, V. (2008). Adaptive coding of visual information in neural populations.
Nature, 452(7184), 220224.
Guyonneau, R., VanRullen, R., & Thorpe, S. J. (2004). Temporal codes and sparse representations:
A key to understanding rapid processing in the visual system. Journal of Physiology Paris,
98, 487497.
184
Haft, M., & van Hemmen, J. L. (1998). Theory and implementation of infomax filters for the retina.
Network, 9, 3971.
Hansel, D., & Sompolinsky, H. (1996). Chaos and synchrony in a model of a hypercolumn in visual
cortex. Journal of Computational Neuroscience, 3(1), 734.
Hawkins, J., & Blakeslee, S. (2004). On intelligence. New York: Times Books, Henry Holt and
Company.
Hebb, D. O. (1949). The organization of behavior: A neuropsychological theory. New York: Wiley.
Hecht-Nielsen, R. (2007). Confabulation theory. The mechanism of thought. Berlin: Springer.
Holden, A. V. (1976). Models of the stochastic activity of neurons. New York: Springer.
Hosaka, R., Araki, O., & Ikeguchi, T. (2008). STDP provides the substrate for igniting synfire
chains by spatiotemporal input patterns. Neural Computation, 20(2), 415435.
Hoyer, P. O., & Hyvarinen, A. (2002). A multi-layer sparse coding network learns contour coding
from natural images. Vision Research, 42(12), 15931605.
Hyvarinen, A., & Hoyer, P. O. (2001). A two-layer sparse coding model learns simple and complex
cell receptive fields and topography from natural images. Vision Research, 41(18), 24132423.
Hyvarinen, A., Hurri, J., & Hoyer, P. O. (2009). Natural Image Statistics. New York: Springer.
Hyvarinen, A., & Karhunen, J. (2001). Independent Component Analysis. New York: Wiley.
Izhikevich, E. M. (2007). Solving the distal reward problem through linkage of STDP and
dopamine signaling. Cerebral Cortex, 17, 24432452.
Izhikevich, E. M., & Desai, N. S. (2003). Relating STDP to BCM. Neural Computation, 15,
15111523.
Johannesma, P. I. M. (1981). Neural representation of sensory stimuli and sensory interpretation of
neural activity. Advanced Physiological Science, 30, 103125.
Kamimura, R. (2002). Information theoretic neural computation. New York: World Scientific.
Kang, K., & Sompolinsky, H. (2001). Mutual information of population codes and distance
measures in probability space. Physical Review Letter, 86(21), 49584961.
Kempter, R., Gerstner, W., & van Hemmen, J. L. (1999). Hebbian learning and spiking neurons.
Physical Review E, 59, 44984514.
Kjaer, T. W., Hertz, J. A., & Richmond, B. J. (1994). Decoding cortical neuronal signals: Network
models, information estimation, and spatial tuning. Journal of Computational Neuroscience, 1,
109139.
Knoblauch, A., & Palm, G. (2004). What is Signal and What is Noise in the Brain? BioSystems,
79, 8390.
Koepsell, K., & Sommer, F. T. (2008). Information transmission in oscillatory neural activity.
Biological Cybernetics, 99, 403416.
Koepsell, K., Wang, X., Vaingankar, V., Wei, Y., Wang, Q., Rathbun, D. L., Usrey, W. M., Hirsch,
J. A., & Sommer, F. T. (2009). Retinal oscillations carry visual information to cortex. Frontiers
in Systems Neuroscience, 3, 118.
Konig, P., Engel, A. K., & Singer, W. (1995). Relation between oscillatory activity and long-range
synchronization in cat visual cortex. In Proceedings of the National Academy of Sciences of the
United States of America, 92, 290294.
Kostal, L., Lansky, P., & Rospars, J.-P. (2007). Neuronal coding and spiking randomness. European
Journal of Neuroscience, 26(10), 26932701.
Krone, G., Mallot, H., Palm, G., & Schuz, A. (1986). Spatiotemporal receptive fields: A dynamical
model derived from cortical architectonics. Proceedings of the Royal Society of London. Series
B, Biological Sciences, 226(1245), 421444.
Kruger, J., & Bach, M. (1981). Simultaneous recording with 30 microelectrodes in monkey visual
cortex. Experimental Brain Research, 41(2), 191194.
Legendy, C. (2009). Circuits in the braina model of shape processing in the primary visual
cortex. New York: Springer.
Legendy, C. R. (1975). Three principles of brain function and structure. International Journal of
Neuroscience, 6, 237254.
Legendy, C. R., & Salcman, M. (1985). Bursts and recurrences of bursts in the spike trains of
spontaneously active striate cortex neurons. Journal of Neurophysiology, 53(4), 926939.
References
185
Letvin, J. Y., Maturana, H. R., McCulloch, W. S., & Pitts, W. H. (1959). What the frogs eye tells
the frogs brain. Proceedings of the IRE, 47(11), 19401951.
Linsker, R. (1988). Self-organization in a perceptual network. Computer, 21, 105117.
Linsker, R. (1989a). An application of the principle of maximum information preservation to linear
systems. In D. S. Touretzky (Ed.), Advances in Neural Information Processing Systems (Vol. 1)
(pp. 186194). San Mateo: Morgan Kaufmann.
Linsker, R. (1989b). How to generate ordered maps by maximizing the mutual information between
input and output signals. Neural Computation, 1(3), 402411.
Linsker, R. (1997). A local learning rule that enables information maximization for arbitrary input
distributions. Neural Computation, 9, 16611665.
Lisman, J., & Spruston, N. (2005). Postsynaptic depolarization requirements for LTP and LTD: A
critique of spike timing-dependent plasticity. Nature Neuroscience, 8(7), 839841.
Loiselle, S., Rouat, J., Pressnitzer, D., & Thorpe, S. J. (2005). Exploration of rank order coding with
spiking neural networks for speech recognition. Proceedings of International Joint Conference
on Neural Networks, 4, 20762078.
MacGregor, R. J. (1987). Neural and brain modeling. New York: Academic.
MacKay, D. M., & McCulloch, W. S. (1952). The limiting information capacity of a neuronal link.
Bulletin of Mathematical Biology, 14(2), 127135.
Mainen, Z. F., & Sejnowski, T. J. (1995). Reliability of spike timing in neocortical neurons.
Science, 268(5216), 15031506.
Markram, H., Luebke, J., Frotscher, M., & Sakmann, B. (1997). Regulation of synaptic efficacy by
coincidence of postsynaptic APs and EPSPs. Science, 275, 213215.
Martignon, L., Deco, G., Laskey, K., Diamond, M., Freiwald, W. A., & Vaadia, E. (2000).
Neural coding: Higher-order temporal patterns in the neurostatistics of cell assemblies. Neural
Computation, 12(11), 26212653.
Martignon, L., von Hasseln, H., Grun, S., Aertsen, A. M. H. J., & Palm, G. (1995). Detecting
higher-order interactions among the spiking events in a group of neurons. Biological Cybernetics, 73(1), 6981.
Martignon, L., von Hasseln, H., Grun, S., & Palm, G. (1994). Modelling the interaction in a set
of neurons implicit in their frequency distribution: A possible approach to neural assemblies.
In F. Allocati, C. Musio, & C. Taddei-Ferretti (Eds.), Biocybernetics (Cibernetica Biologica)
(pp. 268288). Torino: Rosenberg & Sellier.
Masquelier, T., Guyonneau, R., & Thorpe, S. (2009). Competitive STDP-based spike pattern
learning. Neural Computation, 21(5), 12591276.
Massaro, D. W. (1975). Experimental psychology and human information processing. Chicago:
Rand McNally & Co.
McClurkin, J. W., Gawne, T. J., Optican, L. M., & Richmond, B. J. (1991). Lateral geniculate
neurons in behaving priimates II. Encoding of visual information in the temporal shape of the
response. Journal of Neurophysiology, 66(3), 794808.
Miller, J. G. (1962). Information input overload. In M. C. Yovits, G. T. Jacobi, & G. D. Goldstein
(Eds.), Self-Organizing Systems (pp. 6178). Washington DC: Spartan Books.
Morrison, A., Aertsen, A., & Diesmann, M. (2007). Spike-timing-dependent plasticity in balanced
random networks. Neural Computation, 19(6), 14371467.
Morrison, A., Diesmann, M., & Gerstner, W. (2008). Phenomenological models of synaptic
plasticity based on spike timing. Biological Cybernetics, 98, 459478.
Nakahara, H., & Amari, S. (2002). Information geometric measure for neural spikes. Neural
Computation, 14, 22692316.
Nakahara, H., Amari, S., & Richmond, B. J. (2006). A comparison of descriptive models of a single
spike train by information geometric measure. Neural Computation, 18, 545568.
Nemenman, I., Lewen, G. D., Bialek, W., & de Ruyter van Steveninck, R. R. (2008). Neural coding
of natural stimuli: Information at sub-millisecond resolution. PLoS Computational Biology,
4(3), e1000025.
186
Nirenberg, S., & Latham, P. (2003). Decoding neural spike trains: How important are correlations?
Proceedings of the National Academy of Science of the United States of America, 100,
73487353.
Nirenberg, S., & Latham, P. (2005). Synergy, redundancy and independence in population codes.
Journal of Neuroscience, 25, 51955206.
Optican, L. M., Gawne, T. J., Richmond, B. J., & Joseph, P. J. (1991). Unbiased measures
of transmitted information and channel capacity from multivariate neuronal data. Biological
Cybernetics, 65(5), 305310.
Optican, L. M., & Richmond, B. J. (1987). Temporal encoding of two-dimensional patterns by
single units in primate inferior temporal cortex. III. Information theoretic analysis. Journal of
Neurophysiology, 57(1), 162178.
Osborne, L. C., Palmer, S. E., Lisberger, S. G., & Bialek, W. (2008). The neural basis for
combinatorial coding in a cortical population response. Journal of Neuroscience, 28(50),
1352213531.
Palm, G. (1980). On associative memory. Biological Cybernetics, 36, 167183.
Palm, G. (1981). Evidence, information and surprise. Biological Cybernetics, 42(1), 5768.
Palm, G. (1982). Neural assemblies, an alternative approach to artificial intelligence. New York:
Springer.
Palm, G. (1985). Information und entropie. In H. Hesse (Ed.), Natur und Wissenschaft. Tubingen:
Konkursbuch Tubingen.
Palm, G. (1987a). Associative memory and threshold control in neural networks. In J. L. Casti, &
A. Karlqvist (Eds.), Real brains: artificial minds (pp. 165179). New York: Elsevier.
Palm, G. (1987b). Computing with neural networks. Science, 235, 12271228.
Palm, G. (1992). On the information storage capacity of local learning rules. Neural Computation,
4, 703711.
Palm, G., Aertsen, A. M. H. J., & Gerstein, G. L. (1988). On the significance of correlations among
neuronal spike trains. Biological Cybernetics, 59(1), 111.
Palm, G., & Sommer, F. T. (1992). Information capacity in recurrent McCullochPitts networks
with sparsely coded memory states. Network, 3(2), 177186.
Panzeri, S., & Schultz, S. R. (2001). A unified approach to the study of temporal, correlational,
and rate coding. Neural Computation, 13(6), 13111349.
Panzeri, S., Schultz, S. R., Treves, A., & Rolls, E. T. (1999). Correlations and the encoding of
information in the nervous system. Proceedings of the Royal Society of London Series B;
Biological Science, 266(1423), 10011012.
Perkel, D. H., & Bullock, T. H. (1967). Neural coding. Neurosciences Research Program Bulletin,
6(3), 223344.
Perrinet, L., Samuelides, M., & Thorpe, S. J. (2003). Coding static natural images using spike event
times: Do neurons cooperate? IEEE Transactions on Neural Networks, 15, 11641175.
Pfaffelhuber, E. (1972). Learning and information theory. International Journal of Neuroscience,
3, 83.
Pfister, J.-P., & Gerstner, W. (2006). Triplets of spikes in a model of spike timing-dependent
plasticity. The Journal of Neuroscience, 26(38), 96739682.
Prut, Y., Vaadia, E., Bergman, H., Haalman, I., Slovin, H., & Abeles, M. (1998). Spatiotemporal
structure of cortical activity: Properties and behavioral relevance. Journal of Neurophysiology,
79(6), 28572874.
Quastler, H. (1956a). Information theory in psychology: Problems and methods. Glencoe:
Free Press.
Quastler, H. (1956b). Studies of human channel capacity. In E. Cherry (Ed.), Information theory,
3rd London symposium (p. 361). London: Butterworths.
Rieke, F., Warland, D., de Ruyter van Steveninck, R., & Bialek, W. (1997). Spikes: Exploring the
neural code. Cambridge: MIT Press.
Rolls, E. T., Treves, A., & Tovee, M. J. (1997). The representational capacity of the distributed
encoding of information provided by populations of neurons in primate temporal visual cortex.
Experimental Brain Research, 114(1), 149162.
References
187
Schneideman, E., Bialek, W., & M. J. II. Berry (2003). Synergy, redundancy, and independence in
population codes. Journal of Neuroscience, 23, 1153911553.
Seri`es, P., Latham, P., & Pouget, A. (2004). Tuning curve sharpening for orientation slectivity:
Coding efficiency and the impact of correlations. Nature Neurosience, 7(10), 11291135.
Shadlen, M. N., & Newsome, W. T. (1994). Noise, neural codes and cortical organization. Current
Opinion in Neurobiology, 4(4), 569579.
Shadlen, M. N., & Newsome, W. T. (1998). The variable discharge of cortical neurons: Implications
for connectivity, computation, and information coding. Journal of Neuroscience, 18(10), 3870
3896.
Shannon, C. E. (1948). A mathematical theory of communication. Bell Systems Technical Journal,
27, 379423, 623656.
Shaw, G., & Palm, G. (Eds.) (1988). Brain Theory Reprint Volume. Singapore: World Scientific.
Softky, W., & Koch, C. (1992). Cortical cells should fire regularly, but do not. Neural Computation,
4, 643646.
Softky, W. R. (1995). Simple codes versus efficient codes. Current Opinion in Neurobiology, 5(2),
239247.
Softky, W. R., & Koch, C. (1993). The highly irregular firing of cortical cells is inconsistent with
temporal integration of random EPSPs. Journal of Neuroscience, 13(1), 334350.
Song, S., Miller, K. D., & Abbott, L. F. (2000). Competitive Hebbian learning through spiketiming-dependent synaptic plasticity. Nature Neuroscience, 3, 919926.
Srinivasan, M. V., Laughlin, S. B., & Dubs, A. (1982). Predictive coding: A fresh view of inhibition
in the retina. Proceedings of the Royal Society of London Series B; Biological Science,
216(1205), 427459.
Stevens, C. F., & Zador, A. M. (1998). Input synchrony and the irregular firing of cortical neurons.
Nature Neuroscience, 1(3), 210217.
Tetko, I. V., & Villa, A. E. P. (1992). Fast combinitorial methods to estimate the probability of
complex temporal patterns of spikes. Biological Cybernetics, 76, 397407.
Thorpe, S. J., Guyonneau, R., Guilbaud, N., Allegraud, J.-M., & VanRullen, R. (2004). Spikenet:
Real-time visual processing with one spike per neuron. Neurocomputing, 5860, 857864.
Tononi, G., Sporns, O., & Edelman, G. M. (1992). Reentry and the problem of integrating multiple
cortical areas: Simulation of dynamic integration in the visual system. Cerebral Cortex, 2(4),
310335.
Tononi, G., Sporns, O., & Edelman, G. M. (1994). A measure for brain complexity: Relating
functional segregation and integration in the nervous system. Neurobiology, 91, 50335037.
Treves, A., & Panzeri, S. (1995). The upward bias in measures of information derived from limited
data samples. Neural Computation, 7, 399407.
Tsodyks, M., & Markram, H. (1997). The neural code between neocortical pyramidal neurons
depends on neurotransmitter releaseprobability. Proceedings of the National Academy of
Sciences of the United States of America, 94(2), 719723.
Tsodyks, M., Uziel, A., & Markram, H. (2000). Synchrony generation in recurrent networks with
frequency-dependent synapses. The Journal of Neuroscience, 20, 15.
Uttley, A. M. (1979). Information Transmission in the Nervous System. London: Academic.
Vaadia, E., Haalman, I., Abeles, M., Bergman, H., Prut, Y., Slovin, H., & Aertsen, A. M. H. J.
(1995). Dynamics of neuronal interactions in monkey cortex in relation to behavioural events.
Nature, 373, 515518.
van Essen, D. C., Olshausen, B., Anderson, C. H., & Gallant, J. L. (1991). Pattern recognition,
attention and information bottlenecks in the primate visual system. Proceedings of SPIE
Conference on Visual Information Processing: From Neurons to Chips, 1473, 1727.
van Rossum, M. C. W., Bi, G. Q., & Turrigiano, G. G. (2000). Stable Hebbian learning from spike
timing-dependent plasticity. The Journal of Neuroscience, 20, 88128821.
Wang, X., Hirsch, J. A., & Sommer, F. T. (2010). Recoding of sensory information across the
retinothalamic synapse. The Journal of Neuroscience, 30, 1356713577.
188
Yang, H. H., & Amari, S. (1997). Adaptive online learning algorithms for blind separation:
Maximum entropy and minimum mutual information. Neural Computation, 9, 14571482.
Yovits, M. C., Jacobi, G. T., & Goldstein, G. D. (Eds.) (1962). Self-organizing systems. Proceedings
of the Conference on Self-Organizing Systems held on May 22, 23, and 24, 1962 in Chicago,
Illinois. Washington: Spartan Books.
Zemel, R. S., & Hinton, G. E. (1995). Learning population codes by minimizing description length.
Neural Computation, 7, 549564.
Chapter 13
In this chapter we consider the surprise for a repertoire which represents the interest
in several statistical tests which were performed more or less independently. Then
we consider the surprise obtained from repetitions of the same low-probability
event.
The interest in combining evidence from several statistical tests is not uncommon
in practical situations. One example occurs when several researchers have carried
out statistical studies to evaluate the efficiency of a new drug or a new scientific
hypothesis. In neuroscience, one example is the evaluation of firing coincidence
within a small group of neurons, which was carried out as in the preceding chapter
not only for one combination of sites, but for several different combinations,
leading to a number more or less independently performed statistical tests on the
same set of neurons. A particular example is the correlation analysis for two neuron
but for different time bins with respect to a stimulus in the so-called JPSTH (Aertsen
et al. 1989). In statistics the kind of analysis that can be carried out in these situations
is sometimes referred to as meta-analysis (Hedges and Olkin 1985; Hartung et al.
2008). The obvious question in such a situation is: How significant is an effect
which was studied in several instances and was found to be significant in some
cases and insignificant in others?
For example, one should not be very surprised if 5 out of 100 significance tests
which had been performed were significant at the 5% level. Of course, if one doesnt
know of the 95 insignificant tests one still may be impressed by the 5 reported
significant results. This is a severe problem for many practical attempts of metaanalysis, which is more related to the sociology of science and cannot be solved
mathematically.
189
190
random variables and we assume in our first analysis that they are independent. The
statistical interest is expressed by the descriptions Xi .i D 1; : : : ; n/. The common
interest in all n tests is expressed by the description d D \niD1 Xi , i.e., d.!/ D
\niD1 Xi Xi .!/. Our task is to calculate the surprise of d , which we may call the
combined surprise.
First we observe that
Nd .!/ D log2
n
Y
pXi Xi .!/
i D1
D
n
X
i D1
n
X
Yi .!/;
i D1
For n D 2 we get
191
Z1
pY1 C Y2 s D
Zs
D
Z1
F2 .s t/f1 .t/dt C
s
Zs
D
f1 .t/ dt
Z1
2t .ln 2/ dt
D s 2s ln 2 C 2s :
In the same way we can compute
p
" n
X
#
Yi s D 2
s
i D1
and therefore
S.s/ D s log2
n1
X
.s ln 2/i
i D0
n1
X
.s ln 2/i
i D0
!
:
Based on this calculation there is an easy description how to calculate the normalized combined surprise (i.e., the combined significance) of n independent statistical
tests: First we calculate the novelty or naive combined surprise s by summing up
the individual surprises or novelties. Then we calculate the combined surprise S.s/
by the above formula.
192
not only for lotteries but also in some investigations in brain research where the
experimenters looked for repetitions of unlikely events in the firing of small groups
of neurons (. . . ).
We are interested in sequences of (random) experiments, where each time-step a
large number of very unlikely outcomes can happen. We model this by considering
random variables Xit for t D 1; : : : ; T and i D 1; : : : ; n, where T and n are quite
large integers, t stands for the time of the repetition of the experiment and Xit D 1
signifies that the unlikely event number i occurred at time t. For simplicity we
assume that Xit 2 f0; 1g and P Xit D 1 D p for all i and t, and that for t t 0 the
0
random vectors .Xit /i D1;:::;n and .Xit /i D1;:::;n are independent from each other. Then
we count the number of repetitions of each of the unlikely events by
Ni D
T
X
Xit :
t D1
Of course, we assume that p is very small but n should be large enough such
that p n is not small, it could even be equal to 1 as in the case of the lottery. In
many cases the events Xit D 1 and Xjt D 1 are mutually exclusive for i j ; in
some cases, we may assume them to be independent. Most often it is something in
between, i.e., pXit D 1; Xjt D 1 pXit D 1 pXjt D 1.
This also implies that pNi k; Nj k pNi k pNj k and
pNi < k; Nj < k pNi < k pNj < k. Also we can often assume
that Xis and Xit are independent1 for s t. In such cases it may be possible to
compute directly the probabilities that such events are repeated 2 or more times in
the sequence (see Exercise 1).
If we cannot make such assumptions, it is still possible to computeat least
approximatelythe probabilities for 3 or more repetitions, if pT is small, e.g., 18
(to give a definite value). This can be done by the Poisson approximation.
T
P
Let Ni WD
Xit . We now assume that Ni obeys the Poisson distribution (which
t D1
is approximately true when the variables xit are independent over time, but also in
most other practical cases). This means that
pNi D k D e
k
k
with
D p T:
This is typically not the case for the analysis of spike patterns.
193
1,000
10,000
8
10
15
20
30
50
100
1.9624
2.8029
4.4456
5.6534
7.3802
9.5733
12.5617
0.0764
0.3455
1.4155
2.4594
4.0972
6.2599
9.2408
7.4019
8.3389
10.0564
11.2831
13.0198
15.2163
18.2054
5.0971
6.0259
7.7371
8.9624
10.6982
12.8944
15.8835
1,000
10,000
8
10
15
20
30
50
100
6.7698
8.0249
10.3242
11.9647
14.2852
17.2177
21.2061
3.5071
4.7278
7.0073
8.6444
10.9636
13.8958
17.8842
12.4071
13.666
15.9675
17.6084
19.929
22.8615
26.85
10.0857
11.3443
13.6456
15.2865
17.6071
20.5396
24.5281
1
1
j
X
A D log.qk /:
log @e
j
j Dk
n
[
!
Ni k D 1 p
i D1
n
\
!
Ni < k
i D1
1
n
Y
pNi < k
i D1
D 1 .1 qk /n :
Thus
S D log2 .1 .1 qk /n /:
We have tabulated some values for the novelty and the surprise of the repetition
event Ni D k for different values of D p T and n. We show two tables for
k D 3 and k D 4 (Tables 13.1 and 13.2).
Another interesting range is around the parameters of the lottery. Here we assume
that p n D 1. Here we consider different values of T and n (Table 13.3).
194
108
45.77
41.0151
35.8042
25.8385
14.118
1010
59.0577
54.3029
49.092
39.1262
27.4055
References
Aertsen, A. M. H. J., Gerstein, G. L., Habib, M. K., & Palm, G. (1989). Dynamics of neuronal
firing correlation: Modulation of effective connectivity. Journal of Neurophysiology, 61(5),
900917.
Baker, S. N., & Lemon, R. N. (2000). Precise spatiotemporal repeating patterns in monkey
primary and supplementary motor areas occur at chance levels. Journal of Neurophysiology, 84,
17701780.
Hartung, J., Knapp, G., & Sinha, B. (2008). Statistical meta-analysis with applications. Wiley
Series in Probability and Statistics. New York: Wiley.
Hedges, L., & Olkin, I. (1985). Statistical methods for meta-analysis. New York: Academic.
Chapter 14
Entropy in Physics
195
196
14 Entropy in Physics
197
Now the idea is that an ordered velocity distribution means a stronger restriction on
the (micro-)state ! 2 . In terms of the velocity-distribution repertoire, this leads
quite naturally to the definition: the velocity distribution k1 ; : : : ; kn is more ordered
than l1 ; : : : ; ln , if Ak1 ::: kn is less probable than Al1 ::: ln , i.e., if p.Ak1 ::: kn / p.Al1 ::: ln /.
Thus order is now defined as improbability with respect to a certain physical
repertoire or the corresponding description d D d . This idea leads us directly
to Boltzmanns formula. Again an additivity requirement is used to motivate the
logarithm and we obtain the negentropy
H.!/ D k ln p.d.!// D k.ln 2/N .!/:
The positive factor k is determined by thermodynamical considerations. (It is not
J
:)
dimensionless: k D 1:38
1023 K
Actually, a more extreme definition of entropy could also be envisaged in this
context, which is based on the use of surprise instead of novelty and produces the
same qualitative prediction, but of course differs quantitatively:
H.!/ D k.ln 2/S .!/:
It may be interesting to consider the quantitative differences of these two different
entropies in more detail. However, I am not enough of a physicist to do this.
There is a problem that we did not consider yet. How is the probability p on the
physical state space determined?
In classical mechanics there is indeed a unique probability measure p on the state
space , the so-called Liouville-measure, which is distinguished by the fact that it
is invariant under the classical (Hamiltonian) dynamics. If this Liouville-measure
is used for the calculation of the probabilities p for the two problems mentioned
above, it leads to the classical results that have been obtained by Boltzmann through
combinatorial considerations (his so-called complexion probabilities).
In this way, one can for a classical mechanical model of a natural phenomenon
calculate the entropy value for a specific description d.!/ of a state ! 2 . This
understanding of entropy as a special case of information or novelty, namely, for
a particular description d that reflects the interest of the physicist or engineer, is
slightly different from the conventional modern point of view as expressed for example by Brillouin (1962), but closer to the classical point of view [for more details, see
Palm (1985)]. It is also related to the idea of basing thermodynamical predictions
and even other much broader applications on the principle of maximal ignorance,
maximization of entropy or infomax as it is called today. In the experimental and
engineering context, this mode has by and large worked reasonably well up to the
present day: empirically, the second law of thermodynamics has always held.
On the other hand, it has been shown that the second law cannot strictly be true
in the framework of Hamiltonian dynamics (Poincare 1890). After his own attempts
to prove the second law, Boltzmann has also accepted the counterarguments (a good
collection of historical papers on thermodynamics can be found in Brush (1966))
and finally arrived at the conclusion that the second law does not hold strictly but
198
14 Entropy in Physics
199
The more theoretical entropies are easily described from the point of view of
information theory:
1. The entropy used in statistical mechanics for the derivation of theoretical
distributions is (apart from the sign) what we have called the information gain
in Definition 11.6, defined in terms of a density function f usually with respect
to the Lebesgue measure on the phase-space Rn .
2. The dynamical entropy used in ergodic theory for the classification of dynamical
systems is essentially what we have called the information rate in Definition 6.6
on page 83.
Both entropies are not used to prove (or at least understand theoretically) the
second law of thermodynamics from the underlying (micro-) dynamics. In fact, it
has turned out to be very hard to arrive at a proper theoretical understanding of the
second law at all. After all, the second law is an empirical law.
In the following we shall try to consider another version of physical entropy
which is closer to the phenomenological one and to the classical considerations
around the second law of thermodynamics. In order to formulate such an entropy,
we have to rely on a theoretical model of the dynamics that are studied in
thermodynamical experiments. Classically, one considers Hamiltonian dynamics on
a so-called state space . We shall consider a dynamical system, i.e., .; ; pI '/,
where .; ; p/ is a probability space and 'W ! is a mapping that describes
the evolution of a state x in time: on one time unit x moves to '.x/, then to ' 2 .x/
and so forth. The probability p should be the unique probability on that makes
sense for the physicist (see above) and that is invariant under the motion mapping '.
In this setting the physicist is normally not capable of knowing exactly the
point x in state space (for a gas in a box it would mean knowing all positions
and velocities of all molecules). Instead he knows the values of certain macroobservables, which are certain realvalued functions f1 ; : : : ; fn on the state space
(for example, pressure, temperature, energy). And he wants to describe or even
predict the evolution of the system over time in terms of these macro-observables.
In our language, this means that the physicist looks at the system through a
certain description d , which is given by d WD f for a single observable f
(see Definition 2.16 on page 31), or by dm WD f1 \ : : : \ fm for m observables
that he observes simultaneously.
Now it is straightforward to define the entropy of a state x by
H.x/ D .N d /.x/
as above.
There is one additional point that should be mentioned: Usually the macroobservables are not instantaneous measurements on the system, but rather time
averages. If gW ! R is an (instantaneous) observable, we define the time
average as
n1
1X
gn .x/ WD
g.' i .x//:
n i D0
200
14 Entropy in Physics
We may now observe that in ergodic dynamical systems the averages gn converge
to a constant function G at least in probability (. . . ). This means that for any > 0
we can get
pjgn Gj > <
for sufficiently large n.
For the set M WD fx 2 W jgn .x/ Gj g, we have p.M / > 1 . Next, we
consider the observable f D gn and a fixed measurement accuracy > 0.
Then
p.f .x// D pfy 2 W jgn .y/ gn .x/j < g p.M / > 1
for x 2 M and < 2 .
If the average novelty N .g / is finite for the original instantaneous observable g,
one can easily find a constant c such that N gn .x/ < c for almost every x 2 .
Therefore the average novelty of f D gn can be estimated as
N .f / .1 / log.1 / C c;
which goes to zero as ! 0.
This means that for sufficiently large n the entropy for the average observable
f D gn will be almost zero for any fixed measuring accuracy .
If we start the system in an initial condition where we know that .f r/, this
corresponds to the negative entropy value
H.f r/ D log pjf rj < :
If we measure f again, averaging over some large number n of time steps, we will
almost always find f G and the average entropy we will get for this measurement
will be close to zero since
pjf Gj < > 1 :
In practive large n often means times below 1 s, and so this shows that we have
to expect the entropy to increase in time on the scale of seconds.
This argument is a theoretical underpinning for the second law of thermodynamics and its flavor of vagueness combined with triviality is typical for all such
arguments. Still it is in my opinion the best way of showing that entropy is a quantity
defined by the macro-observables for the system that is increasing or constant
over time, and therefore makes certain developments (the entropy-increasing ones)
irreversible. This phenomenon has been observed in connection with heat engines,
where the theoretical description of the state space would be a product of several
spaces i describing several subsystems, and the dynamics could be changed
by connecting or disconnecting some of these subsystems. In any case, such
a system can be led through irreversible processes and for certain procedural
201
pf ' k D r i jf D r
r2B
pf D r
pf 2 B
< pf D r
0
pf 2 B
pi D
:0
for r i 2 B and
otherwise.
This means that we can describe the probability distribution p k given any observation .f 2 B/ by means of a transition matrix
Mijk WD p.f ' k D r j jf D r i /;
i.e., we can describe our original system in terms of a Markov process.
With this notation, we may now look at the development of the entropy over time.
At our initial time t D 0 we have the observation f D ri and the entropy
H 0 D H.f D r i / D N .f D r i /:
202
14 Entropy in Physics
N
X
pf ' k D r j jf D r i log pf D r j
j D1
N
X
Mijk log pj :
j D1
H k D S.ei M k ; p/;
where ei D .0; : : : ; 0; 1; 0; : : : ; 0/ is the i -th unit vector. Here we use the notation
of Definition 3.1 in the following way: Two probability vectors p; q 2 0; 1n
considered as distributions for a random variable X with R.X / D f1; : : : ; ng. Then
S.q; p/ WD Spq .X /
It should be clear that we can only speak of probabilities for the values of the
observables k time steps ahead .k 1/, even when we initially (at t D 0) knew the
values of all the observables fi .
This will usually be the case when
1. The observables .f1 ; : : : ; fn / D f do not determine the state x 2 , and
2. The time of one time step is not too small.
Furthermore, in this argument, the time of one time step should also not be too
large, i.e., small compared to the relaxation time, because otherwise the observables
f1 ; : : : ; fn will be almost constant already after one single time step (as in the
argument of the last section).
With time steps in this medium range, one may try a further simplification. This
simplification will, however, change the dynamics of the system, i.e., whereas the
transition matrices M k defined above, still describe the same dynamical model,
only viewed through a coarse repertoire (namely through the description of states
x in terms of the observables .f1 ; : : : ; fn /, we will now define a kind of averaged
or coarse-grained dynamics. We simply assume that the transition probabilities are
given by a first-order Markov process, i.e., defining
203
n
X
i D1
pi Nij D
n
X
pf D r i p.f ' D r j jf D r i /
i D1
D p.f ' D rj / D pf D r j D pj ;
since p is '-invariant. I.e. p is N -invariant.
Defining p k as the probability vector p 0 N k for an initial vector p 0 that
corresponds to the initial observation f 2 B by
pi0 D
pf D r i
pf 2 B
for r i 2 B and pi0 D 0 otherwise, it is well known from ergodic theory (e.g., Walters
(1982)) that p k ! p, if p is the only N -invariant probability vector y. Thus
I.p k / ! I.p/ and S.p k ; p/ ! S.p; p/ D I.p/. Since S.p k ; p/ I.p k /
and I.p/ will usually be small for typical thermodynamical measurements f , one
can hope that S.p k ; p/ will decrease towards I.p/ and thus H k D S.p k ; p/
will increase. Unfortunately this is not the case in general. As an example one may
consider the start-vector p 0 D ei , where pi pj for every j D 1; : : : ; N . Then
usually H i < H 0 .
In this situation, Schlogl (1966) has suggested the information gain G.p k ; p/ of
k
p with respect to p, instead of the average entropy H k D S.p k ; p/, as a measure
for entropy. In the situation mentioned above, then obviously
G.p k ; p/ D S.p k ; p/ I.p k / ! 0
and it can indeed be shown that G.p k ; p/ decreases toward 0 in general.
Proposition 14.1
0 G.p kC1 ; p/ G.p k ; p/
for every k 2 N.
Proof. The positivity of G.q; p/ has been shown in Proposition 3.1. The main
inequality is again shown by means of the inequality log x x 1 and some
straightforward calculations using pN D p.
t
u
204
14 Entropy in Physics
References
Brillouin, L. (1962). Science and information theory (2nd ed.). New York: Academic.
Brush, S. G. (1966). Kinetic theory, Vol. 2, Irreversible processes. New York: Pergamon Press.
Kolmogorov, A. N. (1958). A new invariant for transitive dynamical systems. Doklady Akademii
nauk SSSR, 119, 861864.
Kolmogorov, A. N. (1959). Entropy per unit time as a metric invariant of automorphism. Doklady
Akademii nauk SSSR, 124, 754755.
Kuppers, B.-O. (1986). Der Ursprung biologischer Information - Zur Naturphilosophie der
Lebensentstehung. Munchen: Piper.
Palm, G. (1985). Information und entropie. In H. Hesse (Ed.), Natur und Wissenschaft. Tubingen:
Konkursbuch Tubingen.
Poincare, H. (1890). Sur le probl`eme des trois corps et les e quations de la dynamique. Acta
Matematica XIII, 13, 1270.
Schlogl, F. (1966). Zur statistischen Theorie der Entropieproduktion in nicht abgeschlossenen
Systemen. Zeitschrift fur Physik A, 191(1), 8190.
Schlogl, F. (1971a). Fluctuations in thermodynamic non equilibrium states. Zeitschrift fur Physik
A, 244, 199205.
Schlogl, F. (1971b). On stability of steady states. Zeitschrift fur Physik A, 243(4), 303310.
Shields, P. (1973). The theory of Bernoulli Shifts. Chicago: The University of Chicago Press.
Smorodinsky, M. (1971). Ergodic theory, entropy (Vol. 214). New York: Springer.
Walters, P. (1982). An introduction to ergodic theory. New York: Springer.
Part VI
Chapter 15
In this part we want to condense the new mathematical ideas and structures that
have been introduced so far into a mathematical theory, which can be put in the
framework of lattice theory. In the next chapter we want to get a better understanding
of the order structure (defined in Definition 10.3) on the set of all covers. For this
purpose we now introduce a number of basic concepts concerning order and lattices
(Birkhoff 1967).
Definition 15.2.
i) Any reflexive, symmetric, and transitive relation is called an equivalence. For
equivalences, we often use the symbol .
207
208
ii) Any reflexive, antisymmetric, and transitive relation is called a partial order
(p.o.) relation.
iii) Any reflexive and transitive relation is called an ordering.
iv) Any partial order relation that is connecting is called a total order relation.
v) Any transitive, irreflexive relation is called a strict order relation.
Proposition 15.1. Every strict order relation is strictly antisymmetric.
Proof. Assuming x r y and y r x implies x r x which contradicts irreflexivity.
t
u
t
u
Usually, S has much more members than , because many elements x of S have
the same d.x/. More exactly: d.x/ D d.y/ if and only if x y. The set is
also referred to as D S= (reads: S modulo tilde). One says that every x 2 S is
identified with its equivalence class e
x and writes x D y= if and only if e
x De
y.
Given a partition we can define an equivalence by x y if and only if x and y
are in the same element of (or equivalently d .x/ D d .y/). Thus there is a oneto-one correspondence between partitions and equivalence relations. Furthermore
any mapping f W S ! T (T any set) gives rise to an equivalence relation f , namely
x f y if and only if f .x/ D f .y/. The corresponding partition is given by the
e.
complete description f
This chapter is a brief introduction to elementary order theory and provides
a slight generalization for some concepts from the most frequently considered
partially ordered sets (p.o.-sets) to what we call orderings.
Before we proceed with a more detailed discussion of order and orderings, here
are some examples.
Example 15.1. 1) The usual relation on Z (the integer numbers) is a total order
relation.
2) The relation on Zn defined by .x1 ; : : : ; xn / .y1 ; : : : ; yn / if and only if
xi i yi for every i D 1; : : : ; n is a partial order relation.
3) The usual relation between sets is a partial order relation on P.M / (the sets
of all subsets of M ).
4) The relation between real functions, defined by f g if and only if
f .x/ g.x/ for every x 2 X is a partial order relation on the set of all realvalued functions f W X ! R on a set X .
5) The relation between descriptions as introduced in Chap. 2 is also a partial
order relation.
t
u
209
If a partial order is not a total order relation, it can happen that two elements x
and y are uncomparable, i.e., neither x y nor x y. In these cases we need a
better definition for the minimum or the maximum of the two. The more general
terms are the meet x^y and the join x_y of x and y, which are used in analogy
to sets, where the meet x ^ y is the intersection and the join x _ y is the union of
the two sets x and y.
Intuitively, x ^ y is the largest element that is x and y and conversely x _ y
is the smallest element that is x and y. We will define this in the even more
general settings of orderings (Definition 15.5).
Definition 15.4. Let be an ordering on a set S . We define two relations < and
as follows:
i) x < y if and only if x y and not y x.
ii) x y if and only if x y and y x.
Proposition 15.3. Let be an ordering on S .
i) The relation < is a strict order relation.
ii) The relation is an equivalence.
iii) The relation on S= is a partial order relation.
Proof. (i) < is obviously irreflexive. We have to show transitivity: If x < y and
y < z, then x < z. To show that not z x, we assume z x and with x y
obtain z y, which contradicts y < z.
(ii) is obviously reflexive, transitive, and symmetric.
(iii) We have to show that is antisymmetric on the set S= of equivalence classes.
If x y and y x, then x y by definition. Thus x and y belong to the
same equivalence class, i.e., x D y in S= .
t
u
Definition 15.5. Let be an ordering on a set S and M S .
i) x is called minimal in M , if x 2 M and for every y 2 M , y x implies
y x.
ii) x is called maximal in M , if x 2 M and for every y 2 M , x y implies
y x.
iii) x is called a lower bound for M , if x y for every y 2 M .
iv) x is called an upper bound for M , if y x for every y 2 M .
v) x is called a largest element of M (also written x D max.M /), if x 2 M and
x is an upper bound for M .
vi) x is called a smallest element of M (also written x D min.M /), if x 2 M and
x is a lower bound for M .
vii) If the set U of all upper bounds of M has a smallest element x, this is called
the smallest upper bound (s.u.b.) of M and written as x D sup.M / D _.M /
(supremum).
210
viii) If the set L of all lower bounds of M has a largest element x, this is called
the largest lower bound (l.l.b.) of M and written as x D inf.M / D ^ .M /
(infimum).
ix) If M D fx; yg then sup.M / is written as x _ y and called the join and inf.M /
is written as x ^ y and called the meet of x and y.
Remarks:
Largest and smallest elements are unique up to equivalence.
Two minimal elements of a set M are either equivalent or incomparable.
If all minimal elements of a set M are equivalent, then they all are smallest
elements.
If two minimal elements of a set M are not equivalent, then the set M has no
smallest elements.
For a partial order the largest and the smallest element of M both are unique.
For a total order, every maximal element is the largest element of M and every
minimal element is the smallest element of M .
Proposition 15.4. A smallest element of a set M is a largest lower bound of M and
a largest element of M is a smallest upper bound of M .
Proof. Let x be the smallest element of M , i.e., x 2 M and y x for every y 2 M .
Thus x is a lower bound for M . If z is a lower bound for M , then z x 2 M . Thus
x is the largest lower bound for M .
t
u
Remark: A set M without a smallest (or largest) element may still have a l.l.b. (or
s.u.b.) outside M .
Proposition 15.5. Let be an ordering on S and M S .
i) x is minimal in M if and only if there is no y 2 M satisfying y < x.
ii) x is maximal in M if and only if there is no y 2 M satisfying x < y.
Proof. This follows directly from the definition.
t
u
The following (well-known) example shows that even for a total order relation
an infinite set M which has an upper bound may neither have a maximal element,
nor a largest element, nor a smallest upper bound. Consider
the usual ordering on
p
Q (the rational numbers) and M D fx 2 QW 0 x 2g.
Proposition 15.6. Let be an ordering on S and M S . If M is finite, then M
has maximal and minimal elements.
Proof. We can start with any element x 2 M . Either x is already maximal or there
is an y 2 M with y x. Now we proceed with y until we arrive at a maximal
element. The procedure ends in finitely many steps because M is finite.
t
u
211
This does not imply that M has a largest or a smallest element, not even that it
has an upper or a lower bound.
For example, let D f1; : : : ; 6g and S D .P./ n ; /. Consider
M D ff1g; f3g; f5g; f1; 2; 3g; f2; 3; 4g; f3; 4; 5g;
f4; 5; 6g; f1; 3g; f2; 4g; f3; 5g; f4; 6g; f2; 6gg
The minimal elements of M are
f1g; f3g; f5g; f2; 4g; f4; 6g; f2; 6g:
Thus M has no smallest element, but it has a lower bound in S , namely ;.
The maximal elements of M are
f1; 2; 3g; f2; 3; 4g; f3; 4; 5g; f4; 5; 6g; f2; 6g:
Note that f2; 6g is both maximal and minimal in M . M has no largest element and
no upper bound in S (because S ).
In the following we want to investigate the algebraic structure of the operations
join _ and meet ^ introduced in Definition 15.5.(ix). Since these are in general
not uniquely defined, but only up to equivalence, it is more convenient to consider
the equivalence classes S= for an ordering on S . On S= , i.e., if we identify
equivalent elements of S , the ordering becomes a partial order relation and join
and meet are unique (if they exist).
Thus in the following we will always assume that we have a partial order relation
on a set S .
Proposition 15.7. Let .Mi /i 2I be arbitrary many subsets of S and let xi D _Mi
for every i 2 I . Then _fxi W i 2 I g D _.[i 2I Mi /. Let yi D ^Mi for every i 2 I .
Then ^fyi W i 2 I g D ^.[i 2I Mi /.
The equation implies that the left hand side exists if the right hand side exists and
vice versa.
Proof.
y is an upper bound of fxi W i 2 I g , y xi for every i 2 I
, y x for every x 2 Mi for every i 2 I
, y x for every x 2 [i 2I Mi
, y is an upper bound of [i 2I Mi :
Thus fxi W i 2 I g and [i 2I Mi have the same upper bounds and therefore the same
smallest upper bounds.
t
u
212
Definition 15.6. Let be a partial order relation on a set S . If for any two x; y 2 S
both x ^ y and x _ y exists, then .S; ; ^; _/ is called a lattice.
Proposition 15.8. Let .S; ; ^; _/ be a lattice. Then the following is true:
i)
ii)
iii)
iv)
x ^ y D x , x _ y D y , x y.
.x ^ y/ _ x D x ^ .y _ x/ D x for any x; y 2 S .
x ^ y D y ^ x and x _ y D y _ x for any x; y 2 S .
x ^ x D x and x _ x D x for any x 2 S .
Proof. (i) If x y, then x is the smallest element of fx; yg. Therefore x is the
largest lower bound of fx; yg, i.e., x D x ^ y. Similarly y is the largest
element of fx; yg and therefore y D x _ y. Conversely, x D x ^ y y
and y D x _ y x.
(ii) x x ^ y and therefore (by (i)) x D .x ^ y/ _ x, x x _ y and therefore
(by (i)) x D .x _ y/ ^ x.
(iii) obvious from the definition
(iv) x is the smallest and largest element of fx; xg D fxg.
t
u
From each lattice operation ^ and _ we can retrieve the partial order by
defining x y if and only if x ^ y D x (, x _ y D y).
Proposition 15.9. Both lattice operations are associative and commutative and for
any finite set M D fx1 ; : : : ; xn g S we have _M D x1 _ x2 _ : : : _ xn (any order,
any bracketing) and ^M D x1 ^ x2 ^ : : : ^ xn (any order, any bracketing).
Proof. By induction on n:
n D 1 W _fx1 g D x1 D x1 _ x1 .
n D 2 W _fx1 ; x2 g D x1 _ x2 D x2 _ x1 by definition.
n ! n C 1 W Let M D fx1 ; : : : ; xnC1 g and Mi D M n fxi g. We have to show
_M D .x1 _ : : : _
xi _ : : : _ xnC1 / _ xi , where the first
bracket contains all xj (j D 1; : : : ; n C 1) except xi in any
order and bracketing. By the induction assumption .x1 _ : : : _
x
i _ : : : _ xnC1 / D _Mi . Since M D Mi [ fxi g we get from
Proposition 15.7
_M D _f_Mi ; _fxi gg D ._Mi / _ xi :
t
u
Remark:
From this it follows that a finite lattice S always has a maximal element, namely _S
and a minimal element, namely ^S . Usually, these are called 1 and 0, respectively,
i.e., 1 D _S and 0 D ^S .
0 and 1 are also used to name the smallest and largest element of a poset.
213
All suprema exist if M is finite, but the inequalities also hold for infinite sets.
214
Proof. (i) c d means c.!/ d.!/ for every ! 2 . Thus is a partial order
just like the set relation .
(ii) .c [ d /.!/ c.!/ for every ! 2 . So c [ d c and also c [ d d .
Conversely, b.!/ c.!/ and b.!/ d.!/ for every ! 2 implies b.!/
c.!/ [ d.!/ for every ! 2 .
(iii) Similar to (ii).
t
u
Proposition 15.13. .D; / has a smallest and a largest element.
i) The smallest element is called 0 and defined by 0.!/ D f!g for ! 2 .
ii) The largest element is called 1 and defined by 1.!/ D for ! 2 .
Proof. obvious.
t
u
t
u
Reference
215
are presented in a slightly more general fashion than usual, because we cannot
assume our orderings to be antisymmetric. Sect. 15.2 contains a first application of
these ideas, recasting the results of Chaps. 2 and 3 in the lattice D of descriptions.
Reference
Birkhoff, G. (1967). Lattice theory (3rd ed.). Providence: American Mathematical Society.
Chapter 16
The set of all repertoires actually has an interesting structure, when we look at
a repertoire in terms of its proper descriptions D./. This means that we should
consider two repertoires to be essentially the same if they have the same proper
descriptions, or we should say that is more refined than if the proper descriptions
in are contained in those in .
This idea leads to two almost equally reasonable definitions for an ordering of
repertoires, which we will call 1 and 2 . Then we will analyze these two orderings
and a third one in more detail.
217
218
219
220
t
u
221
3
D
f D f
f
t
u
As we will see in the next section, the set R of all repertoires turns out to be a
lattice for ordering 1 , but not for ordering 2 . Similarly, the set C of all covers turns
out to be a lattice for ordering 3 . A lattice is quite a strong order structure. This
means that now the two orderings i (i D 1; 3) are antisymmetric and for any two
covers and we have a join _i and a meet ^i as defined in the last section.
222
[ 1 and [ 1 .
If a repertoire satisfies 1 and 1 , then 1 [ .
[ \ [ 1 and [ \ [ 1 .
If a repertoire satisfies 1 and 1 , then 1 [ \ [ .
Proof. Let c 2 D. /. By definition of 2 there are a 2 D./ and b 2 D./ with
a c and b c, and thus c.!/ a.!/ \ b.!/.
t
u
223
g1
<2
g2
b
2
Fig. 16.2 Repertoires illustrating that 2 is not a lattice. Both 1 and 2 are minimally larger than
and . Thus there is no unique meet for 2
224
and
\ D . /
\ 3 and \ 3 .
If a cover satisfies 3 and 3 , then 3 \ .
. [ / 3 and . [ / 3 .
If a cover satisfies 3 and 3 , then 3 [ .
225
In the following we will write for the ordering 1 , ^ for the meet w.r.t. 1 ,
_ for the join with respect to 1 and 4 for the ordering 3 , f for the meet,
g for the join w.r.t 3 .
In Definition 10.4 we defined the sets C./, R./, T./, F./ and P./. To
this we add Ff ./ WD f 2 F./W finiteg. These sets can be made into lattices by
considering the proper orderings and using the proper equivalence relations. This
leads to the definition of five lattices.1
Definition 16.6. For a probability space .; ; p/, we define
C WD C./= 3 ; R WD R./= ; T WD T./; P WD P./= ; F WD F./:
i)
ii)
iii)
iv)
v)
vi)
The lattice property of C and R follows from Propositions 16.11 and 16.13. For T; P, and Ff it
will be shown in the next section. F is not a lattice, in general.
226
t
u
t
u
t
u
227
B; C 2 [ ) B \ C 2 [ :
t
u
Proposition 16.19. For tight covers , , and the following are equivalent:
i)
ii)
iii)
iv)
t
u
228
16.6 Exercises
1) Let D f0; 1g. Determine the lattices C; R; T; P, and F.
2) Let D 0; 1 with the Borel-sets and p the equidistribution. Let x D
fA 2 W p.A/ < xg, x D fA 2 W p.A/ xg, and x D f.a; b/W b a < xg.
What is the order relation between x and y , between x and y , and between
x and y for arbitrary x and y?
3) .; ; p/ as in Exercise 2). What are the minimal and the maximal elements in
C; R; T; P, and F?
4) Let D f1; 2; 3; 4g and consider C, R, T, P, and F. Is there always a smallest
and a largest element in these p.o.-sets? If yes, what is it? If these are left out,
what are the minimal and maximal elements in the remaining p.o.-sets?
5) Is (Definition 9.6) always the second-smallest element in the lattices C, R, T,
P, and F?
References
Adler, R. L., Konheim, A. G., & McAndrew, M. H. (1965). Topological entropy. Transactions of
the American Mathematical Society, 114, 309319.
Goodman, T. N. T. (1971). Relating topological entropy and measure entropy. Bulletin of the
London Mathematical Society, 3, 176180.
Goodwyn, L. W. (1969). Topological entropy bounds measure-theoretic entropy. Proceedings of
the American Mathematical Society, 23, 679688.
Hopf, E. (1937). Ergodentheorie. Berlin: Springer. Reprinted by Chelsea Publishing Co., New York
edition.
Keynes, H. B., & Robertson, J. (1969). Generators for topological entropy and expansivness.
Mathematical Systems Theory, 3, 5159.
Krieger, W. (1970). On entropy and generators of measure-preserving transformations. Transactions of the American Mathematical Society, 149, 453464.
Palm, G. (1976a). A common generalization of topological and measure-theoretic entropy.
Asterisque, 40, 159165.
Palm, G. (1976b). Entropie und Erzeuer in dynamischen Verbanden. Z. Wahrscheinlichkeitstheorie
verw. Geb., 36, 2745.
Walters, P. (1982). An introduction to ergodic theory. Berlin, Heidelberg, New York: Springer.
Chapter 17
229
230
consideration of open covers and also of covers built from other types of sets in our
definition. We again start with a probability space .; ; p/.
Definition 17.1. Let be a subset of (or even of P./).
i) is called \ -stable1 , if A \ B 2 for any A; B 2 ,
ii) is called [ -stable, if A [ B 2 for any A; B 2 .
Definition 17.2. Let be a \-stable and [-stable subset of containing .
S
i) A subset of is called an -cover if D .
ii) The set of all -covers is called C./.
On C./ we consider the ordering 4 or 3 as defined in the last section, and the
corresponding equivalence relation 3 .
.C./; 4/ is an ordering and .C./= 3 ; 4/ is a p.o. set. We define C WD
C./= 3 . The lattice C defined in Definition 16.6 was C .
Proposition 17.1. .C ; 4/ is a lattice with g D and f D [ .
t
u
t
u
f!gW ! 2 :
t
u
Proof. Obvious.
231
t
u
t
u
Consider
.; ; p/ D D and D f1; 3; 5g; f2; 4; 6g , D f1; 2; 3; 4g; f3; 4;
5; 6g . Then f D [ and N . f / D 1. So we have f 4 and
N . f / > N ./ D log2 32 .
Consider .; ; p/ D E4 and D f1; 2; 3g; f1; 2; 4g , D f2; 3; 4g; f1; 3; 4g .
Then g D D f2; 3g; f2; 4g; f1; 3g; f1; 4g and N . g / D 1, but
t
u
N ./ C N ./ D 2 log2 43 < 1.
232
Proposition 17.7.
i) .Ff ; 4/ has a largest element, namely ! D f!gW ! 2 , if is finite.
ii) .Ff ; 4/ has a smallest element, namely fg.
t
u
Proof. Obvious.
Proposition 17.8. .Ff ; 4/ is a distributive lattice. It is a sublattice of .C; 4/.
and _ D [ :
t
u
Example 17.2.
I is not monotonic on R:
Consider .; ; p/ D D and
D f1; 3; 5g; f2; 4; 6g and D f1; 2g; f3; 4g; f5; 6g :
Here _ D [ , but I./ D log2 3 > I. _ / D 1.
I is not subadditive on R:
Consider .; ; p/ D D 2 and the corresponding random variables X1 ; X2 .
Now take
D X1 D i W i D 1; : : : ; 6 [ X2 D 1; X2 1 and
D X2 D i W i D 1; : : : ; 6 [ X1 D 1; X1 1 :
Then I./ D I./ D 16 log 6 C 56 log 65 D log 6 56 log 5 and I. _ / D log 6.
t
u
Observe that I. _ / I./ I./ D log 6 C 53 log 5 > 0.
Proposition 17.10.
i) .R; / has a largest element, namely ! WD f!gW ! 2 , if is countable,
i.e., p.!/ 0 for every ! 2 .
ii) .R; / has a smallest element, namely fg.
Proof. Obvious.
t
u
233
N ./ D N .d /;
S./ D S.d /:
t
u
N . / N ./ N ./ D
1
8
1 1
log2 D log2 5 1 > 0
2 2
5
2
t
u
Proposition 17.15.
i) .T; / has a largest element, namely ! D ff!gW ! 2 g, if is countable.
ii) .T; / has a smallest element, namely fg.
Proof. Obvious.
t
u
234
D f1; 2g; f3; 4g ;
D f1; 3g; f2; 4g ;
D f1; 4g; f2; 3g :
Then
. g / f D f1g; f2g; f3g; f4g f D
. f / g . f / D fg g fg D fg
t
u
Proposition 17.17. .T; / is a p.o. subset, but not a sublattice of .D; / and
of .R; /.
Proof. The join _ in R is [ which usually is not tight. So the join of and
in T is . [ /\ . / (see Proposition 16.19) which is somewhat larger. The
meet ^ in R and T is the same.
Conversely, the join in T is the same as in D, whereas the meet in D is the union
of descriptions which does not correspond to the meet in T.
t
u
2
3
References
235
Proof. The join and meet in P are the same as in T, namely _ D and
^ D [ \ [ . In F they are g D . /f D (for partitions ; ) and
f D . [ /f . So the join is the same, but the meet is different.
t
u
Proposition 17.19. On .P; / both I and N are monotonic and subadditive.
Proof. N D I and I is monotonic and subadditive on T (Proposition 17.14).
t
u
Like T, also P is not a distributive lattice as can be seen from Example 17.4.
17.7 Exercises
1) For #./ D 2; 3, and 4 determine the lattices P, T, F, and R.
2) For #./ D 5; 6 determine the lattices P and F.
3) Find an example for small #./ that shows that T and P are not distributive
lattices.
4) A complement Ac of a lattice element A satisfies A ^ Ac D 0 (the smallest
element) and A _ Ac D 1 (the largest element).
For each of the lattices P, T, Ff , and R find examples for elements A that do
and that do not have a complement.
5) Are there examples of these lattices that have a unique second-largest or secondsmallest element?
References
Adler, R.L., Konheim, A.G., & McAndrew, M.H. (1965). Topological entropy. Transactions of the
American Mathematical Society, 114, 309319.
Goodwyn, L.W. (1969). Topological entropy bounds measure-theoretic entropy. Proceedings of
the American Mathematical Society, 23, 679688.
236
Goodman, T. N.T. (1971). Relating topological entropy and measure entropy. Bulletin of the
London Mathematical Society, 3, 176180.
Walters, P. (1982). An introduction to ergodic theory. Berlin, Heidelberg, New York: Springer.
Appendix A
237
238
239
e
e
WD f! 0 W D! 0 D D! g,
Its completion: D.!;
! 0 / WD 1D! DD! 0 or1 D.!/
Its novelty: ND W ! R; ND .!/ D log2 E.D! /; N .D/ WD E.ND /,
Its surprise: S.D/ WD N .ND /,
e
Its information: I.D/ D N .D/.
240
Then N ./ D E.N /.
241
log2 E.A/
1
for ! 2 A ;
S
for ! :
i) implies N ./ N ./.
ii) N . [ / N ./ C N ./.
Proposition A.7. Let be a fuzzy repertoire. Then I./ D I.f /.
When we consider the set Tf of fuzzy templates, we have a natural join _ of
two templates and , i.e., the smallest template that is larger than [ . Now we
can show that on fuzzy templates I is monotonic and subadditive.
Definition A.15. For two fuzzy repertoires and we define
1. _ WD fA ^ BW A 2 ; B 2 ; A ^ B essentialg and
2. WD fA BW A 2 ; B 2 ; A B essentialg.
Proposition A.8. 1. For two tight fuzzy repertoires and , implies
I./ I./.
2. For two fuzzy templates and we have I. _ / D I. / I./ C I./.
The set Pf of fuzzy partitions with the ordering turns out to be a sublattice of
Tf with join _ . Even on Pf novelty N and information I do not coincide. Of
course, on Tf and in particular on Pf we still have N I and also monotonicity
of N and I. Proposition A.8 shows that I is subadditive on Tf and on Pf in
particular. However, N is not subadditive, neither on Tf nor on Pf , as the following
example shows.
242
A1
A2
A3
A4
1
0
0
0
1
0
0
0
a
1
0
0
a
1
0
0
a
a
1
0
a
a
1
0
a
a
a
1
a
a
a
1
B1
B2
B3
B4
1
a
a
a
0
1
a
a
0
1
a
a
0
0
1
a
0
0
1
a
0
0
0
1
0
0
0
1
1
0
0
0
t
u
and
A2
I./ D I. / D
A2
A2
I./ D I.cr / D
A2
Reference
Zadeh, L.A. (1965). Fuzzy sets. Information and Control, 8, 338353.
and
Glossary
Notation
A; B
E
H
I
Nd .!/
P; Q
R
Sd .!/
T
X; Y; Z
B
C
C
F
G and Gpq .d /
I
IG
Z
L
M
L
N
N
Npq .d /
N
P
Q
R
R
Description.
Propositions.
Expectation value.
Entropy.
Information rate / Information as random variable.
The novelty provided by ! for the description d .
Transition Probabilites.
Range of a function.
Surprise (of an outcome !) of d .
Transinformation rate.
Random variables.
Borel -algebra.
Set of all covers.
Set of complex numbers.
Set of flat covers.
Novelty gain.
Information.
Information gain.
Set of integer numbers.
Average length of a code.
Mutual novelty.
Number of questions in a guessing strategy.
Set of natural numbers.
Average novelty.
Subjective novelty.
Novelty as a random variable.
Set of partitions.
Set of rational numbers.
Set of real numbers.
Set of repertoires.
243
244
Glossary
S
SL
S
T
T
Var
; ; ;
eN
c
C
e
X ; Y; Q
E
d; b; c
d\
e
p; q
.; ; p/
Surprise.
Surprise loss.
Surprise as a random variable.
Set of tight covers or templates.
Transinformation.
Variance.
Letters for covers and repertoires.
Average error of a Transition Probability.
The capacity of a channel. Also letter for a code.
A channel.
The completion of a description, e.g., dQ .
Stochastic proccesses.
The direction of a description, e.g., dE.
Descriptions.
The tightening of a description d .
Error probability of a Bayesian guess.
Letters for probabilities.
Probability space.
-algebra of propositions or events.
elementary event, element of .
Index
Additivity, 5, 14, 40
novelty, 14
Algebra, 4
-cover, 230
Alphabet, 3, 89, 91, 98, 100
index, 89
input, 91, 92
output, 89, 91, 92
Anticipation, 90, 93, 94
bound, 90
finite, 90
span, 90, 94
Antitone, 14
A-priori probability, 69
Asymptotic equipartition property, 85, 8587,
97, 98
Average entropy, 202, 203
Average error, 66
Average information, 81, 83, 84
Average length, 55, 56, 59
Average novelty, 18, 32, 200
Average number of questions, 53
Average surprise, 175
Bayesian guess, 69
Beginning, 55
Boolean algebra, 214
Burst novelty, 168
Burst repertoire, 168, 175
Burst surprise, 168
245
246
clean, 113
disjoint, 110
finitary, 111
flat, 117, 118
\-stable, 115
narrow, 117
partition, 110
product, 116
shallow, 117, 118
tight, 112, 113
Crisp, 238
version, 238
Element
minimal, 111
Elementary event, 9
Entropy, 27, 195, 197203
average, 202
Equivalence class, 208
Equivalent, 131
Error probability, 69, 97, 98, 100
Essential, 238
Essentially, 18
Event, 3, 4, 6, 9, 1517, 19, 105, 109, 110,
166168, 189
elementary, 9
Index
Excitatory postsynaptic potential (EPSP), 166
Expectation-value, 6, 7, 80
Finitary, 111
Finite function, 7
Finite memory, 94
Flat cover, 117
Flattening of a repertoire, 117
Function, 5
convex, 133
countable-valued, 7
finite, 7
integrable, 7
Fuzzy, 237
consequence, 240
descriptions, 237, 239
partitions, 240
propositions, 237, 237
relation, 237
templates, 240
High-probability, 98100
element, 98
pair, 98, 99
sequence, 86, 98
Huffman code, 57, 58
Identically distributed, 79
Improbability, 14
In terms of, 110
Independent, 79
Independent identically distributed, 79, 81, 84,
85, 87, 88, 94, 98
Independent proposition, 15
Information, 24, 27, 32, 40, 42, 51, 5860, 65,
81, 83, 84, 105, 109, 125, 126, 127,
132, 138, 161, 165, 195
average, 30, 81, 83, 84
gain, 37, 147, 152, 199, 201, 203
rate, 83, 8486, 100, 199
subadditive, 132
Inhibitory postsynaptic potential (IPSP), 167
Input alphabet, 89, 91, 92
Input process, 94, 98
Input sequence, 91
Integrable function, 7
\-stable, 174, 230
cover, 115
repertoire, 115
Irreducible, 55, 56
code, 55, 56, 58
Isotone, 14
Index
Join, 210
Mapping, 5
continuous, 7
discrete, 7
identity, 94
Markov process, 79, 201, 202
Maximal, 209
Measurability, 8
Measurable, 8, 9, 18, 30, 66, 79
Meet, 210
Memory, 9092, 94, 97
bound, 90
finite, 90, 94
internal, 91
span, 90, 94
Minimal, 209
Minimal element, 111
Monotonicity, 30
Monty Hall problem, 13
Mutual information, 43
of random variables, 43
Mutual novelty, 42, 145
of descriptions, 42
Narrow, 117
Negentropy, 195, 197
Neural repertoire, 166, 175
Novelty, 14, 18, 21, 32, 36, 3840, 51, 84, 105,
125, 127, 132, 156, 161, 165, 166,
168, 170, 196, 239
additivity, 14
average, 18, 32, 51
burst, 168
conditional, 15, 38, 39
of d for , 106
provided by d for , 106
gain, 36, 146, 147
247
pause, 170
subjective, 36, 146
Pair-process, 85
Partition, xiii, 17
Partition of a repertoire, 110, 112, 196
Pause novelty, 170
Pause repertoire, 168, 170
Pause surprise, 170
Payoff-function, 6, 7
Population repertoire, 175
Postsynaptic potential (PSP), 166, 167
Prefix, 55
Probability, 4, 6, 8, 9, 14, 32, 33, 65, 66, 89,
9194, 98100, 156, 199202
a-priori, 69
conditional, 14, 38
distribution, 65, 66, 146, 175, 201
error, 97, 100
measure, 65
space, 3, 9, 199
transition, 65, 66, 89, 9194
vector, 133, 201203
Process, 79, 84, 94
discrete, 79
input, 94, 98
Markov, 79
stationary, 79, 91
stochastic, 79
Product, 116
Product of covers, 116
Product of distributions, 152
Product of repertoires, 117
Proper, 110
Proper choice, 110, 111113, 115, 118, 120
Proper description, 110, 111, 113
Proposition, 3, 4, 79, 1417, 105, 109113,
127, 173, 174
independent, 15
small, 130
248
discrete, 30, 31, 72
independent continuous, 139
Random vector, 6
Range, 5, 238
Reflexive, 238
Relation
antisymmetric, 207
connecting, 207
equivalence, 207
irreflexive, 207
ordering, 208
partial order, 208
reflexive, 207
strictly antisymmetric, 207
strict order, 208
symmetric, 207
total order, 208
transitive, 207
Repertoire, 105, 110, 111, 112, 113, 115,
117, 120, 125, 130, 134, 138, 148,
164168, 170, 171, 173175, 189,
191, 192, 196198, 202, 204
burst, 168, 175
clean, 113, 174
coincidence, 168, 170, 171, 175
depolarization, 168, 173
disjoint, 110
finite, 148
infinite, 146
\-stable, 115
neural, 161, 166, 175
partition, 110, 112, 196
pause, 168, 170, 175
population, 175
product, 117
shallow, 117, 118, 119
tight, 112, 113, 115, 117, 118, 120, 174
Index
Smallest element, 209
Smallest upper bound (s.u.b.), 209
Spike, 166168, 170
Stationary, 98
Stationary process, 79, 91, 97
Stochastic process, 79, 8088, 98, 100
i.i.d., 79, 81, 84, 85, 87
stationary, 79, 82, 83, 85, 97, 98, 100
Subadditive information, 132
Subjective information, 37, 202
Subjective novelty, 36, 146
Subjective surprise, 37, 156
Surprise, 24, 32, 32, 124, 125, 138, 139, 156,
166168, 170, 174, 175, 239
average, 84, 175
burst, 168
loss, 147
of an outcome !, 24
pause, 170
random variable, 31
subjective, 37, 156
Surprising, 19
Symmetric, 20, 238
Tautology, 4
Tight, 22, 112
description, 22, 115
repertoire, 112, 113, 115, 117, 118, 120,
174
Tightening, 23, 32, 115
Time average, 199
Transinformation, 42, 43, 65, 66, 145, 152, 154
of covers, 154
of descriptions, 42
of random variables, 43, 154
rate, 84
Transition matrix, 201, 202
Transition probability, 65, 66, 89, 9194
Transitive, 238
Uncertainty, 27
Uncomparable, 209
Uniformly disturbing, 71
[-stable, 230
Upper bound, 209