Novelty, Information, and Surprise

Novelty, Information and Surprise
Gunther Palm
Novelty,
Information
and Surprise
123
Gunther Palm
Neural Information Processing
University of Ulm
James-Franck-Ring
Ulm, Germany
ISBN 978-3-642-29074-9
ISBN 978-3-642-29075-6 (eBook)
DOI 10.1007/978-3-642-29075-6
Springer Heidelberg New York Dordrecht London
Library of Congress Control Number: 2012942731
c Springer-Verlag Berlin Heidelberg 2012

This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of
the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology
now known or hereafter developed. Exempted from this legal reservation are brief excerpts in connection
with reviews or scholarly analysis or material supplied specifically for the purpose of being entered
and executed on a computer system, for exclusive use by the purchaser of the work. Duplication of
this publication or parts thereof is permitted only under the provisions of the Copyright Law of the
Publishers location, in its current version, and permission for use must always be obtained from Springer.
Permissions for use may be obtained through RightsLink at the Copyright Clearance Center. Violations
are liable to prosecution under the respective Copyright Law.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
While the advice and information in this book are believed to be true and accurate at the date of
publication, neither the authors nor the editors nor the publisher can accept any legal responsibility for
any errors or omissions that may be made. The publisher makes no warranty, express or implied, with
respect to the material contained herein.
Printed on acid-free paper
Springer is part of Springer Science+Business Media (www.springer.com)
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . xiii
References .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . xxii
Part I
Surprise and Information of Descriptions
Prerequisites from Logic and Probability Theory . .. . . . . . . . . . . . . . . . . . . .

1.1 Logic and Probability of Propositions .. . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
1.2 Mappings, Functions and Random Variables . . .. . . . . . . . . . . . . . . . . . . .
1.3 Measurability, Random Variables, and Expectation Value . . . . . . . . .
1.4 Technical Comments .. . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
References .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
3
3
5
7
10
10
Improbability and Novelty of Descriptions . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .

2.1 Introductory Examples .. . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
2.2 Definition and Properties.. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
2.3 Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
2.4 Properties of Descriptions . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
2.5 Information and Surprise of Descriptions .. . . . . .. . . . . . . . . . . . . . . . . . . .
2.6 Information and Surprise of a Random Variable.. . . . . . . . . . . . . . . . . . .
2.8 Exercises .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
References .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
11
11
14
15
18
24
30
31
32
34
Conditional and Subjective Novelty and Information .. . . . . . . . . . . . . . . . .

3.2 Subjective Novelty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
3.3 Conditional Novelty .. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
3.4 Information Theory for Random Variables . . . . .. . . . . . . . . . . . . . . . . . . .
3.6 Exercises .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
References .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
35
35
36
38
42
44
45
46
vi
Contents
Part II
Coding and Information Transmission
On Guessing and Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .

4.2 Guessing Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
4.3 Codes and Their Relation to Guessing Strategies.. . . . . . . . . . . . . . . . . .
4.4 Krafts Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
4.5 Huffman Codes .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
4.6 Relation Between Codewordlength and Information .. . . . . . . . . . . . . .
4.8 Exercises .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
References .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
51
51
53
54
56
57
58
60
60
62
Information Transmission . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
5.2 Transition Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
5.3 Transmission of Information Across Simple Channels .. . . . . . . . . . . .
5.5 Exercises .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
Reference .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
63
63
65
67
71
72
74
Part III
Information Rate and Channel Capacity
Stationary Processes and Their Information Rate ... . . . . . . . . . . . . . . . . . . .

6.2 Definition and Properties of Stochastic Processes . . . . . . . . . . . . . . . . . .
6.3 The Weak Law of Large Numbers.. . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
6.4 Information Rate of Stationary Processes . . . . . . .. . . . . . . . . . . . . . . . . . . .
6.5 Transinformation Rate . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
6.6 Asymptotic Equipartition Property .. . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
6.8 Exercises .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
References .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
77
77
78
80
81
84
85
87
87
88
Channel Capacity .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
7.1 Information Channels . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
7.2 Memory and Anticipation.. . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
7.3 Channel Capacity.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
7.5 Exercises .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
References .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
89
89
90
91
94
94
95
How to Transmit Information Reliably with Unreliable

Elements (Shannons Theorem) .. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
8.1 The Problem of Adapting a Source to a Channel . . . . . . . . . . . . . . . . . . .
8.2 Shannons Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
97
97
98
Contents
vii
8.3 Technical Comments .. . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 101

8.4 Exercises .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 101
References .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 101
Part IV
9
Repertoires and Covers
Repertoires and Descriptions.. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .

9.2 Repertoires and Their Relation to Descriptions.. . . . . . . . . . . . . . . . . . . .
9.3 Tight Repertoires . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
9.4 Narrow and Shallow Covers . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
9.6 Exercises .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
References .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
105
106
109
115
117
119
120
120
10 Novelty, Information and Surprise of Repertoires ... . . . . . . . . . . . . . . . . . . .

10.2 Definitions and Properties . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
10.3 Finding Descriptions with Minimal Information . . . . . . . . . . . . . . . . . . .
10.5 Exercises .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
References .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
123
123
125
133
138
138
139
11 Conditioning, Mutual Information, and Information Gain .. . . . . . . . . . .

11.2 Conditional Information and Mutual Information . . . . . . . . . . . . . . . . . .
11.3 Information Gain, Novelty Gain, and Surprise Loss. . . . . . . . . . . . . . . .
11.4 Conditional Information of Continuous Random Variables .. . . . . . .
11.6 Applications in Pattern Recognition,
Machine Learning, and Life-Science .. . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
11.7 Exercises .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
References .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
141
141
142
146
152
154
Part V
155
156
157
Information, Novelty and Surprise in Science
12 Information, Novelty, and Surprise in Brain Theory .. . . . . . . . . . . . . . . . . .

12.1 Understanding Brains in Terms of Processing
and Transmission of Information .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
12.2 Neural Repertoires . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
12.3 Experimental Repertoires in Neuroscience . . . . .. . . . . . . . . . . . . . . . . . . .
12.3.1 The Burst Repertoire . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
12.3.2 The Pause Repertoire .. . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
12.3.3 The Coincidence Repertoire . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
12.3.4 The Depolarization Repertoire .. . . . . . . .. . . . . . . . . . . . . . . . . . . .
12.4 Neural Population Repertoires: Semantics and Syntax .. . . . . . . . . . . .
161
161
166
167
168
170
170
173
173
viii
Contents
12.5 Conclusion .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
12.6.1 Coincidence .. . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
12.6.2 Coincidental Patterns .. . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
12.6.3 Spatio-Temporal Patterns. . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
References .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
175
175
179
179
179
181
13 Surprise from Repetitions and Combination of Surprises . . . . . . . . . . . . .

13.1 Combination of Surprises .. . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
13.2 Surprise of Repetitions .. . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
References .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
189
189
191
194
194
14 Entropy in Physics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
14.1 Classical Entropy .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
14.2 Modern Entropies and the Second Law . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
14.3 The Second Law in Terms of Information Gain . . . . . . . . . . . . . . . . . . . .
References .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
195
195
198
201
204
204
Part VI
Generalized Information Theory
15 Order- and Lattice-Structures . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .

15.1 Definitions and Properties . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
15.2 The Lattice D of Descriptions . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
Reference .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
207
207
213
214
215
16 Three Orderings on Repertoires . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .

16.1 Definition and Basic Properties .. . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
16.2 Equivalence Relations Defined by the Orderings .. . . . . . . . . . . . . . . . . .
16.3 The Joins and Meets for the Orderings .. . . . . . . . .. . . . . . . . . . . . . . . . . . . .
16.4 The Orderings on Templates and Flat Covers .. .. . . . . . . . . . . . . . . . . . . .
16.6 Exercises .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
References .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
217
217
220
222
226
227
228
228
17 Information Theory on Lattices of Covers . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .

17.1 The Lattice C of Covers .. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
17.2 The Lattice Ff of Finite Flat Covers . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
17.3 The Lattice R of (Clean) Repertoires . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
17.4 The Lattice T of Templates . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
17.5 The Lattice P of Partitions . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
17.7 Exercises .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
References .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
229
229
231
232
233
234
235
235
235
Contents
ix
Appendices
A
Fuzzy Repertoires and Descriptions .. . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .

A.1 Basic Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
A.2 Definition and Properties of Fuzzy Repertoires.. . . . . . . . . . . . . . . . . . . .
Reference .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
237
238
240
242
Glossary . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 243
Index . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 245
List of Figures
Fig. 1
Examples of covers (top) and the induced hierarchical

structures (bottom) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
xv
Fig. 2.1
Fig. 2.2
Fig. 2.3
Fig. 2.4
Model for descriptions of events . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .

Example of propositions about positions on a table . . . . . . . . . . . . . . . .
Plot of function h.p/ . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
Plot of function I.p/. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
16
17
25
28
Fig. 4.1
Fig. 4.2
Fig. 4.3
Example of a question strategy . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .

Example of a question strategy . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
Tree picturing a question strategy . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
52
52
54
Fig. 5.1
Fig. 5.2
Fig. 5.3
An information channel .. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
Channel example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
Three different channels . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
66
67
73
Fig. 6.1
Channel example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
78
Fig. 7.1
Fig. 7.2
Fig. 7.3
Data transmission on a channel .. . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .

Two channels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
Two simple channels . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
90
94
95
Fig. 8.1
Channel and stationary process .. . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
98
Fig. 9.1
Fig. 9.2
112
Fig. 9.3
Fig. 9.4
Fig. 9.5
Fig. 9.6
Illustration of tightness.. . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
Illustration of cleanness. The vertically hatched region
on the left can be removed, because it is the union of
the two diagonally hatched regions . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
Illustration of the product of two repertoires.. . .. . . . . . . . . . . . . . . . . . . .
Examples of narrow covers and chains. . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
Example for a shallow repertoire (a) and for a chain (b) .. . . . . . . . . .
Illustration of classes of repertoires . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
Fig. 12.1
A single neuron.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 167
113
116
118
118
119
xi
xii
Fig. 12.2
Fig. 12.3
Fig. 12.4
Fig. 12.5
Fig. 16.1
Fig. 16.2
List of Figures
Burst novelty as a function of time for 3 individual

spike-trains and a simulated Poisson-spike-train . . . . . . . . . . . . . . . . . . .
Burst surprise as a function of burst novelty .. . .. . . . . . . . . . . . . . . . . . . .
Histogram of novelty values of spike-bursts. Novelty
based on Poisson distribution. Spikes from visual
cortex neurons in awake behaving cats. For more
details see Legendy and Salcman (1985) from where
Fig. 12.4 and 12.5 are adapted. . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
Spike-burst statistics. a) Histogram of the spike rate
during high surprise bursts (thick bars: N > 10, thin
bars: N > 20), b) Histogram of the number of spikes
in high surprise bursts. Preparation as in Fig. 12.4 from
Legendy and Salcman (1985). .. . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
169
170
177
178
Possible relationships between the orderings 1 , 2 ,

and 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 218
Repertoires illustrating that 2 is not a lattice . .. . . . . . . . . . . . . . . . . . . . 223
Introduction
Nowadays there are many practical applications of information theory in fields like
pattern recognition, machine learning, and data mining (e.g., Deco and Obradovic
1996; MacKay 2005), used in particular in the life sciences [e.g., Herzel et al.
(1994); Schmitt and Herzel (1997); Bialek et al. (2007); Taylor et al. (2007);
Tkacik and Bialek (2007); Koepsell et al. (2009)], i.e., far beyond the classical
applications in communication technology. But Claude Shannons groundbraking
original concept of information remained essentially unchanged.
The main purpose of this book is to extend classical information theory to
incorporate the subjective element of interestingness, novelty, or surprise. These
concepts can only be defined relative to a persons interests, intentions, or purposes,
and indeed classical information theory was often criticized for not being able to
incorporate these ideas. Actually, classical information theory comes quite close to
this, when introducing the information contained in a proposition or statement A
(as log2 p.A/). But in everyday life most commonly this information is not really
transferred if a person X tells the statement A to another person Y , for example
because Y may not be interested in A or because Y may rather be interested in the
fact that it is X who tells A. An interesting extension of information theory could
consider the following question: If Y is interested in B instead of A and perhaps B
largely overlaps with A, then how much information does Y obtain from being told
A? This question and other similar ones will be answered in this book; they are not
totally alien to classical information theory. This means that our new theory does
not have to go far from it. In a technical sense it can be regarded as only a slight (but
perhaps important) extension of Shannons definition of information from partitions
to covers.
This needs some explanation: Shannon tried to define his information as an
objective almost physical quantity (measured in bit). This led him to define information for random variables X . For a discrete random
Pvariable X with finitely many
possible values x1 ; x2 ; : : : ; xn he defined I.X / D i pX D xi log2 pX D xi ,
i.e., as the average of log2 pX D xi where the possible outcomes X D xi
of X are now the propositions A of interest. Again, this definition presupposes that
we are equally interested in all outcomes xi of the random variable. This may not
xiii
xiv
Introduction
always be the case. For example, if we consider a roulette game represented by a

random variable X with possible values in f0; 1; 2; : : : ; 36g, then a person who has
put money on the number 13, the row 31; 32; 33, and the odd numbers, certainly
wants to know the outcome number, but he will be more interested in the numbers
mentioned above. The new concept of novelty defined here takes care of this aspect
of information which goes slightly beyond classical information theory.
Shannons definition uses the statements X D xi . In classical probability theory
these are called events and modeled as subsets of a universal set of all possible
events. These sets actually form a partition of , meaning that they are mutually
exclusive and cover all of . A cover is a more general set of events than a
partition, where mutual exclusiveness is not required. My idea, the foundation of this
theory, is to use a cover (instead of a partition) to model the set of all propositions
or statements a person is interested in, and to define information for such covers.
Starting from a definition of information on partitions the step to covers appears
as a rather straightforward generalization, namely omitting the requirement of
mutual exclusiveness or disjointness of the propositions. Technically, partitions have
a number of very useful properties which seem to be necessary to prove even the
most elementary theorems of information theory (for example, the monotonicity
and subadditivity of information). So the main body of this book is devoted to the
development of a workable extension of classical information theory to (possibly)
overlapping sets of propositions called repertoires or covers, which is the theory of
novelty and which coincides with classical information theory when the covers are
partitions.
This move from partitions to covers allows to turn our attention to the rich
logical structures that are possible between the propositions in arbitrary covers.
In general, these structures can be described most adequately as hierarchies (see
Fig. 1). This turn to the possible logical structures that may underlie definitions
of novelty, information, or even physical entropy provides a new perspective
for the interpretation of these concepts, for example in thermodynamics and in
neuroscience. This may be more a philosophical point, but it can have practical
implications (and it was one of my strongest motivations to write this book). More
about it can be found in Part V.
When we consider arbitrary covers D fA1 ; : : : ; An g of a probability space
instead of partitions, it actually makes sense to distinguish different informationlike concepts, which we call information, novelty, and surprise. In a nutshell, the
information of is the minimum average number of yesno questions that we need
to determine for every ! 2 one element A 2 that describes it (i.e., ! 2 A)the
information needed for .
The novelty of is the average maximum information I.A/ D log2 p.A/ that
we can obtain for ! 2 from an A 2 that describes itthe novelty obtained
from . This means the novelty for ! is N.!/ D maxflog2 p.A/W ! 2 A 2 g.
The surprise of has a more statistical flavor: if we have a very rich cover , in
the extreme case may contain all singletons f!g .! 2 /, then we always obtain a
lot of novelty, but this is not really surprising. So we say that we get really surprised
when observing ! 2 through , when N.!/ is large compared to other values
Introduction
xv
A1
A2
A
A3
C2 C3
C1
AB
A4
A1
A2
A
A3
A4
A1
A2
A3
A4
AB
C1
C2
C3
Fig. 1 Examples of covers (top) and the induced hierarchical structures (bottom)
N.! 0 / .! 0 !/. This leads to the definition of surprise as

S.!/ D log2 pN N.!/ D log2 p.f! 0 2 W N.! 0 / N.!/g/:
Beside introducing the new concepts of novelty and surprise, this book also
extends classical Shannon information. Therefore one may hope that this new more
general treatment of information can be used to solve some problems that could
not be solved with classical information theory. This is indeed the case when we
consider problems which need overlapping propositions for their formulation. Let
me give one example for this here, which is treated more extensively in Chap. 10 and
which appears rather innocent on the surface. You are observing a shell game where
there are eight shells on the table and there is a coin (say 1 e) under two of them.
How much information do you get, when somebody points to one of the shells and
tells you that it contains a coin? How would you devise an optimal guessing strategy
that determines the position of one of the two coins, and how many questions do you
need on average?
You cannot answer these questions properly only with classical information
theory and the correct answers are given in Chap. 10.
Organization of the Book

The book is organized into 6 parts.
Part I:
Introduces the basic concepts on an elementary level. It contains a

brief introduction to probability theory and the new concept of a
xvi
Part II:
Part III:
Part IV:
Part V:
Part VI:
Introduction
description, which maps into for a probability space .; ; p/,

i.e., it associates with every elementary event ! 2 a proposition
A 2 describing it. Already on the level of descriptions we can
distinguish information, novelty, and surprise.
Recapitulates classical coding theory. It is not essential for the new
concepts developed in this book, but for students of information theory
it is necessary to understand the practical meaning of information and
the related notions defined here.
Introduces some more background on stochastic processes which is
necessary to prove Shannons classical theorem, the backbone of
information theory. The material in this part is not new, but in my
opinion some proofs become a bit easier due to the new concept of
description.
Contains the core ideas of this book. It defines various structures on the
set of all covers and motivates their introduction by various practical
examples. It also contains the definitions of information, novelty, and
surprise for covers or repertoires and shows how these numbers can be
calculated in practice.
Shows some applications of our new and more general view of information theory in neuroscience, brain theory, and the physics of entropy.
Concentrates on the mathematical structures on which the new
generalized information theory is built. It harvests and combines the
mathematical results obtained in previous parts (mainly in Part IV). It
defines six new and interesting lattices of covers which could become
the subject of further mathematical investigations.
This book contains a mathematical theory of information, novelty, and surprise.

It should, however, be readable for everyone with a basic mathematical background
(as given in the first year of most scientific curricula) and the patience and stamina
necessary to follow through a mathematical kind of exposition and to do at least
some of the exercises.
Philosophy of the Book

The book provides a brief and comprehensive exposition of classical information
theory, from a slightly unusual point of view, together with the development of a new
theory of novelty and surprise. These new concepts are defined together with the
concept of information as a complementary companion. The word surprise invokes
many mostly subjective notions and this subjectivity is brought in on purpose to
complement the seemingly more objective notion of information.
The subjective vs. objective debate has a long history in the field of probability
theory and statistics. There it focuses on the question of the nature or the origin
of probability, e.g., Keynes 1921; Jeffreys 1939; Kerridge 1961; de Finetti 1974;
Introduction
xvii
Jeffrey 1992, 2004. This discussion can be and has been carried over to information
theory, but this is not the purpose of this book. Instead I have discovered an
additional source of subjectivity in probability and information theory that is
perhaps less relevant in probability but has a definite impact on information theory
once it is brought into the focus of attention. Classical information theory tried to
determine the amount of information contained in a message (as a real number
measured in bits). To do this it has to presuppose that the message is exchanged
between two agents (that are normally assumed to be people not animals, computers,
or brain regions), who have already agreed on a common language for the expression
of the message. This approach makes it possible to develop information theory in a
discrete framework, dealing mainly with a finite alphabet from which the messages
are composed. Thus classical information theory does not consider the process of
perception or of formation of the messages. It starts where this process has ended.
I believe and I will actually show in this book that information theory needs
only a slight modification to include this process of perception and formation of a
message. This process is captured in the definition of a description which connects
every event ! that can happen, with a proposition d.!/ about it. When we describe
actual events that have happened, we will normally be unable to give an exact
account of them, let alone of the total state of the world around us. We are restricted
both by our senses and by our language: we cannot sense everything that there is and
we cannot express everything that we sense (at least not exactly). So the description
d.x/ that we give about x will usually be true not only for x but also for other events
y which are somewhat similar to x.
Now it is quite clear that classical information theory does not deal with events
x that happen, but rather with descriptions of these events or propositions about
those events. Unfortunately, this simple fact is obscured by the usual parlance in
probability theory and statistics where a proposition is called an event and an
event is called an elementary event. This leads to such strange phrases as is the
certain event and ; is the impossible event. What is meant is that is the trivial
proposition, which is true because it says nothing about the event x, and that ; is
never true because it is a self-contradictory statement (like A and not A) about the
event x. Due to this strange use of the language (which is usually carried over from
probability theory to information theory) it may easily happen that this humanoid
or language-related aspect of the concept of information is forgotten in practical
applications of information theory. This is harmless when information theory is
used to optimize telephone lines, but it may become problematic when information
theory is applied in the natural sciences. This had already happened in the nineteenth
century when the term entropy was introduced in statistical mechanics to explain the
second law of thermodynamics and it is happening again today when information
theory is used in theoretical neuroscience or in genetics, e.g., Kuppers 1986. One
problem is that in such applications the concept of information may be taken to
be more objective than it actually is. This argument eventually has led me and
also others to some rather subtle and probably controversial criticisms of these
applications of information theory Knoblauch and Palm 2004; Palm 1985, 1996;
Bar-Hillel and Carnap 1953 which really were not the main motive for writing this
xviii
Introduction
book and which are neither part of the theory of novelty and surprise nor its most
interesting applications. In a nutshell, the situation can be described as follows.
Classical Shannon information theory requires an agreement between the sender
and the receiver of the messages whose information content has to be determined.
So information travels from a human sender to a human receiver, and the agreement
concerns the code, i.e., the propositional meaning of the symbols that constitute
the message; this implies that both sender and receiver use the same description
of events. The slightly broader information theory developed here includes the
situation where information about a purely physical observation is extracted by
human observation, so only the receiver needs to be a human. In many scientific
and everyday uses of information terminology, however, neither the sender nor the
receiver is human. And the problem is not merely that instead of the human we
have some other intelligent or intentional being, for example an alien, a monkey,
a chess-computer, or a frog. Information terminology is also used for completely
mechanical situations. For example, a cable in my car carries the information that
I have set the indicator to turn left to the corresponding lights. Or a cable transmits
visual information from a camera to the computer of my home intrusion-warning
system.
In biology and in particular in neuroscience this common use of information
terminology may interfere in strange ways with our ontological prejudices, for
example concerning the consciousness of animals, because on the one hand
information terminology is handy, but on the other hand we dont want to imply
that the receiver of the information has the corresponding properties comparable to
a human. For example, we may want to quantify the amount of visual information
the optic nerve sends to the brain of a frog (Letvin et al. 1959; Atick 1992), without
assuming that the frog (let alone its brain) has made some agreement with its
eyes about the meaning of the signals that are sent through the nerve. Similarly,
we can easily classify the maximal amount of information that can be expressed
by our genes, but we get into much deeper waters when we try to estimate how
much information is actually transmitted, and who is the sender and who is the
receiver. Does father Drosophila transmit some information (for example, about
how to behave) to his son by his genes? Or is it somehow the whole process of
evolution that produces information (Kuppers 1986, see also Taylor et al. 2007)
and who is the receiver of this information? The usual way out of this dilemma
is to avoid the question who might be the sender and who might be the receiver
altogether. In the technical examples eventually both are always humans, anyway.
In the case of information transmission by genes, neurons, or the optic nerve one can
argue that we are just interested in the physical properties of the device that limit
the amount of transmittable information, i.e., the channel capacity. In all these three
cases actually, the channel capacity has been estimated shortly after its definition by
Shannon 1948. In the case of the genome, the capacity is simply twice the number
of base pairs of the DNA-molecule. But this leads to the question how much of this
capacity is actually used and, invariably in biology, to the suspicion that it is much
less than the capacity. So it is found out that most of the genetic information is not
used or not coding, or that most of the visual information available from our
Introduction
xix
optic nerve is not consciously perceived, neither by us nor by our experimental

animals. Somehow people dont easily accept such statements perhaps because
they seem to conflict with the idea that animals are well designed. Perhaps these
observations again reflect the original dilemma of the receiver which we tried to
circumvent with the rather safe capacity argument. My position is that, at least in
neuroscience, the more subjective turn to information theory presented here may
help to alleviate this problem and to find better estimates of transinformation (or
rather mutual novelty) flows in neural systems that are in between the two extremes
of neurophysiolocigal channel capacity (e.g., MacKay and McCulloch 1952) and
transinformation from experimental stimuli to behavioral responses (e.g., Eckhorn
et al. 1976; Borst and Theunissen 1999) because received novelty < transmitted
information < information capacity. Chapter 12 contains some ideas and first results
in this direction.
The new concepts of novelty and surprise may also help to capture some puzzling
aspects of everyday use of information that are hard to understand in terms of classical information theory. In many expositions of mathematical information theory it
is obvious from the beginning that information in fact deals with propositions. This
is the case when information is primarily defined for partitions, i.e., for complete
sets of mutually exclusive propositions (about the event !). Although I sympathize
with this approach, I believe that this way of defining information is still too narrow,
because it does not allow for mutually overlapping propositions. Indeed, I think that
in everyday life we usually work with partially overlapping concepts which cannot
be sharply separated from each other, or even with hierarchies of concepts which
contain each other. Why should it not also be possible to allow such propositions in
information theory?
Such a more general approach makes it possible to understand the use of the
term novelty or surprise in everyday language where it seems to be unjustified
from the point of view of statistics or classical information theory. For example,
if in the state lottery the numbers .1; 2; 3; 4; 5; 6/ were drawn we would be much
more surprised than by the numbers .5; 11; 19; 26; 34; 41/, although, of course, both
sequences should have exactly the same small chance of being drawn. The reason for
our surprise in the first case seems to be that this sequence can be exactly described
in a very simple way: it consists of the first six numbers. On the other hand there
seems to be no simple description for the second sequence. To remember it one
really would have to memorize all the six numbers. Now it is much more probable
to obtain a sequence of numbers in the lottery that does not admit a simple exact
description than to obtain a sequence like .1; 2; 3; 4; 5; 6/ that does. In the special
case of .1; 2; 3; 4; 5; 6/ we could argue for example that there are only two such
extremely simple sequences, namely the last 6 and the first 6 numbers. Of course,
there are various degrees of simplicity and our surprise will vary accordingly. For
example, the sequence .22; 23; 24; 25; 26; 27/ is simple because it is a sequence of
consecutive numbers. Therefore it is also very surprising, but less surprising than
.1; 2; 3; 4; 5; 6/ because there are 44 such sequences including the two very simple
ones. A mathematician may even find the sequence .5; 11; 19; 26; 34; 41/ a little
surprising, because it contains 4 prime numbers. But since there are many possible
xx
Introduction
sequences containing 4 prime numbers (many more than 44, but much less than all
possible sequences), his surprise will certainly be not as large as for the sequence
.22; 23; 24; 25; 26; 27/.
But if we combine all these different reasons for finding surprise, we may
eventually find something surprising in almost every sequence. In this case, it would
seem naive to add all surprises we can get for a given sequence from various
considerations, rather one would believe that the real surprise provided by any
concrete sequence becomes less when everything is considered as surprising. This
obviously confused wording leads to the distinction between novelty and surprise
that is also made in this book.
Personal History of the Book

The idea of writing this book first occurred to me in 1984, when I received an
invitation to stay for a year in Berlin as a fellow of the Wissenschaftskolleg. Then
I thought I could use the freedom gained by the detachment from my daily duties
and the secretarial help offered by the Wissenschaftskolleg to produce a manuscript.
A few years before that, prompted by my attempts to understand an earlier article
by my friend and colleague, Charles Legendy (Legendy 1975, see also Legendy
and Salcman 1985), I had started to investigate a broader definition of information
(Palm 1981). Charles had tried to cast his ideas in brain theory in a framework of
information theory; the essential idea was that each neuron in the brain tries to get
as much information as possible from the activity of the other neurons and tries to
provide as much information as possible for them by its own activity. At that time
informational ideas for neuroscience were quite popular in Tubingen, possibly due
to the influence of Ernst Pfaffelhuber (Pfaffelhuber 1972). Certainly also my mentor
at the MPI in Tubingen, Valentino Braitenberg, often pointed out the importance
of information theory (e.g., Braitenberg 1977, 2011), and I needed the essentials
for my calculations of associative memory capacity (Palm 1980). It turned out that
Charles did not use exactly the usual concept of information as defined in Shannons
information theory, but rather something very similar, that I had called surprise in
my article. In this article I already had defined most of the essential concepts of
the new theory. So when I started to work on the manuscript in Berlin, my problem
seemed to be not whether I would manage to finish it, but whether there was enough
material for a whole book.
During the time in Berlin I realized that the ideas that had to be developed
actually had an interesting but complicated kinship to the ideas that I had developed
in my thesis on topological and measure-theoretical entropy in ergodic theory
(see Palm 1975, 1976b). I also realized that they had a bearing on the classical
discussions related to physical entropy in thermodynamics and the direction of time
which are also related to optimizational ideas based on entropy (Jaynes 1957, 1982).
At that time I started to read about Helmholtz and to look into some historical
papers by Carnap, Reichenbach, Brillouin, and others. Fortunately, Carl Hempel,
Introduction
xxi
a well-known philosopher of science, was visiting at the same time as a fellow of

the Wissenschaftskolleg and could help guide my studies in this area.
The year went by very quickly and the manuscript had grown considerably, but
there were now more loose ends than before. Back at the Max Planck Institute in
Tubingen I managed to use part of my time to complete the first version of the
manuscript. I sent it to MIT Press and I was again lucky that the manuscript was
seen as potentially interesting although not yet publishable by the reviewers and
in particular by Harry Stanton, who encouraged me to keep on working on the
manuscript.
In spite of this encouragement, I did not find the time to work on the book for
a number of years. Instead I became more deeply involved in brain research and
work on methods in neuroscience due to a number of new personal contacts, in
particular to Ad Aertsen, Peter Johannesma, and George Gerstein, who came to
Tubingen to work in the creative atmosphere in Valentin Braitenbergs group. I had
the possibility of discussing with them (among many other things) my statistical
ideas related to the concept of surprise and its possible use in neuroscience. This
led to the first appearance of the term surprise in some methodological papers on
spike train analysis (Palm et al. 1988; Aertsen et al. 1989) and in the widely used
multiunit analysis program by Ad Aertsen. Since that time some of my colleagues
(in particular Ad Aertsen and Moshe Abeles) have been pushing me to write these
ideas up properly.
After I had left Tubingen in 1988 to become a professor for theoretical brain
research at the university of Dusseldorf, I started to use the concepts of description
and novelty regularly in the teaching of courses on information theory, first in
Dusseldorf and later (after 1991) in Ulm where I became the director of an institute
for neural information processing in the computer science department. During the
brief period in Dusseldorf, one of my friends from student days in Tubingen, Laura
Martignon, joined our group and started to take up work on the book again. She put
some of her experience in teaching into the manuscript and helped to make some
parts more readable. Later, in Ulm she also taught a course on the subject. Together
we submitted the book again to MIT Press and were again encouraged to complete
the manuscript. The book seemed to be almost ready for the second time. However,
in the next years we both found no time to continue working on it, although since
that time I am using it regularly in the teaching of courses on information theory.
Only in the summer of 1997 I did find some time again to work on the book. I
partially reorganized it again and isolated the few missing pieces, many of them in
the exercises. Fortunately, I found a very talented student, Andreas Knoblauch, who
had taken part in one of the courses on information theory and was willing to work
on the solutions of many of the exercises. The peace to put it all together and finish
most of the remaining work on the manuscript was provided during my sabbatical
semester towards the end of 1998 by Robert Miller at the University of Dunedin in
New Zealand.
Unfortunately the book was still not completely finished by the end of the
millennium. Meanwhile I had more academic and administrative duties as chairman
of a collaborative research center and as dean of the computer science department
xxii
Introduction
in Ulm. So it was only towards the end of 2006 when I could take up the book project
again. This time I added more motivating examples to the text. And again I found a
talented student, Stefan Menz, who integrated everything into a neat LATEX-version
of the text. During the last years we also had the opportunity to pursue some of the
new ideas and applications of information theory with excellent PhD students in
an interdisciplinary school on Evolution, Information and Complexity, see Arendt
and Schleich (2009).
This book would never have been finished without the help, encouragement, and
inspiration from all the people I mentioned and also from many others whom I did
not mention. I would like to thank them all!
References
Aertsen, A. M. H. J., Gerstein, G. L., Habib, M. K., & Palm, G. (1989). Dynamics of neuronal
firing correlation: modulation of effective connectivity. Journal of Neurophysiology, 61(5),
900917.
Arendt, W., & Schleich, W. P. (Eds.) (2009). Mathematical analysis of evolution, information, and
complexity. New York: Wiley.
Atick, J. J. (1992). Could information theory provide an ecological theory of sensory processing?
Network: Computation in Neural Systems, 3, 213251.
Bar-Hillel, Y., & Carnap, R. (1953). Semantic information. In London information theory
symposium (pp. 503512). New York: Academic.
Bialek, W., de Ruyter van Steveninck, R. R., & Tishby, N. (2007). Efficient representation as a
design principle for neural coding and computation. Neural Computation, 19(9), 2387-2432.
Borst, A., & Theunissen, F. E. (1999). Information theory and neural coding. Nature Neuroscience,
2(11), 947957.
Braitenberg, V. (1977). On the texture of brain. New York: Springer.
Braitenberg, V. (2011). Information - der Geist in der Natur. Stuttgart: Schatthauer.
Deco, G., & Obradovic, D. (1996). An Information-theoretic approach to neural computing.
New York: Springer.
de Finetti, B. (1974). Theory of Probability (Vol. 1). New York: Wiley.
Eckhorn, R., Grusser, O.-J., Kroller, J., Pellnitz, K., & Popel, B. (1976). Efficiency of different neuronal codes: Information transfer calculations for three different neuronal systems. Biological
Cybernetics, 22(1), 4960.
Herzel, H., Ebeling, W., & Schmitt, A. (1994). Entropies of biosequences: The role of repeats.
Physical Review E, 50(6), 50615071.
Jaynes, E. T. (1957). Information theory and statistical mechanics. Physical Review, 106(4),
620630.
Jaynes, E. T. (1982). On the rationale of maximum entropy methods. Proceedings IEEE,
70, 939952.
Jeffreys, H. (1939). Theory of probability. New York: Oxford University Press.
Jeffrey, R. C. (1992). Probability and the art of judgment. New York: Cambridge University Press.
Jeffrey, R. C. (2004). Subjective probability: The real thing. New York: Cambridge University
Press.
Kerridge, D. F. (1961). Inaccuracy and inference. Journal of the Royal Statistical Society. Series B
(Methodological), 23(1), 184194.
Keynes, J. M. (1921). A treatise on probability. London: MacMillan.
Knoblauch, A., & Palm, G. (2004). What is Signal and What is Noise in the Brain? BioSystems,
79, 8390.
References
xxiii
Koepsell, K., Wang, X., Vaingankar, V., Wei, Y., Wang, Q., Rathbun, D. L., Usrey, W. M., Hirsch,
J. A., & Sommer, F. T. (2009). Retinal oscillations carry visual information to cortex. Frontiers
in Systems Neuroscience, 3, 118.
Kuppers, B.-O. (1986). Der Ursprung biologischer Information - Zur Naturphilosophie der
Lebensentstehung. Munchen: Piper.
Legendy, C. R. (1975). Three principles of brain function and structure. International Journal of
Neuroscience, 6, 237254.
Legendy, C. R., & Salcman, M. (1985). Bursts and recurrences of bursts in the spike trains of
spontaneously active striate cortex neurons. Journal of Neurophysiology, 53(4), 926939.
Letvin, J. Y., Maturana, H. R., McCulloch, W. S., & Pitts, W. H. (1959). What the frogs eye tells
the frogs brain. Proceedings of the IRE, 47(11), 19401951.
MacKay, D. J. C. (2005). Information theory, inference, and learning algorithms. New York:
Cambridge University Press.
MacKay, D. M., & McCulloch, W. S. (1952). The limiting information capacity of a neuronal link.
Bulletin of Mathematical Biology, 14(2), 127135.
Palm, G. (1975). Entropie und Generatoren in dynamischen Verbanden. PhD Thesis, Tubingen.
Palm, G. (1976b). Entropie und Erzeuer in dynamischen Verbanden. Z. Wahrscheinlichkeitstheorie
verw. Geb., 36, 2745.
Palm, G. (1980). On associative memory. Biological Cybernetics, 36, 167183.
Palm, G. (1981). Evidence, information and surprise. Biological Cybernetics, 42(1), 5768.
Palm, G. (1985). Information und entropie. In H. Hesse (Ed.), Natur und Wissenschaft. Tubingen:
Konkursbuch Tubingen.
Palm, G. (1996). Information and surprise in brain theory. In G. Rusch, S. J. Schmidt,
& O. Breidbach (Eds.), Innere ReprasentationenNeue Konzepte der Hirnforschung,
DELFIN Jahrbuch (stw-reihe edition) (pp. 153173). Frankfurt: Suhrkamp.
Palm, G., Aertsen, A. M. H. J., & Gerstein, G. L. (1988). On the significance of correlations among
neuronal spike trains. Biological Cybernetics, 59(1), 111.
Pfaffelhuber, E. (1972). Learning and information theory. International Journal of Neuroscience,
3, 83.
Schmitt, A. O., & Herzel, H. (1997). Estimating the entropy of DNA sequences. Journal of
Theoretical Biology, 188(3), 369377.
Shannon, C. E. (1948). A mathematical theory of communication. Bell Systems Technical Journal,
27, 379423, 623656.
Taylor, S. F., Tishby, N., & Bialek, W. (2007). Information and fitness. arXiv:0712.4382v1.
Tkacik, G., & Bialek, W. (2007). Cell biology: Networks, regulation, pathways. In R. A. Meyers
(Ed.) Encyclopedia of complexity and systems science (pp. 719741). Berlin: Springer.
arXiv:0712.4385 [q-bio.MN].
Part I
Surprise and Information of Descriptions
Chapter 1
Prerequisites from Logic and Probability Theory
This chapter lays the probabilistic groundwork for the rest of the book. We introduce
standard probability theory. We call the elements A of the -algebra propositions
instead of events, which would be more common. We reserve the word event
for the elements of the probability space .
1.1 Logic and Probability of Propositions

We begin with an intuitive explanation of the basic object of probability theory,
i.e., with the probabilistic model of the world: the so-called probability space
.; ; p/ (for a formal definition, see Definition 1.4 on page 8).
Here is a set. It can be taken to stand for the set of all possible events. Thus
any element ! 2 will be called an event.
It should be realized that in a given context one need only distinguish certain
relevant events, but even then there may be manyeven infinitely manydifferent
relevant events, whereas it would certainly require a theory about the world to set
up a set of really all possible events.
is a set of subsets of . It stands for the set of all possible propositions about
events, in a given context. An element A 2 will be called a proposition. The
identification of propositions describing events with subsets of is straightforward:
a set A stands for the proposition the event ! is in A; thus an event ! fulfills a
proposition A, exactly if ! 2 A.
At this point, we could already introduce some syntactical restrictions on the set
of all propositions. For example, one could argue that in any language we can
express only a finite number of propositions. There are two reasons for this:
1. The use of discrete symbols in language, i.e., its digital characterthe fact that
the products of any language are sequences of symbols from a finite set (often
called the alphabet)
G. Palm, Novelty, Information and Surprise, DOI 10.1007/978-3-642-29075-6 1,

Springer-Verlag Berlin Heidelberg 2012
1 Prerequisites from Logic and Probability Theory
2. The fact that we only live for a limited time and therefore are only able to produce
sequences of a limited finite length
When we refer to the set of all possible propositions, we will usually allow
for sequences of finite but unlimited length. Thus we shall usually allow at least
for a countably infinite . In probability theory one usually assumes that
contains all sets f!g (for all ! 2 ), in this case, of course #./ #.),
i.e., has rather more elements then , or a larger or equal cardinality (denoted
by #). Of course, the necessary mathematical assumptions on will always be
explicitly stated.
In real languages it may well happen that has a larger cardinality than . For
example, if an event ! involves the exact position of a glass on a table and this is
described by a pair .x; y/ of real numbers (coordinates), then has the cardinality
of the continuum, which is more than countable. On the other hand, may still be
countable or even finite. If, however, is finite, then will normally contain more
elements than , but cannot be infinite, and in fact #./ 2n , when #./ D n (the
cardinality of is at most 2n , where n is the number of elements in ).
There is another restriction on that we will always assume:
should be closed under the logical operations. This has to be explained: We
may join propositions to new propositions by means of logical operations. If A and
B are propositions, we may also say A and B, A or B, A and not B and the
like. The mathematical counterpart for this are the three set operations:
The intersection of two propositions: A \ B holds for an event !, if ! fulfills A
and ! fulfills B.
A [ B holds for ! if ! fulfills A or if it fulfills B.
AN holds for !, if ! does not fulfill A. Instead of AN the notation Ac (the
complement of A) will be used in the subsequent text.
Definition 1.1. A (nonempty) set of subsets of a set is closed under the logical
operations, if
i) For any A in , Ac is also in .
ii) For any two A, B in , A \ B and A [ B are also in .
A set of subsets of a set is called an algebra, if it is closed under the logical
operations and contains .
In the following we will always assume that P./ is an algebra; here P./
denotes the set of all subsets of . We will even require a little more as is customary
in probability theory (see Definition 1.4).
The set itself, interpreted as a proposition, says that ! 2 , in other words it
says nothing or it holds for every event !: it is a tautology. Its negation c holds
for no event !. As a set, c is called the empty set and written as c D ;.
Finally p is a mapping that assigns a positive number p.A/ to each A 2 . p.A/
is called the probability of the proposition A, i.e., the probability that an event !
fulfills A.
1.2 Mappings, Functions and Random Variables
We will assume that the following requirements are fulfilled by p:

1. p./ D 1
2. p.;/ D 0
3. For every two A, B 2 with A \ B D ;, we have p.A [ B/ D p.A/ C p.B/
Requirement 3 is the most important one, called the additivity of p, whereas 1
and 2 merely specify the range of values of p.
Proposition 1.1. The following holds for A, B 2 .
i)
ii)
iii)
iv)
If A B, then p.A/ p.B/

0 p.A/ 1 .0 D p.;/ p.A/ p./ D 1/
p.A [ B/ D p.A/ C p.B/ p.A \ B/
p.Ac / D 1 p.A/
Proof. (i) If A B, then B D A [ .B n A/ and so by 3:

p.B/ D p.B n A/ C p.A/ p.A/, since p.B n A/ is not negative.
(ii) Is obvious from Proposition (i).
(iii) A [ B D A [ .B n A/ and B D .B n A/ [ .A \ B/. Thus by 3 we get
p.B/ D p.B n A/ C p.A \ B/ and
p.A [ B/ D p.A/ C p.B n A/ D p.A/ C p.B/ p.A \ B/:
(iv) D A [ Ac and so 1 D p./ D p.A/ C p.Ac /.
t
u
1.2 Mappings, Functions and Random Variables

Mappings and functions are really the same: they associate objects with each other.
More exactly, if A and B are two sets, a mapping or function f from A to B
associates with each element of A exactly one element of B. The element associated
by f to a 2 A is denoted as f .a/. Equivalently we may say and write that f maps
a to b or f W a 7! b, or f .a/ D b. A mapping f from a set A to a set B is denoted
as f W A ! B. Mappings form a cornerstone of mathematics, so we should fix some
more notations related to mappings.
Definition 1.2. Let f W A ! B be a mapping. We define
f 1 .b/ WD fa 2 AW f .a/ D bg for any b 2 B;
f 1 .B 0 / WD fa 2 AW f .a/ 2 B 0 g for any B 0 B;
f .A0 / WD fb 2 BW there is an a 2 A0 with f .a/ D bg
D ff .a/W a 2 A0 g for any A0 A:
R.f / WD f .A/ is called the range of f:
For a 2 A, by definition, f .a/ is always one unique element of B. Conversely,

for b 2 B the so-called inverse image f 1 .b/ need not be a unique element of A, but
will in generally be a subset of A. This subset may even be empty, which happens if
b is not in the range of f , i.e., b R.f /. If f 1 .b/ contains exactly one element
of A for every b 2 B, the mapping f is called bijective or invertible, if it contains
at most one element of A for every b 2 B, the mapping f is called injective or
one-to-one.
Proposition 1.2. Let f W A ! B be any mapping. For any subsets A00 A0 A
and B 00 B 0 B we have
i)
ii)
iii)
iv)
v)
f .A00 / f .A0 / and f 1 .B 00 / f 1 .B 0 /,

f 1 .f .A0 // A0 ,
f .f 1 .B 0 // D B 0 \ R.f /,
f .f 1 .f .A0 /// D f .A0 / and
f 1 .f .f 1 .B 0 /// D f 1 .B 0 /.
Proof. (i) is obvious, it is called the monotonicity of f and f 1 on sets.

(ii) If a is in A0 , then f .a/ is in f .A0 /. By definition of f 1 this means that
a 2 f 1 .f .A0 //.
(iii) If b 2 f .f 1 .B 0 //, this means that b D f .a/ for some a 2 f 1 .B 0 /, but
a 2 f 1 .B 0 / means that b D f .a/ 2 B 0 . This shows that f .f 1 .B 0 //
B 0 . Clearly f .f 1 .B// f .A/ D R.f /. Conversely, b 2 B 0 \ R.f /
means that b D f .a/ for some a 2 A and f .a/ 2 B 0 . This means a 2
f 1 .B 0 / and therefore b D f .a/ 2 f .f 1 .B 0 //.
(iv), (v) By means of the monotonicity there can be deduced easily from the
inclusions (ii) and f .f 1 .B 0 // B 0 (from (iii)).
t
u
In (ii) we do not have an equation and it is indeed possible that f 1 .f .A0 // A0 .
Can you give an example?
Definition 1.3. A mapping X W ! R is called a (real) random variable and a
mapping X W ! Rn is called a random vector (for n 2).
In the following we will normally use upper-case letters like X , Y , Z for random
variables.
A mapping X W ! R, which assigns a real number X.w/ to each event w 2 ,
can be understood as a payoff-function: if the event w occurs, the player gets paid
X.w/if X.w/ happens to be negative, he has to pay, e.g., if X.w/ D 5, he has
to pay 5 units, if X.w/ D 3 he gets 3 units. By means of the probability p it is
possible to determine the average payoff of a random variable X ; it will be called
the expectation value of X and denoted as E.X /.
How can it be found? Let us first make a few simple observations:
1. If we consider two payoff-functions X and Y , then a player who is payed the
sum X C Y , i.e., X.w/ C Y .w/ for every event w, will have as his average payoff
the sum of the average payoffs for X and for Y :
E.X C Y / D E.X / C E.Y /
1.3 Measurability, Random Variables, and Expectation Value
2. If a player gets 1 unit in case a proposition A 2 holds and nothing otherwise,

his average payoff will be exactly p.A/. To write this down, we introduce the
function 1A by
(
1 if ! 2 A;
1A .w/ WD
0 if ! A:
Then we have E.1A / D p.A/.
3. If a payoff-function X is positive, i.e., X.!/ 0 for every ! 2 , then clearly
its expectation value is also positive, i.e., E.X / 0.
With these three observations we can calculate E.X / for a large class of
functions X . If X attains finitely many values a1 ; : : : ; an , we can consider the sets
Ai D X D ai D f! 2 W X.!/ D ai g:
If these sets are in , we can write X as
XD
n
X
ai 1Ai
i D1
and determine
E.X / D
n
X
ai p.Ai /:
i D1
In most of the following, considering finite or at most countable-valued functions

will be sufficient. Yet, sometimes we will need the distinction between discrete and
continuous mappings X W ! M . Discrete will mean that X has at most countably
many values in the set M , whereas continuous will mean that this is not the case. For
continuous functions (like ! 7! ! 2 on R) we need some basic results of measure
theory.

It is one of the topics of measure theory to determine exactly the class of functions
X for which we can compute the expectation E.X /. These functions are called
integrable and the expectation is written as an integral.1 This class of integrable
functions is found through the process of approximation.
For more details see any book on probability or measure theory, e.g., Ash (1972); Bauer (1972);
Billingsley (1979); Halmos (1950); Jacobs (1978); Lamperti (1966).
If X.!/ D lim Xn .!/, for every ! 2 , and E.Xn / is known, then E.X /
n!1
should be determinable as E.X / D lim E.Xn /. For this idea to work we need yet
n!1
another property of the probability p:

For a sequence An of sets in with AnC1 An it is clear that
0 p.AnC1 / p.An / and therefore lim p.An / 0 exists. Now it is required that
n!1
1
1
T
T
An 2 and p.
An / D lim p.An /. This requirement is in fact equivalent to
nD1
nD1
n!1
Definition 1.4.(i) below.

With this idea we can determine E.X / for many functions X . Consider a function
X W ! R with jX.!/j M for every ! 2 . Now we take an integer n and the
sets

kC1
k
kC1
k
D ! 2 W X.!/
X
Ak D
n
n
n
n
for k D M n; : : : ; M n.
Then clearly
Xn WD
M n
X
kDM n
k
1Ak
n
Xn C
1
DW Xn0 :
n
Therefore X.!/ D lim Xn .!/ for every ! 2 and

n!1
E.X / D lim E.Xn / D lim

n!1
n!1
M n
X
kDM n

k
kC1
k
p
X
:
n
n
n
A slightly different argument for the same computation of E.X / is the following:
From Xn X Xn0 it is clear that E.Xn / E.X / E.Xn0 /. And from Xn0
Xn D n1 we get E.Xn0 / E.Xn / D n1 and this shows lim E.Xn0 / D lim E.Xn /
n!1
n!1
which then is equal to E.X /.

In order to determine E.Xn / or E.Xn0 / we need to know the probabilities of
propositions like a X b for a; b 2 R, i.e., these propositions a x b
should belong to . This requirement is called measurability of X in terms of . If
not stated otherwise, functions on are always assumed to be measurable.
In the remainder we shall briefly define some basic technical concepts of measure
and probability theory for further reference.
Definition 1.4.
i) An algebra that contains
1
S
An for every sequence of
nD1
propositions An 2 , is called a -algebra. Usually a probabilityp is defined

1
S
An D
on a -algebra and it is required that p be -additive, i.e., that p
1
P
nD1
nD1
p.An / for any sequence An of mutually disjoint sets in .
ii) A triple .; ; p/ where is a set, is a -algebra of subsets of , and p is

a (-additive) probability, is called a probability space.
iii) Given any set S of propositions we call the smallest -algebra containing S
the -algebra generated by S and denote it by .S /.
iv) Given a function X W ! R, we consider the propositions X a D f! 2
W X.!/ ag D X 1 ..1; a/ for a 2 R. Let S D fX aW a 2 Rg, then
.X / WD .S / is called the -algebra (of propositions) concerning X .
v) Given a function X W ! M with values in some finite or countable set M ,
we consider S D fX D mW m 2 M g. Here .X / WD .S / is again called the
-algebra (of propositions) concerning X .
vi) A function X on is called measurable in terms of a set of propositions, if
.X / ./.
vii) Given a probability space .; ; p/, a function X W ! R is called a random
variable, if it is measurable in terms of . This is the case exactly if X a
2 for all real numbers a, i.e., if S (see Definition 1.4). The function
FX W R ! 0; 1 defined by FX .a/ D pX a is called the distribution
function of X .
The concept of a distribution function helps to compute expectation values for
random variables by means of real integration. In fact, for any measurable
function hW R ! R, the functionR h X W ! R Ris a random variable and
one can show that E.h X / D h.x/dFX .x/ D h.x/FPX .x/dx (if either
integral is finite). In Rthis way one can, for example, compute the moments
Mi .X / WD E.X i / D x i dFX .x/ for i D 1; 2; 3; : : :.
Observe that a function 1A is measurable if and
P only if A 2 . From this
observation one can infer that also step functions i ai 1Ai (with all Ai 2 ) are
measurable.
We end this introductory section with two examples of probability spaces, one
continuous and one discrete.
Example 1.1. Let D 0; 1 and D B./, where B./ is the Borel--algebra.2
Define p..a; b// D b a for a < b 2 0; 1. p defines a probability measure on
0; 1.3 All elementary events ! 2 0; 1 have probability 0.
t
u
Example 1.2. Let D f1; 2; : : : ; 6g, D P./ and p.f!g/ D 16 for every ! 2 .
This space represents the fair dice. For easier reference, we simply call this space
D D .; ; p/.
t
u
Similarly, we define some finite probability spaces for further reference.
Definition 1.5. D k D .; ; p/ where D f1; 2; 3; 4; 5; 6gk , D P./ and
p.!/ D 6k for every ! 2 .
B./ is the smallest -algebra containing all open intervals .a; b/ 0; 1.
see Bauer (1972) for example; p is called the Lebesgue measure.
10
En D .; ; p/ where D f1; : : : ; ng, D P./, and p.!/ D n1 for every

! 2 .
D k describes the throwing of k dice. The individual dice are described by the
random variables Xi W ! f1; : : : ; 6g with Xi .!1 ; !2 ; : : : ; !k / D !i .
1.4 Technical Comments

The first goal in writing this book was to focus on the new concepts in a rather
intuitive spirit. Therefore some formulations will appear rather naive for a trained
mathematician. In this spirit I could generate an interesting, meaningful and correct
theory only for finite objects. On the other hand, most definitions can almost equally
well be formulated for infinite objects, so I could not resist the temptation of
formulating this theory in its natural generality. The intuitive ideas still remain valid
as far as I can see, but this approach will occasionally lead into some technical
difficulties, which are very common and well known in classical probability theory.
They have to do with measurability, sets of probability (or measure) 0, and the
like. Clearly the most reasonable framework for dealing with such subtleties is the
classical probability space .; ; p/, which is the basis for all subsequent chapters.
References
Ash, R. B. (1972). Real analysis and probability. New York: Academic Press.
Bauer, H. (1972). Probability theory and elements of measure theory. New York: Holt, Rinehart
and Winston.
Billingsley, P. (1979). Probability and measure. New York, London, Toronto: Wiley.
Halmos, P. R. (1950). Measure theory. Princeton: Van Nostrand.
Jacobs, K. (1978). Measure and integral. New York: Academic Press.
Lamperti, J. (1966). Probability : A survey of the mathematical theory. Reading, Massachusetts:
Benjamin/Cummings.
Chapter 2
Improbability and Novelty of Descriptions
In this chapter we define the information of an event A 2 , or in our terminology

the novelty of a proposition A as log2 p.A/. We further define the important
new concept of a description and extend the definition of novelty from events to
descriptions. Finally we introduce the notions of completeness and directedness of
descriptions and thereby the distinction between surprise and information, which
are opposite special cases of novelty. This deviation from classical information
theory is further elaborated in the fourth part of this book. The interested expert
may go directly from Chap. 3 to Part IV.
2.1 Introductory Examples

You meet somebody (Mr. Miller) and ask him about his children. During the
conversation he tells you two facts: I have two children and This is my son
(pointing to a person next to him). Given these two facts, what is the probability that
he has a daughter?
The task seems to be the evaluation of a conditional probability concerning the
children of Mr. Miller. If we take his two statements at face value, this probability
is p he has a daughter j he has a son and he has two children .
However, the situation is not so simple. For example, Mr. Millers statement
I have two children is also true, if he has three or more children. Now, if he has
two or more children, at least one of them a son, and for each child the probability
of being a daughter is about 12 , then the probability asked for is certainly at least 12 ,
maybe larger, depending on the number of children he actually has. But there is
another, more plausible way of interpreting the two statements: You consider the
situation of this conversation and what else Mr. Miller could have said. For example,
if you asked him Do you have two children? and he simply answered yes,
then it may be that he actually has more than two children. However, if he could
equally well answer your question with I have three children instead of I have
two children you would expect that he chooses the first statement if he actually has
11
12
2 Improbability and Novelty of Descriptions
three children, since this describes his situation more accurately. This is what we
normally expect in a conversation.
For example, it may have started by Mr. Miller pointing out This is my son.
Then you may have asked Do you have more children? and he may have answered
Yes, I have two. In this case, one statement would actually mean [he has two
children and no more]. Let us turn to the other statement this is my son. We would
assume that Mr. Miller happened to be accompanied by one of his children. Indeed,
if he would be accompanied by two of them, we would expect him to mention
both of them in a usual conversation. So the other statement means [Mr. Miller was
accompanied by one of his two children and this was his son]. Now we can work
out the desired probability if we assume that it is equally probable that Mr. Miller is
accompanied by one or the other of his two children, if he is accompanied by just
one of them (which seems quite reasonable).
Mathematically, the situation can be described by three random variables X1 and
X2 2 fm; f g for child one and two, and C 2 f0; 1; 2g for his companion: C D 0
means he is not accompanied by just one child, C D 1, he is accompanied by child
one, C D 2 for child two. We assume that
p.X1 D m/ D p.X1 D f / D p.X2 D m/ D p.X2 D f / D
1
2
and pC D 1 D pC D 2 D p and work out the answer to the problem.

The face value of the statement that he has a son (assuming he has exactly two
children) is Am D X1 D m or X2 D m. With Af D X1 D f or X2 D f , the
face-value probability that we asked for is
p.Af jAm / D
p.Af \ Am /
D
p.Am /
1
2
3
4
2
:
3
The other interpretation of the same statements is what is described more formally in
this chapter. Mr. Miller describes the situation differently, depending on the values
of X1 ; X2 , and C :
If X1 D m and C D 1 he says Am ,
If X2 D m and C D 2 he says Am ,
If X1 D f and C D 1 he says Af ,
If X2 D f and C D 2 he says Af ,
If C D 0 he says nothing (i.e., ) about the sex of his children.
Since he has said Am , we have to ask for the set of all conditions under which he
says Am . This is called e
Am in our theory:
e
Am D X1 D m; C D 1 [ X2 D m; C D 2:
13
The desired probability is

p.Af \ e
Am /
Am / D
p.Af je
e
p.Am /
pX1 D m; X2 D f; C D 1 C pX1 D f; X2 D m; C D 2
pX1 D m; C D 1 C pX2 D m; C D 2
p
p
C
1
4
4
D p
p D 2:
C
2
2
Another example for this distinction between A and e

A is the famous Monty Hall
problem (Gardner 1959, 1969, see also Selvin 1975; Seymann 1991; BapeswaraRao and Rao 1992; Gillman 1992; Granberg and Brown 1995).
A quizmaster M gives the successful candidate C the opportunity to win a
sportscar S . There are three doors and the sportscar is behind one of them (behind each
of the other two doors is a goat). The candidate points at a door and if the sportscar
is behind it its his. Now the quizmaster opens one of the other doors and shows
him a goat behind it (he knows where the sportscar is). Then he asks the candidate
wether he wants to change his previous decision. Should the candidate change?
Again it is a problem of conditional probabilities. It can be described by
three variables S; C; M 2 f1; 2; 3g, describing the position of the sportscar, the
initial choice of the candidate and the door opened by the quizmaster. There is
a restriction on M , namely S M C . By opening one door (door i), the
quizmaster effectively makes a statement Ai D the sportscar is not behind door i
(i D 1; 2; 3).
Here the face-value probability is pS D C jAi . For reasons of symmetry, we
may assume that all these probabilities are the same for i D 1; 2; 3. Thus we may
assume C D 1 and i D 2. Then pS D 1jA2 ; C D 1 D pS D 1jS D 1 or
S D 3 D 12 .
When, however, we ask for the conditions under which the quizmaster says Ai
(i.e., for e
Ai ), the answer is M D i . Thus the desired probability is
pS D C jM D i D
pS D C; M D i
:
pM D i
Again for reasons of symmetry we may assume C D 1 and i D 2. Thus

pS D 1jM D 2; C D 1 D
D
pS D 1; C D 1; M D 2
pM D 2; C D 1
pS D 1; C D 1; M D 2
pS D 1; C D 1; M D 2 C pS D 3; C D 1; M D 2
1 1 1

1
3 3 2
D
D :
1 1
1 1 1
3
C
3 3 2
3 3
14
2.2 Definition and Properties

In this chapter we will define the novelty (on this level we might as well call
it information or surprise) of a proposition, which should be a measure of its
unexpectedness.
The first idea is that a proposition certainly is the more surprising, the more
improbable, i.e., the less probable it is.
A real function f W R ! R is called isotone (or increasing), if x y implies
f .x/ f .y/, and it is called antitone (or decreasing), if x y implies f .x/
f .y/.
Given a probability p on and an antitone real function f , the function f p
(defined by .f p/.A/ D f .p.A// for A 2 ) may be called an improbability.
Definition 2.1. For a proposition A 2 we define the novelty N of A as
N .A/ WD log2 p.A/:1
We note that N is an improbability, since x 7! log2 x is antitone. But why
did we choose f D log2 why the base two? The basic idea is that N .A/ should
measure the number of yesno questions needed to guess A. This will become much
clearer in Chap. 4; here we just want to give a hint to make this choice of f D log2
plausible.
Obviously, with one yesno question we can decide between two possibilities,
with 2 questions between 4 possibilities, with 3 questions between 8 possibilities,
and so on, since we can use the first question to divide the 8 possibilities into
2 groups of 4 possibilities each, and decide which group it is, and then use the
two remaining questions for the remaining 4 possibilities. In this way, with each
additional question the number of possibilities that we can decide between is
doubled. This means that with n questions we can decide between 2n possibilities.
If we want to find out the number of questions from the number of possibilities, we
have to use the inverse relationship, i.e., for k possibilities we need log2 k questions.
The most important property that is gained by the choice of a logarithmic
function is the additivity of novelty: N .A \ B/ D N .A/ C N .B/ for independent
propositions A and B. To explain this, we have to expand a little on the notion of
independence.
Given two propositions A and B, we may try to find a statistical relation between
the two. For example, we might ask whether it has an influence on the probability
of A to occur, when we already know that B is true. The question is, whether the
probability p.A/ is the same as the so-called conditional probability of A given B,
which is defined as:
This definition is the classical basic definition of information or entropy, which goes back to
Boltzmann (1887) (see also Brush 1966).
2.3 Descriptions
15
pB .A/ D p.AjB/ WD
p.A \ B/
:
p.B/
If p.AjB/ D p.A/ then there is no such influence, and we call A and B

independent.
Of course, we could also reverse the roles of A and B, and say that A and B
are independent if p.BjA/ D p.B/. It turns out that this condition is essentially
equivalent to the other one, because
p.BjA/ D
p.A \ B/
p.A \ B/ p.B/
p.B/
D

D p.AjB/
equals p.B/
p.A/
p.B/
p.A/
p.A/
if p.AjB/ D p.A/.
Of course, all of this only makes sense if p.A/ and p.B/ are not zero.
It is also clear that the two equivalent conditions are essentially the same as
p.A \ B/ D p.A/ p.B/, because p.AjB/ D p.A/ also implies
p.A/ p.B/ D p.AjB/ p.B/ D p.A \ B/:
These considerations are summarized in the following definition:
Definition 2.2. Two propositions A and B are called independent, if
p.A \ B/ D p.A/ p.B/:
Proposition 2.1. If we define the conditional novelty of A given B as
N .AjB/ D log2 p.AjB/;
then we have
i) N .A \ B/ D N .B/ C N .AjB/
ii) N .A \ B/ D N .A/ C N .B/ if A and B are independent.
Proof. Obvious.
t
u
2.3 Descriptions
Let us come back to an observation already made in Sect. 1.1. In general, the set
of all possible events may be very large. In a typical physical model the events
! 2 would be represented as real vectors ! D .!1 ; : : : ; !n / 2 Rn and thus
would be larger than countable. On the other hand, may well be countable, and
so a description of an element ! 2 by propositions A 2 would be essentially
inexact. Moreover, different persons may use different propositions to describe the
16

Language
World
description
World
interpretation
x
A = d(x)
Fig. 2.1 Model for descriptions of events
same event !. For example, when we walk on the street, we may see an event !,
lets say we see a car passing by. But this is not an exact description, and if we say
that we see a blue Mercedes driven by an old man passing by very slowly, this still
is not an exact description and somebody else might rather describe the same event
as Mr. Miller is driving with his wifes car into town.
What goes on here is that
1. Given the event, somebody describes it by a statement in a certain language.
2. Then this statement is interpreted again in our model of the possible events as
a proposition A, i.e., as the set A of all events y which are also described by the
same statement (Fig. 2.1).
The point of view taken by a particular observer in describing the events ! 2
by propositions A 2 , constitutes a particular description of these events. This
process of description can be defined mathematically as a mapping.
Definition 2.3. Given .; ; p/, a mapping d W ! that assigns to each ! 2
a proposition d.!/ D A 2 such that ! 2 A, is called a description.
In addition we require2 that for every A 2 R.d /
pd D A D p.f! 2 W d.!/ D Ag/ 0:
This means that we dont want to bother with propositions that may happen
only with probability 0. This additional requirement is quite helpful for technical
reasons, but it is quite restrictive and it rules out some interesting examples (see
Example 2.3). Note that the requirement that ! 2 d.!/ means that the description
has to be true. So every event ! is described by a true proposition about it.
This requirement obviously implies that the propositions d D A are in for every A 2 R.d /.
It also implies that R.d / is finite or countable.
2.3 Descriptions
17
Fig. 2.2 Example of

propositions
about positions on a table
A
Example 2.1. Consider the throwing of a dice, i.e., the space of possible events
D f1; 2; 3; 4; 5; 6g. Consider the following descriptions:
1) Even vs. odd: For A D f2; 4; 6g and B D f1; 3; 5g D Ac we define the
description e by eW 1 7! B, 2 7! A, 3 7! B, 4 7! A, 5 7! B, 6 7! A.
2) Small vs. large:
d W 1 7! f1; 2; 3; 4g D A; 2 7! A; 3 7! A;
4 7! f3; 4; 5; 6g D B; 5 7! B; 6 !
7 B:
3) Pairs:
c W 1 7! f1; 2g; 2 7! f2; 3g; 3 7! f3; 4g;

4 7! f3; 4g; 5 7! f4; 5g; 6 7! f5; 6g:
t
u
Example 2.2. Try to define some descriptions that make use of the propositions
indicated in Fig. 2.2 about locations on a square table including some logical
operations on them.3
t
u
Example 2.3. Without the requirement added to Definition 2.3 a description d may
have an uncountable range R.d /. Here are two examples for this.
Take D R and a (continuous) probability p on .R; B/ and > 0. For ! 2
we define d.!/ D .! ; ! C / and c.!/ D !; 1/ D fx 2 RW x !g.
Both c and d are interesting descriptions, but for every A in R.c/ or R.d / we
observe pc D A D 0 and pd D A D 0.
t
u
Definition 2.4. A finite or countable collection fA1 ; : : : ; An g or fAi W i 2 Ng of
(measurable) subsets of is called a (measurable) partition, if
For example, one can describe points x 2 AnC by d.x/ D A, points x in A\C by d.x/ D A\C ,
points x in B n C by d.x/ D B, and points x in B \ C by d.x/ D C .
18
i)
Ai D and
ii) p.Ai \ Aj / D 0 for i j .

For a partition we always have
p.Ai / D 1 and essentially4 every ! 2 is in
exactly one of the sets Ai .

Usually part (ii) of the definition says Ai \ Aj D ; for i j . Here we
again disregard sets of probability 0 as essentially empty. In the following we
will sometimes identify sets (A and B) that differ only by an essentially empty set,
i.e., we may write A D B (essentially), meaning that
pA B D p.AnB/ C p.BnA/ D 0:
Example 2.4. Given a partition D fA1 ; : : : ; An g we may define d .!/ WD Ai for
! 2 Ai . Obviously, d is a description.
t
u
2.4 Properties of Descriptions

Definition 2.5. Given a description d W ! , we consider the novelty mapping
N W ! R defined by N .A/ D log2 p.A/ and call
Nd .!/ D .N d /.!/ D N .d.!//
the novelty provided by ! for the description d . Note that Nd W 7! R is a random
variable.5 We further define the average novelty for the description d as
N .d / WD E.Nd /:
For this definition it is required that N d is measurable. Let us illustrate this
in the case where d has finite range (see Definition 1.2). Let us say d takes the
values A1 ; : : : ; An , then N d also has a finite range, and the value of N d is
log2 p.Ai /, when the value of d is Ai .
This occurs on the set e
Ai WD f! 2 W d.!/ D Ai g D d 1 .Ai /. Now it is clear
that N d is a step function, namely
N d D
n
X
i D1
log2 p.Ai /1e

Ai :
Clearly in this case we have to require that the sets e

Ai D d 1 .Ai / are in for
i D 1; : : : ; n, and then we can calculate
From Definition (ii) it is obvious that the probability that ! is in two sets, e.g., Ai and Aj , is 0.
So, disregarding propositions with probability 0, every ! 2 is in exactly one of the sets Ai . We
will usually disregard propositions with probability 0 and this is meant by the word essentially.
5
Due to the additional requirement in Definition 2.3 the function Nd is measurable. However, it
may happen that E.Nd / is infinite. For an example see Proposition 2.17.
N .d / D E.N d / D
19
n
X

p e
Ai log2 p.Ai /:
i D1
But these sets e

Ai do have still another significance. Let us ask what we can infer
about the event ! that has taken place from its description d.!/ D Ai given by
an observer. The obvious answer is of course that he tells us that ! is in Ai . But
if we know the attitude of the observer well enough, we can infer even more. If
we know his procedure of description, i.e., the mapping d he uses, we can infer
that ! is such that d.!/ D Ai , i.e., we can infer that ! 2 d D Ai D fyW d.y/ D
Ai g D d 1 .Ai / DW e
Ai , and e
Ai is a more exact information about ! than Ai , because
e
Ai Ai . (Indeed every ! in e
Ai satisfies d.!/ D Ai ).
Let us give an example of this kind of inference. If an observer says: the car
went slowly this may be interpreted literally as speed of the car below 30 km/h,
say, but when we know that the observer would say that a car goes very slowly,
when its speed is below 10 km/h, we can even infer that the speed of the car was
between 10 km/h and 30 km/h. Of course, an accurate observer might have said that
the car goes slowly but not very slowly, but it is quite common experience that this
additional information is neither mentioned nor used (nor of any interest).
If this kind of additional inference is already explicitly contained in a
description d , then we call it complete. Given an arbitrary description d it is quite
easy to construct another description dQ that gives exactly this additional implicit
information; this description dQ we call the completion of d .
Definition 2.6. For a description d and for A 2 R.d / we define
i) e
A WD d D A D f! 2 W d.!/ D Ag
ii) The description dQ by dQ .!/ D f! 0 2 W d.!/ D d.! 0 /g.
dQ is called the completion of d . A description d is called complete if d D dQ .
Proposition 2.2. The following properties of a description d are equivalent:
i) d is complete,
ii) If d.!/ d.! 0 / then d.!/ \ d.! 0 / D ;,
iii) The range R.d / of d is a partition6 of .
Proof. This should be obvious from the definitions. If not, the reader should try to
understand the definition of dQ and why the range of dQ must be a partition.
t
u
If we ask how surprising the outcome of a fixed description d will be, we again
encounter the sets e
Ai . The idea here is to consider the surprise of one particular
outcome in comparison to all the other (usual) outcomes of the description d .
Strictly speaking, here a partition should be defined by Ai \ Aj D ; instead of p.Ai \ Aj / D 0

(compare Definition 2.4). If we disregard 0-probability-propositions we should interpret d D dQ in
Definition 2.6 as pd D dQ D 1 and we should use the weaker formulation p.d.!/ \ d.! 0 // D 0
in part (ii) of this definition.
20
As a first step we can order the sets Ai according to their probability such that
p.A1 / p.A2 / : : : p.An /. Then we can say that A1 is more surprising
than A2 and so on. To quantify the amount of surprise we really get from Ai we
determine the probability pi that d gives us at least as much surprise as Ai does.
This is pi D pd.x/ D A1 _ d.x/ D A2 _ : : : _ d.x/ D Ai . Since d can take only
one value for every ! 2 , this is the sum of probabilities
pi D
i
X
pd D Aj D
j D1
i
X
p.e
Aj /:
j D1
So for our description d , the surprise of Ai is log pi , whereas its novelty is

simply log p.Ai /, and its information is log p.e
Ai /. Given a description d we
can construct in the above fashion another description dE which gives the surprise
of d .
Definition 2.7. For a description d and for A 2 R.d /, we define
AE WD
e B 2 R.d /; p.B/ p.A/g

fBW
D p d p.A/
D f! 2 W p.d.!// p.A/g
and the description dE by
dE.!/ WD f! 0 W p.d.! 0 // p.d.!//g:
A description d is called directed, if d D dE.
Definition 2.8. A description d is called symmetric, if for every x; y 2 , x 2
d.y/ implies y 2 d.x/.
We can now reintroduce the set-theoretical operations that are defined on
propositions, on the level of descriptions; in particular, we have the natural ordering
and the union and intersection of descriptions:
Definition 2.9. We say that a description c is finer than a description d , or that d is
coarser than c and write c d , if c.!/ d.!/ for every ! 2 .
With this definition we see that the completion dQ of any description d is always
finer than d . This is because for any ! 2 , x 2 dQ .!/ means d.x/ D d.!/ and this
implies x 2 d.x/ D d.!/. Obviously we also have dQ dE, since d.! 0 / D d.!/
implies p.d.! 0 // p.d.!//.
Usually d dE, but dE d is also possible.
21
Example 2.5. Let D f1; : : : ; 6g.

d.!/ WD f!g;
c.1/ D f1; : : : ; 4g;
b.!/ D f1; : : : ; !g;
and
c.2/ D f1; : : : ; 5g;
c.i / D for i > 2:
Then
dE .!/ D ;
cE.1/ D f1g;
E
b.!/
D b.!/;
and
cE.2/ D f1; 2g;
cE.i / D for i > 2:
Thus we have bE D b, cE c, and dE d .
t
u
Definition 2.10. For any two descriptions c and d we define

i) c \ d by .c \ d /.!/ WD c.!/ \ d.!/,
ii) c [ d by .c [ d /.!/ WD c.!/ [ d.!/,
iii) d c by d c .!/ D .d.!//c [ f!g for every ! 2 .
Note that the complement or negation has to be slightly adjusted in order to keep
the property that ! 2 d c .!/. Still the complement has the nice property that d \ d c
is the finest possible description, namely d \ d c .!/ D f!g, and d [ d c is the
coarsest possible description, namely d [ d c .!/ D .
The point is that descriptions have to be true, i.e., ! 2 d.!/, and therefore the
flat negation of a description cannot be a description. But given d.!/ one can try
to describe the same event ! in the opposite or most different way, which is d c .!/.
It is like saying a glass of wine is half-empty instead of half-filled.
Proposition 2.3. c d implies N c N d , and this implies N .c/ N .d /.
t
u
Proof. Obvious.
This property of the novelty is called monotonicity. It is one of the essential

properties that are needed for developing information theory.
Proposition 2.4. Let c and d be two descriptions. Then cQ \ dQ c \ d .

Q
\ dQ .!/. Then c.! 0 / D c.!/ and d.! 0 / D d.!/. Therefore
Proof. Let ! 0 2 c.!/
0
c \ d.! / D c \ d.!/, i.e., ! 0 2 c \ d .!/.
t
u
Unfortunately cQ \ dQ c \ d in general, as the following example shows.

Example 2.6. Let D f1; : : : ; 6g and define
c.1/ D c.2/ D f1; 2g;
c.3/ D c.4/ D c.5/ D c.6/ D f3; 4; 5; 6g
22
and
d.1/ D f1; 2; 3; 4; 5g;
d.3/ D f2; 3; 4; 5; 6g;
d.2/ D f1; 2; 3; 4g;
d.4/ D f1; 3; 4; 5; 6g;

d.5/ D d.6/ D :
We observe that dQ cQ D c d . Here dQ .!/ D f!g for ! D 1; 2; 3; 4 and

dQ .5/ D dQ .6/ D f5; 6g. Thus cQ \ dQ D dQ and c \ d D c implying c \ d D cQ D c.
t
u
Also c d does not imply cQ dQ , in general. The above is even an example

where c d and cQ dQ . These problems can be circumvented, if we consider
so-called consequential or tight descriptions. The reason for the second name will
become clear in Chap. 9.3.
Definition 2.11. A description d is called consequential or tight, if for every
x; y 2
x 2 d.y/ implies d.x/ d.y/:
Proposition 2.5. Let c, d be descriptions.
i) dQ is tight and dE is tight.
ii) If c and d are tight, then c \ d is tight.
Proof. (i) Let x 2 dQ .y/. Then d.x/ D d.y/ and therefore dQ .x/ D dQ .y/.
Let x 2 dE.y/. Then p.d.x// p.d.y//.
Let ! 2 dE.x/. Then p.d.!// p.d.x// p.d.y//, i.e., ! 2 dE.y/.
Thus dE.x/ dE .y/.
(ii) Let x 2 c \ d.y/. Then c.x/ c.y/ and d.x/ d.y/ since c and d are tight.
Thus c \ d.x/ D c.x/ \ d.x/ c.y/ \ d.y/ D c \ d.y/.
t
u
Proposition 2.6. If c and d are tight descriptions, then
i) c d implies cQ dQ ,
ii) c \ d D cQ \ dQ .
Proof. (i) We show that cQ dQ which implies N .c/

Q N .dQ /. Take any x 2 .
For y 2 c.x/
Q
we have to show y 2 dQ .x/. Let y 2 c.x/,
Q
i.e., c.y/ D c.x/. This
implies
d.y/ c.x/ 3 x and d.x/ c.y/ 3 y:
By tightness of d we get d.x/ d.y/ and d.y/ d.x/, i.e., d.x/ D d.y/.
This means y 2 dQ .x/.
f cQ \ dQ . The reverse
(ii) From (i) we get c \ d cQ and c \ d dQ , thus cd
inclusion is Proposition 2.4.
t
u
Proposition 2.7. For a tight description d the following are equivalent:

i) d is symmetric
ii) d is complete.
23
Proof. (ii) ) (i) : dQ is symmetric by definition.

(i) ) (ii) : If x 2 d.y/, then also y 2 d.x/ and tightness implies d.x/ D d.y/,
i.e., x 2 dQ .y/. Thus d.y/ D dQ .y/.
t
u
Proposition 2.8. For a tight description d the following are equivalent:
i) d is directed.
ii) d.!/ D f! 0 W d.! 0 / d.!/g D f! 0 W p.d.! 0 // p.d.!//g D dE.!/ for ! 2
.
iii) d.!/ d.! 0 / or d.! 0 / d.!/ for any !; ! 0 2 .
Proof. Let d be tight.
(i) ) (ii) : ! 0 2 d.!/ implies d.! 0 / d.!/ which in turn implies
p.d.! 0 // p.d.!//. Therefore d.!/ f! 0 W d.! 0 / d.!/g
f! 0 W p.d.! 0 // p.d.!//g D dE.!/ By (i) all these sets are
equal.
(ii) ) (iii) : Assume p.d.!// p.d.! 0 //. Then dE.!/ dE.! 0 /, or vice
versa.
(iii) ) (i) : p.d.! 0 // p.d.!// implies d.! 0 / d.!/ by (iii) and therefore
! 0 2 d.!/. Thus dE .!/ d.!/. Conversely ! 0 2 d.!/ implies
d.! 0 / d.!/ which in turn implies p.d.! 0 / p.d.!//
t
u
Definition 2.12. For a description d we define the description d \ by
d \ .!/ WD
\
fA 2 R.d /W ! 2 Ag:
d \ is called the tightening of d .

Proposition 2.9. For any description d the following holds:
i) d \ d and therefore N .d \ / N .d /,
ii) d \ is tight
iii) d \ D d , if and only if d is tight
Proof. (i) Is obvious
(ii) !
T
(iii) d \ is tight: ! 0 2 d \ .!/ implies that d \ .! 0 / D fA 2 R.d /W ! 0 2 Ag
d \ .!/ If d is tight and ! 2 A 2 R.d /, then d.!/ A. Thus d.!/ d \ .!/.
Together with (i) we obtain d D d \
t
u
The interpretation of d \ becomes obvious, when we consider R.d / as the set
of propositions that a person (or machine) is willing (or able) to make about an
event !. If ! happens, we may collect all he, she or it can say (correctly) about !.
This is d \ .!/.
In Example 2.6 the description c is complete and therefore tight, while d is
not tight. In Proposition 2.5 we have seen that also directed descriptions are tight.
However, there are many tight descriptions d which are different both from dE and dQ .
The following is an example for this (cf. Exercise 1)).
24
Example 2.7. Let D f1; 2; : : : ; 6g and define

d.1/ D d.2/ D f1; 2; 3; 4g;
d.3/ D d.4/ D f3; 4g
and
d.5/ D d.6/ D f3; 4; 5; 6g:

t
u
2.5 Information and Surprise of Descriptions

The information of a description is defined as the average novelty provided by its
completion.
Definition 2.13. Let d be a description on .; ; p/. We denote by I.d / the
number N .dQ / D E.N dQ / and call it the information provided by d . For ! 2
we also define the random variable
Id .!/ WD N .dQ .!//:
If d is complete, I.d / coincides with the usual concept of Shannon information
on .; ; p/ provided by d . (see Shannon and Weaver 1949 and Proposition 2.14).
It is now easy to prove
Proposition 2.10. For any description d we have
0 N d Id
and
0 N .d / I.d / D N .dQ / D E.Id /:
Proof. see Exercise 4) on page 33.

We first show that dQ d . This implies N .dQ / N .d / by Proposition 2.3.
The first elementary observation on I.d / is that it is always positive, since it is
the sum of positive terms p.e
A/ log p.e
A/.7 Figure 2.3 shows a plot of the function
h.p/ D p log2 p for p 2 0; 1.
Definition 2.14. Let d be a description.
i) We define the surprise (of an outcome !) of d by
Sd .!/ WD log2 p.dE.!// D N .dE.!//:
ii) We define the surprise of a description d by
S.d / WD E.Sd /:
If p.e
A/ D 0 for some e
A in the range of dQ, we set p.e
A/ log p.e
A/ D 0 since lim x log x D 0.
x!0C

Fig. 2.3 Plot of function
h.p/
25
h(p)
0.75
0.5
0.25
0.25
0.5
0.75
Proposition 2.11. For every description d we have

0 S.d /
1
and S.d / I.d /.
ln 2
Proof. The second inequality follows from dQ dE.

For the first it is sufficient8 to consider descriptions d with finite R.d /. We may
further assume that d is directed, i.e., that R.d / D fA1 ; : : : ; An g with A1 A2
n
P
p.Ai n Ai 1 / log p.Ai / .A0 D ;/ Let pi WD p.Ai /,
: : : An . Then S.d / D
i D1
then
S.d / D
n
X
.pi pi 1 / log pi
i D1
i D1
Z1
n
X
log2 .x/ dx D
1
ln 2
Zpi
log2 .x/dx
pi 1
t
u
Unfortunately, both, information and surprise, do not have the property of

monotonicity in general. However, information is monotonic on tight descriptions.
8
Here we rely on the approximation argument as in Sect. 1.3 for the calculation of an expectation
value. If all the finite sum approximations satisfy the same inequality ( ln12 ), then this inequality
is also satisfied by the limit.
26
Proposition 2.12. If c and d are tight descriptions then c d implies I.c/ I.d /.
t
u
Proof. Follows from Proposition 2.6 i).
From Proposition 2.5 it is obvious that both complete and directed descriptions
are tight. So we have monotonicity of information also on complete and on directed
descriptions.
Novelty and surprise are both smaller than information, but how do they
relate to each other? Usually novelty will be much larger than surprise. For
example, a complete description d with p.d.!// D n1 for every ! 2 has
N .d / D I.d / D log2 n, but S.d / D 0. The following example shows, however,
that N < S is also possible.
Example 2.8. Let .; ; p/ D E16 and9
d W 1 ! f1; : : : ; 11g
6!
14 ! f7; : : : ; 16g
2 ! f1; : : : ; 12g
::
:
7!
::
:
15 ! f8; : : : ; 16g
5 ! f1; : : : ; 15g
13 !
16 ! f9; : : : ; 16g
Ordering d.!/ by the increasing values of p.d.!// for ! 2 , we obtain

d.16/; d.15/; d.14/; d.1/; d.2/; : : : ; d.5/; d.6/; : : : ; d.13/, where from d.6/ on,
we have p.d.i // D 1. Thus
dEW 16 ! f16g
15 ! f15; 16g
14 ! f14; 15; 16g
1 ! f1; 14; 15; 16g
6!
2 ! f1; 2; 14; 15; 16g

::
:
7!
::
:
5 ! f1; 2; 3; 4; 5; 14; 15; 16g
13 !
Thus, Nd .!/ < Sd .!/ for ! f6; : : : ; 13g and Nd .!/ D Sd .!/ D 0 for
! 2 f6; : : : ; 13g. Thus N .d / < S.d / in this example.
u
t
It is also quite easy to characterize the extreme cases where two of the three
quantities N , I, and S coincide. This is done in the next proposition.
Proposition 2.13. Let d be a description, then
i) N .d / D I.d / implies d D dQ essentially
ii) S.d / D I.d / implies d
essentially
iii) If d is tight then N .d / D S.d / implies d D dE essentially
Proof. (i) We have N .d / D E.Nd /, I.d / D E.NdQ /, and Nd NdQ . If for some
! 2 , Nd .!/ < NdQ .!/, the same is true for all ! 0 2 dQ .!/, and therefore
9
See Definition 1.3.
27
N .d / < I.d /. Thus Nd .!/ D NdQ .!/ and therefore p.d.!// D p.dQ .!//
for every ! 2 . Since dQ .!/ d.!/, this implies that dQ .!/ and d.!/ are
essentially equal.
(ii) Since S.d / D E.NdE /, I.d / D E.NdQ /, and dQ dE we can again infer that
N E D N Q and that dE.!/ D dQ .!/ for every ! 2 . For some ! we have
d
dE .!/ D , thus dQ .!/ D , i.e., d

.
(iii) ! 0 2 d.!/ ) d.! 0 / d.!/ ) Nd .! 0 / Nd .!/ ) ! 0 2 dE.!/
t
u
For further reference we provide a simple comparison of the three basic formulae
to calculate N , I, and S.
P
p.e
A/ log p.A/,
i) N .d / D
A2R.d /
P
ii) I.d / D
p.e
A/ log p.e
A/,
A2R.d /
iii) S.d / D
E
p.e
A/ log p.A/.
A2R.d /
Proof. Here we definitely need the additional requirement of Definition 2.3. Then
we just have to compute the expectation of a step function with at most countably
many values. Let us show (iii) for example:
S.d / D E.Sd / D
pSd D xx D
x2R.Sd /
E
pd D Alog p.A/:
t
u
A2R.d /
Example 2.9. Let

(
d.!/ D
A for ! 2 A and p D p.A/;

otherwise.
Then
I.d / D p log2 p .1 p/ log2 .1 p/ DW I.p/:
This function is plotted in Fig. 2.4.
t
u
In everyday life the common use of the word information is closely related
with the common use of the word uncertainty. In fact, we expect the information
provided by the description of a phenomenon to be equivalent to the amount of
uncertainty the description eliminates. The concept of uncertainty has been treated
in the contexts of thermodynamics and statistical physics (Brush 1966) and has been
expressed in terms of entropy (a term introduced by Clausius (1865) and defined by
a formula similar to Proposition 2.14.(ii) by Boltzmann 1887). It is one of our aims
to elucidate the relation between entropy and information in depth (see Chap. 14).
For the moment we limit ourselves to observing that the information of a partition
is also called its entropy by many authors.
28
Fig. 2.4 Plot of function
I .p/
I(p)
1
0.75
0.5
0.25
0.25
0.5
0.75
In a nutshell, the words information, novelty, and surprise introduced here can be
distinguished or characterized as follows:
Information you get whether you can use it or not, whether it is interesting or not,
Novelty measures how much of this is new and interesting for you,
Surprise is provided by an event that is comparatively improbable; if everything
is equally improbable, nothing is surprising.
The following proposition characterizes descriptions that provide no information.
Proposition 2.15. The information of a description d is zero if and only if all sets
in the range of its completion dQ have probability zero except for one set e
A with
p.e
A/ D 1. In accordance with our additional requirement in Definition 2.3, this is
equivalent to d
.
Proof. Clearly the condition is sufficient: if p.e
A/ D 1 for one set in the range of
e D 0 for the rest then I.d / D 0. Now, if I.d / D 0 we
dQ and p.B/
must have p.e
A/ log p.e
A/ D 0 for all e
A 2 dQ ./. Therefore p.e
A/ is 0 or 1. Since
P
e
e Q
e
t
u
e
A2dQ./ p.A/ D 1, exactly one of the sets A 2 d ./ must satisfy p.A/ D 1.
This proposition corresponds to the natural expectation that a description
provides no new information if we are certain about the outcome of the situation.
The other extreme case that provides maximal information and corresponds to
maximal uncertainty is attained by descriptions on which the probability measure is
uniformly distributed, as stated by the following proposition.
29
Proposition 2.16. For a fixed n consider all descriptions whose range has n
elements, that is all d on .; ; p/ with d./ D fA1 ; : : : ; An g. I.d / attains a
maximum of log2 n for d D dQ and p.Ai / D n1 for each i .D 1; : : : ; n/. The same is
true for N .d /.
t
u
Proof. see Exercise 8) on page 33.

Proposition 2.17. A description with infinite range can have infinite novelty.
Proof. We give an example for such a description. Let D RC , p.x/ D .x C e/1

.ln.x C e//2 , where
Z1
Z1
p.x/dx D
h
1 i1
x 1 .ln x/2 dx D
D1
ln x e
Let D fi 1; i / D Ai W i 2 Ng and d.x/ D Ai for x 2 Ai 2 . Then I.d / D

1
P

p.Ai / log p.Ai / where p.Ai / D p.xi / for some xi 2 Ai , so
i D1
p.Ai / < p.i 1/ D .i 1 C e/1 .ln.i 1 C e//2

and
1
for i 2 N
e
p.Ai / > p.i / D .i C e/1 .ln.i C e//2 :
Thus
p.Ai / log2 p.Ai / > p.i / log2 p.i /
> .i C e/1 .ln.i C e//2 log2 .i C e/
D .i C e/1 .ln.i C e//1 .ln 2/1
> .ln 2/
1
i CeC1
Z
i Ce
1
dx:
x ln x
Therefore
I.d / > .ln 2/
1
Z1
1
.x ln x/1 dx D .ln 2/1 ln ln x 1Ce D 1:
t
u
1Ce
Finally we consider two potential properties of information, novelty and surprise

that are essential for the further development of a useful information theory.
30
The first property, already shown in Proposition 2.3 to hold for novelty, is
monotonicity, i.e., the requirement that finer descriptions should have larger novelty,
information, and surprise. Unfortunately, it holds neither for information nor for
surprise, because c d does not imply cQ dQ , nor cE dE. From Example 2.6 we
can easily create an example where c d and N .c/ > N .d /, but I.c/ < I.d /.
However, we get monotonicity of I for tight descriptions because of Proposition 2.6.
Counterexamples against monotonicity of surprise are easy to find. For example,
e X , but
for an equally distributed discrete random variable X , clearly X

e / D 0, whereas S.X / > 0.
S.X
The other important property of classical information is its subadditivity:
I.c \ d / I.c/ C I.d /. This will be shown in the next chapter (Proposition 3.5).
It is quite easy to see that the novelty N as defined here does not have this property,
i.e., in general, N .c \ d / N .c/ C N .d /. An example for this can be obtained
by considering a description d and its complement d c . (See also Exercise 7).) Also
surprise does not have this property (see Exercise 16)). We will see in the next
section that information has this property.
In order to obtain both properties, monotonicity and subadditivity, for both,
information and novelty, the definitions given on the level of descriptions in this
chapter are not sufficient. We have to elevate these definitions to a higher level,
which is done in Part III of the book.
The other possibility is to consider only descriptions with particular properties,
for example, tight or complete descriptions. For complete descriptions, information
and novelty coincide and clearly have both properties. This is the framework of
classical information theory.
2.6 Information and Surprise of a Random Variable

For a measurable mapping X W ! M with values in some finite or countable set
e , which is only concerned with
M , we may consider the corresponding description X
the values of X on and defined by
e .!/ WD f! 0 2 W X.! 0 / D X.!/g:
X
e is always complete and that our above definition of
We see that the description X
e With the aid of
dQ for a description d is just a special case of the definition of X.
e , we can define the information contained in a random
the complete description X
e /. For later reference we restate this definition.
variable X as I.X / WD N .X
Definition 2.15. For a discrete random variable X we define its (average)
information content as
e /:
I.X / WD N .X
31
Remark: Usually information is defined for partitions or for discrete random

variables (Shannon and Weaver 1949; Ash 1965; Gallager 1968; Billingsley 1978).
Since partitions correspond to complete descriptions in our terminology and since
e is complete, this definition coincides
for any random variable X its description X
with the usual one.
Given an arbitrary random variable X W ! R it may happen that pX D x D 0
e / would be infinite and we would need a
for any x 2 R. In such a case N .X
different workable definition (see Chap. 11.4). Therefore it is sometimes useful to
e For example, one may
consider other descriptions concerning X , different from X.
be interested in the largeness or the smallness of the values of X . This leads to the
definitions of the descriptions X and X . Or one may be interested in the values
of X only up to a certain limited accuracy. This leads to the definition of X .
Definition 2.16. For a random variable X W ! R we define the descriptions
X .!/ D f! 0 2 W X.! 0 / X.!/g and
X .!/ D f! 0 2 W X.! 0 / X.!/g and
X .!/ D f! 0 2 W jX.! 0 / X.!/j < g for any > 0 and ! 2 :
Proposition 2.18. For a random variable X we have N .X / D S.X / and
N .X / D S.X / D N .X /.
Definition 2.17. For a random variable X we define the surprise of X as
S.X / WD N .X /:
This definition provides a simple relation between the surprise of a description d
and the surprise of its novelty Nd (see Definition 2.5).
Sd .!/ D N .Nd Nd .!// and therefore S.d / D N .Nd / D S.Nd /.

Here we introduce information, novelty, and surprise as expectation values of
appropriate random variables. For Shannon information this idea was occasionally
used (e.g., Khinchin 1957), but it was always restricted to partitions, i.e., to complete
descriptions in our terminology. The more general idea of an arbitrary description,
although quite natural and simple, has never appeared in the literature.
In this exposition I have adopted the strategy to disregard propositions of zero
probability, because this provides an unrestricted application domain for the ideas
introduced. As mentioned in Chap. 1, this more general approach entails some
technical difficulties involved in some of our definitions and propositions, mostly
32
concerning measurability and nonempty sets of probability 0. These difficulties are

dealt with in some of the footnotes. Another possibility would have been to develop
everything for discrete probability spaces first, assuming p.!/ 0 for every
! 2 , and extend it to continuous spaces later. This is often done in elementary
treatments of information theory.
The new concept of surprise will provide a bridge from information theory to
statistical significance. In earlier papers (Palm 1981), I have called it normalized
surprise.
The concept of novelty was first introduced in Palm (1981) by the name of
evidence. The problem here, is to find yet another word, which has not too
many different connotations. Today I believe that novelty is the more appropriate
word, for such reasons. Proposition 2.14 gives the classical definition of information
(Shannon and Weaver 1949).
The concept of a consequential description (Def. 2.11) and the following
propositions 2.5 to 2.9 are perhaps a bit technical. These ideas are taken up again in
Part IV.
2.8 Exercises
1) For the descriptions given in Examples 2.1, 2.52.7 determine their completion,
their tightening,their novelty, their information, and their surprise.
2) Let D f0; : : : ; 999g and consider the following random variables describing
these numbers:
X.!/ WD first digit of !;
Y .!/ WD last digit of !;
Z.!/ WD number of digits of !; for every ! 2 :
What are the corresponding descriptions, what is the information content of X ,
Y , and Z, and what is the corresponding surprise (assuming equal probabilities
for all thousand numbers)?
3) Measuring the height of a table by means of an instrument with an inaccuracy of
about 1 mm can be described by two different descriptions on D 500; 1500
(these are the possible table heights in mm):
d1 .!/ WD \ ! 0:5; ! C 0:5 and
d2 .!/ WD i; i C 1 for ! 2 i; i C 1/ for i D 500; 501; : : : ; 1499:
What is the completion in these two cases and what is the average novelty,
information, and surprise (assuming a uniform distribution of table heights
on )?
2.8 Exercises
33
4) Prove Proposition 2.10 on page 24.

5) Is it true that I.X / D I.X 2 / for any random variable X ?
6) Give an example for a pair .X; Y / of two random variables where
B
e\Y
eDX
A
.X; Y / D X
Y:
7) Let D f1; : : : ; ng and X D idW ! R. For equal probabilities on , what

e /? What is the limit for n ! 1 in each of the three
is N .X /, N .X /, N .X
e D X \ X . For which values of n is N .X \ X / >
cases? Observe that X
N .X / C N .X /?
n
P
Hint: Remove the constraint
p.e
Ai / D 1 by expressing p.e
An / as a function of
i D1
p.e
A1 /; : : : ; p.e
An1 /. Then compute a local extremum by setting the derivatives
of I.d / to 0.
9) Given a probability space .; ; p/ and the events A; B 2 . We say
A supports B, if p.BjA/ > p.B/
A weakens B, if p.BjA/ < p.B/
If A supports B, which of the relations supports and weakens hold for the
following expressions?
a)
b)
c)
d)
A and B c
B and A
Ac and B c
B c and A
10) Let c; d be descriptions. We say

c supports d , if p.c \ d / > p.c/ p.d /
c weakens d , if p.c \ d / < p.c/ p.d /
c is independent of d , if p.c \ d / D p.c/ p.d /
Let D f1; : : : ; 6g. Give examples for c and d such that they
a) Weaken each other
b) Support each other and
c) Are independent of each other
11) Determine the tightening of the descriptions in Example 2.1 and 2.6.
12) Is it possible that N .d \ / > N .dQ /? If yes, give an example; if no, give a proof.
13) Given a description d on .; ; p/. The function P W ! 0; 1 defined
by P .!/ D p.d.!// is a random variable. Can you give an example for a
description d for which
a) pP x D x,
b) pP x D x 2 ,
c) pP x D 12 C
for every x 2 0; 1?
x
2
34
14) Determine all complete, all directed, and all tight descriptions on D f1; 2; 3g.
15) Let D f1; : : : ; 8g. Let c.i / D f1; 2; 3; 4g for i D 1; 2; 3; 4, and c.i / D for
i D 5; 6; 7; 8, and d.i / D f2; : : : ; 6g for i D 2; : : : ; 6, d.i / D for i D 1; 7; 8.
Calculate N , S, and I for c, d , and c \ d .
16) Let D f1; : : : ; 6g, c.1/ D c.2/ D f1; 2g, c.i / D for i D 3; : : : ; 6, and
d.1/ D d.6/ D , d.i / D f2; : : : ; 5g for i D 2; : : : ; 5. Calculate N , S, and I
for c, d , and c \ d .
References
Ash, R. B. (1965). Information theory. New York, London, Sidney: Interscience.
Billingsley, P. (1978). Ergodic theory and information. Huntington, NY: Robert E. Krieger
Publishing Co.
Boltzmann, L. (1887). Uber

die mechanischen Analogien des zweiten Hauptsatzes der Thermodynamik. Journal fur die reine und angewandte Mathematik (Crelles Journal), 100, 201212.
Brush, S. G. (1966). Kinetic theory: Irreversible processes, Vol. 2. New York: Pergamon Press,
Oxford.
Bapeswara-Rao, V. V., & Rao, M. B. (1992). A three-door game show and some of its variants.
The Mathematical Scientist, 17, 8994.
Clausius, R. J. E. (1865). Uber

verschiedenen fur die Anwendung bequeme Formen der Hauptgleichungen der mechanischen Warmetheorie. Annales de Physique, 125, 353400.
Gallager, R. G. (1968). Information theory and reliable communication. New York, NY, USA:
John Wiley & Sons, Inc.
Gardner, M. (1969). The unexpected hanging and other mathematical diversions. Simon and
Schuster: New York.
Gardner, M. (1959). Mathematical games column. Scientific American.
Gillman, L. (1992). The car and the goats. American Mathematical Monthly, 99(1), 37.
Granberg, D., & Brown, T. A. (1995). The Monty hall Dilemma. Personality and Social Psychology
Bulletin, 21(7), 711723.
Khinchin, A. (1957). Mathematical foundations of information theory. New York: Dover Publications, Inc.
Selvin, S. (1975). On the Monty Hall problem [Letter to the editor]. The American Statistician,
29(3), 134.
Seymann, R. G. (1991). Comment on lets make a deal: The players Dilemma. The American
Statistician, 45(4), 287288.
Shannon, C. E., & Weaver, W. (1949). The mathematical theory of communication. USA:
University of Illinois Press.
Chapter 3
Conditional and Subjective Novelty

and Information
This chapter introduces some more involved versions of the concepts of novelty
and information, such as subjective and conditional novelty.

You are playing cards and you are about to bet on a card that it is an ace. Your
neighbor, who could have seen the value of the card, whispers to you Dont do
that! If this card is an ace, my uncle is the pope. The card turns out to be an ace.
Now you know that the pope is his uncle.
From the point of view of information theory, here you have combined two statements of low information content to achieve a statement of very high information
content. Classical information theory tells you that this is not possible (on average):
the combined information content cannot exceed the sum of the two separate
information contents, or equivalently, the conditional information of X given Y
cannot exceed the unconditional information.
This is true for random variables X and Y and also for complete descriptions (as is shown in this chapter). However, it is not always true for arbitrary descriptions. More exactly, we are able to show additivity of novelty for
arbitrary descriptions, i.e., the combined novelty of c and d is the sum of
the novelty of c and the conditional novelty of d given c. But this conditional novelty can be much larger than the unconditional novelty. Perhaps the
e by a random variable X and
simplest example of this is the description X
c
e
its complement X as defined in Sect. 2. In this case, for every ! 2 we
e .!/ D X D X.!/ and X
e c .!/ D X X.!/ [ f!g. So X
e .!/ \
have X
c
e
X .!/ D f!g.
e and X
e c completely determines ! 2 , whereas
The combined information of X
c
e
e
the information of X and of X both can be rather small.
To come back to our example, let us assume there are two random variables X
and Y on . X 2 f1; : : : ; 8g determines the value of a card (in a deck of 32 cards),
35
36
3 Conditional and Subjective Novelty and Information
e / D 3 bit. The Novelty of X

e c is even smaller, because p.X
e c .!//
so I.X / D N .X
7
7
c
e / log2 0:1926.
pX X.!/ D 8 for every ! 2 and so N .X
8
The variable Y 2 f1; : : : ; 220 g indicates one out of 220 , i.e., roughly a million
people who is the nephew of the pope.
Now we can consider the two descriptions c.!/ D f! 0 W X.! 0 / D X.!/g and
d.!/ D f! 0 W X.! 0 / X.!/ or Y .! 0 / D Y .!/g. Here
Nc .!/ D log2 p.c.!// D log2
1
8
D 3;
Nd .!/ D log2 p.d.!// D log2 . 78 C
1
8
220 / 0:1926
and
Nc\d .!/ D log2 p.c.!/ \ d.!//

D log2 p.f! 0 W X.!/ D X.! 0 / and Y .!/ D Y .! 0 /g/
D log2 . 18 220 / D 23:
Another example for this phenomenon is the following. One person tells you:
John is quite heavy. I am sure his weight is at least 80 kilos. Another person
tells you John is not so heavy. I am sure his weight is at most 80 kilos. If both
statements are true, you know that Johns weight is exactly 80 kilos. If X denotes
Johns weight, we may assume that the first person describes it by X and the
e (see Sect. 2), which can have
second person by X . Now clearly X \ X D X
large information content, and we have seen (Proposition 2.11) that the novelty of
both X and X is at most log2 e 1:4427.
3.2 Subjective Novelty

Before we can proceed we need some additional notation. Up to now we have only
considered one probability p on . In the following we shall sometimes consider
different probabilities p, q; : : : on , and, if necessary we shall write Nq .A/ for the
novelty of A taken with respect to the probability q and Eq .X / for the expectation
of X for the probability q. Furthermore we shall write
Nq .d / WD Eq .Nq d / and similarly for Iq and Sq :
Definition 3.1.
Npq .d / WD Ep .Nq d /
is called the subjective novelty of d (believing in q while p is the correct
probability).
Gpq .d / WD Npq .d / Np .d /
is called the novelty gain between q and p.
3.2 Subjective Novelty
37
For a discrete random variable X and two probabilities p and q, we define the
subjective information of X as
e /;
Npq .X / WD Npq .X
the information gain as
e/
Gpq .X / WD Gpq .X
and the subjective surprise as
Spq .X / WD Npq .X /
Remark: For d D e
d , Gpq .d / is also called the information gain or the Kullback
Leibler distance between p and q (with respect to d ) (Kullback 1959, 1968;
Kullback and Leibler 1951).
Proposition 3.1. Gpq .d / 0 for any complete description d .
Proof. The assertion Gpq .d / D Ep . log.q d / C log.p d // 0 is independent
from the base of the logarithm. Since the proof is the least clumsy for the natural
logarithm, which satisfies the simple inequality ln x x 1, we use it here.
Gpq .d /
D
D

q.d.!//
Ep log
p.d.!//
X
q.D/
p.D/ log
p.D/
D2R.d /
ln x x 1
X
D2R.d /
D2R.d /
q.D/
1
p.D/
p.D/
q.D/
p.D/ D 1 1 D 0:
D2R.d /
Thus Gpq .d / 0. Here the summation extends over all propositions D in the range
R.d / of d .
t
u
Remark: The inequality ln x x 1 is very important for information theory; it can
be used to prove most of the interesting inequalities.
The following example shows that Gpq .d / can be negative for descriptions d that
are not complete.
38
Example 3.1. Take a description d defining a bet on A , i.e., d.!/ WD A for

! 2 A and d.!/ WD for ! A. If p.A/ < q.A/, then
Gpq .d / D p.A/ log2
p.A/
< 0:
q.A/
t
u
Proposition 3.2. If c and d are two complete descriptions, then c d implies

Gpq .c/ Gpq .d /.
Proof. We use the idea of Proposition 3.1
Gpq .d / Gpq .c/ D Ep
D
p.d.!// q.c.!//
log
q.d.!// p.c.!//
p.C / log
C 2R.c/
X
D2R.d /
p.d.!// q.C /
q.d.!// p.C /
X p.C /
p.D/ q.C /
log

p.D/
p.D/
q.D/ p.C /
C D
!
X p.C / p.D/ q.C /
1
p.D/

p.D/ p.C / q.D/
D
C D
!
X
X p.C /
X q.C /
D
D0

p.D/
q.D/ C D p.D/
D
C D
t
u
D0
3.3 Conditional Novelty

When we use conditional probabilities, i.e., probabilities pA on given by
pA .B/ D p.BjA/, we shall further simplify the notation by writing NA instead
of NpA and similarly for EA , NA , IA , and SA .
Clearly NA .d / will be referred to as the novelty of d under the condition A, or
the novelty of d given A. Similarly for IA .d / and SA .d /. Sometimes the following
notation is also useful:
EA .X / D E.X jA/
conditional expectation,
NA .d / D N .d jA/
conditional novelty,
SA .d / D S.d jA/
conditional surprise and
IA .d / D I.d jA/
conditional information
39
Now we want to define the novelty of a description d given another description c.

The simplest way of defining this, is to consider the mapping p.d jc/W ! R,
defined by
p.d jc/.!/ WD p.d.!/jc.!// for each ! 2 :
Definition 3.2. The mapping
Nd jc .!/ WD log2 p.d.!/jc.!//
defines the novelty of d given c for the event ! 2 . In addition we define
N .d jc/ WD E.Nd jc /:
But there is also a different way of interpreting the novelty of d given c. Indeed,
if we know the description c for an event !, we also know that ! is described by c,
i.e., that c.!/ D c, i.e., ! 2 c D c.!/ D e
c.!/. Thus we might as well say that
the novelty of d given c is really N .d je
c /.
There is still another quite reasonable definition for the novelty of d given c,
namely to use the average of Nc.!/ .d / (the novelty of d given c.!/), i.e., to
define N 0 .d jc/ D E.Nc.!/ .d //. In general, N 0 .d jc/ N .d jc/ (see the following
example), but it turns out that the two definitions coincide if c is complete.
Example 3.2. Consider D f1; 2g with probabilities p.f1g/ D
Define descriptions c; d with
3
4
and p.f2g/ D 14 .
c.1/ D f1g; c.2/ D f1; 2g; d.1/ D f1g and d.2/ D f2g:
Then we have R.c/ D fC1 ; C2 g D ff1g; f1; 2gg; R.d / D fD1 ; D2 g D ff1g; f2gg,
e1 ; C
e 2 g D ff1g; f2gg D fD
e 1; D
e 2 g D R.e
and correspondingly R.e
c / D fC
d /.
Now we can calculate
N .d jc/ D
2
2 X
X
ei \ D
e j / log2 p.Dj jCi / D : : : D
p.C
i D1 j D1
N 0 .d jc/ D E! .Sc.!/ .d // D
2
X
1
2
and
e i /ECi .sCi .d //
p.C
i D1
D
2
X
i D1
ei /
p.C
2
X
e j jCi / log2 p.Dj jCi / D : : : D

p.D
j D1
So in general, N 0 .d jc/ N .d jc/.

Proposition 3.3. For c complete we have
c /:
N .d jc/ D N 0 .d jc/ D N .d je
3
1

log2 3:
2 16
t
u
40
Proof. Clearly N .d jc/ D N .d je

c / since c D e
c. Now N 0 .d jc/ is defined as
X
E.Nc.!/ .d // D
p.C /NC .d /
C 2c./
p.C /EC . log2 p.d.!/jC //
C 2c./
p.C /EC .Nd jc /
C 2c./
D E.Nd jc .!// D N .d jc/:
t
u
We now collect some properties of novelty N and information I concerning the

relation and the intersection of descriptions.
Proposition 3.4. Let c and d be two descriptions. Then Nc C Nd jc D Nd \c . Thus
N .c/ C N .d jc/ D N .d \ c/.
Further the following propositions are equivalent:
i) Nd jc D Nd
ii) Ncjd D Nc
iii) p.c.!/ \ d.!// D p.c.!// p.d.!// for every ! 2 .
t
u
Proof. Obvious.
The first equation in this proposition implies is called the additivity of novelty.
In classical information theory, monotonicity and subadditivity are the most
important properties of information. Since for complete descriptions information
and novelty coincide with the classical concept of information, both measures have
these two properties on complete descriptions, but for general descriptions novelty
is monotonic (Proposition 2.3) and not subadditive (Exercises 2.7) and 3.5) on
page 45), whereas information is subadditive (cf. the following Proposition 3.5)
but not monotonic (cf. the following example). The last assertion is quite obvious
because c d does not imply e
c e
d (see Example 2.6 and the subsequent
discussion). The subadditivity of information is the subject of the next proposition.
Example 3.3. Consider D f1; : : : ; 32g with equal probabilities on , i.e.,
.; ; p/ D E32 . Define
a.1/ D f1g;
a.32/ D f32g; and for all other i 2
a.i / D f2; : : : ; 31g:
and define
b.i / D f1; : : : ; 31g for i D 1; : : : ; 16;
b.i / D f2; : : : ; 32g for i D 17; : : : ; 32:
Then a b but I.a/ I.b/.
Another example was given in Example 2.4.
t
u
41
Proposition 3.5.
I.c \ d / I.c/ C I.d /:
and
I.c \ d / D I.c/ C I.d /

if and only if p .c \ d /.!/ D p.e
c.!// p.e
d .!// for (almost) every ! 2 .
d / D fD1 ; : : : ; Dn g. We have e
c \e
d
Proof. Let R.e
c / D fC1 ; : : : ; Cn g and R.e
c \ d . Thus
c \e
d / N .e
c / C N .e
d /;
I.c \ d / D N .c \ d / N .e
because
N .e
c \e
d / N .e
c / N .e
d/ D
k
n X
X
p.Ci \ Dj / log2 p.Ci \ Dj /
i D1 j D1
n
X
p.Ci / log2 p.Ci /C
i D1
k
n X
X
p.Ci \ Dj / log2
i D1 j D1
k
X
p.Dj / log2 p.Dj /
j D1
p.Ci /p.Dj /
p.Ci \ Dj /

p.Ci /p.Dj /
1

1
p.Ci \ Dj /

ln 2
p.Ci \ Dj /
i D1 j D1
0
1
k
k
n
n X
X
1 @X X
D
p.Ci /p.Dj /
p.Ci \ Dj /A
ln 2 i D1 j D1
i D1 j D1
k
n X
X
D 0:
x1
ln x

.
ln 2
ln 2
In order to obtain equality in both inequalities in this proof, we need that
p.C
i \ Dj / D p.Ci/ p.Dj / for every i and j (with Ci \ Dj ;) and that
c.!/ \ e
d .!/ for every ! 2 .
t
u
p .c \ d /.!/ D p e
The inequality holds because log2 x D
In order to obtain additivity also for information, we have to define conditional

information.
Definition 3.3. For two descriptions c and d , we define the conditional information
of d given c as
I.d jc/ WD N .e
d je
c /:
42
Proposition 3.6. Let c and d be tight descriptions. Then

I.c \ d / D I.c/ C I.d jc/:
Proof. Obvious from Proposition 3.4 and Proposition 2.6.
t
u
Together with Proposition 2.6 we have now shown that on tight descriptions
the information I has all the classical properties: monotonicity, subadditivity, and
additivity. The novelty N , however, lacks subadditivity.
Proposition 3.7. Let c and d be two descriptions and R.c/ finite. Then
I.d jc/ D
p.A/I.d jA/:
A2R.e
c/
Proof. follows immediately from Proposition 3.3.
t
u
Definition 3.4. For two descriptions c and d we define their mutual novelty as
M.c; d / WD N .c/ C N .d / N .c \ d /
and their transinformation1 as
T .c; d / WD I.c/ C I.d / I.c \ d /:
Proposition 3.8. Let c and d be descriptions. Then
i) T .c; d / 0,
ii) M.c; d / D N .c/ N .cjd / D N .d / N .d jc/,
iii) If c and d are tight, then T .c; d / D I.c/ I.cjd / D I.d / I.d jc/.
Proof. (i) From Proposition 3.5,
(ii) From Proposition 3.4,
(iii) From Proposition 3.6.
t
u
3.4 Information Theory for Random Variables

e /.
We have defined information in general for random variables by I.X / WD N .X
e
e
Now we can also define I.Y jX / WD N .Y jX /. Furthermore we can define X 4 Y
if .X / .Y / (see Definition 1.4). With these notions we get the usual relations
on the information of random variables.
Classical information theory defines T .c; d / as the transinformation or mutual information.

Here we distinguish between transinformation and mutual novelty.
3.4 Information Theory for Random Variables
43
Proposition 3.9. Let X; Y , and Z be discrete random variables.

i) The pair .X; Y / is also a random variable, defined by
.X; Y /.!/ WD .X.!/; Y .!//
ii)
iii)
iv)
v)
vi)
e\Y
e.
and we have .X; Y / D X
I.X; Y / I.X / C I.Y /
X 4 Y implies I.X / I.Y /
I.X; Y / D I.X / C I.Y jX /
0 I.Y jX / I.Y /
X 4 Y implies I.X jZ/ I.Y jZ/ and I.ZjX / I.ZjY /
(subadditivity)
(monotonicity)
(additivity)
t
u
Proof. Exercise 3.8)
Definition 3.5. The transinformation or mutual information between two discrete

random variables X and Y is defined as
T .X; Y / WD I.X / C I.Y / I.X; Y /:
From Proposition 3.9.(ii) we can infer that T .X; Y / 0.
Proposition 3.10. Let X and Y be two random variables, then
T .X; Y / D I.X / I.X jY / D I.Y / I.Y jX /:
Proof. This follows immediately from Proposition 3.4.
t
u
Proposition 3.10 points at an interpretation of T .X; Y /: I.X / is the average

novelty we get from a value of X , I.X jY D b/ D IY D b .X / is the average
novelty we still get from a value of X when we already know that Y D b, and
I.X jY / is the average novelty we get from a value of X when we know a (random)
value of Y . The difference is the amount of novelty obtained from X that is removed
by knowing Y . In other words, T .X; Y / is the amount of information about X that
we get from knowing Y .
Proposition 3.11. Let X; Y , and Z be three random variables. Assume that X and
Z are independent given Y , i.e.,
pX D a; Z D c j Y D b D pX D a j Y D b pZ D c j Y D b
for every a 2 R.X /, b 2 R.Y / and c 2 R.Z/. Then,
T .X; Z/ min.T .X; Y /; T .Y; Z//:
Proof.
T .X; Z/ D I.X / I.X jZ/ I.X / I.X j.Y; Z//
./
D I.X / I.X jY / D T .X; Y /:
44
Equality ./ holds because

e .!/ \ Y
e.!/ \ Z.!/
e
p X

Ne
X j.Yf
;Z/ .!/ D log2
e.!/ \ Z.!/
e
p Y

e .!/ \ Z.!/j
e
e.!/
p X
Y

D log2
e
e.!/
p Z.!/j
Y

e .!/jY
e.!//p.Z.!/j
e
e.!/
p X
Y

D log2
e
e.!/
p Z.!/j
Y
D Ne
X je
Y .!/
t
u
The concept of mutual information can also be defined for three and more random
variables. We define T .X; Y; Z/ WD T .X; Y / T .X; Y j Z/. This definition is
again symmetric in X; Y; and Z, because
T .X; Y; Z/ D I.X / C I.Y / C I.Z/
I.X; Y / I.Y; Z/ I.X; Z/
C I.X; Y; Z/:
T .X; Y; Z/ can have both positive and negative values in contrast to T .X; Y /
(see Exercise 3.15)). This definition can of course be extended to more than three
variables, see, for example, Bell (2003); Attneave (1959).

This chapter already contains the essential definitions and proofs for the theory of
novelty and surprise to be developed further in Part VI of this book. It also proves
the basic properties of information (additivity, subadditivity, and monotonicity)
for random variables, which are the fundament of classical information theory
(cf. classical books like Cziser and Korner 1982; Heise and Quattrocchi 1989; Cover
and Thomas 1991; Reza 1994).
The basic ideas of information theory introduced so far are also sufficient
for the widespread recent practical applications of information theory in pattern
recognition, machine learning, and data mining (e.g., Deco and Obradovic 1996;
Amari and Nagaoka 2000; MacKay 2005), most of them based on the concepts of
transinformation and information gain or Kulback-Leibler distance. In theory, some
of these applications actually need the continuous versions of these concepts. This
issue is discussed in Chap. 11.
3.6 Exercises
45
3.6 Exercises
1) What is T .X; Y /, T .Y; Z/, T .Z; X / for the three random variables X , Y , and
Z defined in Exercise 2.2).
2) Compute N .a/, N .b/, I.a/, I.b/, M.a; b/, and T .a; b/ for Example 3.3 on
page 40.
3) Compute T .c; d / for Example 2.6 on page 21 and T .c; d /, T .c; e/, T .d; e/
for Example 2.1 on page 16.
4) Given two dice, i.e., D f1; : : : ; 6g f1; : : : ; 6g. Let X1 D first dice, X2 D
second dice, D D doublets, i.e.,
(
D.i; j / D
1 if i D j ,
0 otherwise.
Determine T .X1 ; X2 /, T .X1 ; D/, T .X2 ; D/, T .X1 ; .X2 ; D//, T ..X1 ; X2 /; D/.
5) Find examples that show that in general N .c \ d / N .c/ C N .d /,
i.e., M.c; d / 0, is false.
6) In a container there are w white and r red balls. Two balls are drawn
consecutively. What is the uncertainty or novelty of the outcome of the first
drawing as compared to the (conditional) uncertainty of the second drawing?
7) There are 12 balls of equal size, but one of them is slightly heavier or lighter
than the others. You have a simple balance scale with two pans and are allowed
to put any number of balls on either pan in order to find out the odd ball and
the direction of its weight difference to the others. How many weighings (each
with three possible outcomes) do you need? How can information theory help
in answering that question?
8) Show Proposition 3.9 on page 43.
9) Show the following.
Proposition 3.12. i) I.X jY; Z/ C I.Y jZ/ D I.X; Y jZ/,
eY
e implies I.X jZ/ I.Y jZ/ and I.ZjX / I.ZjY /,
ii) X
iii) I.X jY; Z/ minfI.X jY /; I.X jZ/g.
10) Prove or refute each of the following statements:
a) T .X; Z/ max.T .X; Y /; T .Y; Z//
b) T .X; Z/ T .X; Y / C T .Y; Z/
c) T ..X; Z/; Y / T .X I Y / C T .Y I Z/.
11) There are three different 40-Cent coins, each of these coins consisting of two
20-Cent coins glued together (head-to-head, head-to-tail, or tail-to-tail). One
of these three coins is drawn randomly and placed on a table. The random
variables X and Y denote the top and bottom side of the drawn coin and
can take the values h (head) or t (tails). Describe this experiment by
means of a probability space. Compute pX D Y , pX D t, pX D Y jY D t,
pX D tjY D h and T .X; Y /.
46
12) Let X; Y; Z be discrete random variables. Show the following:

2I.X; Y; Z/ I.X; Y / C I.Y; Z/ C I.X; Z/:
13) Let D f1; : : : ; 6g and define d
d.1/ D f1g;
d.2/ D f1; 2; 3; 4g;
d.3/ D f1; 3; 5g
d.4/ D f2; 4; 6g;
d.5/ D f3; 4; 5; 6g;
d.6/ D f6g:
Define a description c as c.i / WD f1; : : : ; i g for every i 2 . Use Ncjd as defined

in Definition 3.2.
a) Calculate the expectation value E.Ncjd /.
b) Compare E.Ncjd / with N .c \ d / N .d /.
c) Show the following or give a counterexample:
i) Does E.Ncjd / 0 hold in general?
ii) Does E.Ncjd / N .c/ hold in general?
14) On a small group of islands an old telegraph system was found. The inhabitants
used it to play games of dice between the different isles. The device is
still working, but not like it should do. Experiments showed the following
transmission behavior:
1 7! 6
2 7! 3
3 7! .3; 5/
4 7! .2; 4/
5 7! 2
6 7! .1; 5/
The transmission in ambiguous (faulty) cases is equally probable, i.e., if one

tries to transmit a 3, a 3 is received with a probability of 50%; however,
a 5 is also received with probability of 50%. Let X denote the input value
(outcome of a dice cast) and Y the actual output value. Specify a suitable
guessing strategy such that the actual value of X can be determined from Y
with the smallest possible error. Let Z be the guessed value (i.e., Z is a function
of Y ). Calculate the error probability pX Z, the conditional probability
p.X ZjZ D k/ for each k and the transinformation T .X; Z/.
15) T .X; Y; Z/ as defined on page 44 can have both positive and negative values in
contrast to T .X; Y /. Give examples for this.
References
Amari, S., & Nagaoka, H. (2000). Methods of information geometry. AMS and Oxford University
Press.
Attneave, F. (1959). Applications of information theory to psychology. New York: Holt, Rinehart
and Winston.
References
47
Bell, A. J. The co-information lattice. In Proceedings of the 4th International Symposium on

Independent Component Analysis and Blind Signal Separation (ICA2003), (pp. 921926).
Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. Wiley-Interscience.
Cziser, I., & Korner, J. (1982). Information theory. New York: Academic Press.
Deco, G., & Obradovic, D. (1996). An information-theoretic approach to neural computing.
Secaucus, NJ, USA: Springer-Verlag New York, Inc.
Heise, W., & Quattrocchi, P. (1989). Informations- und Codierungstheorie. Berlin, Heidelberg,
New York: Springer.
Kullback, S. (1959). Information theory and statistics. New York: John Wiley.
Kullback, S. (1968). Information theory and statistics. New York: Dover.
Kullback, S., & Leibler, R. A. (1951). On Information and Sufficiency. The Annals of Mathematical
Statistics, 22(1), 7986.
MacKay, D. J. C. (2005). Information theory, inference, and learning algorithms. Cambridge, UK:
Cambridge University Press.
Reza, F. M. (1994). An introduction to information theory. New York: Dover Publications, Inc.
Part II
Coding and Information Transmission
Chapter 4
On Guessing and Coding
In Chap. 2 we defined the novelty of a proposition as a special function of its

probability p.A/. We motivated the definition N .A/ D log2 p.A/ by the idea
that N .A/ should measure the number of yesno questions needed to guess A. We
have extended this notion of novelty to the (average) novelty of a description d
as E.N d / D N .d /, and we introduced the slightly more complicated notion
of information I.d / D N .e
d /. We shall now investigate the strategies for smart
guessing and, in so doing, we will attain a better understanding of concepts like
novelty and information in terms of the number of yesno questions.
This chapter introduces the Huffman code (Huffman 1952). The ideas of coding
and optimizing average codewordlength are essential for understanding the concept
of information and the closely related concepts of novelty and surprise. The chapter
contains no new material. Error-correcting codes (e.g., Hamming 1950; Bose and
Ray-Chaudhuri 1960) are not considered, because they lead to difficult, more
algebraic considerations and are not related to the new concepts developed in this
book.

Assume you have to find out the value of a dice by asking yesno-questions. How
many questions do you need?
One strategy is depicted in Fig. 4.1. With this strategy we need two or three
questions. For 1 or 2 we need 2 questions, for 3, 4, 5, or 6 we need 3. On average
we need 13 2 C 23 3 D 2 32 questions. Is there a better strategy?
In this section we will show, how to find the best guessing strategy, that there is
no better strategy for the dice, a formula for the average number of questions in an
optimal strategy, and finally, that this number is close to the information content (for
the dice the information is log2 6 2:585).

51
52
4 On Guessing and Coding

First question:
Is it odd?
yes
Second question:
no
Is it 1?
Third question:
Is it 2?
yes
no
yes
no
Is it 3?
Is it 4?
yes
no
yes
no
Fig. 4.1 Example of a question strategy

First question:
Second question:
Third question:
Is there an ace among the rst 3 cards?

yes
no
Is it card 1?
Is card 4 or 5 an ace?
yes
no
yes
no
Is it card 2?
Is it card 4?
Is it card 6?
yes
no
yes
no
yes
no
7,8
Fig. 4.2 Example of a question strategy
A similar example is the following: You are shown a deck of 8 cards containing
2 aces. The deck is shuffled and the cards are put on the table (face down). Your task
is to find one ace by asking yesno-questions. How many questions do you need?
This task turns out to be much harder. In fact, the theory provided in this chapter
does not solve it, nor does classical information theory. We will return to it in
Chap. 10. A reasonable strategy for this task could be the following (see Fig. 4.2):
This strategy needs three questions, except when card 1 is an ace; in this case it
needs two. The probability for this is 14 . So on average this strategy needs 14 2 C
3
3 D 2 34 questions.
4
Is there a better strategy? The answer will be given in Chap. 10. There are good
reasons why the best strategy should need between 2 and 3 questions, but it may be
surprising that the result is closer to 3. After all, with 3 questions one can find 1 ace
4.2 Guessing Strategies
53
among 8 cards. Adding another ace to the deck doubles the probability of hitting an
ace (from 18 to 14 ), so should we not be able to do it in 2 questions? This proves to be
impossible after a few trials. Another argument for an information content of 2 goes
as follows. Let us assume one of the aces is black and one red. In order to determine
the color of the ace we have found, we need one additional bit of information. To
find out one ace plus its color (e.g., the red ace) we need 3 questions, i.e., 3 bits
of information. So the information content of the localization of one ace should be
3 1 D 2 bits (if information is additive). Well, it turns out that in this problem
information is not additive.
4.2 Guessing Strategies

Consider the following problem:
Given a description d with finite range, find guessing strategies for d ,
i.e., sequences of yesno questions in order to determine d.!/ for any given (but
unknown to the asker) event !, minimizing the average number of questions needed.
Let us assume that R.d / D fA1 ; : : : ; Ak g, and let Li be the number of questions
needed in a particular guessing strategy to determine the proposition Ai . Furthermore let L.!/ denote the number of questions needed in that guessing strategy to
determine d.!/. Then
E.L/ D
k
X
Li p.f! 2 W d.!/ D Ai g/ D
i D1
k
X

p e
Ai Li ;
i D1
where e
Ai D d D Ai as defined in Definition 2.6.
More generally, assume that we want to guess the values of a random variable
X W ! A, where A D fa1 ; : : : ; an g. Let pX D ai D pi . We may summarize
this situation in the scheme

a1 : : : an
:
p1 : : : pn
Again we denote by Li the number of questions needed (in a certain guessing
strategy) to determine ai , then the average number of questions needed is
E.L/ D
n
X
pi Li :
i D1
A fixed guessing strategy starts with a fixed first question. Then, depending on
the first answer (yes or no), there are two alternative fixed second questions. Again
in every possible case after the second answer, there is a fixed third question, and so
on. A useful way of picturing such a guessing strategy and its outcomes is by means
54
0
0
000
1
01
1000
1010
1110
001
Fig. 4.3 Tree picturing a question strategy
of a tree (Fig. 4.3). It has a certain number of levels; at each level (l D 1; : : : ; k), the
number b.l/ of branches corresponds to the number of different cases after the lth
question. It is clear that b.l/ 2l for yesno questions, and b.l/ will indeed usually
be smaller, because whenever one value ai of X is determined (at level Li ) we stop
asking questions and therefore cut off the subsequent branches at higher levels. The
number of possible branches at levels l > Li that are thus cut off are 2lLi . The
n
highest level k in the tree is, of course, k D max Li .
At the level k,
n
P
i D1
kLi
of the possible 2k branches are used or cut off by the
i D1
questioning strategy. Clearly
n
P
2kLi can be at most 2k , i.e.,
i D1
n
P
2Li 1. This
i D1
relation is known as Krafts inequality.
4.3 Codes and Their Relation to Guessing

Strategies
With the help of trees as pictured in Fig. 4.3, we can now work out an optimal
guessing strategy for X . But before we proceed, we mention a connection between
guessing strategies and codes.
Definition 4.1. A code of the set A D fa1 ; : : : ; an g in the alphabet B D fb1 ; : : : ; bm g
is an invertible mapping cW A ! B , where B is the set of all sequences (of finite
length) of elements from B. Thus
B D
1
[
i D1
where B i is the set of sequences of length i .
Bi ;
4.3 Codes and Their Relation to Guessing Strategies
55
Now a fixed guessing strategy can as well be understood as a code of A for

B D f0; 1g if a sequence, like 001011 in B is interpreted as a protocol of the
answers to the fixed sequence of questions in the strategy (0 identified with no and
1 with yes). In this code each ai will be uniquely determined by the corresponding
0-1-sequence of length Li .
So the number Li of questions needed in a fixed guessing strategy to determine
ai corresponds to the length l of the codeword c.ai /. And the problem of finding an
optimal guessing strategy, which was our starting point, is equivalent to the problem
of finding a 0-1-code for A of minimal average length. The average length of a code
c is defined as:
n
X
pi l.c.ai //:
L.c/ WD
i D1
It should be noted that the codes c that occur in this correspondence to guessing
strategies have a particulary nice property: they are irreducible (or prefix-free).
Definition 4.2. A code cW A ! B is called irreducible, if there are no two a; a0
in A such that c.a/ is a beginning (or prefix) of c.a0 /. A codeword b is called a
beginning of b 0 , if l.b/ l.b 0 / and bi D bi0 for i D 1; : : : ; l.b/.
Example 4.1. For A D fa; b; c; d; e; f; gg and B D f0; 1g, consider the two codes
c and c 0 defined by
c.a/ D 0;
c.b/ D 10;
c.c/ D 1100;
c.e/ D 1110;
c.f / D 11110;
c.g/ D 11111
c 0 .a/ D 0;
c 0 .b/ D 10;
c 0 .c/ D 110;
c 0 .e/ D 1110;
c 0 .f / D 0101;
c 0 .g/ D 1111:
c.d / D 1101;
and
c 0 .d / D 1101;
Which of the two codes is irreducible (compare Exercise 6))?
t
u
If we identify a probability vector p D .p1 ; p2 ; : : : ; pn / with a scheme

a1 a2 : : : an
p1 p2 : : : pn
we may ask for an optimal irreducible 0-1-code for p, i.e., an irreducible code
cW f1; : : : ; ng ! f0; 1g with minimal average length L.c/. We define L.p/ as the
average length L.c/ of this code.
56
4.4 Krafts Theorem

The following theorem relates guessing strategies and irreducible codes.
Theorem 4.1 (Kraft). Let A D fa1 ; : : : ; an g and let L1 ; : : : ; Ln be integers. The
following propositions are equivalent:
i) There is a questioning strategy for A taking L1 ; : : : ; Ln questions for the n items
in A.
ii) There is an irreducible 0-1-code for A with codeword lengths L1 ; : : : ; Ln .
n
P
iii)
2Li 1.
i D1
Proof. .i / .i i /: is obvious.
.i i / .i i i /: Let us repeat the proof of this fact in the language of coding: Let
n
c be an irreducible code for A with L1 ; : : : ; Ln . Let k D max Li .
i D1
For a 0-1-sequence w 2 f0; 1gl with l k let Mk .w/ D fw0 2

f0; 1gk W w is a beginning of w0 g. Then clearly Mk .w/ has 2kl.w/
elements. Now the sets Mk .c.ai // and Mk .c.aj // are disjoint for
i j , because the code c is irreducible. Thus
n
X
kLi
i D1
Therefore
n
X
# .Mk .c.ai /// D #
i D1

# f0; 1gk D 2k :
n
P
n
[
!
Mk .c.ai //
i D1
2Li 1.
i D1
.i i i / .i i /: Assume that L1 L2 : : : Ln .
We select an arbitrary codeword c.a1 / of length L1 for a1 . Then
we select an arbitrary codeword c.a2 / of length L2 , which does
not have c.a1 / as a beginning. This means that the sets ML2 .c.a1 //
and ML2 .c.a2 // defined above, have to be disjoint. We repeat this
procedure until we select c.an /. It will work as long as in every
step j there is a suitable codeword left, i.e., a word of length Lj
that has none of the words c.a1 /; : : : ; c.aj 1 / as a beginning. But
jP
1
jP
1
2Lj Li , i.e., 1 >
2Li ,
this is the case as long as 2Lj >
i D1
which by 4.1 is true up to j D n.
i D1
t
u
Now we come back to our original problem of finding an optimal guessing

strategy or a code with minimal average length. The solution to this problem is
the Huffman code. We shall prove this in the next section.
4.5 Huffman Codes
57
Example 4.2. Consider the scheme

1
ABC DEF G
@1 1 1 1 1 1 1 A
4 4 8 8 8 16 16
0
Find an optimal irreducible 0-1-code for it (Exercise 4.6)).
t
u
4.5 Huffman Codes

an
where p1 p2 : : : pn be given. We decide
Let the scheme pa11 :::
::: pn
to distinguish the two possibilities with the lowest probabilities by the very last
question. Let us say, for instance, the last question leads with yes or 1 to an and
with no or 0 to an1 . This decision leaves us with the reduced scheme

a1 : : : an2
an1
p1 : : : pn2 pn1 C pn
where an and an1 have merged into an1 . The reduced scheme may now be
reordered so that the probabilities in the lower row decrease from left to right. We
continue this procedure with the same recipe until n D 1. The code obtained by
means of this procedure is called the Huffman code (cf. Huffman 1952).
Lemma 4.1. If c is an optimal code for a scheme with probabilities .p1 ; : : : ; pn /
that has the corresponding codeword lengths L1 ; : : : ; Ln and pi <pj , then Li Lj .
Proof. Otherwise one would get a better code c 0 by choosing c 0 .ai / D c.aj /,
c 0 .aj / D c.ai / and c 0 .ak / D c.ak / 8 k i; j :
L.c/ L.c 0 / D
n
X
rD1
0
pr Lr @
1
pr Lr C pi Lj C pj Li A
ri;j
D pi Li C pj Lj pi Lj pj Li
D .pj pi /.Lj Li / > 0;
if Lj > Li :
t
u
So we can assume that p1 p2 : : : pn and that for an optimal code

L1 L2 : : : Ln .
Lemma 4.2. Any optimal irreducible code c has an even number of longest
codewords which are pairs of the form .w0; w1/ with w 2 f0; 1g.
Proof. If w0 was a longest codeword and w1 not, then w could be used as a shorter
codeword instead of w0.
t
u
58
Theorem 4.2. The Huffman code h is an optimal irreducible code for the scheme

a1 : : : an
p1 : : : pn

;
and L.p/ can be calculated recursively by

L.p1 ; : : : ; pn / D L.p1 ; : : : ; pn2 ; pn1 C pn / C pn1 C pn ;
()
if .p1 ; : : : ; pn / was ordered such that pn1 and pn are the two smallest elements of
the vector.
Proof. By induction on n: Let c be an optimal code for p, then by Lemmas 4.1
and 4.2 we can assume that the codewords for an1 and an are among the longest
codewords and (possibly by exchanging longest codewords) that they are of the form
c.an1 / D w0 and c.an / D w1.
Thus .c.a1 /; : : : ; c.an2 /; w/ D c 0 is a code for .p1 ; : : : ; pn2 ; pn1 C pn / and
L.c/ D L.c 0 / C pn1 C pn . Clearly c 0 has to be optimal, because otherwise it could
be shortened leading also to a shorter code for p. This proves ().
By construction, the Huffman codes h also fulfill (), i.e.,

L h.p1 ; : : : ; pn / D L h.p1 ; : : : ; pn2 ; pn1 C pn / C pn1 C pn ;
t
u
so they are optimal.
In summary, we have now solved the problem of finding an optimal guessing

strategy (or an optimal irreducible code ) for the values a1 ; : : : ; an of a random
variable X , where the probabilities pX D ai D pi are known. It turns out that
the answer (given by the Huffman code) is closely related to the information I.X /.
Indeed, Krafts inequality shows that the codewordlengths or the numbers Li of
n
P
2Li 1, and this inequality can
questions needed to determine the ai satisfy
i D1
be used to establish a relationship between codewordlength and information (see

Proposition 4.2 on page 59).
4.6 Relation Between Codewordlength and Information

Proposition 4.1. For numbers pi 0 and qi 0 with
we have
n
X
i D1
pi log2 pi
n
X
i D1
n
P
i D1
pi D 1 and
pi log2 qi :
n
P
i D1
qi 1
4.6 Relation Between Codewordlength and Information
59
Proof. This statement has essentially been proved in Proposition 3.1. As in Proposition 3.1, we use the properties of the natural logarithm in the proof. Our assertion
follows from the fact that
n
X
pi ln pi C
i D1
n
X
pi ln qi D
i D1
n
X
i D1
qi
pi
n
X
n
X
pi ln
qi
i D1
ln x x 1
n
X
i D1
pi .
qi
1/
pi
pi 0:
t
u
i D1
If we now take qi D 2Li in Proposition 4.1, then Theorem 4.1 holds for the
lengths Li and we see that
I.X / D
n
X
pi log2 pi
i D1
n
X
pi log2 qi D
i D1
n
X
pi Li D E.L/:
i D1
On the other hand, if we try to work with Li D d log2 pi e, then obviously

n
X
Li
i D1
n
X
2. log2 pi / D 1;
i D1
and we can construct a corresponding guessing strategy, which has

E.L/ D
n
X
pi Li D
i D1
n
X
pi .d log2 pi e/
i D1
n
X
pi . log2 pi C 1/ D I.X / C 1:
i D1
Thus the optimal guessing strategy (or code) has an average number of questions
E.L/, which is close to I.X /, and we have shown the following
Proposition 4.2. If L is the number of questions needed in an optimal guessing
strategy for X , then
I.X / E.L/
n
X
pi d log2 pi e < I.X / C 1:
i D1
Proof. See above.
t
u
The interpretation of information as the average number of questions needed in

an optimal strategy or as the minimal average codewordlength can be made even
more precise if we consider not only the guessing of one random variable X but of
a sequence of similar random variables Xn W ! A. We shall see this in Chap. 6.
The idea is simply that the difference of 1 in the estimate of Proposition 4.2 can
be made arbitrarily small compared to I.X / if one considers variables X with
60
sufficiently high information content, or alternatively, if one considers the guessing

of many independent samples of values from the same random variable X . This idea
is carried out below (see Proposition 6.6 on page 83).
Of course, we can apply this interpretation also to conditional information. For
example, for two random variables X and Y the conditional information IY Db .X /
based on the conditional probability pY Db .A/ D p.AjY D b/, can be interpreted
as the number of yesno questions needed to guess the value of X if one knows that
Y D b. Similarly, I.X jY / corresponds to the average number of yesno questions
needed to guess the value of X if one knows the value of Y .

This short chapter contains the classical results on optimal noiseless coding that
go back to Shannon (1948, see also Huffman 1952; Fano 1961). They provide
the essential justification for the basic definition (Definition 2.1) of information
or novelty in Chapter 2. They are given here for the sake of comprehensiveness
although they are not related to the new ideas concerning novelty and surprise.
4.8 Exercises
1) Try to find an optimal guessing strategy for Example 4.2 and draw the
corresponding tree.
2) Given a deck of 16 cards, among them 4 aces. Let Ai D [the i th card is the first
ace,
from the top of the deck]. Find an optimal code for the scheme

A1 :::Acounted
16
with
the corresponding probabilities.
p1 :::p16
3) Given a dice, i.e., D f1; : : : ; 6g with equal probabilities, find an optimal code
for it.
4) Consider the information I as a function on probability vectors
n
P
p D .p1 ; : : : ; pn / with pi 0 and
pi D 1 defined by
i D1
I.p/ D
n
X
pi log2 pi :
i D1
Show the following:

Proposition 4.3.
i) For q 2 .0; 1/ and p; p 0 probability vectors, we have
q I.p/ C .1 q/ I.p 0 / C I.q; 1 q/ I.q p C .1 q/ p 0 /

q I.p/ C .1 q/ I.p 0 /:
4.8 Exercises
61
ii) On probability vectors of length n, I takes its maximum at

I
1
1
n; : : : ; n
D log2 n
and its minimum at

I.1; 0; : : : ; 0/ D 0:
5) For which probability vectors p D .p1 ; : : : ; pn / is I.p/ D L.p/?
6) Which of the two codes c and c 0 defined in Example 4.1 is irreducible? For the
other code, find a sequence in B that can be interpreted as the concatenated
code of two different sequences in A .
7) The following table1 contains the frequencies of letters in the German alphabet.
Construct a Huffman Code for it.
A
B
C
D
E
F
G
6:51 %
1:89 %
3:06 %
5:08 %
17:40 %
1:66 %
3:01 %
H
I
J
K
L
M
N
4:76 %
7:55 %
0:27 %
1:21 %
3:44 %
2:53 %
9:78 %
O
P
Q
R
S
T
U
2:51 %
0:79 %
0:02 %
7:00 %
7:27 %
6:15 %
4:35 %
V
W
X
Y
Z
0:67 %
1:89 %
0:03 %
0:04 %
1:13 %
8) Can it be that there are two optimal codes with different codewordlengths for
the same probability vector p?
9) A game with 5 (or 6) uniformly distributed outcomes is played repeatedly. The
result shall be transmitted binary with a maximum of 2.5 bit available for each
result. For which n 2 N exists a proper n-tuple code?
10) 6 cards (3 aces, 3 kings) are placed side by side on a table in random order. One
wants to find the position of an ace. Determine an optimal guessing strategy for
this purpose.
11) Answer the following questions:
1) Is there a binary code with six codewords of length 1,2,2,2,2, and 2?
2) Is there a binary code with six codewords of length 1,3,3,3,3, and 3?
3) Is there a prefix-free binary code with six codewords of length 1,3,3,3,3,
and 3?
4) Is there a prefix-free binary code with six codewords of length 2,3,3,3,3,
and 3?
1
Modified from Beutelspacher, A. (1993). Kryptologie. Friedr. Vieweg & Sohn Verlagsgesellschaft
mbH, Braunschweig/Wiesbaden.
62
References
Bose, R. C., & Ray-Chaudhuri, D. K. (1960). On a class of error correcting binary group codes.
Information and Control, 3, 6879.
Fano, R. M. (1961). Transmission of information: A statistical theory of communications.
New York: Wiley.
Hamming, R. V. (1950). Error detecting and error correcting codes. Bell Systems Technical Journal,
29, 147160.
Huffman, D. A. (1952). A method for the construction of minimum redundancy codes. Proceedings
of the IRE, 40, 10981101.
27, 379423, 623656.
Chapter 5
Information Transmission
This chapter introduces the concept of a transition probability and the problem of
guessing the input of an information channel from observing its output. It gives a
first idea on the classical results of Shannon, without introducing the technicalities
of stationary stochastic processes and the proof of Shannys Theorem. This material
is provided in the next three chapters. Since it is not necessary for the understanding
of Parts IV, V, and VI, one can move directly to Part IV after this chapter.

In a game of dice you are betting on sixes. When the dice is thrown you can put down
1 e, betting for the 6, and you get 5 e if youre right. You have two independent
experts E1 ; E2 who can predict the sixes, E1 makes 10 % errors, E2 makes 20 %
errors. What do you do if E1 predicts a 6 and E2 not, or vice versa?
To explain this more clearly, X 2 f1; : : : ; 6g represents the value of the dice,
E1 ; E2 2 f0; 1g use the value 1 to predict the six.
We assume that
pE1 D 1jX 6 D pE1 D 0jX D 6 D 0:1 and
pE2 D 1jX 6 D pE2 D 0jX D 6 D 0:2:
From this one can compute the expected win in each of the four cases. If the
expected win is more than 1 e, it is reasonable to play.
E.W jE1 D 1; E2 D 1/ D 5 pX D 6jE1 D 1; E2 D 1
D5
pX D 6; E1 D 1; E2 D 1
pE1 D 1; E2 D 1

63
64
5 Information Transmission
1
0:9 0:8
6
D5
5
1
0:9 0:8 C 0:1 0:2
6
6
0:72
0:72
D5
D5
>4
0:72 C 5 0:02
0:82
E.W jE1 D 0; E2 D 0/ D 5
pX D 6; E1 D 0; E2 D 0
pE1 D 0; E2 D 0
0:1 0:2
0:1 0:2 C 5 0:9 0:8
0:02
< 0:01
D5
3:62
D5
E.W jE1 D 0; E2 D 1/ D 5
pX D 6; E1 D 0; E2 D 1
pE1 D 0; E2 D 1
0:1 0:8
0:1 0:8 C 5 0:9 0:2
0:40
1
0:08
D
<
D5
0:98
0:98
2
D5
E.W jE1 D 1; E2 D 0/ D 5
pX D 6; E1 D 1; E2 D 0
pE1 D 1; E2 D 0
0:9 0:2
0:9 0:2 C 5 0:1 0:8
0:45
0:18
D
> 1:5
D5
0:58
0:29
D5
The most interesting cases are those of conflicting expert opinion. Our result is
that one should rely on the better of the two experts, which seems obvious.
This case can also be used to show the difference between maximum likelihood and maximum Bayesian probability. Here we ask the following question:
Given the two conflicting expert opinions, is it more likely that X D 6 or that
X 6? The correct (and Bayesian) interpretation of this question compares
pX D 6jE1 ; E2 with pX 6jE1 ; E2 . A more sloppy interpretation of this
question might compare pE1 ; E2 jX D 6 with pE1 ; E2 jX 6. In fact, the two
interpretations may lead to different results. Here it happens for E1 D 1; E2 D 0.
To see this we can compute the so-called likelihood-ratio
5.2 Transition Probability
65
0:9 0:2
0:18
pE1 D 1; E2 D 0jX D 6
D
D
>1
pE1 D 1; E2 D 0jX 6
0:1 0:8
0:08
and compare it with the Bayesian probability ratio
pX D 6jE1 D 1; E2 D 0
pX D 6; E1 D 1; E2 D 0
1 0:18
0:18
D
D
D
< 1:
pX 6jE1 D 1; E2 D 0
pX 6; E1 D 1; E2 D 0
5 0:08
0:4
This result means that in the case of conflicting values of our experts E1 and E2 , the
most probably correct guess on X is that X 6 rather than X D 6. However, we
may still bet successfully on X D 6.
5.2 Transition Probability

In Chap. 3, we introduced the transinformation T .X; Y / between two random
variables (with finite ranges) X and Y as a measure for the amount of information
that X tells about Y or vice versa. The fact that T .X; Y / is symmetric in X and Y ,
already indicates that T .X; Y /, like for example the correlation, does not say
anything about the causality of the dependence between X and Y . It does not matter
whether the values of X have a causal influence on those of Y or vice versa or
whether there is for example a common cause for the values of both X and Y .
If there is indeed something like a causal link or a mechanism that works in the
direction from X to Y , then it may be possible to say a little more than one gets
from the interpretation of the transinformation.
This is indeed the case, but before we can proceed to explain this in more detail,
we have to give a mathematical foundation for this mechanism that works in
one direction only. This specification turns out to coincide with the concept of a
transition probability. It also describes an information channel leading from X to Y
as depicted in Fig. 5.1.
Definition 5.1. A transition probability P W A
B from an (input-)set1 A to a
(output-)set B, is a family .pa /a2A of probability measures on B.
The idea in this definition is the following: Given one input value a 2 A, the
mechanism that relates this input value a to an output value b 2 B, need not
specify precisely one b 2 B with certainty, but may rather lead to amore or less
peakedprobability distribution pa on B for the possible outcomes b 2 B. In this
sense atransition probability p from A to B can be regarded as the stochastic version
of a mapping f W A ! B. Clearly, when B is not a finite set, a -algebra on B has
to be specified, for which the .pa /a2A are probability measures.
This definition is correct for finite sets A. In general, there are some more technical requirements
concerning measurability (see Bauer 1972 for example) which we do not mention here.
66
Fig. 5.1 An information

channel
Definition 5.2. For two random variables X and Y with values in A and B
respectively, and a transition probability P W A
B, we write P W X
Y and
say that X is transmitted into Y by P , or P links Y to X , if for every a 2 A and
every (measurable) M B
pY 2 M jX D a D pa .M /:
This definition essentially means that the two random variables X and Y are
linked through the transition probability P .
The knowledge of the channel mechanism, i.e., the probabilities pa for each
a 2 A makes it possible to determine the joint probabilities
pab D prX D a; Y D b D prX D a pa .b/ D q.a/pa .b/;
given a probability distribution q on A.
From these joint probabilities one can also obtain the corresponding probability
distribution on B by
X
prY D b D
prX D a; Y D b:
a2A
This probability distribution on B resulting from q by means of the transition

probability P W A
B is denoted as P q.
We can then insert these joint probabilities into our formulae for I.X /, I.Y /,
I.X; Y /, T .X; Y / etc. In this way we can for example determine the transinformation between two random variables that are related or connected by a transition
probability.
For finite sets A and B the transition probabilities pa .b/ for a 2 A and b 2 B
can also be arranged in a matrix. This matrix P D .pab /a2A;b2B with pab D pa .b/,
is called a transition matrix.
If A D B, the goal of information transmission obviously is to preserve the input
at the output. In this case, the transition matrix P is also called the confusion matrix;
ideally P should be the identity matrix, and with n D jAj,
1X
1X
1X
eD
pab D
.1 paa / D 1
paa
n
n a2A
n a2A
ab
is called the average error of P .

We can also consider a transition probability P W A
B as a very simple
example of an information channel that connects an input from the alphabet A to
an output from the alphabet B. We will encounter more general and more complex
models of information channels in Chap. 7.
5.3 Transmission of Information Across Simple Channels

Fig. 5.2 Channel example
67
0
Example 5.1. A mapping mW A ! B can also be regarded as a transition probability

and so it can also be taken as a very simple model for a channel. We simply define the
conditional probabilities p m given by m as pam .b/ D 1 if b D m.a/ and pam .b/ D 0
if b m.a/.
For example, take A D B D f0; : : : ; 9g and define m.i / D m.i C 1/ D i for
i D 0; 2; 4; 6 and m.8/ D 8 and m.9/ D 9. This channel can also be depicted as in
Fig. 5.2.
In this channel we can obviously use the digits 0; 2; 4; 6; 8; and 9 for perfect
transmission.
t
u

How much information can we transmit across a channel? Let us consider another
example for illustration:
Example 5.2. Consider A D B D f0; : : : ; 9g and the following simple channel
PWA
BW pi .i / D pi .i C 1/ D 12 for i D 0; : : : ; 8 and p9 .9/ D p9 .0/ D 12 .
Here there are two approaches to answer the above questions.
1
for every i on A, we
1. If we take simply the equal distribution p.i / D 10
can compute the transinformation across the channel. We define two random
variables X and Y on A that are linked by P and take p as the distribution
of X . Then we find I.X / D I.Y / D log2 10 and I.X I Y / D log2 20. Thus
T .X; Y / D log2 5.
2. We can safely use this channel, if we take only odd (or only even) digits as input,
because 1 can become 1 or 2, 3 can become 3 or 4, and so on. Thus from the
output 6 for example we can infer that the input was 5, or from 3 we can infer the
input 3. Thus we can transmit safely the five numbers f1; 3; 5; 7; 9g, which means
log2 5 bit.
t
u
68
Unfortunately, not all examples work out as nicely and so one eventually needs a
rather complicated theory that again makes use of sequences of inputs and outputs
in a similar way as we just remarked after Proposition 4.2. With these techniques
one can eventually prove that (up to an arbitrarily small error probability) one can
indeed get as many bits of information safely (or with very high fidelity) through a
channel as is computed by maximizing the transinformation.2 This quantity is called
the channel capacity.
Definition 5.3. Let P W A
B be a transition probability. We define its capacity as
c.P / WD maxfT .X; Y /W X W ! A and P W X
Y g:
Thus the maximum is taken over all input random variables X with values in
A and the output Y is linked to X by P . For A D a1 ; : : : ; an we could as well
maximize over all probability vectors p D .p1 ; : : : ; pn / where pi D p.ai / D
pX D ai . The maximization can then be carried out by the ordinary methods of
real analysis (Exercises 9 and 10).
The meaning of this definition of capacity is quite easy to understand in view of
the remarks at the end of Chap. 4. The transinformation T .X; Y / D I.X / I.X jY /
measures the amount of uncertainty about X that is removed by knowing Y , or the
amount of information about X provided by Y , or the average reduction in the
number of yesno questions needed to guess the value of X , that one gets from
knowing the value of Y . In simple words: T .X; Y / measures, how much the channel
output Y says about the channel input X . And the channel capacity is the maximum
that the channel output can say about the channel input for an appropriately chosen
input.
Definition 5.4. Let P W A
B and QW B
C be two transition probabilities
we define their composition R D Q P as the transition probability RW A
C
given by
X
ra .c/ WD
pa .b/ qb .c/:
b2B
Note that for finite sets A D a1 ; : : : ; an and B D b1 ; : : : ; bn a transition

probability P W A
B can also be regarded as a matrix P D .pij /, namely
Pij D Pai .bj / for i D 1; : : : ; n and j D 1; : : : ; m. Then the composition of
transition probabilities R D Q P simply corresponds to matrix multiplication
R D Q P.
Before we move on to some special cases, let us consider in the general situation,
where P W X
Y , the guessing problem: Given an output Y D b, what is the most
probable input that has led to it?
This statement is Shannons famous Theorem. See Chap. 8.
69
The answer is obviously given by maximizing the conditional probability pX D

ajY D b over the possible values of a. The resulting value a is also called the
Bayesian guess given Y D b. It can be computed more easily if one uses the
Bayesian formula which relates pX D ajY D b to pY D bjX D a D pa .b/.
Indeed,
pX D ajY D b D
and so
if and only if
pX D a pY D bjX D a
pX D a; Y D b
D
;
pY D b
pY D b
pX D a jY D b > pX D ajY D b
pX D a pa .b/ > pX D a pa .b/:
Thus one has to maximize simply the forward transition probability pa .b/,
weighted by the so-called a-priori probability pX D a. And in many practical
cases the a priori probabilities are all equal and can also be left out.
Definition 5.5. Given a transition probability P W A
B (A and B finite sets), an a
priori probability p for A and an output b 2 B, then a D G.b/ is called a Bayesian
guess for b, if p.a / pa .b/ p.a/ pa .b/ for every a 2 A.
The mapping GW B
A can also be regarded as a channel and one can now
consider the combined channel Q D G P W A
A which is defined by the
transition probabilities qa .A0 / WD pa .G 1 .A0 // for any a 2 A and any A0 A.
A good measure for the fidelity of the whole procedure of transmitting a message
in A through a channel and guessing it from the channels output is the error
probability, which is defined as e D pG.Y / X , where again P W X
Y . It
can be calculated as follows:
XX
eD
pX D a pa .b/ 1G.b/a :
a2A b2B
If we observe that
pX D apa .b/ D pY D b, then
a2A
eD
X
X

pY D b pX D G.b/ pG.b/ .b/ D 1
pX D G.b/ pG.b/ .b/:
b2B
b2B
Clearly this error probability is not the same as the average error e introduced
in Chap. 5.2 because it depends on the probabilities of the input variable X . The
error probability of the combined channel from A back to A is also related to
the transinformation of this channel, because, if the error probability is low, the
transinformation has to be high. Clearly this also means that the transinformation of
the original channel P was high. This is made explicit in the following proposition.
Since T .X; Y / D I.X / I.X jY /, high transinformation means that I.X jY / has
to be close to zero. So we consider the relation between I.X jY / and e.
70
Proposition 5.1. Let P W A

B be a transition probability and P W X
Y . Let
GW B ! A be a Bayesian guess and e D pX G.Y / and n D #.A/ 1. Then
i) I.X jY / e.log2 n C ln12 / e log2 e, which goes to zero for e ! 0.
ii) Conversely, I.X jY / 2e.
Proof. We need a few definitions and observations. We introduce the random
variable Z D 1X G.Y / . Then
E.Z/ D e D
pX G.Y /jY D bpY D b:
b2B
We define eb D pX G.Y /jY D b, so e D
pY D b eb .
(ii) We first consider the case that Y D b and consider I.X jY D b/ D Ib .X /
which is just the information of X for the probability pY Db and denoted as
e /, i.e., NY Db .X
e / D Ib .X /.
NY Db .X
P
From Proposition 3.7 we know that I.X jY / D
pY D bI.X jY D b/.
b2B
Now
Ib .X / D Ib .X; Z/ D Ib .Z/ C Ib .X jZ/
D eb log2 eb .1 eb / log2 .1 eb /
e / C .1 eb / NY Db;X DG.b/ .X/:
e
C eb NY Db;X G.b/ .X
The first equality holds because knowing the values of both X and Y , we also
know Z.
The last term on the right is zero because we know X D G.b/. It is quite easy
to estimate the second-last term since for a G.b/, we have
pX D ajY D b pX D G.b/jY D b D 1 eb
and so
pX D ajX G.b/; Y D b
1 eb
eb
.
e / log2 . 1eb /. We use this estimate only
This implies that NY Db;X G.b/ .X
eb
for eb > 12 . For eb 12 we obtain
Ib .X / eb log2 eb .1 eb / log2 .1 eb / 2eb
and also for eb >
1
2
we get
Ib .X / eb log2 eb .1 eb / log2 .1 eb / eb log2

D log2 .1 eb / 2eb
1 e
b
eb
Thus I.X jY / D
71
pY D bIb .X / 2e.
(i) We have
I.X jY / D I..X; Z/jY / I.X; Z/
D I.Z/ C I.X jZ/
e log2 e .1 e/ log2 .1 e/ C e log2 n:
t
u
We end this chapter with a few remarks on the practical computation of the
channel capacity. A case that is quite common in applications is a so-called
uniformly disturbing channel. We call a transition probability P W A
B uniformly
disturbing, if all conditional probabilities pa (for every a 2 A) on B have the same
information I.Pa /, i.e., if I.Pa / D c for all a 2 A. Such a channel produces
the same amount of uncertainty on its output for every input a. For a uniformly
disturbing channel P , it is easy to compute c.P / because P W X
Y implies
T .X; Y / D I.Y / I.Y jX / D I.Y / c. So
c.P / D maxfI.Y /W P W X
Y; X W ! Ag c:
Proposition 5.2. For a uniformly disturbing channel, we have c.P / D maxfI.Y /W

PWX
Y; X W ! Ag c.
Proof. See above.
t
u
In many cases I.Y / can be maximized quite easily and often the maximal
possible value of I.Y /, namely I.Y / D log2 #.B/ can be obtained. This is the
case for example for so-called symmetric channels, where one has to choose X
equally distributed to make also Y equally distributed yielding I.Y / D log2 #.B/.
Another case occurs when the matrix P is invertible, so that one can compute the
distribution for X that makes Y equally distributed. In these cases one simply gets
c.P / D log2 #.B/ c.
The next three chapters need a little more mathematics on stochastic processes
than the rest of the book; they may be skipped because they are not necessary for
the understanding of the subsequent chapters.

This chapter defines transition probabilities as the simplest model of information
channels. It relates transinformation across channels with the ideas of Bayesian
inference. It does not contain new material. A result like Proposition 5.1 that relates
transinformation to error probabilities is needed for the proof of the inverse
Shannon theorem (see Proposition 8.1). The idea probably goes back to Fano (e.g.
Fano 1961).
72
5.5 Exercises
1) A lonesome islander has tried to repair the telegraph from Exercise 3.14) so
that he can play dice with his neighbors again. Since he has never learned to
fix electronic devices, the attempted repair failed. Now the telegraph shows the
following transmission behaviour:
1 7! 6
2 7! .2; 4; 6/
3 7! .3; 5/
4 7! .2; 4/
5 7! .2; 6/
6 7! .1; 3; 5; 6/
The transmission in ambiguous (faulty) cases is equally probable, e.g., if one

tries to transmit a 3, a 3 is received with a probability of 50 %; however, a 5
is also received with probability of 50 %. The same notion as in Exercise 3.14)
is applied here. Determine if the repair improved or corrupted the telegraph.
Calculate
a) the error probability pX Z,
b) the conditional probability p.X ZjZ D k/,
c) the transinformation T .X; Z/.
and compare the results with Exercise 3.14). How must the game of dice
(i.e., the set of transmittable symbols) be restricted, if one wants to use the
telegraph in its current state for perfect transmission?
2) Given a symmetric binary channel, i.e., a channel with the transition probability

P D

1p p
:
p p1
Here p stands for the error probability of the channel. Calculate the channel
capacity. What is the result of T .X; Y / for p D 0:17?
Hint: Use T .X; Y / D I.Y / I.Y jX /.
3) The channel from Exercise 2) with error probability p D 0:17 shall now be
used to transmit symbols. What error probability per bit results from optimal
guessing by the use of gW f0; 1g3 ! f0; 1g? Determine the transinformation
T .X; g.Y //, whereas X and Y denote the sent and received bit, respectively.
4) Let P be the symmetric binary channel from Exercise 2). Calculate the
transition probability of channel Q that results from applying the channel P
twice.
5) Consider three discrete random variables X; Y; and Z. X and Y are called
independent given Z, if for every a; b; and c
pX D a; Y D bjZ D c D pX D ajZ D c pY D bjZ D c:
Show that T .X; Y / T .X; Z/ if X and Y are independent given Z.
5.5 Exercises
73
a
0
b
0
2
p
1
p
Fig. 5.3 Three different channels
6) A random experiment has seven different outcomes with the probabilities

1 1 1 1 1 1 1
; ; ; ; ; ; . The experiment is executed once a day and the result is
3 3 9 9 27 27 27
transmitted via a phone line. The phone company offers two rates. Rate A
transmits binary digits for 0.20 e per digit and rate B transmits ternary digits for
0.32 e per digit. Determine a code and choose a transmission mode such that
the expected costs are minimized. In particular, answer the following questions:
a)
b)
c)
d)
Which transmission rate shall be used?

Which code shall be used?
What are the expected costs?
If the costs for rate B are changed, for which costs would you revise your
decision?
e) Do the previous answers change if the experiment is executed many times
a day?
7) 8 shells are placed on a street. Underneath two of the shells there lies
1 e, respectively. Someone points at one of the shells and says that there is
a euro lying underneath it.
This situation can be modeled with the probability space D f!
f1; 2; : : : ; 8gW j!j D 2g with D P./ and the equipartion (uniform
distribution) p.!/. How big is p.!/ for ! 2 ? We define the random
variable Xk WD 1f!Wk2!g for k D 1; 2; : : : ; 8. Calculate I.Xk /, I.X1 ; X3 / and
I.X1 ; X2 ; X3 /.
Let the description dk .!/ D f! 0 W k 2 ! 0 g if k 2 !, and dk .!/ D if k !.
How big are N .Xk / and I.dk /?
8
T
Next, we define d.!/ WD
dk .!/. How big is N .d /?
kD1
Let c.!/ WD f!W max.!/ 2 ! 0 g. Calculate N .c/ and I.c/.

Finally, let the random variable X.!/ WD max.!/. Calculate E.X / and I.X /.
8) Devise guessing strategy for the 8 shells exercise that is as optimal as possible
for guessing one shell with a euro underneath it.
9) Calculate the capacities for the following channels (Fig. 5.3):
For which values of p do the channels (a), (b), and (c) have the capacity 0?
10) Let the following channel transmit the results of a game of dice (fair dice):
74
842
B3 8 4
B
1 B
B1 3 8
B
P D
16 B 0 1 3
B
@0 0 1
011
11
10
31
83
48
24
1
0
0C
C
C
0C
C:
1C
C
3A
8
Calculate P .Y D 6jX 6/, P .Y 6jX D 6/, N.d6 /, and N.d6 jY /, where

(
d6 .x; y/ D
X D 6
if x D 6;
X
if x 6:
11) In a game show one can bet 10 e on wether a blindly cast dice shows a six or
not. If the guess is correct one wins 60 e, otherwise one loses the 10 e.
You happen to know a staff member of the show, who reveals insider information to you. Before the show is broadcast, you come to know if the dice is
showing a six or not. Via the channel given in Exercise 10) your informant sends
1 if the outcome is not a six, otherwise he sends a 6. Actually, he had used
the channel before to transmit the cast number, but has decided that it is better
to just send a 1 or a 6. After receiving the information you can decide whether
to take part in the game or not. How does this information need to be evaluated
for maximizing profit? How much is the average profit per show?
Reference
and Winston.
Fano, RM. (1961). Transmission of Information: A Statistical Theory of Communication. Wiley,
New York
Part III
Information Rate and Channel Capacity
Chapter 6
Stationary Processes and

Their Information Rate
This chapter briefly introduces the necessary concepts from the theory of stochastic
processes (see for example Lamperti 1977; Doob 1953) that are needed for a proper
definition of information rate and channel capacity, following Shannon.
The purpose of Part III is a compact development of the main results of classical
information theory, including Shannons theorem. I believe that the use of the
e as defined in Chap. 2 simplifies
concept of a description and in particular X
the notation and perhaps also the understanding a bit. These results need the
terminology of stochastic processes. For this reason they are usually regarded as
technically demanding and not included in introductory textbooks such as Topse
(1974). Part III can be skipped by experts who already know the classical results on
channel capacity and by beginners who want to understand basic information theory
and the new concepts of novelty and surprise introduced in this book.

We are given a channel (see Fig. 6.1). We want to use it to transmit sequences of
many digits from signals f0; : : : ; 9g by using a suitable code. How many channelsymbols do we transmit for each digit in the long run? What is the error probability
and the transinformation of this procedure per digit?
We can use this channel to transmit the ten digits directly, with some uncertainty,
which we can compensate by sending the digits twice or 3 times. On the other hand,
we can transmit five symbols (for example, a; b; c; d; e) safely, so we could code
sequences of ten digits into longer sequences of five symbols, and decode them
again after passing the channel.
Since 53 D 125, three symbols would suffice to code two digits. Another longer
code would result for 513 D 1220703125 > 109 .

77
78
Fig. 6.1 Channel example
6 Stationary Processes and Their Information Rate

0
0.8
0.2
0.8
0.2
0.8
0.2
0.8
0.2
0.8
0.2
0.8
0.2
0.8
0.2
0.8
0.2
0.8
0.2
0.8
The transinformation of the channel C is 2:6 bit1 . Sending the digits twice would
4
result in an error probability of 160
at most, because only two transmission errors
would lead to a wrong guess at the channel output. This yields a transinformation of
3:10245 bit.2 This means a bit-rate of 3:10245=2 D 1:55123 bits per channel use.
The method of coding two digits in three symbols which are transmitted reliably
yields a bit-rate of 23 log 10 D 2:2146. The last method, coding 9 digits in 13
9
symbols yields a bit-rate of 13
log 10 D 2:2998.
All these bit-rates stay below the capacity of the channel C. In this section
we work towards the demonstration that for sufficiently long sequence-codes it is
possible to achieve bit-rates close to the capacity with small error probabilities. To
this end we have to consider sequences of random inputs and outputs at the channel
under consideration. These sequences are called stochastic processes.
6.2 Definition and Properties of Stochastic Processes

We start this section by reviewing some general material from the theory of
stochastic processes in Definition 6.1 to Proposition 6.3 (compare Doob 1953;
Lamperti 1966).
8
2
c D log2 10 C 10
log2 8 C 10
log2 2 log2 10 D 2:6 bit.

68 16
1
16
1
2
bit-rate D log2 10 C 100
log2 17
C 17
log2 17
D log2 10 C
17
log2 10 0:322757 0:68 D 3:10245.
1
68 64
100 17

log2 17 D
6.2 Definition and Properties of Stochastic Processes
79
Definition 6.1. A sequence X D .Xn /n2N of measurable functions Xn W ! A is

called a (discrete) (stochastic) process (with values) in A.
We usually assume that A is finite or countable and that pXn D a 0 for
every a 2 A.
Definition 6.2. A stochastic process with values in A is called stationary, if
pX1 2 A1 ; X2 2 A2 ; : : : ; Xn 2 An D pXt C1 2 A1 ; Xt C2 2 A2 ; : : : ; Xt Cn 2 An
for every t; n 2 N and every choice of measurable sets A1 ; : : : ; An A.
Stationarity of a process simply means that the probabilities for the process to
take certain values do not change, when the whole observation of the process is
shifted in time.
In the following we shall describe some important classes of stochastic processes
that are extensively studied in probability theory. We shall need these ideas later, but
they are not essential for the understanding of the concept of information.
Definition 6.3. A sequence .Xn /n2N of measurable functions Xn W ! A is called
independent, if for every n 2 N and every choice of (measurable) sets A1 ; : : : ; An
A we have
pX1 2 A1 ; : : : ; Xn 2 An D pX1 2 A1 pX2 2 A2 : : : pXn 2 An :
It is called identically distributed, if for every n and every (measurable) set B A:
pXn 2 B D pX1 2 B:
Proposition 6.1. A process X D .Xn /n2N with independent identically distributed
random variables Xn (in short an i.i.d. process) is stationary.
Proof.
pX1 2 A1 ; : : : ; Xn 2 An D
n
Y
pXi 2 Ai D
i D1
n
Y
n
Y
pX1 2 Ai
i D1
pXt Ci 2 Ai
i D1
D pXt C1 2 A1 ; : : : ; Xt Cn 2 An :
t
u
Definition 6.4. A process X is a Markov process, if 8n 2 N 8A1 ; : : : ; An A

pXn 2 An j Xi 2 Ai for i D 1; : : : ; n 1 D pXn 2 An j Xn1 2 An1 :
This means that in a Markov process Xn depends only on Xn1 and not on earlier
outcomes of the process once Xn1 is known. Another way of stating this is the
80
following: For a Markov process the future is independent of the past given the
presence.
Proposition 6.2. Let X be a stationary Markov process on a finite set A. Let
PWA
A be given by pab D pX2 D bjX1 D a. Then
pX1 D a1 ; X2 D a2 ; : : : ; Xn D an D pX1 D a1
n1
Y
Pai ai C1 :
i D1
In addition, for A D fa1 ; : : : ; ak g, the row-vector q with qi D pX1 D ai satisfies

q P D q, where Pij D Pai aj :
Proof.
pX1 D a1 ; : : : ; Xn D an D pX1 D a1
n1
Y
pXi C1
i D1
D ai C1 jX1 D a1 ; : : : ; Xi D ai
D pX1 D a1
n1
Y
pXi C1 D ai C1 jXi D ai
i D1
D pX1 D a1
n1
Y
pX2 D ai C1 jXi D ai
i D1
qi D pX1 D ai DpX2 D ai D
pX2 D ai jX1 D aj qj D
Pj i qj :u
t
6.3 The Weak Law of Large Numbers

Definition 6.5. A stochastic process X D .Xn /n2N with values in R satisfies the
weak law of large numbers (w.l.l.n.), if for every > 0
lim pjYn E.Yn /j > D 0;
n!1
where Yn D
1
n
n
P
i D1
Xi .
This obviously means that for large n, the average Yn of the first n random
variables Xi will be with very high probability very close to a constant valueits
expectation. Thus the scatter in Yn will become negligible.
6.4 Information Rate of Stationary Processes
81
Proposition 6.3. Every i.i.d. process with finite E.X12 / satisfies the w.l.l.n.
Proof. For the proof we need a well known estimate of pjYn E.Yn /j , called
the Chebyshev-inequality:
For any random variable X and any > 0 we have
1jX j jX j
and thus
pjX j E.jX j/:
Now we consider the function
X

n

2
1
Xi E.Xi /
X WD .Yn E.Yn // D
n i D1
2
in the Chebyshev-inequality and obtain

2 p.Yn E.Yn //2 2 E.X / D
n
n
1 XX
E..Xi E.Xi //.Xj E.Xj ///:
n2 i D1 j D1
For i j these expectations are E .Xi E.Xi // E.Xj E.Xj // D 0, and for
i D j they are
E..Xi E.Xi //2 / D E.Xi2 / .E.Xi //2 D Var.Xi / D Var.X1 /:
Thus
E.X / D

1
1
E.X12 / .E.X1 //2 D Var.X1 /
n
n
and therefore
p.Yn E.Yn //2 2 D pjYn E.Yn /j
Var.X1 /
!0
2 n
for n ! 1:u
t
The classes of stochastic processes introduced up to this point will be helpful

in the subsequent discussion of the average information generated by a stochastic
process.

In what follows we study the information rate of sequences of random variables.
e1 \ X
e 2 /. This can be generalized to any number of
Recall that I.X1 ; X2 / D N .X
random variables. The following lemma characterizes maximum information rate of
n random variables.
82
Lemma 6.1. For random variables X1 ; : : : ; Xn , Xi W ! A, A finite, I.X1 ; : : : ; Xn /

is maximized if the random variables are independent.
Proof. From Proposition 3.9.(ii) it follows that
I.X1 ; : : : ; Xn /
n
X
I.Xi /:
i D1
If X1 ; : : : ; Xn are independent, we obtain by Proposition 3.9.(iv) that

I.X1 ; : : : ; Xn / D
n
X
I.Xi /:
t
u
i D1
Proposition 6.4. For a stationary process X D .Xn /n2N in A, the limit

1
I.X1 ; : : : ; Xn /
n!1 n
I.X / D lim
exists.
Proof. Recall that
I.X1 ; X2 / D I.X1 / C I.X2 jX1 /
(6.1)
as stated in Proposition 3.9.(iv).

From this equality we infer
I.X1 ; : : : ; Xn / D I.X1 / C
n
X
I.Xi jX1 ; : : : ; Xi 1 /
for each n 2 N.
(6.2)
i D2
It is also not difficult to check (see Prop. 3.12.(iii) or 3.9.(iv) and consider the
stationarity of X ) that for each i 2 N
I.Xi jX1 ; : : : ; Xi 1 / I.Xi 1 jX1 ; : : : ; Xi 2 /:
(6.3)
Thus in (6.2) every term in the sum on the right hand side is less than or equal to
its predecessor. Furthermore, we have
I.X1 / D I.X2 / I.X2 jX1 /
by Proposition 3.9.(v).
Therefore
n I.Xn jX1 ; : : : ; Xn1 / I.X1 ; : : : ; Xn /
(6.4)
83
and
(6.1)
I.X1 ; : : : ; Xn / D I.X1 ; : : : ; Xn1 / C I.Xn jX1 ; : : : ; Xn1 /
(6.3)
I.X1 ; : : : ; Xn1 / C I.Xn1 jX1 ; : : : ; Xn2 /
(6.4)
I.X1 ; : : : ; Xn1 / C
D
1
I.X1 ; : : : ; Xn1 /
n1
n
I.X1 ; : : : ; Xn1 /;
n1
and so
1
1
I.X1 ; : : : ; Xn /
I.X1 ; : : : ; Xn1 /:
n
n1

This means that the sequence n1 I.X1 ; : : : ; Xn / is decreasing. Since it is always
positive, the limit exists.
t
u
Definition 6.6. The limit in Proposition 6.4 is called the information rate I.X / of
the process X D .Xn /n2N .
It is important to observe that the information rate I.X / of a process X coincides
with the average information needed for determining the value taken by one of the
random variables Xn , when knowing the previous ones, as the following proposition
shows.
Proposition 6.5. Given a stationary process X D .Xn /n2N , then
lim I.Xn jX1 ; : : : ; Xn1 / D I.X /:
n!1
Proof. By (6.3) the limit exists. And now we write

1X
I.Xi jX1 ; : : : ; Xi 1 /
n i D1
n
bn D
and observe that bn converges to this same limit.

By (6.2) bn D n1 I.X1 ; : : : ; Xn /.
t
u
Proposition 6.6. Let X be a stationary process in A. Let cn denote an optimal 0-1code for .X1 ; : : : ; Xn /, then
lim
n!1
1
L.cn / D I.X /:
n
Proof.
1
1
1
1
I.X1 ; : : : ; Xn / L.cn / I.X1 ; : : : ; Xn / C
n
n
n
n
by Proposition 4.2.
t
u
84
Proposition 6.7. If the process X D .Xn /n2N is i.i.d., then

I.X1 ; : : : ; Xn / D n I.X1 /
and
I.X1 / D I.X /:
t
u
Proof. Obvious.
The last two propositions show that the information content of a random variable
can be precisely identified with the average number of yesno questions needed in
repeated guessing of this variable for one determination of its value.
A similar interpretation as for I.X / can now also be given for
I.X jY / D
pY D b I.X jY D b/:
Here I.X jY D b/ can be interpreted as the average number of yesno questions

needed to determine the value of X , if one knows that Y D b. Thus I.X jY / is the
average number of questions needed to determine the value of X , if one knows the
value of Y . The equation I.X jY / I.X / fits with the idea that knowledge of Y
can only help to determine X .
The equality I.X jY / D I.X /, which holds when X and Y are independent, fits
with the idea that knowledge of Y does not actually help to determine X , when Y
and X are independent.
T .X; Y / D I.X / I.X jY / D I.Y / I.Y jX /
can thus be interpreted as the amount of information that knowledge of Y contributes to the determination of X , or vice versa.
If we further apply this interpretation to the novelty and information of a
description d , then N .d / is the average novelty obtained from one application of the
description d to a particular event x, i.e., the average number of yesno questions
that would have been needed to guess the corresponding proposition d.x/ that has
been provided by d , whereas I.d / D S.e
d / is the average information (Dnumber
of yesno questions) needed to predict the outcome of the description d for one
particular event x.
The limiting procedure of Proposition 6.5 has indeed been employed to define
the information rate of English writing [and also for some other languagessee, for
example, Attneave (1959) and Topse (1974)] from observed n-block probabilities.
6.5 Transinformation Rate

Definition 6.7. For two processes X D .Xn /n2N and Y D .Yn /n2N , we define the
transinformation rate between X and Y as
T .X ; Y/ D lim
n!1
1
T ..X1 ; : : : ; Xn /; .Y1 ; : : : ; Yn //:
n
6.6 Asymptotic Equipartition Property
85
Defining the pair-process .X ; Y/ of the two processes X and Y as

.X ; Y/ D ..Xn ; Yn /n2N /, then obviously
T .X ; Y/ D I.X / C I.Y/ I.X ; Y/;
in accordance with Definition 3.5.
Consider now a process X D .Xn /n2N in a finite set A. We may define the
random vectors X n D .X1 ; : : : ; Xn / and the random variables

e 1 .!/ \ X
e 2 .!/ \ : : : \ X
e n .!/ D N .X
e n .!//:
In .!/ WD N X
Obviously, In depends on X n and E.In / D I.X1 ; : : : ; Xn / D I.X n /.
6.6 Asymptotic Equipartition Property

We have shown that for a stationary process X the limit lim
1
I.X1 ; : : : ; Xn /
n!1 n
I.X / D I exists. Now we are interested not only in the averages but also
in the individual values of the functions In .x/, where x D .x1 ; : : : ; xn / is a
particular value of the random vector .X1 ; : : : ; Xn /. We want to investigate, for
which individual vectors x they come close to the average I for large n. It will
turn out that this can happen for most x, i.e., with high probability.
Definition 6.8. A process X D .Xn /n2N with values in a finite set A is said to
have the asymptotic equipartition property (a.e.p.), if the sequence .InC1 In /n2N
satisfies the w.l.l.n.
This definition actually means that for processes with a.e.p. n1 In comes close to
I with high probability.
Proposition 6.8. An i.i.d. process X D .Xn /n2N with values in a finite set A
satisfies the a.e.p.
Proof. Because the Xi are independent, we have
n
X

e
e
ei :
N X
In D N X 1 \ : : : \ X n D
i D1
e nC1 depends on XnC1 only. Therefore the sequence

Thus InC1 In D N X
.InC1 In /n2N is i.i.d. and by Proposition 6.3 satisfies the w.l.l.n.
t
u
Let us work out the meaning
equipartition property in some
asymptotic

of the
more detail. It means that p n1 In I < ! 1 for any > 0. Here I D I.X / is
the information rate of the process X D .Xn /n2N . If we define
86

1
D ! 2 W In .!/ I <
n
Hn;
and
An; D f.X1 .!/; : : : ; Xn .!// 2 An W ! 2 Hn; g;

then it is clear that for every > 0, there is an n 2 N such that p.Hn; / D p.An; / >
1 .
The sets Hn; and An; are called the high-probability sets for the process
X D .Xn /n2N on A. The sequences a D .a1 ; : : : ; an / 2 An; An are called
high-probability sequences. Thus the majority of sequences in An are highprobability sequences and it turns out that all the high-probability sequences have
about the same probability, i.e., the high-probability sequences form an almostequal-probability partition of An; . This is the reason for the name asymptotic
equipartition property.
In order to see why each of the high-probability sequences has about the same
probability, we now proceed to estimate this probability. From this estimate, we can
also get an estimate on the total number of high-probability sequences.
In the above discussion we have introduced an obvious probability p on An ,
namely p.a1 ; : : : ; an / D pX1 D a1 ; : : : ; Xn D an .
For a 2 An and any ! 2 where .X1 .!/; : : : ; Xn .!// D a, we have
In .!/ D log2 p.a/:
Now the a.e.p. yields the following estimates for a 2 An; :
I <
log2 p.a/
< I C ;
n
thus 2n.I / > p.a/ > 2n.I C/ :
If we sum these inequalities for all a 2 An; , we obtain the following estimates
for #An; :
#An; > p.An; / 2n.I / > .1 / 2n.I / ;
#An; < p.An; / 2n.I C/ 2n.I C/ :
Thus we have proved the following.
Proposition 6.9. Let X D .Xn /n2N be a process with values in A that has the
a.e.p. and the information rate I . Then there is for every > 0, an n 2 N and a set
An; An of so-called high-probability sequences satisfying
i) p.An; / > 1 ,
ii) 2n.I / > p.a/ > 2n.I C/ for every a 2 An; ,
iii) .1 / 2n.I / < #An; < 2n.I C/ .
6.8 Exercises
87
The classes of processes that satisfy the w.l.l.n. or the a.e.p. have been more
thoroughly investigated in the mathematical literature, in particular, in a branch of
ergodic theory which analyzes dynamical systems from a probabilistic point of
view (see also Gray 1990). It turns out that they are in fact rather large compared to
the very special case of i.i.d. processes, since they contain for example the class
of all ergodic processes (see for example Billingsley 1978 or Friedman 1970;
Walters 1982). It is the requirement of independence of the variables Xi that is so
strong, and it has actually been weakened in several interesting ways (Billingsley
1978; Gray 1990).

This chapter only contains classical material from information theory (e.g., Cover
and Thomas 1991) and the theory of stochastic processes (e.g., Lamperti 1966,
1977; Doob 1953) that is needed to prove Shannons theorem. The exposition is
e by a discrete
rather brief. I think this is made possible by the use of the description X
random variable X (defined in Sect. 2.6) which considerably simplifies the notation.
The nonconstructive approach to stochastic processes taken here, by simply
considering collections of random variables without explicitly constructing a possibly underlying space and a -algebra , i.e., by disregarding or circumventing
most of the problems successfully solved by measure theory, may still look a
bit tedious here and there, but can be taught to and grasped by students with
limited mathematical experience. The definition of information as the expectation
of a random information variable N and the corresponding introduction of the
e provided by a random variable X (see Chap. 2) are most useful in
description X
this context. The material presented, however, is classical (Shannon 1948; Khinchin
1957; Ash 1965). See also Gray (1990) for a more advanced presentation.
6.8 Exercises
1) Let A D f1; : : : ; ng and .Xi /i 2N be independent identically distributed on A
such that pXi D k D n1 for every i 2 N and k 2 A. For any B A obviously
p.B/ D n1 #.B/. The relative frequency with which the Xi hit the set B is defined
m
P
1B Xi . Can one say that the random variables Ym converge in
as Ym D m1
i D1
some sense to p.B/?

2) Let .Xi /i 2N and .Yi /i 2N be processes on A and B, respectively. The process
.Xi ; Yi /i 2N has values in A B. Show the following:
a) If .Xi ; Yi /i 2N is i.i.d. so are the processes .Xi /i 2N and .Yi /i 2N .
b) If .Xi ; Yi /i 2N is i.i.d. then T ..Xi /i 2N ; .Yi /i 2N / D T .X1 ; Y1 /.
88
3) Is it always true that the combined process .Xi ; Yi /i 2N is i.i.d. if the processes
.Xi /i 2N and .Yi /i 2N are i.i.d.?
4) Let .Xi /i 2N be independent identically distributed random variables on the finite
set A with pX1 D a 0 for every a 2 A. What is the probability that every
word from A occurs infinitely often in a sequence .Xi /i 2N ? Here A denotes
the set of all finite words on A, given by
A D
Ai
i 2N
with Ai D fa1 a2 : : : ai W aj 2 A for j D 1; : : : ; i g.

Hint: Determine the probability that a given word ! 2 A just occurs finitely
many times.
5) Let .Xi /i 2N be an i.i.d. process on f0; 1g. Let p0 D pX1 D 0 and p1 D
pX1 D 1. Consider the random variable In defined in Sect. 6.5. We want to
determine the distribution of its values
n 1
o
W .In / D log2 .p0k p1nk /W k D 0; 1; : : : ; n :
n
How often does each of these values occur, and with which probability? What is
the expectation of the probabilities themselves, i.e., of Pn D 2nIn ? For p0 D 18
and n D 800, what is E.In / and what is the probability that In deviates not more
than 5% from E.In /?
References
and Winston.
Billingsley, P. (1978). Ergodic theory and information. Huntington, NY: Robert E. Krieger
Publishing Co.
Cover, T. M. & Thomas, J. A. (1991). Elements of information theory. New York: Wiley.
Doob, J. L. (1953). Stochastic processes. New York: Wiley.
Friedman, N. A. (1970). Introduction to ergodic theory. New York: Van Nostrand Reinhold
Company.
Gray, R. M. (1990). Entropy and information theory. New York: Springer.
Khinchin, A. (1957). Mathematical foundations of information theory. New York: Dover Publications, Inc.
Lamperti, J. (1966). Probability : A survey of the mathematical theory. Reading, Massachusetts:
Benjamin/Cummings.
Lamperti, J. (1977). Stochastic processes - a survey of the mathematical theory. Applied Mathematical Sciences 23, New York: Springer.
27, 379423, 623656.
Topse, F. (1974). Informationstheorie: eine Einfuhrung. Stuttgart: Teubner Verlag.
Walters, P. (1982). An introduction to ergodic theory. New York, Berlin, Heidelberg: Springer.
Chapter 7
Channel Capacity
In this chapter we extend the definitions of Chap. 5 to real information channels that
handle sequences of symbols instead of single symbols. This extension is necessary
to use the idea of taking a limit of very long sequences to define information rate
(Definition 5.5) now to define transinformation rate and channel capacity. This leads
to the proof of Shannons famous theorem in the next chapter.
7.1 Information Channels

The concept of a transition probability (Definition 5.1) is very closely related to
the concept of an information channel. The basic idea is that a channel is a special
kind of transition mechanism that deals with sequences of values from finite sets
A and B.
Definition 7.1. Let A and B be finite sets. A channel C with input alphabet A and
output alphabet B is a transition probability CW AN
B N from
AN D f.a1 ; a2 ; a3 ; : : :/W ai 2 Ag to
B N D f.b1 ; b2 ; b3 ; : : :/W bi 2 Bg1 :
One has to imagine that at discrete time-steps t D 1; 2; 3; : : : symbols ai from

the alphabet A are fed into the channel, leading to the outputs b1 ; b2 ; b3 ; : : : at
(almost) the same times (compare Fig. 7.1 on page 90). So, in a sense b1 belongs to
a1 , b2 to a2 , etc.
The simplest examples for channels are the so-called deterministic channels,
which are defined simply by a mapping from AN to B N ; the first two examples
are of this type.
In this very general definition, one needs the canonical -algebras on AN and B N that are
generated by the so-called cylinder sets (cf. Bauer 1972 and see also Definition 7.2).

89
90
Fig. 7.1 Data transmission
on a channel
7 Channel Capacity
t=1
a1
t=2
b1
a2
b3
a4
t=4
t=3
a3
b2
b4
Example 7.1. For a mapping f W A ! B and a D .a1 ; a2 ; : : :/ 2 AN define

Ca .M / D 1 if the sequence .f .a1 /; f .a2 /; : : :/ 2 M and Ca .M / D 0 otherwise for
measurable sets M B N .
t
u
This example (as well as the next one) is a simple construction of a deterministic
mapping f W AN ! B N . Here the output value at any time t depends only at the
corresponding input value at the same time.
Example 7.2. For a mapping f W A A ! B, a fixed b0 2 B and any a 2 AN
and any measurable M B N define Ca .M / D 1 if M contains the sequence
.b0 ; f .a1 ; a2 /; f .a2 ; a3 /; f .a3 ; a4 /; f .a4 ; a5 /; : : :/ and Ca .M / D 0 otherwise. u
t
The next example is the opposite extreme, where the output probability on B N
does not depend on the input a 2 AN at all.
Example 7.3. Given a probability measure p on B N , for any a 2 AN and any
measurable M B N define Ca .M / WD p.M /.
t
u
7.2 Memory and Anticipation

Definition 7.2. An information channel C from A to B is said to have memory
bound k and anticipation bound l, if for every m n the probability Ca .fb D
.b1 ; b2 ; : : :/W bm 2 Bm ; bmC1 2 BmC1 ; : : : ; bn 2 Bn g/ depends only on the coordinates
amk ; : : : ; anCl of the input a D .a1 ; a2 ; a3 ; : : :/.
More exactly: For every a and a0 in A with ai D ai0 for i D m k; : : : ; n C l,
we have

Ca fb D .b1 ; b2 ; : : :/W bm 2 Bm ; : : : ; bn 2 Bn g D

Ca0 fb D .b1 ; b2 ; : : :/W bm 2 Bm ; : : : ; bn 2 Bn g :
Definition 7.3. The memoryspan of a channel is its lowest memory bound k. The
anticipationspan of a channel is its lowest anticipation bound l. A channel with
finite memory and anticipationspan is called a channel with finite memory and
anticipation.
7.3 Channel Capacity
91
In the following we shall concentrate on the simplest type of channel: the channel
without memory and anticipation, also called the simple channel.
Let .pa /a2A be a transition probability from A to B, where A and B are both
finite sets.
We can then construct an information channel with input alphabet A and
output alphabet B that simply works independently on each term ai in an
input sequence .a1 ; a2 ; : : :/ and produces the corresponding term bi of the output
sequence .b1 ; b2 ; : : :/.
For a 2 AN we thus define
Ca b1 2 B1 ; : : : ; bn 2 Bn WD pa1 .B1 / pa2 .B2 / : : : pan .Bn /:
(7.1)
Definition 7.4. Given a transition probability P D .pa /a2A from A to B, where A

and B are finite sets, then (7.1) defines the simple channel corresponding to P .
It is obvious from (7.1) that the simple channel has memoryspan 0 and also
anticipationspan 0.
It should be remarked, however, that zero memory- and anticipationspan do not
fully characterize the simple channel. In a sense it has two additional properties:
time invariance and no internal memory (see also Feinstein 1958; Ash 1965).
For two stochastic processes X and Y, the existence of a simple channel
transmitting X into Y in the sense of Definition 5.2 implies that X and Y have
some properties in common.
Proposition 7.1. Let CW AN
B N be a simple channel and CW X
Y, then
i) If X is stationary, so is Y,
ii) If X is independent, so is Y.
t
u
Proof. left as an exercise to the reader (Exercise 8)).

For a channel CW AN B N from A to B, we can now define its information
transferring capacity as the maximal amount of transinformation that can pass
through the channel.
Definition 7.5. The channel capacity c of a channel CW AN
B N is defined by2
c.C/ WD supfT .X ; Y/W X stationary process on A,

Y process on B and CW X
2
Yg:
Of course, a general definition should not make requirements on the processes involved. Yet
Shannons theorem relies on strong properties and it seems adequate to restrict the definition to
stationary processes.
92
7 Channel Capacity
Example 7.4. Given a transition probability P W A A

on B, one can easily define a channel with memory:
B and a probability p0
Ca b1 2 B1 ; b2 2 B2 ; : : : ; bn 2 Bn D p0 .B1 / P.a1 ;a2 / .B2 / : : : P.an1 ;an / .Bn /:

t
u
Example 7.5. Given a transition probability P W A B
complicated channels. For instance we may define
B one can define more
Ca b1 D k1 ; b2 D k2 ; : : : ; bn D kn D p0 .k1 / P.a1 ;k1 / .k2 / : : : P.an1 ;kn1 / .kn /:

t
u
In the general case, it is difficult to compute channel capacities. There are,
however, special and yet very useful cases where this is easier.
Proposition 7.2. Let C be a simple channel with input alphabet A and output
alphabet B, both finite sets, given by a transition probability P W A
B. Then
c.C/ D maxfT .X; Y /W X W ! A; Y W ! B; P W X
X
q.a/ I.pa /g;
D maxfI.P q/
q
Y g;
a2A
where the maximum3 extends over all probability vectors q D .qa /a2A and I denotes
the information of a probability vector as defined in Exercise 4.4.
Proof. Let A D fa1 ; : : : ; am g, B D fb1 ; : : : ; bn g. For two stationary processes X
and Y with CW X
Y we have
T .X ; Y/
D
D
D
./
(7.1)

1
I .Y1 ; : : : ; Yn / j .X1 ; : : : ; Xn /
n

1 e
e n / j .X
e1 \ : : : \ X
e n/
N .Y 1 \ : : : \ Y
I.Y/ lim
n!1 n

1
e1 \ : : : \ Y
en / j X
en
E log2 p .Y
I.Y/ C lim
n!1 n

n
Y

1
I.Y/ C lim E log2
PXi Yi
n!1 n
i D1
I.Y/ lim
n!1

1X
E log2 PXi Yi
n!1 n
i D1
n
I.Y/ C lim
The maximum is attained because the set of all probability vectors q is compact.

stationarity
93

I.Y/ C E log2 PX1 Y1
XX
pX1 D aPa .b/ log2 Pa .b/ :
I.Y/ C
a2A b2B
depends only on X1
./ holds because for X n D x 2 An and Y n D y 2 B n , we have

e1 \ : : : \ Y
en j X
e n / D p.Y n D y j Xn D x/ D
p.Y
n
Y
pxi .yi /:
i D1
Because Y is stationary (Proposition 7.1 on page 91) and the distribution of

Y1 only depends on the distribution of X1 , we can maximize T .X ; Y/ for given
distribution of X1 if we choose X i.i.d., because in this case also Y is i.i.d.
(Proposition 7.1 on page 91) and thus I(Y) is maximized (Lemma 6.1 on page 82),
whereas the second term remains unchanged. In this case, I.Y/ D I.Y1 / (by
Proposition 6.7) and
T .X ; Y/ D I.Y1 / C
XX
pX1 D apa .b/ log2 pa .b/
a2A b2B
D I.Y1 /
pX1 D aI.pa /
(7.2)
a2A
D I.Y1 / I.Y1 jX1 / D T .X1 ; Y1 /:

Thus T .X ; Y/ can be maximized by maximizing T .X1 ; Y1 /, where pW X1
This proves the first equation; the second follows from (7.2).
Y1 .
t
u
Given a transition probability p from A to B and another one q from B to C ,

it is possible to connect the two together to a transition probability q p from
A to C . One simply defines for a 2 A and M C the transition probability
.q p/a .M / as the average with respect to the probability pa on B of qb .M /,
i.e., .q p/a .M / D Epa .qb .M //. The same definition can be used to connect
channels together. For channels without anticipation (the physically normal case),
we can give this definition more explicitly.
Definition 7.6. Let CW AN B N and PW B N
ipation. We define .P C/W AN
C N by
.P C/a c1 2 C1 ; : : : ; cn 2 Cn D
C N be two channels without antic-
pa b1 D m1 ; : : : ; bn D mn
m1 ;:::;mn
qm c1 2 C1 ; : : : ; cn 2 Cn
where m 2 B N starts with .m1 ; : : : ; mn /.
94
7 Channel Capacity
Fig. 7.2 Two channels

0
1p
00
0.5
01
2
2
10
0.5
1p
11

This chapter again contains only classical material. We use a quite general definition
of an information channel.
7.5 Exercises
1) What is the memory- and anticipationspan of Examples 7.17.5?
2) How could one modify Example 7.4 with P W A A
B in order to define a
channel with memory span 4?
3) When two channels with finite memory and anticipation are connected, how
does the memory- and anticipationspan of the composed channel depend on
those of the two channels?
4) Show that for a simple channel (see Definition 7.4 on page 91) Cp defined by
a transition probability pW A
B, the capacity c.Cp / is attained for an i.i.d.
input process on A.
5) Show that the channel capacity c.p/ of a channel pW AN B N , where
Ipa .b1 ; : : : ; bn / is independent of a 2 AN for every n 2 N, can be obtained
by maximizing fI.Y/W X process on A; pW X
Yg.
6) In Example 7.1 take B D f0; 1g, A0 A, and consider the mapping f that has
(
f .a/ D
for a 2 A0 ;
otherwise :
In Example 7.2 take A D f1; : : : ; ng, B D A A and f as the identity mapping.

With these specifications, what is the channel capacity in Examples 7.17.3?
7) Determine the channel capacity of the following simple channels with input
alphabet f0; 1g and f00; 01; 10; 11g, respectively, and output alphabet f0; 1; 2g
in Fig. 7.2.
8) Prove Proposition 7.1.
References
Fig. 7.3 Two simple
channels
95
0
2
0.5
1
0.25
0.5
0.5
0.25
9) For the simple channels C given in Fig. 7.3, what is their capacity and what is
the capacity of C C?
10) Show the following:
Proposition 7.3. Let C W AN
B N be a channel. Its capacity
c.C/ satisfies c min.#.A/; #.B//.
References
Bauer, H. (1972). Probability theory and elements of measure theory. New York: Holt,
Rinehart and Winston.
Feinstein, A. (1958). Foundations of information theory. New York: McGraw-Hill Book
Company, Inc.
Chapter 8
How to Transmit Information Reliably

with Unreliable Elements (Shannons Theorem)
The goal of our rather technical excursion into the field of stationary processes
was to formulate and prove Shannons theorem. This is done in this last chapter
of Part III.
Shannons Theorem is one of the most important results for the foundation of
information theory (Shannon and Weaver 1949). It says that the channel capacity
c determines exactly what can effectively be transmitted across the channel. If you
want to transmit less than c bits of information per time unit across the channel you
can manage to do it in such a way that you can recover the original information from
the channel output with high fidelity (i.e., with low error probabilities). However, if
you want to transmit more than c bits per time unit across the channel, this cannot
be done with high fidelity. This theorem again underlines the fact that information
is incompressible (like water) and that a given channel can only transmit a given
amount of it in a given time.
8.1 The Problem of Adapting a Source to a Channel

The situation is the following: We have a channel CAB from A to B and a stationary
process U with values in C satisfying the a.e.p. Since we cannot directly connect the
source-process U to the channel C, we have to construct an appropriate adapter.
We want to construct a block-coding hW C n ! A by which we can couple U to
the channel CAB as in Fig. 8.1. From the channel output Y, it should be possible to
determine U with low error probability.
In this section we shall prove Shannons Theorem for the simplest type of channel: the memoryless or simple channel. Proofs for channels with memory become
technically more complicated (see for example Wolfowitz 1964; Pfaffelhuber 1971;
Kieffer 1981), but rely on the same basic ideas.

97
98
8 How to Transmit Information Reliably with Unreliable Elements (Shannons Theorem)

Channel
Coding
C
U
A
X
CAB
B
Y
Fig. 8.1 Channel and stationary process
8.2 Shannons Theorem

Theorem 8.1 (Shannon 1949). Let CAB be a simple channel from A to B with
capacity c. Let U be a stationary process on the alphabet C satisfying the a.e.p. and
I.U/ D r < c. Then for any > 0 there is an n 2 N and a mapping mW C n !
An such that the values of U can be determined from the outputs of the combined
channel CAB m with an error probability less than .
Proof. For the proof we first consider an input process X on A, which is i.i.d. and
Y/. This is possible by Proposition 7.2.
has T .X ; Y/ D c (and CAB W X
From the a.e.p., we can infer that for any > 0 there is an n 2 N such that
i
h 1

e1 \ : : : \ X
e n I.X / > <
i) p N X
n
h 1
i

e1 \ : : : \ Y
en I.Y/ > <
ii) p N Y
n
i
h 1

e1 \ : : : \ X
en \ Y
e1 \ : : : \ Y
en I.X ; Y/ > <
iii) p N X
n
h 1
i

e1 \ : : : \ U
e n I.U/ > <
iv) p N U
n
From Proposition 6.8 we can also estimate the number of the corresponding highprobability sequences, i.e., N D #.An; /, #.Bn; /, M D #.Cn; /, and also the number
P of high-probability pairs.
Now the idea is to consider only the high-probability elements in C n , An , B n ,
and An B n , and to map each high-probability element c D .c1 ; : : : ; cn / 2 Cn;
onto a different randomly chosen a D .a1 ; : : : ; an / that is the first element in a highprobability pair .a; b/. This procedure will work if there are more such as than
there are high-probability cs, and if the probability of finding the first element a
from the second element b in a high-probability pair .a; b/ is sufficiently high. In
this case, we can guess first a and then c from the channel output b. In order to
carry out the proof, we now have to estimate the number of these as appearing in
high-probability pairs .a; b/.
1. Given a high-probability a, we estimate the number Na of high-probability
pairs (a,b) containing a as follows: We use the abbreviations X D .X1 ; : : : ; Xn /,
Y D .Y1 ; : : : ; Yn /, and U D .U1 ; : : : ; Un /, and consider only high-probability
elements ! 2 . Then
8.2 Shannons Theorem
99
e jX
e/ D
p.Y
p.X; Y /
2n.I.X ;Y/I.X /C2/:
e/
p.X
Thus
1
p.bja/ Na 2n.I.X ;Y/I.X /C2/
and Na 2n.I.X ;Y/I.X /C2/:
b2B n
Now we have to make sure that

M
P
P

:
2n.I.X ;Y/I.X /C2/
Na
Because of the estimates for M and P from Proposition 6.7, this is true if
2n.I.U /C/ .1 /2n.I.X /3/ :
Since I.X / T .X ; Y/ D c > r D I.U/ this is certainly true for sufficiently
small .
2. Given a high-probability b 2 B n , we estimate the number Nb of high-probability
pairs .a; b/ in An B n containing b similarly to (1):
e jY
e/ D p.X; Y / 2n.I.X ;Y/I.Y/C2/ :
p.X
e/
p.Y
Thus
1
p.ajb/ Nb 2n.I.X ;Y/I.X /C2/ and Nb 2n.I.X ;Y/I.X /C2/:
a2An
This number we use to estimate the probability that there is at most one m.c/
occurring as first component among the Nb pairs, for each of the high-probability
bs at the channel output. More exactly, for a fixed high-probability c we take
a D m.c/ as channel input and obtain b as channel output. Now we ask for
the probability pf that there is another c 0 such that .m.c 0 /; b/ is also a highprobability pair. For fixed b let nb be the number of codewords m.c 0 / such that
.m.c 0 /; b/ is a high-probability pair. Now we can estimate
pf pnb 1 < E.nb /
DM
Nb
N
2n.I.U /CI.X ;Y/I.Y/I.X /C4/

D 2n.4Crc/ :
100
8 How to Transmit Information Reliably with Unreliable Elements (Shannons Theorem)
Since I.U/CI.X ; Y/I.Y/I.X / D r c < 0, this probability is sufficiently

small for sufficiently large n and sufficiently small .
This means that a high-probability c will be coded into a channel input a in such
a way that with high-probability a can be determined from the channel output b,
and from a one can determine c. What is the error probability in this procedure?
An error may occur when c is not in the high-probability group, or .a; b/ is not
in the high-probability group, or b is not in the high-probability group, or if there
is more than one m.c/ in Nb . Otherwise, we know which a we have chosen to
correspond to the output b and we know which c has been mapped by m onto a.
Taking our various estimates together, the probability of error is at most 3 C
2n.4Crc/ and it remains to choose sufficiently small and n sufficiently large to
finish the proof of the theorem.
t
u
There is also a converse of Shannons theorem which essentially says that a
source with information rate r that is greater than the channel capacity c cannot
be connected to the channel in such a way that it can be retrieved from the
channel output with high reliability. The proof of this theorem rests essentially on
Proposition 5.1.
Proposition 8.1. Let CAB be a channel with capacity c, U a stationary process with
alphabet C and I.U/ D r > c.
Then there is a > 0 such that for any coding mW C n ! An the values of U can
only be determined from the outputs of the combined channel CAB m with an error
probability of at least .
In order to prove Proposition 8.1, we need the following lemma.
Lemma 8.1. Let X; Y be two random variables and mW W .X / ! W .Y /, a mapping
such that pm.X / Y < . Then T .X; Y / > I.Y / .1 /.
Proof. T .X; Y / T .m.X /; Y / D I.Y / I.Y jm.x//. Let E D m.X / D Y . On
E we have I.Y jm.X // D 0. On E c we have I.Y jm.X // I.Y /. Thus
T .X; Y / p.E/ I.Y / > I.Y / .1 /:
t
u
Proof (Proof of Proposition 8.1). Define the processes U, X , and Y. Assume that
for any > 0 there is a coding mW C n ! An and a decoding f W B n ! C n such that
pf .Yn / Un < . Then by the lemma T .Un ; Yn / > I.Un / .1 /, and therefore
T .Xn ; Yn / T .Un ; Yn / > I.Un / .1 /:
Since this holds for every > 0 we have T .X ; Y/ I.U/ and therefore
c T .X ; Y/ I.U/ D r:
t
u
References
101

In this chapter we reproduce essentially the classical proof of Shannons theorem
(Shannon 1948). Several improved versions of this proof have appeared over the
last 50 years perhaps starting with McMillan (1953), giving better constructions for
the choice of high-probability pairs (Feinstein 1954, 1959) or allowing more general
conditions on the channel. See Gray (1990) for an overview of those more advanced
ideas.
8.4 Exercises
1) Let P C be the compound channel obtained from connecting two channels
P and C.
a) Show that c.P C/ minfc.P/; c.C/g.
b) Give an example where c.P/ 0 and c.C/ 0, but c.P C/ D 0. This can
be done with deterministic channels P and C.
c) Show that for any > 0 one can construct a deterministic adapter channel
R such that c.P R C/ minfc.P/; c.C/g .
d) What would be a good adapter channel for P D C, for the two channels of
Exercise 7.9?
References
Feinstein, A. (1954). A new basic theorem of information theory. IRE Transactions on Information
Theory, 4, 222.
Feinstein, A. (1959). On the coding theorem and its converse for finite-memory channels.
Information and Control, 2, 2544.
Gray, R. M. (1990). Entropy and information theory. New York: Springer.
Kieffer, J. (1981). Block coding for weakly continuous channels. IEEE Transactions on Information Theory, 27(6), 721727.
McMillan, B. (1953). The basic theorems of information theory. Annals of Mathematical Statistics,
24, 196219.
Pfaffelhuber, E. (1971). Channels with asymptotically decreasing memory and anticipation. IEEE
Transactions on Information Theory, 17(4), 379385.
27, 379423, 623656.
Shannon, C. E. & Weaver, W. (1949). The mathematical theory of communication. Champaign:
University of Illinois Press.
Wolfowitz, J. (1964). Coding theorems of information theory (2nd ed.). Springer-Verlag New York,
Inc.: Secaucus.
Part IV
Repertoires and Covers
Chapter 9
Repertoires and Descriptions
This chapter introduces the notion of a cover or repertoire and its proper descriptions. Based on the new idea of relating covers and descriptions, some interesting
properties of covers are defined.
Definition 9.1. For a probability space .; ; p/, a cover is a subset of n f;g
such that [ D .1
In general, a cover may be a finite or an infinite set of propositions. For a basic
understanding of the concepts, it will be much easier to consider finite covers.
Part II on coding and information transmission illustrated that information is
an objective quantity, which quantifies the amount of information needed to
determine the value of a random variable. On the other hand, novelty is slightly
more subjective in the sense that it takes explicitly into account a particular
interpretation or description d of events x, and so it is a measure of the novelty
provided by viewing the world (the events) through d . Of course, interest in the
value of a particular random variable X , also implies a particular description of
e , but this description is of a special clear-cut type: we have said
events, namely X
it is complete. After having investigated the clear-cut, complete descriptions that
are needed for optimal guessing and coding in Parts II and III, we now want to
come back to the more personal, biased, one-sided descriptions of events. The
opposite extreme to complete descriptions are descriptions that express a very
particular interest pointing only to one direction. They are the directed descriptions,
corresponding to the narrow covers defined in Sect. 9.4.
The world view of a person could be characterized as the collection of
all propositions that (s)he will eventually use (externally or internally) to describe
events in the world. We may also understand as the collection of all propositions
a person is potentially interested in. Such a collection will also be called a
repertoire, meaning the repertoire of all elementary observations (propositions),
through which a person views or describes the occurring events x 2 . One could
In addition we usually may require that p.A/ 0 for every A 2 .

105
106
9 Repertoires and Descriptions
assume that such a repertoire should be closed under the logical operations NOT,
AND, and OR, i.e., complement, intersection, and union. This would make it an
algebra. We do not take this point of view here, in particular with respect to negation.
Indeed, non-table would not appear as such a natural elementary description of a
thing as table.
Definition 9.2. For a cover and a description d , we define the novelty provided
by d for (novelty of d for ) by N .d / WD E.Nd / where
Nd .!/ D supfN .A/W d.!/ A 2 g:
Proposition 9.1. For any finite cover and any description d , we have
i) N .d / N .d /
ii) N .d / D N .d / if and only if R.d / .2
Proof. Obvious.
t
u

In the first example of Chap. 4, one is interested in the value of a dice, i.e., in the
six propositions Ai D X D i with .i D 1; : : : ; 6/. This interest is expressed
in the repertoire D fX D i W i D 1; : : : ; 6g, which is a partition. When one
is only interested in the value X D 6, this can be described by two repertoires:
1 D fX D 6; X 6g, and 2 D fX D 6; g; 1 is a partition, 2 not, because
2 describes all results different from 6 by saying nothing. Here it is a matter of
taste, whether you argue that being interested in the value 6, you should also be
interested in being told that it is not a 6, or you should simply not be interested.
In this book I have introduced the general concept of novelty in order to be able
to distinguish two opposite extremes of it, namely information and surprise. 1
describes the information given to someone who is interested in the 6: Telling him,
it is not a 6, also provides information. 2 describes the surprise given to someone
who is interested in the 6: Telling him, it is not a 6, gives no surprise, it says nothing.
The surprise idea can also be illustrated by considering statistical tests: In
a statistical test you are interested in showing that a test-statistics T you have
computed from your data, is surprisingly, or significantly large. This interest can
be captured by the repertoire D fT aW a > 0g. In this case, T a is the
more surprising, the smaller pT a, and indeed pT a is called the level of
significance of the test result T D a.
In the second example of Chap. 4, 8 cards are placed on the table and one
has to locate an ace among them. In this case, one is interested in the repertoire
More exactly: for every B 2 R.d / there is A 2 with p.A4B/ D 0.
107
D fCi D aceW i D 1; : : : ; 8g, where Ci 2 face; no aceg describes the i -th card.
Here is not a partition, because there are two aces among the eight cards.
In the example of Chap. 2, the Monty Hall problem, there is the location of the
sportscar S 2 f1; 2; 3g and the candidates guess C 2 f1; 2; 3g. The quizmaster is
allowed to make a true statement from the repertoire D fS i W i D 1; 2; 3g.
Here, however, there is another restriction given by the situation: He cannot chose
to say S C , because he should not open the door the candidate has chosen.
Further examples for repertoires can be considered in the context of the game
of lotto: 6 numbers are drawn from L D f1; : : : ; 49g without replacement. If you
have guessed them right, you get a lot of money. Usually the six numbers are not
displayed in the order they are drawn, but ordered by their size. Thus we have 6
random variables X1 < X2 < X3 < X4 < X5 < X6 with values in L.
If you regularly take part in the game, you want to know the six numbers. This
interest is described by
D fX1 D a1 ; X2 D a2 ; : : : ; X6 D a6 W a1 < a2 < a3 < a4 < a5 < a6 2 Lg;
which is a partition with large information content, (6 pairs of digits have an
information content of about 40 bit, since they are less than 49 it will be
about 6 bit less). Normally, you are unable to remember these 6 numbers, if
you are just told them. You may, however, be able to remember them, if they
are interesting configurations for example .1; 2; 3; 4; 5; 6/. It seems natural to
say that this configuration is more surprising than any of the usual configurations, like .4; 5; 18; 27; 34; 40/. Some people may be interested in particular
numbers like 3; 13; 33, or perhaps prime numbers. Here I only want to consider
a few particular configurations: it is surprising, if 3 or more numbers come in
a row. This interest in configurations can be expressed by the repertoire D
f3 in a row; 4 in a row; 5 in a row; 6 in a row; g. Another interesting feature
may be the compactness of the sequence, i.e., X6 X1 . So we may be interested
in 2 D fX6 X1 D aW a 2 Lg. Since small values of X6 X1 are more surprising
than large values, this interest may be even better expressed by 2 D fX6 X1
aW a 2 Lg. In addition we may be surprised by large values of X1 or by small values
of X6 as described by
1 D fX1 D aW a 2 Lg;
1 D fX1 aW a 2 Lg;
6 D fX6 D aW a 2 Lg;
6 D fX6 aW a 2 Lg:
Let us play a bit with these 7 repertoires. First we should note that X1 ,X6 , and
X6 X1 do not really take all values in L. For example, X6 cannot be less than
6 and X1 cannot be more than 44. So X6 D 3 is the empty set, so is X6 4.
Now it does not really matter whether we put the empty set into a repertoire or not.
The empty set cannot be used to describe anything. In our definitions we usually
assume that ; ; normally, we even assume that p.A/ 0 for all propositions in
a repertoire.
108
Let us now define the repertoire more exactly: A3 D 3 in a row means

A3 D Xi C2 D Xi C 2 for i D 1; 2; 3 or 4:
Similarly
A4 D Xi C3 D Xi C 3 for i D 1; 2 or 3;
A5 D Xi C4 D Xi C 4 for i D 1 or 2 and
A6 D X6 D X1 C 5:
So D fA6 ; A5 ; A4 ; A3 ; g.
If we use the 5 propositions in to describe lottery drawings, we first observe
that is no partition; on the contrary, A6 A5 A4 A3 .
So, if a particular drawing ! 2 is in A5 , for example, it is also in A4 and in
A3 , so it may be correctly described by each of these propositions. However, if there
are 5 in a row, we would expect the description by A5 , because this is the most exact
description that fits !, i.e., the smallest set A 2 that contains !.
More generally, for ! 2 we call ! D fA 2 W ! 2 Ag the set of all
possible descriptions of ! in . What we assume is that a minimal proposition
in ! should be chosen to describe !. A is minimal means here, that there is
no other proposition in ! that is contained in A. Such a minimal description
is called a proper description in Definition 9.3. Even if we consider only proper
descriptions, there may be several minimal propositions in ! , in general. But for
our repertoire, defined above there is indeed only one minimal proposition in !
for every ! 2 . So all elements in A6 are described by A6 , all elements in A5 n A6
by A5 , all elements in A4 n A5 by A4 and so on. Also the other repertoires 1 ; 2 ,
and 6 have this property.
If we describe the lottery by any of the repertoires ; 2 ; 2 ; 6 , or 6 , we
can understand why ! D .1; 2; 3; 4; 5; 6/ is much more surprising than most
other drawings. Indeed, for a repertoire it would be reasonable to define the
novelty N .!/ as the maximal novelty of all propositions in ! , i.e., N .!/ WD
maxfN .A/W A 2 ! g. For ! D .1; 2; 3; 4; 5; 6/, we then obtain
N .!/ D N .A6 /; which is much larger than N .A5 /; N .A4 /; etc.
N2 .!/ D N .X6 X1 D 5/; again the largest value in 2 :
N2 .!/ D N .X6 X1 5/; the same.
N6 .!/ D N .X6 D 6/; even larger.
N6 .!/ D N .X6 6/; again the same.
Next we could try to combine the repertoires defined so far. The simplest combination of two repertoires and is their union [ . So for example we could
use [ 6 to describe lottery drawings. For example, ! D .4; 5; 6; 7; 9; 10/ could
9.2 Repertoires and Their Relation to Descriptions
109
be described by A4 or by X6 10. Both would be minimal propositions about

! in [ 6 , and in fact it is not so easy to say which of the two has lower
probability, i.e., larger novelty. If we wanted to impress, we would of course choose
the description of those two that has the larger novelty value. We would get an even
more impressive novelty value if we would allow to combine those two statements
by logical and. This leads to the definition of another combination of two
repertoires and : WD fA \ BW A 2 ; B 2 g. In 6 we would describe
our ! above by A4 \ X6 10. Also in all these combinations (1,2,3,4,5,6) has a
uniquely high novelty value.
There is a possibility of ordering repertoires: we can say if for every
! 2 any description of ! in terms of can be inferred from a description of ! in
terms of . For example, 1 1 , because X1 D a implies X1 a. Similarly,
2 2 and 6 6 . Also 2 can be inferred from 1 and 6 , more exactly from
1 6 , because X1 D a and X6 D b implies X6 X1 D b a. In the same way
2 can be inferred from 1 and 6 , or again 1 6 , i.e., 2 1 6 .
In summary, we have found the relations
2 1 6 1 6
and 2 2 1 6 :
Interestingly, we also have 6 1 2 , but not 6 1 2 . Also is not related

in any simple way to any of the other repertoires.
Much more could be said about these examples, but I want to stop here and leave
further elaborations and perhaps calculations of probabilities to the reader and to the
exercises.

From the mathematical point of view repertoires are essentially characterized as
covers. They are well-studied structures, mainly in topology. We propose a systematic treatment of repertoires or covers which collects their important properties in
their relation to descriptions.
It may be easier for the beginner to assume that covers are finite sets. Also
adding a few propositions A with p.A/ D 0 to does not really change the
information provided by a cover ; thus, we can safely use the notion of repertoires,
as defined in Definition 9.4 below, which is not quite as general as covers.
Example 9.1. The most trivial examples of covers are the whole -algebra
and fg, which have an extremely high or an extremely low information content,
respectively.
t
u
Example 9.2. Another example could be the set E1 of all propositions about events
! 2 that can be formulated in the English language, or the sets En of all
110
propositions that can be formulated in sentences of at most n letters. Obviously

1
S
each En is finite, EnC1 En and
En D E1 is countable.
nD1
In line with the observations made in Sect. 9.1, it is obvious that the novelty
of the same event ! can be larger if described in En with larger n (n k
implies NEn .!/ NEk .!/). So the real surprise of ! should be evaluated as some
combination of NEn .!/ and n, the idea being that an event ! is more surprising if it
can be described with fewer words (or letters) and has smaller probability.
Also in this line of thinking one could try to take the shortest English expression
(i.e., the smallest n such that En contains the proposition) as a measure for the
information content of a proposition. A more formalized version of this idea indeed
leads to an alternative, algorithmic approach to information theory by Chaitin (1975,
1977, 1987), Kolmogorov (1965, 1968), or to the so-called description complexity.
t
u
Example 9.3. Another example close to the simple extreme fg is the cover
f; Ag, where 0 < p.A/ < 1, which describes a simple bet on the proposition
t
A about events ! 2 , I bet that A. A different example is the cover fA; Ac g. u
Further examples are provided by the range R.d / of a description d , or by the
cover depicted in Fig. 2.2 in Sect. 2.3.
In this section we want to study the relationship between covers or repertoires
(see Definition 9.4) and descriptions.
Definition 9.3. Given a cover , a description d is called consistent with or in terms
of or a choice from , if d.x/ 2 for every x 2 . A choice from a cover is called
proper, if for every x 2 there is no element A of such that x 2 A d.x/.3
We denote by D.) the set of all proper choices4 from .
The word choice stems from the idea that a particular description d is always
a choice in the sense that for a given x 2 , d.x/ has to be chosen from those
propositions A 2 that contain x. Of course, it may also be possible that there is
no choice (or rather only one choice) for a repertoire .
Proposition 9.2. A cover admits only one description d , if and only if A\B D ;
for any A B 2 .
Proof. Obvious.
t
u
Such a cover is called disjoint or a partition (cf. Definition 2.6). Note that the
requirement that
Pp.A/ 0 for every A 2 implies that a disjoint cover has to be
countable and
p.A/ D 1.
A2
3
In line with our general strategy to disregard sets of probability 0, we can interpret A d.x/ as
p.A n d.x// D 0 and p.d.x/ n A/ > 0.
4
In the definition of D./, we understand a description d simply as a mapping d W ! with
! 2 d.!/, i.e., without the additional requirement of Definition 2.3.
111
For finite covers there are always proper choices. Unfortunately, this may no
longer be true for infinite covers as the following example shows.
Example 9.4. D f.a; b/W a < b 2 R D g.
t
u
Let me explain the idea behind the concept of a proper choice or description:
If you ask someone for a description of Mr. Miller, and in the course of this
description he says Mr. Miller has two children, then you usually assume not
only that Mr. Miller has two children, but also that Mr. Miller has no more than
two children. On the other hand, if Mr. Miller had three children, it would still be
correct to say that he has two children, since having three children implies having
two children.
In the repertoire of the person describing Mr. Miller, there are certainly propositions about Mr. Miller of the type Mr. Miller has n children. Among these
Mr. Miller has two children and Mr. Miller has three children are both correct
choices, if Mr. Miller has three children. But we assume that our informer will
choose the stricter of these two statements in this case. His proper choice of
statement should be such that there is no stricter statement available (in his
repertoire) about Mr. Miller that is true.
In the following we will be mostly concerned with proper choices. We might ask
whether there is always a proper choice for a repertoire .
We define .x/ WD fA 2 W x 2 Ag. An element M 2 .x/ is called minimal, if
there is no A 2 .x/ with A M . Thus a proper choice d picks for every x 2
a minimal element d.x/ 2 .x/. If .x/ is a finite set for every x 2 , as will be
usually the case, then it is clear that .x/ has minimal elements and that a proper
choice exists.
Definition 9.4. A cover is called finitary, if
i) p.A/ 0 for every A 2 and
ii) For almost every ! 2 and every A 2 containing ! there is a minimal
proposition B 2 with ! 2 B A. Minimality of B means that ! 2 C B
and C 2 implies C D B.
A finitary cover is also called a repertoire.
Example 9.5. Take D R and
1 D f.a ; a C /; a 2 Rg for > 0;
2 D f.a; 1/; a 2 Rg;
3 D fa; 1/; a 2 Rg;
4 D ff!gW ! 2 Rg:
Which of these covers are finitary?
t
u
Obviously finite covers with property (i) are finitary. This requirement is used to
ensure the existence of proper choices also from infinite covers.
112
Fig. 9.1 Illustration of

tightness
tight
not tight
Proposition 9.3. Let be a finitary cover. For every ! 2 and every A 2 with
! 2 A there is a proper choice4 d 2 D./ such that d.!/ A.
Proof. We obtain a proper choice d 2 D./ by choosing for every ! 0 2 a
minimal B 0 2 with ! 0 2 B 0 . Now take ! 2 and A 3 !. Then there is a
minimal B 2 with ! 2 B A. We define d.!/ WD B.
t
u
For further reference we define two particular repertoires, a very fine one, called
(for large) and a very coarse one, called (for small).
Example 9.6. Let be finite or countable and p.!/ 0 for every ! 2 . On
.; ; p/ we define two covers or repertoires:
D ff!gW ! 2 g
and
D f n f!gW ! 2 g :
Obviously is a partition, whereas the sets in are widely overlapping.
t
u
Definition 9.5. A repertoire is called tight if there is exactly one proper choice
from it. In this case, this choice is called d .
Example 9.7. The pictures in Fig. 9.1 illustrate tightness.
The repertoire in the right picture is not tight, because for the lower left corner,
both the horizontally hatched region or the diagonally hatched region can be (part
of) a proper choice. Thus there is more than one proper choice for this repertoire. In
the left picture the horizontally hatched region cannot be chosen because it contains
each of the diagonally hatched regions and will therefore not be a proper choice in
those regions.
t
u
Example 9.8. A bet on the truth of a single proposition A can also be described by
a repertoire, namely by D fA; g (cf. Example 9.3). This repertoire is tight and
the only choice d is
(
A for x 2 A and
d .x/ D
for x A.
For a random variable
X we defined the description X in Definition 2.16. The

range D R X of this description is a tight repertoire and d D X again.
e for
Any partition is obviously tight, and so is the range of the description X
any discrete random variable X on .
t
u
113
Fig. 9.2 Illustration of

cleanness. The vertically
hatched region on the left can
be removed, because it is the
union of the two diagonally
hatched regions
not clean
clean
Of course, it is straightforward, how to associate a cover to a given description d :

We simply consider the range R.d / D fd.x/W x 2 g, which is a repertoire, if R.d /
is finite. Vice versa, we can associate exactly one proper description to a repertoire,
only when it is tight.
Does the process of going from a repertoire to a proper choice and back always
lead to the same initial repertoire? The answer is obviously no. Even for a tight
repertoire , it is clear that R.d / , but it may happen that contains more
elements than R.d /. For example, if R.d / D and , we may consider
the repertoire 0 D [ fg, then R.d0 / D . But of course this slight difference
between and 0 does not seem to be essential.
Definition 9.6. For a cover the set
[
c WD
fR.d /W d 2 D./g
is called the cleaned version5 of .
By forming c we have removed from all unnecessary propositions that are
not used in proper choices. These propositions are exactly the (nontrivial) unions of
other propositions in . Indeed, if A D [ is not an element of for some
then for every x 2 A, A itself is not a proper choice, because x 2 B 2 and
B A. The following definition is therefore natural.
Definition 9.7. A cover is called clean, if for any with [ we have
[ . A tight and clean repertoire is called a template.
One can easily check that the cleaned version c of a repertoire is clean
(Exercise 10).
The pictures in Fig. 9.2 illustrate the concept of cleanness.
The right picture does not contain any nontrivial unions, whereas the left one has
the vertically hatched region as union of the two diagonally hatched regions.
Definition 9.8. For any cover we define
[ WD f[W ; g:
If D./ D ; we obtain c D ;, so c is not a cover. In this case, we add to c . To avoid this

redefinition, one could define c only for repertoires.
114
The definition of [ describes the opposite process of cleaning. In view of the

above discussion the following proposition appears quite obvious.
Proposition 9.4. For two repertoires6 and we have
i)
ii)
iii)
iv)
c [
D./ D D./ implies c D c
D.c / D D./ D D.[ /
c[ D [ D [[ and [c D c D cc
Proof. (i) Obvious,

(ii) is obvious from the definitions of c and c ,
(iii) d 2 D./ clearly implies d 2 D.c /. Conversely d 2 D.c / is a choice
from . If d was not proper, then there is a choice c d from . This is
a contradiction because c is also a choice from c . d 2 D.[ / implies that
d 2 D./ because of the above discussion, since a nontrivial union cannot be
chosen by a proper choice. d 2 D./ clearly implies d 2 D.[ /.
(iv) Clearly c[ [ D [[ . To show that [ c[ take BS2 [ . For any
x 2 B there is a proper choice d.x/ B from . Thus B D
d.x/.
x2B
The second pair of equations follows from (ii) and (iii).
t
u
Proposition 9.5. For repertoires and the following are equivalent:

i)
ii)
iii)
iv)
D./ D D./
c D c
[ D [
c [
Proof. (i) , (ii))

: : is obvious from the definitions of c .
( : Assume d 2 D./, but d D./.
This would imply 9c 2 D./W c d which contradicts c D c .
(ii) ) (iii) : With Proposition 9.4(iv), c D c ) c[ D c[ ) [ D [ .
(iii) ) (iv) : With Proposition 9.4(iv), [ D [ ) [c D [c ) c D c .
With Proposition 9.4(iv), c D c [ D [ .
(iv) ) (ii) : (iv) implies cc c [c . (ii) follows from cc D [c D c by
applying Proposition 9.4(iv).
t
u
These conditions define an equivalence relation (cf. Definition 15.2) between
repertoires which can be easily extended to covers.
Definition 9.9. For two covers and we write , if [ D [ . This defines
an equivalence relation on the set of covers.
For repertoires, this equivalence coincides with the conditions of Proposition 9.5.
Also, statement (iv) of Proposition 9.5 describes the equivalence class of a repertoire
, i.e., all covers that are equivalent to , as fW c [ g.
We need this assumption only for the first equation in (iii) and (iv).
9.3 Tight Repertoires
115
9.3 Tight Repertoires

In this section we concentrate on the characterization of tight repertoires , i.e., on
those repertoires for which there is a unique choice d . We begin by investigating
intersections within repertoires. It turns out that stability under intersections is
essentially equivalent to tightness. Then we characterize those descriptions that lead
to tight repertoires.
Proposition 9.6. A repertoire is tight if and only if for every A; B 2 with
A \ B ; and any z 2 A \ B, there is a C 2 with z 2 C A \ B.
Proof. If is tight, the second condition holds obviously for C D d .z/. If is
not tight, then there are two proper choices c d from , i.e., 9x 2 where
c.x/ d.x/. So c.x/; d.x/ 2 and x 2 c.x/ \ d.x/ ;, but there is no C 2
with x 2 C c.x/ \ d.x/ since c and d are proper choices.
t
u
Definition 9.10. A cover is called \-stable, if A \ B 2 for any two A; B 2 .
For any cover we define
\ WD f\W ; ; finiteg n f;g:
The essential property of \-stable repertoires is presented in the following
proposition.
Proposition 9.7. Every \-stable repertoire is tight.
Proof. Let be \-stable. If is not tight, there are proper choices c and d and
x 2 such that c.x/ d.x/.
Now x 2 c.x/ \ d.x/ 2 is in contradiction to the properness of c or d .
t
u
Obviously \ is \-stable and therefore tight. Thus \ is called the tightening
T
of . The unique choice d\ from \ can be defined by d\ .x/ WD
.x/.
The natural question now is whether the converse holds, i.e., whether every tight
repertoire is \-stable. This turns out to be almost true, i.e., up to equivalence .
Proposition 9.8. If a repertoire is tight, then [ is \-stable.
Proof. Let d be the unique choice from S
. For any A; B 2 and any x 2 A \ B
we have d.x/ A \ B. Thus A \ B D fd.x/W x 2 A \ Bg 2 [ .
t
u
In view of Proposition 9.6, it may seem natural to call a description d tight if
for any x, y 2 such that d.x/ \ d.y/ ;, and any z 2 d.x/ \ d.y/ one has
d.z/ d.x/ \ d.y/. This condition is the same as Definition 2.11.
The tightening of a description d is defined in Definition 2.12 as the unique
description d\ that is compatible with the tightening R.d /\ of its range R.d /. With
these observations we get the following proposition.
Proposition 9.9. The following properties of a description d are equivalent:
i) d is tight.
ii) R.d / is tight and d 2 D.R.d //.
iii) d coincides with its tightening.
116
ab
Fig. 9.3 Illustration of the product of two repertoires
Proof. (i) ) (ii) : follows essentially from Proposition 9.6.

(ii) ) (iii) : Let d be a proper choice for R.d / and d 0 a proper choice for R.d /\ .
d 0 .x/ d.x/, because R.d /\ R.d /, and if d 0 .x/ D A \ B,
with A; B 2 R.d /, for example (there could be more than two sets
intersecting which doesnt change the argument), then
d.x/ A; B ) d.x/ A \ B D d 0 .x/:
(iii) ) (i) : Assume d is not tight ) 9x; yW d.x/ \ d.y/ ;, 9z 2 d.x/ \ d.y/
such that d.z/ 6 d.x/ \ d.y/. Thus d\ .z/ d.z/ \ d.x/ \ d.y/
d.z/, and therefore d\ d .
t
u
It may indeed happen that d 62 D.R.d // as the following example shows:
Example 9.9. Take D f1; : : : ; 6g; A D f5; 6g and define d.x/ D for x 2
f1; : : : ; 5g, and d.6/ D A. Then R.d / D f; Ag is tight and its only proper choice
dR.d / has dR.d / .5/ D A.
t
u
Proposition 9.10. For tight repertoires and the following are equivalent:
i) [ ,
ii) d d .
Proof. (i) ) (ii) : d .x/ 2 [ implies that there is a B 2 with x 2 B d .x/.
Thus d .x/ B d .x/.
S
(ii) ) (i) : For A 2 and x 2 A, we have d .x/ A. Thus A
d .x/
x2A
S
d .x/ A, showing A 2 [ .
t
u
x2A
We close this section making use of intersections to define a basic operation on

repertoires.
Definition 9.11. For two covers and we define the product
WD fA \ BW A 2 ; B 2 g:
Figure 9.3 illustrates how the product of two simple repertoires is formed.
9.4 Narrow and Shallow Covers
117
Proposition 9.11. For two tight covers and the product is tight and d D
d \ d .
Proof. Obviously for every A 2 with x 2 A, we have d .x/ \ d .x/ A. For
this reason there is only one proper choice for .
t
u
It will be seen later, in Part VI, that for tight repertoires and , their product
is essentially the smallest tight repertoire containing and .
9.4 Narrow and Shallow Covers

Covers are almost arbitrary collections of sets from the -algebra . One can define
further properties of covers by considering the ordering by set inclusion . In this
ordering, they can appear as multilayered hierarchical structures (cf. Fig. 9.4) and
one can characterize them for example by the breadth and depth of these hierarchies.
The following definitions describe the two extreme cases.
Definition 9.12. A cover is called narrow, if for any two A, B 2 either A B
or B A. A narrow, clean repertoire is called a chain.
For further reference we formulate the following obvious proposition.
Proposition 9.12. i) Narrow covers are \-stable.
ii) Finite narrow covers are clean.
iii) Any finite narrow cover satisfies D c D [ .
iv) Narrow repertoires are tight.
v) Chains are templates.
Proof. (i), (ii) Let be a narrow cover. For A; B 2 the smaller of the two is
A \ B, and the larger of the two is A [ B, which implies (ii).
(iii) Follows from the above (and Proposition 9.7).
(iv), (v) Follows from (i) and Proposition 9.7.
u
t
Example 9.10. The following is an example of an infinite narrow cover that is not
clean. Take D N and D ff1; : : : ; ng W n 2 Ng [ fg.
t
u
The next definition introduces another class of repertoires which is the extreme
opposite of the previously defined one (Fig. 9.5).
Definition 9.13. A cover is called shallow or flat, if there are no two sets
A; B 2 satisfying A B. Furthermore we define the flattening7 of a cover
as f WD fA 2 W A maximal in g, if f is a cover.
The flattening of an arbitrary cover may not exist, because f may not be a cover. An example
for this is D R.X / for a random variable X with R.X/ D R. In this case, has no
maximal elements. If the flattening exists, it is clearly flat. Usually we consider finite covers which
guarantees that the flattening exists.
118
Fig. 9.4 The hierarchy for D fA; B; C; D; E; F; G; H g where D f1; 2; 3; 4; 5; 6g, and A D
f1; 2; 3; 4; 6g, B D f2; 4; 5; 6g, C D f1; 2; 3; 4g, D D f1; 6g, E D f4; 5; 6g, F D f3g, G D f4g,
H D f5g
Fig. 9.5 Example for a shallow repertoire (a) and for a chain (b)
Shallow or flat covers are obviously clean repertoires. Clearly a cover is shallow
if and only if every choice for it is proper. So for shallow covers the distinction
between choices and proper choices is unnecessary. fg is the only cover that is
shallow and narrow.
The flattening is a very bold operation on covers or repertoires, it can easily make
them uninteresting, i.e., very coarse. Indeed, every cover is equivalent to (Def. 9.9)
[ fg and . [ fg/f D fg.
Proposition 9.13. A cover is a partition8 if and only if it is shallow and tight.
Proof. Clearly every partition is shallow and tight. Let be shallow and tight. Take
A B 2 and assume that there is an x 2 A \ B. Then d .x/ is contained in A
Here we are using the countable version of Definition 2.4.
119
clean
template
tight
partition
chain
shallow
all repertoires
{}
Fig. 9.6 Illustration of classes of repertoires
and in B, and thus cannot be shallow. This shows that is disjoint and therefore
a partition.
t
u
The picture in Fig. 9.6 illustrates the different classes of repertoires and how they
are related with each other.

Together with Chap. 2, this chapter and the next one contain the central ideas of
this book. I developed these ideas over the last 30 years as witnessed by a few
scattered publications (Palm 1981, 1985, 1996, 2007). Of course, the idea of a cover
is classical in topology. In measure theory and probability it is usually replaced
by a partition. In this chapter we reinterpret a cover as a repertoire of possible
propositions and thereby create a new universe of more complex structures that may
be used in information theory. The impact of these structures on information theory
will be discussed more formally in Part VI. The connection to the results obtained
in Part I is made by the introduction of proper (i.e., minimal) descriptions from a
repertoire . The types of covers introduced in this chapter and illustrated in Fig. 9.6
are essential for the further development of information theory in Part VI.
120
9.6 Exercises
1) For the examples of repertoires in Examples 9.3, 9.7, and 9.9, what is c ,
which of them are tight, and what is the cardinality (number of elements) of
D./ in each case?
2) For which repertoires with n elements does D./ have the largest (resp. smallest) cardinality? Give an example for each case. What is the result if jj D n?
3) Prove the following:
Proposition 9.14. For a repertoire : [ is an algebra if and only if c is a
partition.
4) Let be a repertoire and jj D n. What is the maximal number of elements in
[ and in \ ?
5) When jj D n and is a clean repertoire, or a shallow repertoire, respectively,
what is the maximal number of elements in [ and in \ ?
6) Let be a repertoire and jj D n. Consider the unique proper choice d for \ .
What is the maximal number of elements in R.d /? How is R.e
d / related to the
algebra .) generated by ?
7) Given a description d , is there a difference between R.d\ / and R.d /\ ?
f
8) Given a description d , is there a difference between .e
d /\ and .d
\ /?
9) Give an illustrating example for each of the set theoretical differences and
intersections of sets in Fig. 9.6.
10) Show that c is clean for any repertoire.
11) Show that for finite (and p.!/ 0 8 ! 2 ) all covers are repertoires.
12) Give an example that this is not the case when is countable.
13) For finite with jj D n,
a) which are the largest and which are the smallest tight repertoires and what
are their cardinalities?
b) the same for templates,
c) the same for flat covers.
14) Show that (Example 9.6) is flat, but not tight (for jj > 2).
References
Chaitin, G. J. (1975). Randomness and mathematical proof. Scientific American, 232(5), 4752.
Chaitin, G. J. (1977). Algorithmic information theory. IBM Journal of Research and Development,
21, 350359.
Chaitin, G. J. (1987). Algorithmic information theory, volume I of Cambridge tracts in theoretical
computer science. Cambridge: Cambridge University Press.
Kolmogorov, A. N. (1965). Three approaches to the quantitative definition of information.
Problems in Information Transmission, 1, 47.
Kolmogorov, A. N. (1968). Logical basis for information theory and probability theory. IEEE
Transactions on Information Theory, 14, 662664.
References
121
Palm, G. (1996). Information and surprise in brain theory. In G. Rusch, S. J. Schmidt, &
O. Breidbach (Eds.), Innere ReprasentationenNeue Konzepte der Hirnforschung, DELFIN
Jahrbuch (stw-reihe edition) (pp. 153173). Frankfurt: Suhrkamp.
Palm, G. (2007). Information theory for the brain. In V. Braitenberg, & F. Radermacher (Eds.),
Interdisciplinary approaches to a new understanding of cognition and consciousness: vol.
20 (pp. 215244). Wissensverarbeitung und Gesellschaft: die Publikationsreihe des FAW/n
Ulm, Ulm.
Chapter 10
Novelty, Information and Surprise

of Repertoires
This chapter finally contains the definition of novelty, information and surprise for
arbitrary covers and in particular for repertoires and some methods for their practical
calculation. We give the broadest possible definitions of these terms for arbitrary
covers, because we use it occasionally in Part VI. Practically it would be sufficient
to define everything just for repertoires. It turns out that the theories of novelty and
of information on repertoires are both proper extensions of classical information
theory (where complementary theorems hold), which coincide with each other and
with classical information theory, when the repertoires are partitions.

In this chapter we will acquire the tools to solve one example mentioned in Chaps. 4
and 9. Eight cards are on the table, two of them are aces. One has to find out the
position of one ace. We can describe this situation by the probability space D
f.a1 ; a2 /W 1 a1 < a2 8g, jj D 28, or equivalently by two random variables
X1 and X2 2 f1; : : : ; 8g, X1 < X2 , giving the positions of the two aces. We are
interested in the propositions Ai D fX1 D i or X2 D i g. So let D fAi W i D
1; : : : ; 8g. The problem of finding an optimal guessing strategy for can be divided
into two problems: First we have to find a description d in terms of with minimal
information content, i.e., I.d / D N .e
d / should be as small as possible. Then we
find an optimal guessing strategy for e
d . This problem has been solved in Chap. 4.
Coming back to the first problem, this is the problem of calculating I./ and it
is essentially solved by Proposition 10.8. This proposition tells us that we have to
consider descriptions by ordering of . In this case, all orderings give the same
result, so we can as well consider the ordering already provided, starting with A1 and
ending with A8 . For this ordering the description by ordering is defined as follows:
All elements of A1 are described by A1 , all elements of A2 n A1 are described by
A2 , all elements in A3 n .A2 [ A1 / are described by A3 and so on.
Thus d.!/ describes ! by the set Ai 3 ! with the smallest index. In other words,
e 1 . So we
d.!/ gives the position of the first ace. This actually means that e
d D X
123
124
10 Novelty, Information and Surprise of Repertoires
have reduced our problem to the familiar problem of determining an optimal code
for the random variable X1 . It is easy to calculate the probabilities that X1 D i and
to work out I.X1 / 2:61 (cf. Exercise 10.7)).
Next I want to give some examples illustrating the definition of surprise S./.
Someone, Mr. Miller, wants to prove the efficiency of a certain medication by a
statistical test. To this end he has acquired 2,000 data points: each comparing the
result with and without medication on the same patient. A textbook on statistical
testing suggests a certain test T and mentions that this test should be carried out
with at least 10, better 20 data points. Of course, it is also recommended to use as
many data points as possible.
Now Mr. Miller has the idea to divide his 2,000 data points into 100 batches of
20 and perform 100 tests with the idea to report the best result.
He finds out that 6 of his tests lead to significant results and two are even highly
significant. Instead of simply mentioning one of those highly significant results,
he now starts to wonder, that two significant results should somehow be more
significant than one significant result. A friend tells him that this is obviously the
case because the 100 tests were performed on different persons and therefore can
be assumed to be independent. Thus the significance probabilities can simply be
multiplied. But what about the nonsignificant results? Well, says his friend, even
if they have rather high-significance probabilities pi , these are certainly smaller
than 1 and so they can safely be multiplied also, they will just make the resulting
significance probability smaller.
What do you think about this argumentation?
The 100 statistical tests can be described by 100 random variables Ti , and we
can indeed assume them to be independent. The significance probability pi of Ti
is p.Ti /, which is also described by the cover i D fTi xW x > 0g. The first
100
S
idea, to report the best result, corresponds to the repertoire D
i , whereas the
i D1
suggestion of the friend corresponds to the repertoire D 1 2 : : : 100 .

What Mr. Miller and his friend intend to report as significance of ! is in fact the
probability related to the novelty N .!/ or N .!/.
In a proper statistical test one should not report these values without comment.
Rather, one should report the probability that this amount of novelty (N .!/ or
N .!/) can be obtained under the 0-hypothesis, i.e., under chance assumption. The
negative logarithm of this probability is exactly what we have defined as the surprise
S .!/ (resp. S .!/). For example, under the 0-hypothesis one would expect about
5 of 100 statistical tests to be significant on the 5% level.
It turns out to be not so easy to compute the surprise values for the examples
and given here. This will be carried out in Chap. 13.1.
Another problem of this type could happen to a person who considers the
statistics of earthquakes. He has the impression that there are periods in time where
there are particularly many earthquakes. In his data he has 1,000 earthquakes and
their times can be described by random variables 0 < X1 < X2 < : : : < X1;000 .
He wants to substantiate his impression by statistics. So he considers the propositions Akit D Xi Ck Xi t for all k 2 and 0 < t, i.e., the periods where k C 1
10.2 Definitions and Properties
125
earthquakes happened in a short time t. As a 0-hypothesis for the probabilities of

these events he takes the exponential distribution on the random variables Xi . So
for each event of this type in his data he obtains a probability. Can he use these
probabilities as significance probabilities?
The propositions of interest can be put together in the cover
D fAkit W k 2; 0 < t; i D 1; : : : ; 1; 000 kg:
For any ! 2 or any collection of values for X1 ; X2 ; : : : ; X1;000 , we can determine the differences Xi Ck Xi D ti k , find tk D min ti k , compute the probabilities
i
pXi Ck Xi tk D pXkC1 X1 tk and take the minimum p of these

probabilities. This probability again corresponds to the novelty N .!/ by N .!/ D
log p . Again, for statistical significance we need to compute the probability that
p is not larger than the value found, or equivalently that N N .!/. This is the
surprise S .!/. Again this computation is not so simple: We will come back to this
problem in Chap. 12.

In this section we shall extend the concepts of novelty, information, and surprise to
the framework of repertoires. We begin by introducing five quantities.
Definition 10.1. Let be a repertoire. We define
N ./ WD maxfN .d /W d 2 D./g; 1
I./ WD minfI.d /W d 2 D./g; 2
b
I./ WD minfI.d /W d 2 D./; N .d / D N ./g; 2;3
S./ WD maxfS.d /W d 2 D./g; 1
b
S./ WD maxfS.d /W d 2 D./; N .d / D N ./g:1;3
For arbitrary, possibly uncountable repertoires the max may be a sup (i.e., not attained). It may
also be that the expectation does not exist or is infinite for some d 2 D./; in this case, the max is
defined as 1. For finite repertoires we will see that the max exists and is finite.
2
For arbitrary, possible uncountable covers the min may be a inf (i.e., not attained). It may also be
that the expectation does not exist or is infinite for all d 2 D./; in this case, the min is defined
as 1. For finite repertoires we will see that the min exists and is finite.
3
If there is no d 2 D./ with N .d / D N ./, this definition is not reasonable. In this case,
it should be replaced by b
I ./ D
lim minfI .d /W d 2 D./; N .d / an g and b
S ./ D
an !N ./
lim
an !N ./
maxfS .d /W d 2 D./; N .d / an g.
126
These definitions are formulated in the most general way for arbitrary repertoires.
Usually they will be applied to finite repertoires. In more general cases some
technical details have to be observed as indicated in the footnotes. Clearly N ./
is called the novelty of , both I./ and b
I./ may be called the information of ,
and both S./ and b
S./ may be called the surprise of .
Proposition 10.1. For any repertoire the following inequalities hold:
1
S./ b
S./ N ./ b
I./ I./:
ln 2
Proof. (i) For any d 2 D./, S.d /
(ii)
(iii)
(iv)
(v)
1
ln 2
by Proposition 2.11 thus b

S./
1
ln 2 .
S./ b
S./ is obvious.
b
I./ I./ is obvious.
N ./ b
I./, because I.d / N .d / for any d .
b
S./ N ./. Assume S.d / D b
S./ which implies N .d / D N ./.
Let d.!/ D A then, p.A/ D minfp.B/W ! 2 B 2 g.
Now AE D p.d / p.A/ D f! 0 W p.d.! 0 // p.A/g A and we are done.
E
t
u
Indeed ! 0 2 A ) p.d.! 0 // p.A/ ) ! 0 2 A.
We now can ask in which cases we have equalities or strict inequalities in

Proposition 10.1.
1. Usually b
S./ D S./, but S./ > b
S./ is also possible when b
S./ is very small
(see Example 10.2).
2. Usually b
S./ < N ./ and N can become much larger than S (see Example 10.1).
b
S./ D N ./ can only happen when is a chain (Proposition 10.4).
3. N ./ < b
I./ is usual and again b
I can be much larger than N (see
Example 10.1). b
I./ D N ./ can only happen when contains a small
partition (Proposition 10.3).
4. b
I./ > I./ is most commonly the case, but also b
I./ D I./ is easily
possible. When for any ! 2 there are no different proper choices A; B 2
with p.A/ p.B/, then obviously b
I./ D I./. But even when for some
! 2 there are different proper choices A; B 2 with p.A/ p.B/ it is still
possible that b
I./ D I./.
5. Between I./ and N ./ every relation is easily possible. Here it is possible that
I is much larger than N and vice versa (see the following examples).
Example 10.1. Take .; ; p/ D D and D ff1; 2; 3; 4; g; f4; 5; 6gg and D
ff1; 2; 3; 4g; f4; 5; 6g; f3; 4gg. In both cases there are two possible choices (for
4 2 ).
1
1
6
1
log2 C log2 2 D log2 3;
2
4
2
2
1
1
2
3
1
N ./ D log2 3 C C log2 D log2 3;
3
3
3
2
3
N ./ D
127
1
6
4
2
log2 C log2 3 D log2 3 ;
6
4
3
3
1
1
1
1
2
I./ D log2 3 C log2 6 C D log2 3 C ;
3
6
2
2
3
b
I./ D 1;
I./ D
b
I./ D log2 3;
1 b
D S./;
2
1
3
1
S./ D log2 3 C log2 D b
S./;
3
3
2
S./ D
where N ./ > I./ and N ./ < I./.
t
u
Example 10.2. A more extreme example is

D ff1; : : : ; 5g; f1; 6g; f2; 6g; f3; 6g; f4; 6g; f5; 6gg:
Here N . / is considerably larger than I. /. Also S. / is larger than b
S. /.
N . / D log2 3;
1
6
5
log2 C log2 6;
6
5
6
1
2
b
I. / D log2 3 C log2 6;
3
3
I. / D
S. / D
1
log2 3;
3
b
S. / D 0:
t
u
An interesting property of novelty is that it can be maximized (or minimized)

locally, i.e., at every ! 2 . This means that we can define the novelty of ! as
N .!/ D sup fN .A/W A 2 ! g :4
for every ! 2 and N ./ D E.N /. In this aspect, novelty and information differ
from each other since the information of ! depends strictly on e
d .!/ and not on
the proposition d.!/. Similarly, surprise can be defined locally by b
S./ D S.N /.
Proposition 10.2. For every d 2 D./ with N .d / D N ./ we have

S.d / D N N :
Therefore
b
S./ D S.N /:
N is a random variable if is at most countable. It may happen that N .!/ D 1 on a set of

nonzero probability. In this case, of course, E.N / D 1.
128
Proof. Because N .d / D N ./ we have Nd .!/ D N .d.!// D N .!/ for every

! 2 . Thus S.d /
Proposition2:10
N .N / D S.N /.
t
u
In the following example we consider two simple infinite repertoires.

Example 10.3. Let us take the probability space .RC ; B; p/, where p is the
Rx
exponential distribution, i.e., p.0; x/ D e t dt D 1 e x . On D RC we
0
consider D fa; a C 1W a 0g and D fa; 1/W a 0g. D./ is very large,
D./ is very small: it contains just one description bW x 7! x; 1/.
Here N ./ D N .d / for d W x 7! x; x C 1, yielding N ./ D log2 .e/
log2 .1 e 1 / after some computation (see the next example), S./ D b
S./ D
S.d / D S./ D ln12 . On the other hand, I./ D I.c/ for cW x 7! bxc; bxc C 1,
which is again computed in the next example, but b
I./ D I.d / D 1. N ./ D
N .b/ for bW x 7! x; 1/, S./ D b
S./ D S.b/ D N .b/, and I./ D b
I./ D
I.b/ D 1.
Next we calculate
Z1
Z1
Z1 Z1
1
1
N .b/ D e x . log e x /dx D
x e x dx D
1yx e x dydx
ln 2
ln 2
0
1
ln 2
Z1
e y dy D
1
1:44269;
ln 2
Z1
N .d / D
x
. log.e
x
e
x1
1
//dx D
ln 2
Z1
e x .x ln.1 e 1 //dx
0
1
1 ln.1 e /
2:10442;
ln 2
1
X
N .c/ D
.e j e j 1 /. log.e j e j 1 //
D
j D0
D .1 e 1 /
1
1 X j

e .j ln.1 e 1 //
ln 2 j D0

e 1
1 e 1
ln.1 e 1 /

ln 2
.1 e 1 /2
1 e 1
1

e
1
1

D

ln.1

e
/
1:50134.
ln 2
1 e 1
D
t
u
Example 10.4. Let us now consider Example 9.5 and evaluate N , I and the other
quantities for the exponential distribution. For 1 and x 2 RC we obtain
129
N1 .x/ D supfN .A/W x 2 A 2 1 g

D log2 e x e .xC2/
D log2 .e x / log2 .1 e 2 /
Dd >0
x
C d:
D
ln 2
So N .1 / D
R1 x
. ln 2 C d /e x dx D
0
1
ln 2
C d.
For the calculation of b

I.1 / we need the definition given in the footnote to
Definition 10.1 and obtain b
I.1 / D 1.
I.1 / is obtained as I.d / for the description d that corresponds to the partition
f.2i ; 2.i C 1//W i 2 N [ f0gg, which actually covers only RC n Z where
Z D f2i W i 2 N [ f0gg
is a 0-probability set.
We get
I.1 / D I.d / D
1
X
e 2i .1 e 2 /.
i D0
D .1 e 2 /
X
1
i D0
2i
C d/
ln 2
1
2i 2i X 2i
e
C
de
ln 2
i D0
e 2
2

C d:
ln 2 1 e 2
For ! 0, I becomes ln12 Cd which goes to infinity like log2 . For increasing
the information I decreases monotonically towards 0.
2 is not a repertoire.
For the exponential distribution 3 is the same as from Example 10.3.
For 3 and x 2 RC we obtain
N3 .x/ D maxfN .A/W x 2 A 2 3 g D log.e x / D
So N .3 / D
R1
0
x x
ln 2 e
dx D
1
ln 2 .
x
:
ln 2
Actually, for 3 we have exactly one proper
description, namely d.x/ D x; 1/.

Again b
I.2 / D 1.
Here I.2 / D 0 is obtained for a description d that has d.x/ D .0; 1/ for every
x > 0.
130
4 is not a repertoire, but it would be quite obvious that N4 .x/ D 1 for any
x 2 R and we obtain N .4 / D b
I.4 / D I.4 / D 1.
t
u
Now we proceed to prove some of the statements made in the remarks following
Proposition 10.1.
Definition 10.2. Let be a cover. A proposition A 2 is called small in , if for
almost every x 2 A and every B 2 with x 2 B we have p.B/ p.A/.
Proposition 10.3. For a repertoire the following statements are equivalent:
i) N ./ D b
I./ < 1.
ii) contains a small partition, i.e., a partition of small propositions.
Proof.
0
(ii) ) (i) : Let 0 be a partition that is small
P in . For 2 A 2 define d.x/ D A.
Then N ./ D N .d / D
p.A/ log2 p.a/ D I.d / D b
I./.
A2 0
(i) ) (ii) : Let b

I./ D I.d / where N .d / D N ./ D b
I./. Thus N .d / D I.d /
and d is complete, i.e., R.d / is a partition. For x 2 d.!/ we have
Nd .x/ D N .x/, i.e., p.d.x// D minfp.A/W x 2 A 2 g. Thus R.d /
is small.
t
u
Proposition 10.4. For any countable repertoire the following statements are
equivalent:
i) N ./ D b
S./.
ii) is a chain.
Proof.
(ii) ) (i) : Let be a chain, then d is tight with D./ D fd g and d is directed.
Thus N ./ D N .d / D S.d / D S./.
(i) ) (ii) : E.N / D N ./ D S./ D E.S /. Let D fAi W i 2 Zg and p.Ai /
p.Ai C1 / andS
p.Ai C1 n Ai / 0.
Let Bi WD
Aj . For ! 2 Ai n Ai 1 we have N .!/ D N .Ai /
j i
and S .!/ D N .Bi /. Since Ai Bi we have N .Ai / N .Bi /, so

N S . In order to get E.N / D E.S / we need that N .Ai / D
N .Bi / and therefore p.Bi n Ai / D p.Bi / p.Ai / D 0, i.e., Bi D Ai
essentially. Thus D fBi W i 2 Zg is a chain.
t
u
In the following example it happens that S./ D N ./ although is not a chain.
So, unfortunately, Proposition 10.4 only works with b
S, not with S.
Example 10.5. Take D f1; : : : ; 6g with
p.1/ D a 0:03 p.2/ D 0:2 a
p.3/ D 0:3 D p.6/
For D ff1; 2; 3; 4g; f1; 2; 3; 5g; f2; 3; 4; 5; 6gg we get

A1
A2
A3
p.4/ D 0:1 D p.5/:
131
N ./ D 0:7 log2
10
6
0:3 log2 .1 a/;
b
S./ D 0:7 log2
10
7
D 0:36020;
S./ D 0:4 log2 0:4 D 0:52877:

It turns out that S./ D N ./ for the right choice of a, namely a D 0:0293.
S./ is obtained for the description d which has d.1/ D d.2/ D d.4/ D A1 ,
d.5/ D A2 , and d.6/ D d.3/ D A3 .
t
u
Before we proceed, we notice that all quantities defined in Definition 10.1 really
only depend on D./. Thus it is useful to define the following equivalence relation
and ordering on repertoires.
Definition 10.3. Two repertoires and are called equivalent
if D./ D D./. Furthermore, we define by [ .
The definition of coincides with the more general Definition 9.9 (i.e., [ D
[ ). Also the -relation as defined here can be used for arbitrary covers.
Proposition 10.5.
i) The relation is an equivalence relation on repertoires.
ii) For , we have N ./ D N ./, I./ D I./, b
I./ D b
I./, S./ D
b
b
S./ and S./ D S./.
iii) We have if and only if and .
iv) For we have N ./ N ./.
Proof. (i), (ii) are obvious.
(iii) ): By Proposition 9.4 D./ D D./ implies c D c which implies
[ D c[ D c[ D [ :5
(: [ implies [ [[ D [ and vice versa. So [ D [ : Again
by Proposition 9.4 D./ D D.[ / D D.[ / D D./.
(iv) Take c 2 D./. For ! 2 ; c.!/ 2 [ , so there is a minimal B 2 with
! 2 B c.!/. Now we define d.!/ D B and obtain d 2 D./ with d c
and therefore N .d / N .c/.
t
u
In particular, c and [ have the same information, novelty, and surprise as
a repertoire . The equivalence classes of have already been determined in
Proposition 9.3.
The next proposition shows an interesting property of N .
Proposition 10.6. Let and be two repertoires.
i) N[ .!/ D max.N .!/; N .!//
5
For this we need and to be finitary.
8! 2
132
ii) N ./; N ./ N . [ / N ./ C N ./
t
u
Proof. Exercise 6).
The most important inequalities to be proved about information measures are

monotonicity and subadditivity. Propositions 10.5 and 10.6 mean that the novelty
N is monotonic as well as subadditive for the ordering ( [ is the natural
supremum of and for this ordering as we will see in Chap. 16). The same is not
true for information as the following counterexample shows.
Example 10.6. Consider two dice, i.e., .; ; p/ D D 2 , with the two random
variables X1 and X2 . We define the following repertoires:
D fX1 3; X1 > 3g;
D fX2 3; X2 > 3g;
e 1 /;
D R.X
e 2 /:
D R.X
Observe that
I. [ / D min.I./; I. // D I./ < I. /
and thus I is not monotonic for . And that
I. [ / D I./ D 1
and I. [ / D I./ D 1;
whereas
I.. [ / . [ // D I. / D I. / C I./ > I./ C I./
and even
I.. [ / [ . [ // D I./ > I./ C I./
and thus I is not subadditive (i.e., in general I. [ / and also I. / can be
greater than I./ C I./).
u
t
Of course, we should be able to obtain monotonicity and subadditivity also for I
for reasonably large subclasses of repertoires. This is indeed the case for tight and
for flat repertoires as we will see in Chap. 17.
In order to focus and simplify this discussion of information properties on various
sets of covers, we formally introduce symbols for the most important sets. They will
be the subject of Part VI.
10.3 Finding Descriptions with Minimal Information
133
Definition 10.4. Let .; ; p/ be a probability space.

i)
ii)
iii)
iv)
v)
C./ is the set of all (measurable) covers,

R./ is the set of all repertoires,
T./ is the set of all templates,
F./ is the set of all flat covers,
P./ is the set of all (possibly countable) partitions.
Next we ask the question for which repertoires we can achieve maximal and
minimal values of S, b
S, I, b
I, and N . To this end we consider a finite or countable
space .; ; p/ with p.!/ 0 for every ! 2 . It is quite obvious that the minimal
value for all these quantities S D b
S DN Db
I D I D 0 is achieved for D fg.
On the other hand, the maximal value for N D b
I D I is achieved for the partition
introduced in Example 9.6. For jj D n it is at most log n (for p.!/ D n1 for
every ! 2 ) and for infinite it can be infinite (Proposition 2.17). The maximal
value of S and b
S is rather small, which is the subject of Exercises 10.12). The final
problem we have to tackle in this chapter is how to compute I and b
I in concrete
cases. (Computing N , b
S, and S is in fact quite simple as we saw above.) The next
section is devoted to this problem.

Computing I and b
I for a repertoire means to find a description d in with
minimal information. The following lemma is a first step towards this end.
Lemma 10.1. For a probability vector p D .p1 ; : : : ; pn / with two components
pi < pj and any number q between pi and pj , we may form the vector p 0 by
pk0 WD pk
for k i; j and
pi0 WD q
and
pj0
WD pi C pj q:
Then I.p 0 / > I.p/.

Proof. Let h.x/ WD x log2 x for x > 0; h is a convex function in the sense that
n
P
its second derivative is negative, i.e., h00 < 0, and I.p/ D
h.pi /. Let r D
i D1
q pi
2 .0; 1/. Then
pj pi
r pi C .1 r/pj D q C pj C pi D pj0
and
r pj C .1 r/pi D q D pi0 :
134
Thus
h.pi0 / C h.pj0 / > r h.pj / C .1 r/h.pi / C .1 r/h.pj / C r h.pi /
D h.pj / C h.pi /:
t
u
This lemma provides the essential justification for the following idea: For any
d 2 D./ the propositions in R.e
d / that are used for e
d will be contained in elements
of . The idea now is that we can restrict our search to partitions built from
propositions B that are formed by means of complements and intersections from .
The following definition introduces descriptions defined by orderings within
repertoires which are built using set differences.
Definition 10.5. Let be a repertoire with n elements.6 A one-to-one mapping
aW f1; : : : ; ng ! is called an ordering of . Given an ordering a of , the
description da for by the ordering a is defined as
da .!/ D a.1/ for every ! 2 a.1/;
da .!/ D a.2/ for every remaining ! 2 a.2/; i.e., for ! 2 a.2/ n a.1/;
da .!/ D a.3/ for every remaining ! 2 a.3/; i.e., for ! 2 a.3/ n .a.1/ [ a.2//;
da .!/ D a.n/ for ! 2 a.n/ n .a.1/ [ : : : [ a.n 1//:
Any description d such that d D da for an ordering a of , is called a description
by ordering of . Note that d .!/ D a.k/ for k D minfi W ! 2 a.i /g.
Proposition 10.7. For any finite repertoire7 the minimum
minfI.d /W d description in g
is obtained at a description da by ordering of .
d / is a partition,
Proof. Let d be any description for , R.e
d / D fD1 ; : : : ; Dn g. R.e
n
P
h.p.Di //, and we may in addition assume that p.D1 / : : : p.Dn /.
I.d / D
i D1
Define A1 WD d.D1 /.
Now consider the description d 1 defined by
(
d 1 .!/ D
A1
for ! 2 A1 ;
d.!/
otherwise:
This definition can be easily extended to countable repertoires with aW N ! .

This proposition can be easily extended to countable repertoires,and even to arbitrary repertoires. This is because any reasonable description d has a countable range R.d / (see
Definition 2.3).
6
7
135
Let h.x/ D x log x for x > 0 as in the proof of Lemma 10.1.

n
P
p.A1 \ Di / and therefore
We have p.A1 n D1 / D
i D2
I.d 1 / D h.p.D1 / C
n
X
p.A1 \ Di // C
i D2
n
X
h.p.Di / p.A1 \ Di //:
i D2
Lemma 10.1 shows that I.d 1 / I.d /.

Now we reorder R.e
d 1 / D fA1 D D11 ; D21 ; : : : ; Dn1 g such that again p.D11 /
1
1
p.D2 / : : : p.Dn / and define A2 WD d 1 .D21 /. Then we define
(
2
d .!/ D
A2
1
d .!/
for ! 2 A2 n A1 ;
otherwise:
Again Lemma 10.1 shows that I.d 2 / I.d 1 /. So we proceed until we obtain d n
with I.d n / : : : I.d 1 / I.d /; d n is a description by ordering of .
t
u
Note that in Proposition 10.7 we have not yet considered proper descriptions. The
condition of properness further constrains the order in which we subtract subsets.
This is the subject of the next two propositions.
S
Definition 10.6. For a repertoire we define A4 WD A n fB 2 W B Ag for
any A 2 and 4 WD fA4 W A 2 g n f;g called the difference repertoire of .
4 is a cover with the interesting property that for ! 2 and A 2 , we have
! 2 A4 if and only if A is minimal in ! .
The idea in the definition of 4 is the same that led to the definition of the
completion e
d of a description d . When we know that a person could also take the
description B A instead of A which would be more exact, we can assume that !
is not in B, when he describes ! just by A. So actually we can infer that ! 2 A4
when it is described by A. However, this kind of completion is only partial, because
in general 44 4 , i.e., 4 is not yet flat. The following is an example for this
(cf. Exercise 1).
Example 10.7. Let D f1; : : : ; 6g and
D ff1g; f1; 2g; f1; 3g; f3; 4g; f2; 4; 6g; f2; 3; 5; 6gg.
Then 4 D ff1g; f2g; f3g; f3; 4g; f2; 4; 6g; f2; 3; 5; 6gg;
44 D ff1g; f2g; f3g; f4g; f4; 6g; f5; 6gg;
444 D ff1g; f2g; f3g; f4g; f6g; f5; 6gg;
4444 D ff1g; f2g; f3g; f4g; f5g; f6gg:
t
u
136
Proposition 10.8. For any repertoire its difference repertoire 4 satisfies

4 .
Proof. Clearly 4 is a repertoire and 4 because any A 2 can be
written as A D [fB 4 W B 2 ; B Ag. Indeed, for any ! 2 A consider
fB 2 W ! 2 B Ag. If B is minimal in this set, then ! 2 B4 .
t
u
Proposition 10.9. For any flat cover we have 4 D .
S
Proof. If is flat, then for any A 2 clearly fB 2 W B Ag D ;.
t
u
Proposition 10.10. For any repertoire we have

I./ D minfI.d /W d description in 4 g
and
b
I./ D minfI.d /W d description in 4 and N .d / D N ./g:
Proof. If d is a proper description in , then d 0 .!/ WD d.!/4 defines a description

in 4 ; in fact, ! 2 d.!/4 because d is proper. Moreover, d.! 0 / D d.!/ )
d 0 .! 0 / D d 0 .!/ and therefore e
d de0 implying N .e
d / N .de0 /. Vice versa, if d is
a description in 4 and for each A 2 4 we arbitrarily choose one A0 2 such that
A D A04 , then d 0 .!/ WD .d.!//0 defines a proper description in . Indeed, d 0 is
proper because ! 2 d.!/ D d 0 .!/4 . Moreover, d.!/ D d.y/ ) d 0 .!/ D d 0 .y/
and therefore e
d de0 implying N .e
d / N .e
d 0 /.
t
u
The above Propositions 10.7 and 10.10 can be applied directly to calculate I./;
they also can be combined to give a simple characterization of the description d that
minimizes the information.
Definition 10.7. Let be a finite repertoire. A description d 2 D./ is called
orderly, if there is an ordering a of such that d is defined as follows:
For ! 2 let k.!/ WD minfi W a.i / is minimal in ! g, then d.!/ WD a.k.!//.
Obviously any orderly description is proper.
Proposition 10.11. Let be a finite8 repertoire. The minimum in the definition
of I./ is obtained at an orderly description in . Similarly, the minimum in the
definition of b
I./ is obtained at an orderly description.
Proof. First we form the partial completion 4 of . From Propositions 10.7
and 10.10 we know that there is an ordering of p , so that we can write 4 D
4
0
0
4
fA4
1 ; : : : ; An g, and a description d in 4 with d .!/ D Ak for k D minfi W ! 2
4
0
Ai g such that I./ D I.d /.
If we now define
d.!/ WD Ak for k D minfi W Ai is minimal in ! g D minfi W ! 2 A4
i g;
8
This proposition actually holds for arbitrary repertoires in the same way as Proposition 10.7, if
there is a description d 2 D./ which satisfies the additional condition in Definition 2.3.
137
0
4
0
then d.y/ D Ak D d.!/ ) k D minfi W ! 2 A4
i g ) d .y/ D Ak D d .!/.
Thus e
d D de0 and I.d / D I.d 0 /.
t
u
Example 10.8. We take the throwing of two dice X1 and X2 as our basic experiment,
i.e., .; ; p/ D D 2 . We consider a simple example of a repertoire . Take
A1 D f.5; 5/; .6; 6/g;
A2 D X1 D X2 ;
A4 D X1 D 1 [ X2 D 1;
A5 D ;
and
A3 D X1 5;
D fA1 ; A2 ; A3 ; A4 ; A5 g:
Next we form 4 D fA01 ; : : : ; A05 g where A01 D A1 , A02 D A2 n A1 , A03 D A3 n A1 ,

S
A04 D A4 , and A05 D n 4iD1 Ai .
Now we have to find a choice d from 4 that minimizes I. To this end we notice
that we have only one choice for the elements in A01 and A05 . We have a real choice
for A02 \ A04 D f.1; 1/g and for A03 \ A04 D f.5; 1/; .6; 1/g. So we only have to
consider the ordering of A2 , A3 , and A4 to find the optimal orderly description for
(or ordered description for 4 ). Let us explicitly consider the cases:
fA1 ; A2 ; A3 ; A4 ; A5 g leads to the partition fA01 ; A02 ; A03 ; A04 n .A02 [ A03 /; A05 g D 1
fA1 ; A2 ; A4 ; A3 ; A5 g leads to the partition
fA01 ; A02 ; A04 n A02 ; A03 n A04 ; A05 g D 2
fA1 ; A3 ; A2 ; A4 ; A5 g leads to the partition fA01 ; A03 ; A02 ; A04 n .A02 [ A03 /; A05 g D 1
fA01 ; A03 ; A04 n A03 ; A02 n A04 ; A05 g D 3
fA01 ; A04 ; A02 n A04 ; A03 n A04 ; A05 g D 4
fA01 ; A04 ; A03 n A04 ; A02 n A04 ; A05 g D 4
For each of these partitions we can calculate the information. The best of these six
possibilities is 4 with I.4 / 1; 7196.
The following is a slightly more complicated example:
Take
A1 D X1 ; X2 2 f2; 4; 6g; A2 D f.1; 1/; .6; 6/g; A3 D X1 C X2 2 f3; 5; 7; 9; 11g;
A4 D X1 C X2 D 4; A5 D X1 D X2 ; A6 D X1 D 5 [ X2 D 5; and
A7 D X1 C X2 D 5:
t
u
In many cases one obtains the minimum information description in by the
following recipe:
1. Consider the largest element in 0 WD 4 , i.e.,
A1 WD arg maxfp.A/W A 2 0 g:
2. Define 1 WD fA n A1 W A 2 0 g and repeat, i.e.,
A2 WD arg maxfp.A/W A 2 1 g and 2 WD fA n A2 W A 2 1 g:
138
The following example shows that this procedure does not always work.
Example 10.9. Let .; ; p/ D E34 and D fA1 ; : : : ; A6 g with
A1 D f1; : : : ; 18g;
A2 D f19; : : : ; 34g;
A3 D f1; : : : ; 8; 31; 32; 33; 34g;
A4 D f9; : : : ; 16; 27; 28; 29; 30g;
A5 D f17; 23; 24; 25; 26g;
A6 D f18; 19; 20; 21; 22g:
Beginning with A1 we obtain a description d1 with d1 .!/ D A1 for ! 2 A1 and
so on. This is the description we obtain with the rule of thumb. We get I1 D
I.d1 / D log 34 18
log 18 16
log 4 D 1:93868. However, beginning with A2
34
34
we obtain a description d2 with d2 .!/ D A2 for ! 2 A2 and so on, leading to
I2 D I.d2 / D log 34 18
log 18 16
log 8 D 1:46809.
t
u
34
34

This is the central chapter of this book. Here we introduce and define the new
concepts of novelty and surprise for covers or repertoires, thereby extending the
classical definition of information for partitions. Actually, Definition 10.1 defines
five new terms: novelty and two versions of information and of surprise. I think it is
a matter of taste whether one prefers to use I or b
I for the information of repertoires,
and S or b
S for the surprise. The theory works equally well for both versions.
We also show how these new measures can be calculated and a few of their
elementary properties. These will be investigated extensively in Part VI. For finite
covers, most results of this chapter have already been worked out in Palm (1975,
1976a,b, 1981).
10.5 Exercises
1) Calculate I, b
I, N , b
S, and S for the repertoires in Examples 9.3, 9.7, 9.9,
10.7, 10.8, and 10.9.
2) For the repertoires in Examples 9.7, 9.9, 10.7, 10.8, and 10.9, which are not
tight, maximize N .d / I.d / and E.N d=N e
d / on D./.
3) What is the surprise of a chain D fA1 ; : : : ; An g, what is the information?
What is the maximal surprise of a chain of n elements, what is the maximal
information?
4) What is the surprise of a continuous chain, i.e., a repertoire X for a continuous
random variable X ? Compare this result with Exercise 3).
References
139
5) Given two independent continuous random variables X and Y , what is the

surprise of X \ Y and of X [ Y ? What is the surprise of X \ X ?
7) Work out the introductory example on page 124!
8) Like in Exercise 5.7), 1 e coins are placed beneath 2 out of 8 shells. They are
lying under shells 1 and 2 with a probability of 0.5; otherwise, they are lying
under any pair of shells with equal probability. If they are lying beneath shells
3 and 4, then there is also a 2 e coin under shell 5.
The task is to choose a shell with the highest possible amount of money lying
underneath it. Let D ffi; j gW i; j D 1; 2; : : : ; 8I i j g. Define a suitable
repertoire which holds the interesting propositions for this problem.
Let the shell game be repeated several times. How many bits of information does
one informer (who knows the positions of the coins) need on average to tell the
guesser which shell to choose (and what amount of money lies beneath it)?
Hint: Find an appropriate description d with propositions from with minimal
information I.d / D I./.
9) Give some more examples of repertoires yielding different values for I vs. b
I
vs. N vs. S.
10) For jj D n and p.!/ D n1 8! 2 ,
a) which are the flat repertoires with minimal N > 0?
b) which are the flat repertoires with minimal I > 0?
11) Give an example for two covers and where
a) I. [ / < I./; I./,
b) I. [ / > I./; I./.
12) Let D f1; : : : ; ng and p D .p1 ; : : : ; pn / with p1 p2 : : : pn .
a) Which repertoire achieves the maximal value for b
S and S?
b) Which repertoire achieves the smallest value of I, N or b
I, respectively,
that is not zero?
13) For the probability space E5 find the largest clean cover . What is its
cardinality? Compute I./; N ./ and S./.
14) Which repertoires satisfy for , defined in Example 9.6?
15) Show that for countable the maximal value of I, N , and b
I is achieved for
the partition .
References
Palm, G. (1975). Entropie und Generatoren in dynamischen Verbanden. PhD Thesis, Tubingen.
Palm, G. (1976a). A common generalization of topological and measure-theoretic entropy.
Asterisque, 40, 159165.
verw. Geb., 36, 2745.
Chapter 11
Conditioning, Mutual Information,

and Information Gain
In this chapter we want to discuss the extension of three concepts of classical

information theory, namely conditional information, transinformation (also called
mutual information), and information gain (also called KullbackLeibler distance)
from descriptions to (reasonably large classes of) covers. This extension will also
extend these concepts from discrete to continuous random variables.

In this chapter we will define a new measure for the deviation between two
probability distributions: The surprise loss, which is an interesting complement to
the information gain or KullbackLeibler distance (Kullback and Leibler 1951). The
following example introduces the idea of surprise loss:
In the classical Popperian view of science development, a proponent of a new
theory will try to devise an experiment that turns up a result which is predicted by
the new theory and contradicts the old theory. In other words, he can predict the truth
of a proposition A which is believed to be false by the old theory. In many areas of
modern science, in particular, in Biology, Psychology, or Neuroscience, the situation
usually is not quite as simple: Often one can only hope to find a proposition A that
has a very high probability based on the new theory and a very low probability based
on the old theory.
In this situation scientific prediction assumes the nature of a bet: The proponent
of the new theory bets on A, hoping it will be the outcome of the experiment. If A
happens indeed, he can claim that he has predicted something to happen that has
a very low probability for the old theory and thus makes it seem unlikely that the
old theory is true. Of course, the success of a scientific theory should not depend
on a single bet. In many cases, where the prediction of a theory has some stochastic
component (as is often the case in the life sciences) and the probability of A is
somewhat larger than 12 for the new theory and somewhat smaller than 12 for the
old theory, one can construct a new proposition A0 (for example, by independent
141
142
11 Conditioning, Mutual Information, and Information Gain
repetition of experiments), which has a probability close to 0 for the old theory
and close to 1 for the new theory. Another common version of this procedure is a
statistical test: In this case, the proponent of the new theory bets not just on one
proposition A, but on a chain of propositions A1 A2 A3 : : : An , each of
which is much more likely for the new theory than for the old one. If ! 2 Ak n Ak1
happens, he reports pold .Ak / as the significance of his result.1
Now we want to assume that both, the old probability q and the new
probability p are known, and want to construct a set A for which the novelty Npq ./
is maximal (with D fA; g). Then we consider the same problem for chains
D fA1 ; : : : ; An ; g. This is done in Sect. 11.3.
11.2 Conditional Information and Mutual Information

Our goal is to define I.j/ and N .j/ for suitably large classes of repertoires
or covers in such a way that it retains the most important properties. Already in
Chap. 10 we have seen that this enterprise will be rather hopeless for the surprise S,
because S has quite different properties from I.
We begin by listing criteria that a good definition should meet (formulated for I,
but the same is meant for N ) in order to come close to the theory of information for
random variables (see Proposition 3.9).
Criteria: We would like to have the following:
1.
2.
3.
4.
Positivity: I.j/ 0,
Identity: I.j/ D 0,
Two-way monotonicity: ) I.j / I.j / and I. j/ I. j/,
Additive symmetry: I./ C I.j/ D I./ C I.j/.
Note that requirement 4 is necessary for additivity2 :
I./ C I.j/ D I. _ / D I./ C I.j/:
Also, with a proper definition of _ which yields additivity, monotonicity in

the second argument is equivalent to subadditivity because
I. _ / D I./ C I.j/ I./ C I.j/ D I./ C I./:
See Chow (1996) for example.

cf. Sect. 3.3, Proposition 10.6 and Chap. 17. _ should be the smallest repertoire that is larger
than and . This of course depends on the ordering of repertoires or of subclasses of repertoires
(see Chaps. 15 and 16). For repertoires _ for the ordering defined in Def. 10.3, is simply [ ,
whereas for templates it turns out to be (cf. Chap. 16).
2
143
All these requirements are obviously fulfilled for information (= novelty) on

the set P of partitions, i.e., in classical information theory. Given the definitions
of I./ and N ./, the most obvious idea is to reduce I.j/ to I.ajb/ and
N .j/ to N .ajb/ by appropriate minima and/or maxima over a 2 D./ and
b 2 D./. The requirement of two-way monotonicity suggests a min max or
max min construction. In order to achieve identity, we have to use max min (see
Example 11.1). This leads to the following definition.
Definition 11.1. For any two repertoires ; , we define
I.j/ WD max
min I.ajb/
b2D./ a2D./
and
N .j/ WD max
min N .ajb/:
a2D./ b2D./
When we consider the sets R and F of repertoires and flat covers, respectively,
we observe that identity is easily fulfilled, monotonicity is half-fulfilled. It holds
in the first argument, but not in the second, if we use the proper ordering, namely
4 (this will be defined and analyzed in Chap. 16) for I and for N . Additive
symmetry is not fulfilled. This is also shown in the next example.
Example 11.1. In general we dont get I.j/ D 0 with the minmax construction:
Consider .; ; p/ D D 2 , X1 D first coordinate, X2 D second coordinate, and
e1 [ X
e 2 . For any .i; j / 2 there are exactly two choices from , namely
DX
X1 D i and X2 D j .
For any b 2 d./ and any ! 2 , we take ab .!/ b.!/. Thus e
ab .!/ \ e
b.!/
ab .!/ \ b.!/ D f!g. Then
ab je
b/ D N .e
ab \e
b/N .e
b/ D log 36N.e
b/ log 36log 12 D log 3
I.ab jb/ D N .e
and
N .ab jb/ D N .ab \ b/ N .b/ D log 36 log 6 D log 6:
Thus
min max I.ajb/ min I.ab jb/ log 3 > 0
b2D./ a2D./
b2D./
and similarly for N . Thus the minmax definition of I or N would not satisfy
identity.
To construct a counterexample against additive symmetry, we have to remember:
I./ C I.j/ D min I.b/ C max min I.ajb/;
b
N./ C N.j/ D max N.a/ C max min N.bja/:

a
144
e 1 and D X
e1 [ X
e 2 . Then from Definition 11.1 we find
Now let D X
I./ C I.j/ D log 6 C log 6;
N ./ C N .j/ D log 6;
I./ C I.j/ D log 6 C 0;
N ./ C N .j/ D log 6 C log 6:
So we cannot hope to obtain additivity in general.
t
u
Still the two definitions of conditional information and novelty may be quite
interesting when they are interpreted in a game-theoretical way:
Player A chooses a proper choice from and player B a proper choice from ,
A with the goal to minimize information and to maximize novelty and B the opposite goal. In this book I do not want to explore this game-theoretical interpretation
further.
When we are considering the set T of templates, there is only one proper choice
from and from . So we can forget about the minima and maxima and expect a
reasonable theory (in particular, looking back at Sect. 3.3).
Proposition 11.1. On T we have
i) I.j/ D 0 and N .j/ D 0,
ii) implies I.j / I.j /; I. j/ I. j/; N .j / N .j / and
iii) I./ C I.j/ D I. / and N ./ C N .j/ D N . /:
for any tight covers , , and . In particular, the same holds for partitions , ,
and .
Proof. Let a be the only proper choice from , and b the same for , and c for .
(i) Obviously I.aja/ D N .aja/ D 0.
a and
(ii) N .a/ C N .bja/ D N .a \ b/ by Proposition 3.4 and the same holds for e
e
b and therefore for I (Proposition 3.6).

(iii) implies a b and therefore p a.!/jc.!/ p b.!/jc.!/ for every
! 2 , so N .ajc/ N .bjc/. Similarly, the assertions on I have been shown
for tight descriptions (Propositions 3.5 and 2.6).
t
u
With this proposition we have extended all essential properties of classical
information to the information I on T, and almost all properties (except monotonicity in the second argument) to the novelty N on T. Indeed, we cannot get this
monotonicity for N , and equivalently we cannot get subadditivity, as is shown in
the following example.
Example 11.2. N is not subadditive on T. Let .; ; p/ D E16 , and

D fi; : : : ; 16gW i 2 ;

D f1; : : : ; i gW i 2 ;

D fi; : : : ; j gW i j 2 fi gW i 2 :
So N ./ D N ./
N ./ C N ./.
1
ln 2
145
and N . / D log2 16 D 4. Thus N . / >

t
u
This discussion shows that it is not at all straightforward how to define mutual
information for arbitrary covers, as long as we do not have additive symmetry,
because it may happen that the expressions I./ C I./ I. _ /, I./ I.j/,
and I./ I.j/ all give different results. For templates we can safely define
mutual information, because of Proposition 11.1.
We now define the mutual information or transinformation T .; / and the
mutual novelty M.; / for templates.
Definition 11.2. For any two templates and we define
T .; / WD I./CI./I. /
and
M.; / WD N ./CN ./N . /:
Proposition 11.2. Let and be two templates, d and d the corresponding

unique proper descriptions, d D d \ d , and
N D Nd ; N D Nd ; N D Nd
and I D Id ; I D Id ; I D Id
the corresponding random variables (defined in Definitions 2.5 and 2.13). Then
!

p e
d
i) T .; / D E I C I I D E log
0 and
p e
d p e
d

p d
.
ii) M.; / D E N C N N D E log
p d p d
Proof. Everything is obvious from the definitions, the positivity in (i) was shown in
Proposition 3.9.(ii).
t
u
We observe that it is possible that M.; / < 0 (see Example 11.2).
Proposition 11.3. Let ; be templates. Then
i)
ii)
iii)
iv)
v)
vi)
vii)
viii)
T .; / D I./ I.j/ D I./ I.j/,

T .; / min .I./; I.// ,
T .; / D 0, if and only if , are independent,
T .; / D I./
M.; / D N ./ N .j/ D N ./ N .j/,
M.; / min .N ./; N .// ,
M.; / D 0, if p d D .p d / .p d / almost everywhere,
M.; / D N ./.
t
u
Proof. All assertions follow directly from Proposition 11.2.

Example 11.3. a D .X Y / , b D .X C Y / , c D .X C Y / , D R.a/,

D R.b/, D R.c/. What is M.; /? See Exercise 11.7).
t
u
146
11.3 Information Gain, Novelty Gain, and Surprise Loss

In Chap. 10 we extended the concept of information and surprise to repertoires.
In what follows we do the same for information gain (see Definition 3.1) and
surprise loss. The definition of surprise loss will be analogous to that of information
gain; we choose a different wording here because gaining information is usually
associated with loosing surprise.
As in classical information theory, information gain and the new related terms
may actually be defined in a more general setting where information and novelty
often become infinite. We will achieve this goal here by defining these terms for
arbitrary covers in Definition 11.5. However, we start by considering repertoires.
Definition 11.3. Given two probability distributions p and q on , we define the
subjective novelty of a repertoire as
Npq ./ WD maxfNpq .d /W d 2 D./g
and the novelty gain as
Gpq ./ WD maxfGpq .d /W d 2 D./g:
Remarks:
1. The subjective novelty can be maximized locally for every ! 2 (compare
Definition 10.1 on page 125) and we can also define the novelty gain locally as
a random variable
Gpq ./.!/ D maxfNq .d.!// Np .d.!//W d 2 D./g Nq; .!/ Np; .!/
and then
Gpq ./ D Ep .Gpq ./.!// Npq ./ Np ./:
2. It is easy to see that Gpq ./ can be negative; a simple example is D fA; g with
p.A/ < q.A/. It is also possible that Gpq ./ > Npq ./ Np ./. An example is
given in Exercise 6.
3. These concepts can also be defined and are often finite forinfinite repertoires.
This is quite obvious for finitary repertoires, but it is known that in many cases
where is not finitary and both Np and Npq are infinite, the difference Gpq can
still be defined in a reasonable way (compare Definition 11.5).
Essentially, there are two possibilities:
(a) Either there is a proposition A for which p.A/ D 0 and q.A/ 0; then we can
obtain infinite novelty if we believe that p is the true probability, but in fact it
is q. Something can happen which we believe to be essentially impossible.
147
(b) Or there is no such proposition, then the theorem of RadonNikodym3 shows

that q has a continuous density function f with respect to p and we can use
f to define the information or novelty gain.
This reasoning is traditionally based on the concept of information gain, which
is usually defined for partitions (or algebras) . It naturally leads to the following
definitions.
For the definition of information gain, we first restrict ourselves to the set T of
templates again.
Definition 11.4. Let p and q be probabilities on . Let be a template and d be
the unique proper choice from . We define the novelty gain as
Gpq ./ WD Gpq .d /;
the information gain as
d /:
IGpq ./ WD Gpq .e
It is obvious that this definition of novelty gain coincides with Definition 11.3.
Proposition 11.4. Let p and q be probabilities on . If is a partition, then
Gpq ./ D IGpq ./:
Proof. There is only one choice d 2 D./ and d D e
d by Proposition 2.2.
t
u
This proposition shows that IG could as well be defined as the novelty gain of
partitions. By analogy we can now define the surprise loss SL as the novelty gain
of chains. This is also the idea explained in the introductory example.
We use the name surprise loss because the semantics of information and
surprise appears to be opposite: when you gain information you loose surprise. With
this idea it is possible to define G, SL, and IG even for arbitrary infinite covers.
Definition 11.5. For two probabilities p and q on and for an arbitrary cover ,
we define

SLpq ./ WD sup Gpq .d /W [ ; finite chain and

IGpq ./ WD sup Gpq .d /W ./; finite partition :
Definition 11.5 defines information gain and surprise loss as the solutions of
two optimization problems: namely to find a partition with maximal information
gain and to find a chain with maximal surprise loss. The second problem is the one
we posed in the introductory example. Proposition 11.7 gives the solution to these
problems.
See Bauer (1972) for example.
148
Proposition 11.5. Definition 11.5 for IGpq ./ is consistent with Definition 11.4 for
templates.
Proof. Let be a template and d the corresponding description. Let ./
be a finite partition and d the corresponding description. We consider one ! 2 .
d .!/ D B n A, where both A and B are unions of elements of \ .
Since is a template, for any ! 0 2 A we have d A, so d .! 0 / d .!/
because ! A. On the other hand, d .!/ B and thus e
d .!/ B n A D d .!/.
Thus IG.d / IG.d / by Proposition 3.2.
t
u
Proposition 11.6. For any cover , we have
SLpp ./ D 0 D IGpp ./:
t
u
Proof. Obvious.
Proposition 11.7. Let p and q be probabilities on . If is a -algebra and f is

the density4 of p with respect to q on , then we define F .!/ D p.f .!//=q.f .!//.
i) IGpq ./ D Ep .log2 f /:
ii) SLpq ./ D Ep .log2 F /:
iii) SLpq ./ IGpq ./ 0:
Proof. For a finite repertoire the finite algebra generated by is already a
-algebra and it is equivalent to a partition . For this partition we may consider
the only choice d . Then Definition 11.4 becomes

p.d .x//
IGpq ./ D Ep .sq d sp d / D Ep log2
:
q.d .x//
On the other hand, the density of p with respect to q on a finite -algebra like
./ D ./ is defined by
f .x/ D
p.d .x//
:
q.d .x//
This shows (i) in the finite case. For the infinite case, we need an approximation
argument.
In (i) and (ii) we have to show that the supremum in Definition 11.6 is actually
attained at the given formula. In (iii) because of Proposition 11.8 we only have to
show the first inequality. We first observe that for any proposition B 2 , we can
Eq .1B f /
p.B/
interpret
D
DW fNB as the average of f on B with respect to
q.B/
q.B/
By the RadonNikodym theorem (e.g. Bauer 1972).
149
the probability q. More formally, we can define the conditional probability qB (as
p.B/
in Chap. 3 and observe that
D EqB .f /.
q.B/
(i) By definition or construction of the expectation E, it is sufficient to take any
finite partition D fA1 ; : : : ; An g and show that Gpq ./ Ep .log2 f / since
Ep .log2 f / can be approximated by sufficiently large partitions. Now
Gpq ./ D
n
X
p.Ai / log2
i D1
p.Ai /
:
q.Ai /
We use the trick of Proposition 11.8 again and take the natural logarithms to
show
n
X
p.Ai /
Ep .ln f /:
p.Ai / ln
q.Ai /
i D1
Indeed
n
X
X
p.Ai /
q.Ai /EAi .f / ln EAi .f /
D
q.Ai /
i D1
n
p.Ai / ln
i D1
and
Ep .ln f / D Eq .f ln f / D
n
X
q.Ai /EAi .f ln f /:
i D1
Now for each i we have

EAi .f /
EAi .f / ln EAi .f / EAi .f ln f / D EAi f ln
f

EAi .f /
EAi f
1
D 0:
f
(ii) Let D fA1 ; : : : ; An g be an arbitrary partition and D fB1 ; : : : ; Bn g with
i
S
Bi WD
Aj the corresponding chain. First I will show that in order to
j D1
maximize SLpq ./ the best ordering of the propositions Ai in is the one

where f has larger values on Ai than on Ai C1 , so that and is actually
ordered in the direction of f . Then I will show that refining will increase
the surprise loss. This shows that the expectation in Definition 3.1 of Gpq .f /
is actually obtained as the surpremum over all partitions.
By definition
SLpq ./ D
n
X
i D1
X
p.Bi /
D
p.Ai / log2 EBi .f /:
q.Bi /
i D1
n
p.Ai / log2
150
In order to maximize this, the sets Ai should be chosen and ordered in such
a way that EBi .f / becomes as large as possible. Since Bi C1 includes Bi
and Bn D (implying EBn .f / D 1) the best choice is to have the largest
values of f on A1 and generally f should be larger on Ai than on Ai C1 . This
implies EAi .f / EAi C1 .f / and also EBi .f / EBi C1 .f /. Incidentally, it
also implies that EBi .f / EAi .f / since Bi contains on average larger values
of f than in Ai . This already shows (iii).
Furthermore, if we can split a set Ai into two sets A0i and A00i with EA00i .f /
EA00i .f /, the surprise gain will increase, because it can only become larger on
A0i and remains the same on A00i .
(iii) We show that for every x 2 in fact
F .x/ D p.f .x//=q.f .x// D Eq .f jf f .x// f .x/:
Indeed,
Eq .f jf f .x// D Eq .f 1f f .x/ /=qf f .x/
D Ep .1f f .x/ /=qf f .x/
D p.f .x//=q.f .x//:
t
u
Proposition 11.8. For any cover we have IGpq ./ 0.

Proof. The proof is the continuous version of the proof of Proposition 3.1. Again,
we use the natural logarithm instead of log2 in this proof and the inequality ln x
x 1.

1
IGpq ./ D Ep log2
f
and

1
1
1
Ep ln
D Eq f ln
Eq f
1
f
f
f
D Eq .1/ Eq .f / D 0:
t
u
Example 11.4. Take D 0; 1 with the usual Lebesgue measure q. Let p be given
by a density f with respect to q. Define
WD fx; x C W 0 x 1 g [

0; 2 ; 1 2 ; 1
for some > 0 and

WD fx; x C W 0 x 1 ; 0 < < 1g:
What is Gpq ./ and Gpq ./? See Exercise 11.8).
t
u
151
Definition 11.6. For any two probability distributions p and q on .; /, we define
i) IG.p; q/ WD IGpq ./ and
ii) SL.p; q/ WD SLpq ./
These definitions use the concepts of information gain and surprise loss to define
measures for the distance between two probability distributions. Information gain
IG is the same as the KullbackLeibler distance, whereas SL is a new distance
measure.
Proposition 11.9. Let .; / be a measurable space and p; q be two probabilities
on .; /, and let p have a density f with respect to q. Then
i) IG.p; q/ D IGpq ./ D Ep .log2 f /,
ii) SL.p; q/ D Gpq .f / D Ep .log2 p f log2 q f / and
iii) SL.p; q/ IG.p; q/ 0:
Proof. This follows directly from Proposition 11.7. Observe that
log2 p f log2 q f D log2 F
for the function F defined in Proposition 11.7.
t
u
Example 11.5. Consider D 0; 1 with the Borel-sets , the equidistribution q on

Rb
0; 1 and p D 3x 2 q, i.e., p..a; b// D 3x 2 dx D b 3 a3 . We compute IG.p; q/
a
and SL.p; q/ from Proposition 11.9 with f D 3x 2 :

Z1
IG.p; q/ D
3x 2 log2 .3x 2 / dx
0
Z1
SL.p; q/ D

3x 2 log2 .pf x/ log2 .qf x/ dx; so
Z1
ln 2 IG D
Z1
2
3x 2 ln 3 dx
3x 2 ln x dx C
0
1
2
D 6 2 C ln 3 D ln 3 0:432
3
3
and IG 0:623I
Z1
ln 2 SL D
0

3x 2 ln.1 x 3 / ln.1 x/ dx
152
yD1x 3
Z1
Z1
3.1 y/2 ln y dy
ln y dy
0
D 1 C 3 6
1
1
1
1
C 3 D C D 0:833
4
9
2
3
and SL 1:202:
For comparison we can also compute IG.q; p/ and SL.q; p/:
Z1
ln 2 IG D
ln.3x 2 / dx D 2 ln 3 0:9014
0
Z1
ln 2 SL D

ln.qf x/ ln.pf x/ dx
Z1
D
0
ln x ln x
Z1
dx D 2
ln x dx D 2
0
It is plausible that these two values are larger than the other two computed before,
because the density of q with respect to p has much larger values than the density
f .x/ D 3x 2 of p with respect to q.
Similar effects can also be seen by maximizing the surprise loss SLpq ./ for a
simple bet D fA; g (see also Exercise 4).
t
u
The transinformation T .X; Y / between two random variables X and Y , which
was introduced in Chap. 4, can obviously also be regarded as the information gain
between the common distribution of X and Y (i.e., the distribution of .X; Y /) and
the product of the distributions of X and of Y . Thus Proposition 3.9.(ii) which
shows the positivity of the transinformation, can be regarded as a special case of
Proposition 3.1 or 11.8. We can use this observation to extend the definition of
transinformation from discrete to continuous random variables. This will be carried
out in the next section.
11.4 Conditional Information of Continuous

Random Variables
In Chap. 3 we have presented classical information theory for discrete random variables. Sometimes it may be useful to extend some of these concepts to continuous
random variables (see for example Kolmogorov 1956; Cover and Thomas 1991;
11.4 Conditional Information of Continuous Random Variables
153
Shannon 1948), in spite of the fact that by any reasonable definition information
will be infinite for such variables because events can occur with arbitrarily small
probabilities.
The idea is to use only conditional information as defined in Definition 11.1
and
random variable X by the cover (-algebra) .X / D
to describe a continuous

e as in Chap. 3. With
fX aW a 2 Rg instead of simply using the description X
this idea we can reproduce all basic theorems, i.e., Proposition 3.9.
Unfortunately we cannot directly define I.X / by I..X //, because usually
.X / is not a repertoire. So we have to use the method of definition that we
used for information gain (Definition 11.5) and work with finite partitions or finite
subalgebras of .X /.
Definition 11.7. Let X; Y be arbitrary random variables. We define
I.X / WD supfI./ W partition; .X /g;
I.X jY / WD sup .X / inf .Y / I.j/;
where both sup and inf are extended over all partitions.
Clearly for discrete random variables X and Y these definitions coincide with
Definition 3.4.
There is a connection to another common definition of conditional information
that is worth mentioning.
For A 2 , p.A/ 0 and a random variable X we can define the random
variable5 p.AjX /. Based on this, for a partition we define the random variable
P
I.jX / WD A2 p.AjX / log2 p.AjX / and
I.jX / WD E.I.jX //.
Proposition 11.10. For two random variables X and Y we have I.Y jX / D
supfI.jX / W finite partition .Y /g.
Now we can show a useful continuous analog to Proposition 3.9.
Proposition 11.11. Let U , X , Y , and Z be arbitrary random variables.
i) I..X; Y /jU / I.X jU / C I.Y jU /;
ii) X 4 Y implies I.X jU / I.Y jU / and I.U jX / I.U jY /;
iii) I..X; Y /jU / D I.X jU / C I.Y j.X; U //:
Proof. These statements follow directly from their discrete analogs in Proposition 3.9.
t
u
We can use the idea of Definition 11.5 also for an extension of our previous
Definition 11.2 of transinformation.
5
p.AjX/ is a random variable that depends on the value of X, i.e., p.AjX/ D f .X/, where
f .x/ D p.AjX D x/ which can be properly defined for almost every x 2 R.X/.
154
Definition 11.8. Let and be arbitrary covers. We define the transinformation

T .; / D sup T . 0 ; 0 /W 0 ./ and 0 ./ partitions :

As in Proposition 11.5, we can again show that this definition is consistent with
Definition 11.2 for templates.
Definition 11.9. For two arbitrary random variables X; Y , we define the transinformation

T .X; Y / WD T .X /; .Y / :
Proposition 11.12. Let X; Y be two random variables on .; ; p/. Consider the
probability distributions pX pY and p.X;Y / on R2 . Then
T .X; Y / WD IG.p.X;Y / ; pX pY / 0:
Proof. (i) For discrete random variables X and Y this can be calculated directly
using Proposition 11.7: The (discrete) density f of p.X;Y / D p with respect to
q D pX pY at .x; y/ 2 R2 is
f .x; y/ D
pX D x; Y D y
pX D x pY D y
and

e \Y
e/
p.X
T .X; Y / D I.X / C I.Y / I.X; Y / D E log2
D E.log2 f /:
e / p.Y
e/
p.X
(ii) In the general case we simply have to observe that our definitions fit together:
T .X; Y / is defined in Definitions 11.9 and 11.8, the information gain is defined
in Definitions 11.6 and 11.5. Both definitions reduce the calculation to finite
partitions of R2 , where we have shown the equality in (i). In order to see that
the two resulting suprema are the same, we need an approximation argument
and Proposition 3.2.
t
u

This chapter has tried to find plausible ways of extending the classical concepts of
information gain and KullbackLeibler (KL) distance to arbitrary covers, leading to
Definitions 11.1, 11.3, 11.6, 11.7, 11.8, and 11.9. In classical information theory,
a similar reasoning is used to extend these concepts from discrete to continuous
random variables.
From the point of view of applications, these concepts have been used to
measure the distance between two probabilities p and q and to approximate an
unknown probability p by parameterized known probabilities q in optimization or
learning algorithms in the fields of pattern recognition or artificial neural networks
11.6 Applications in Pattern Recognition, Machine Learning, and Life-Science
155
(see Pearlmutter and Hinton 1987; Linsker 1989b; Atick 1992; Deco and Obradovic
1996; Hinton and Ghahramani 1997; Dayan and Abbott 2001; Kamimura 2002;
Erdogmus et al. 2003; Ozertem et al. 2006; Coulter et al. 2009; Brown 2009).
In general, expressions of information gain or conditional information are used
quite often for optimization in pattern recognition (Amari 1967; Battiti 1994; Amari
et al. 1996; Amari and Nagaoka 2000; Principe et al. 2000; Torkkola and Campbell
2000; Hyvarinen 2002; Mongillo and Denève 2008).
Actually, one can distinguish three different but related lines of argumentation
that converge on the use of information theory for a better understanding of learning
algorithms:
1. The statistical or Bayesian approach (e.g., MacKay 2005): Here the idea is
essentially that what is learned is a common distribution p.x; y/ of the two
variables (or sets of variables) X and Y , often called data and labels, whose
relationship has to be learned. Here it is common to use the KL distance to
measure the distance between the currently learned distribution and the true
distribution p.
2. The statistical physics approach is actually very similar, but arises from the
tradition in statistical physics to use the principle of maximal ignorance
(Jaynes 1957, 1982), which then leads to approaches that maximize entropy
(i.e., information) or transinformation.
3. Approaches that try to understand and mimic biological learning processes. Here
the idea often is that biological learning has the goal to optimize the neural
representation of the learning situation, i.e., of the values of the variables X
and Y , now interpreted as stimulus or stimulus situation and response of
the animal. Very often this leads again to maximization of the transinformation
between the neural representation Z and the variable X or Y or both (e.g., Barlow
1989; Atick 1992; Linsker 1989b, 1992, 1997; Zemel and Hinton 1995).
For these purposes we do not need to consider repertoires. However, this chapter
also introduces a new novelty-based measure that could be used for learning or
optimization: the surprise loss which could be used in a similar way as information
gain. For these applications the most important results are Propositions 11.7 and
11.9. In this book we do not try to elaborate these possibilities further.
11.6 Applications in Pattern Recognition,

Machine Learning, and Life-Science
In most practical applications you are faced with the problem of learning a kind of
relation or structure from data. More specifically, the data are points xi in a highdimensional space Rd and they may be associated with discrete labels or with output
values yi 2 Rl . The problem then can be formulated as finding a probabilistic model,
i.e., a common distribution for the pairs .X; Y / 2 Rd Cl that fits the given data best.
156
A partial question may be how much information the data X provide about the labels
Y , i.e., the transinformation or the information gain between the joint distribution
of X and Y and the product of their individual distributions. The learning process,
i.e., the process of approximating the joint distribution by a sequence of distributions
that are calculated from the data, can often be well described again in terms of
the information gain between these distributions (Amari 1982, 1985; Amari and
Nagaoka 2000; MacKay 2005). Practically all recent applications of information
theory to such data-driven learning problems in the fields of pattern recognition,
machine learning, or data mining are of this type. They are based on the use of
information gain or KullbackLeibler distance to measure the distance between
probabilities.
In the life sciences there are two particular fields of application where arguments
based on information terminology have a particular additional appeal: the neurosciences (which are the subject of the next section) and molecular and cell biology
where the obvious link is the information contained in the DNA (Herzel et al. 1994;
Schmitt and Herzel 1997; Grosse et al. 2000; Weiss et al. 2000; Slonim et al. 2005;
Taylor et al. 2007; Tkacik and Bialek 2007; Mac Donaill 2009).
11.7 Exercises
1) Given two probabilities p and q on a finite find a function X on for which
Ep .Nq X / is maximal! Do the same for Gpq .X / What is the difference?
2) Compare the surprise values obtained in Exercise 1) with SL.q; p/, SL.p; q/,
IG.q; p/, and IG.p; q/! Are all relations between these numbers possible?
3) Give examples for two continuous probabilities p q on 0; 1 such that
a) IG.p; q/ D IG.q; p/,
b) SL.p; q/ D SL.q; p/.

4) For D f0; 1g, p D .0:7; 0:3/, and q D 13 ; 23 determine the chain with maximal subjective surprise. Compare this to Exercise 1)
5) Given n independent random variables X1 ; : : : ; Xn what is the novelty and what
the surprise of X1 \ X2 \ : : : \ Xn ?
The solution is of interest for the evaluation of a number of statistical tests
for the same hypothesis undertaken in independent settings (for example,
by different research groups). For this reason we will give the result here:
The novelty for a particular event x is of course the sum of the novelties
N Xi Xi .x/, and if this sum is s, then the surprise is
2s
n1
X
.s ln 2/k
kD0
References
157
6) Consider D fA; B; g with p.A/ D q.A/ D 14 , p.A \ B/ D q.A \ B/ D 15 ,

p.B/ D 23 , q.B/ D 13 . Compute SLpq ./, IGpq ./, Gpq ./ and Npq ./.
7) Compute T .; / and M.; / for Example 11.3.
8) Compute IGpq ./, Gpq ./, IGpq ./, and Gpq ./ for Example 11.4.
References
Amari, S. (1967). A theory of adaptive pattern classifiers. IEEE Transactions on Electronic
Computers, 16(3), 299307.
Amari, S. (1982). Differential geometry of curved exponential familiescurvature and information
loss. Annals of Statistics, 10, 357385.
Amari, S. (1985). Differential-geometrical methods in statistics. New York: Springer.
Amari, S., & Nagaoka, H. (2000). Methods of information geometry. USA: AMS and Oxford
University Press.
Amari, S., Cichocki, A., & Yang, H. H. (1996). A new learning algorithm for blind signal
separation. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in Neural
Information Processing Systems (Vol. 8) (pp. 757763). Cambridge: MIT Press.
Barlow, H. B. (1989). Unsupervised learning. Neural Computation, 1, 295311.
Battiti, R. (1994). Using mutual information for selecting features in supervised neural net learning.
Neural Networks, 5, 537550.
and Winston.
Brown, G. (2009). A new perspective for information theoretic feature selection. In Proceedings
of the 12th international conference on artificial intelligence and statistics (AI-STATS 2009).
Chow, S. L. (1996). Statistical significance: Rationale, validity and utility. London: Sage Publications.
Coulter, W. K., Hillar, C. J., & Sommer, F. T. (2009). Adaptive compressed sensinga new class
of self-organizing coding models for neuroscience. arXiv:0906.1202v1.
Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. London: Wiley.
Dayan, P., & Abbott, L. F. (2001). Theoretical neuroscience: Computational and mathematical
modeling of neural systems. MA: MIT Press.
Deco, G., & Obradovic, D. (1996). An Information-theoretic approach to neural computing.
New York: Springer.
Erdogmus, D., Principe, J. C., & II, K. E. H. (2003). On-line entropy manipulation: Stochastic
information gradient. IEEE Signal Processing Letters, 10(8), 242245.
Grosse, I., Herzel, H., Buldyrev, S., & Stanley, H. (2000). Species independence of mutual
information in coding and noncoding DNA. Physical Review E, 61(5), 56245629.
Herzel, H., Ebeling, W., & Schmitt, A. (1994). Entropies of biosequences: The role of repeats.
Physical Review E, 50(6), 50615071.
Hinton, G., & Ghahramani, Z. (1997). Generative models for discovering sparse distributed
representations. Philosophical Transactions of the Royal Society B: Biological Sciences,
352(1358), 11771190.
Hyvarinen, A. (2002). An alternative approach to infomax and independent component analysis.
Neurocomputing, 4446, 10891097.
Jaynes, E. T. (1957). Information theory and statistical mechanics. Physical Review, 106(4),
620630.
Jaynes, E. T. (1982). On the rationale of maximum entropy methods. Proceedings IEEE, 70,
939952.
158
Kamimura, R. (2002). Information theoretic neural computation. New York: World Scientific.
Kolmogorov, A. N. (1956) On the Shannon theory of information transmission in the case of
continuoussignals. IRE Transactions on Information Theory, IT-2, 102108.
Kullback, S., & Leibler, R. A. (1951). On information and sufficiency. The Annals of Mathematical
Statistics, 22(1), 7986.
Linsker, R. (1989b). How to generate ordered maps by maximizing the mutual information between
input and output signals. Neural Computation, 1(3), 402411.
Linsker, R. (1992). Local synaptic learning rules suffice to maximize mutual information in a linear
network. Neural Computation, 4, 691702.
Linsker, R. (1997). A local learning rule that enables information maximization for arbitrary input
distributions. Neural Computation, 9, 16611665.
MacKay, D. J. C. (2005). Information theory, inference, and learning algorithms. UK: Cambridge
University Press.
Mac Donaill, D. (2009). Molecular informatics: Hydrogen-bonding, error-coding, and genetic
replication. In 43rd Annual Conference on Information Sciences and Systems (CISS 2009).
MD: Baltimore.
Mongillo, G., & Denève, S. (2008). On-line learning with hidden Markov models. Neural
Computation, 20, 17061716.
Ozertem, U., Erdogmus, D., & Jenssen, R. (2006). Spectral feature projections that maximize
shannon mutual information with class labels. Pattern Recognition, 39(7), 12411252.
Pearlmutter, B. A., & Hinton, G. E. (1987). G-maximization: An unsupervised learning procedure
for discovering regularities. In J. S. Denker (Ed.), AIP conference proceedings 151 on neural
networks for computing (pp. 333338). Woodbury: American Institute of Physics Inc.
Principe, J. C., Fischer III, J., & Xu, D. (2000). Information theoretic learning. In S. Haykin (Ed.),
Unsupervised adaptive filtering (pp. 265319). New York: Wiley.
Schmitt, A. O., & Herzel, H. (1997). Estimating the entropy of DNA sequences. Journal of
Theoretical Biology, 188(3), 369377.
Slonim, N., Atwal, G., Tkacik, G., & Bialek, W. (2005). Estimating mutual information and multiinformation in large networks. arXiv:cs/0502017v1.
Taylor, S. F., Tishby, N., & Bialek, W. (2007). Information and fitness. arXiv:0712.4382v1.
Tkacik, G., & Bialek, W. (2007). Cell biology: Networks, regulation, pathways. In R. A. Meyers
(Ed.) Encyclopedia of complexity and systems science (pp. 719741). Berlin: Springer.
arXiv:0712.4385 [qbio.MN]
Torkkola, K., & Campbell, W. M. (2000). Mutual information in learning feature transformations.
In ICML 00: Proceedings of the Seventeenth International Conference on Machine Learning
(pp. 10151022). San Francisco: Morgan Kaufmann.
Weiss, O., Jimenez-Montano, M., & Herzel, H. (2000). Information content protein sequences.
Journal of Theoretical Biology, 206, 379386.
Zemel, R. S., & Hinton, G. E. (1995). Learning population codes by minimizing description length.
Neural Computation, 7, 549564.
Part V
Information, Novelty and Surprise

in Science
Chapter 12
Information, Novelty, and Surprise

in Brain Theory
12.1 Understanding Brains in Terms of Processing

and Transmission of Information
In biological research it is common to assume that each organ of an organism
serves a definite purpose. The purpose of the brain seems to be the coordination and
processing of information which the animal obtains through its sense organs about
the outside world and about its own internal state (Bateson 1972). An important
aspect of this is the storage of information in memory and the use of the stored
information in connection with the present sensory stimuli. Thus the brain deals
with information (Hebb 1949) and therefore after Shannons formal definition of
information (Shannon 1948) it seemed most appropriate to use this new theory in
brain research. So information theory has become an important ingredient in the
theory and modeling of neural networks and brains (e.g., Palm 1982, 1992; Shaw
and Palm 1988; Tononi et al. 1992, 1994; Edelman and Tononi 2000).
Three aspects of the handling of information can be distinguished in approaches
to a functional understanding of brains:
1. Transmission of information (in particular in the cable-like nerves or fiberbundles connecting different parts of the brain).
2. Storage and retrieval of information in memory.
3. Processing of information.
Classical (Shannon) information theory provides a quantitative measure for the
amount of information that is contained in a message or that can be maximally
transmitted through a cable or that can be stored in and retrieved from memory
(Palm 1980). This kind of theory is directly relevant for the first two aspects of
information handling, but less relevant for the third.
The third aspect is the subject of computer science and even in this discipline
computation alone is not exactly in the focus of interest. In fact, one often
considers a duality between representation and computation of information. In
many computational problems the difficulty of a problem depends essentially on
161
162
12 Information, Novelty, and Surprise in Brain Theory
its representation. Thus the issue of representation has to be studied with about
the same intensity as the problems of computation. In a way computation can be
regarded simply as a transformation between different representations of the same
information.
Here we do not address problems of computation but focus on the issue of
representation and on the use of information theory in this context. The first
question to be discussed is a methodological or even philosophical one: How is it
possible to use rather technical concepts from information theory to discuss issues
of representation in neuroscience and brain research?
In the 1950s and 1960s the use of information measurements in brain research
and experimental psychology became quite popular (e.g., MacKay and McCulloch
1952; Barnard 1955; Quastler 1956a,b; Attneave 1959; Barlow 1961; Wenzel 1961;
Miller 1962; Yovits et al. 1962; Gerstein and Mandelbrot 1964; Cherry 1966;
Perkel and Bullock 1967; Pfaffelhuber 1972; Abeles and Lass 1975; Massaro 1975;
Eckhorn et al. 1976; Uttley 1979; Johannesma 1981; Srinivasan et al. 1982); for
example, the information transmission rates of a single neuron, the optic nerve,
or of conscious reactions to stimuli were determined and discussed, also during
the following years. Similarly the information storage capacity of short-term, longterm and some other memories was discussed. In my early work, I used information
theory to investigate the optimal storage capacity of neural associative memories
(Palm 1980, 1987b; Palm and Sommer 1992) based on Hebbian synaptic plasticity
(Hebb 1949). This led me to the prediction that spiking activity in associative neural
populations should be sparse. An important theme in the early discussions was the
question of the neural code (see Perkel and Bullock 1967), i.e., whether the
neurons use a kind of Morse code, where patterns of exact time-intervals between
spikes are crucial, whether the single spikes of a neuron can be interpreted simply
as binary yes or no signals, or whether it is the vigor or frequency of the single
neurons spiking that signals some degree of intensity (either for the size of some
measured variable or for the certainty of a proposition).
This discussion has been revived and deepened in the last years from the 1990s
to the present (e.g., Optican and Richmond 1987; Linsker 1988; Barlow et al. 1989;
Bialek et al. 1991; Optican et al. 1991; van Essen et al. 1991; Atick 1992; Kjaer
et al. 1994; Shadlen and Newsome 1994; Tononi et al. 1994; Softky 1995; Dan et al.
1996; Gerstner et al. 1997; Golomb et al. 1997; Rieke et al. 1997; Rolls et al. 1997;
Tsodyks and Markram 1997; Brunel and Nadal 1998; Borst and Theunissen 1999;
Eckhorn 1999; Brenner et al. 2000; Panzeri and Schultz 2001; Nakahara and Amari
2002; Adelman et al. 2003; Seriès et al. 2004; Dean et al. 2005; Butts and Goldman
2006; Butts et al. 2007; Gutnisky and Dragoi 2008; Koepsell and Sommer 2008;
Coulter et al. 2009; Wang et al. 2010). This is not surprising since the available
experimental evidence about neural activity, spike trains of single and multiple
neurons, and neural systems has increased dramatically over the last 50 years.
However, the discussion of the neural code still remained in the context of
classical experimental paradigms and traditional Shannon information (one notable
exception was Charles (Legendy 1975; Legendy and Salcman 1985; Legendy
2009), who used an early version of the idea of novelty introduced in this book).
12.1 Understanding Brains in Terms of Processing and Transmission of Information
163
One question which is still being discussed concerns spike rate vs. single spike
timing: is it just the rate of spikes in a neurons output spike train that contains the
information it conveys (Abbott 1994; Dan et al. 1996; Deadwyler and Hampson
1997; Gerstner et al. 1997; Golomb et al. 1997; Kang and Sompolinsky 2001;
Kjaer et al. 1994; Nirenberg and Latham 2003; Optican et al. 1991; Optican
and Richmond 1987; Panzeri and Schultz 2001; Panzeri et al. 1999; Seriès et al.
2004; Shadlen and Newsome 1998; Treves and Panzeri 1995) or is there additional
information in the precise timing of individual spikes? Since there is no clock in
the brain, the second alternative requires a temporal reference. This could be in the
preceding spikes of the neuron itself. This idea leads to the observation of suspicious
interspike interval patterns (Abeles et al. 1993; Abeles and Gerstein 1988; Baker and
Lemon 2000; Dayhoff and Gerstein 1983a,b; Grun et al. 2002a,b, 1999; Tetko and
Villa 1992; Martignon et al. 1994, 2000, 1995). Or the reference could be spikes
of other neurons. This leads to the idea of spike patterns across populations of
neurons, which would be very hard to observe experimentally, or at least to the very
common idea of coincidence or synchronicity of spikes in two or more neurons
which could be measured by correlation (Tsodyks et al. 2000; Engel et al. 2001;
Grun et al. 1994b; Konig et al. 1995; Palm et al. 1988). Today there is a lot of
experimental evidence for the importance of both spike rates and synchronicity.
The idea of a population code as opposed to single neuron codes, be it in terms
of spike frequencies, ordering of spike latencies (Perrinet et al. 2003; Thorpe
et al. 2004; Guyonneau et al. 2004; Loiselle et al. 2005) or spike coincidences
in the population, has also become quite popular. An important aspect here is the
sparseness of co-occurring spikes in a typical population of excitatory neurons
(see Palm 1980, 1982, 1987a; Palm and Sommer 1992; Hyvarinen and Karhunen
2001; Furber et al. 2007). The population idea has been considered to derive rules
for synaptic plasticity and learning. The guiding principle in these theories is that
the neural interconnectivity should be modified by synaptic plasticity (the synapses
form the connections between neurons) in such a way that it creates population
activities that maximize the information content about the stimuli (Barlow 1989;
Bell and Sejnowski 1995; Haft and van Hemmen 1998; Linsker 1989b,a, 1992,
1997; Yang and Amari 1997) that are believed to be represented in that neural
population or cortex area (for example, visual information in the visual cortex).
More recently, details of the precise timing of pre- and postsynaptic spikes and
the resulting influence on synaptic efficiency have been the focus of experimental,
information theoretical, and neural modeling studies (Dan and Poo 2006; Bi and
Poo 1998; Bialek et al. 1991; Kempter et al. 1999; Guyonneau et al. 2004; Hosaka
et al. 2008; Izhikevich 2007; Markram et al. 1997; Masquelier et al. 2009; Morrison
et al. 2008; Pfister and Gerstner 2006; van Rossum et al. 2000).
Another issue is the variability of neural responses. The same neuron (in
the visual cortex for example) will respond to repetitions of the same (visual)
stimulus not exactly in the same way. There may be large variations in both the
rate and the fine timing of the spikes. This leads to the question of what is the
signal (i.e., information about the experimental stimulus) and what is the noise
164
(this may also be information about something else) in the neural spike train
(Abeles et al. 1995; Arieli et al. 1996; Bair and Koch 1996; Christodoulou and
Bugmann 2001; Butts and Goldman 2006; Knoblauch and Palm 2004; Mainen and
Sejnowski 1995; Shadlen and Newsome 1998; Softky and Koch 1992, 1993; Stevens
and Zador 1998).
The ensuing discussions can get technically quite involved and detailed, but still
they often try to avoid questions about the purpose of the activity of an individual
neuron or a cortical area for the information processing of the behaving animal. It
is very likely that the purpose of the primary visual cortex (V1), for example, is not
just to present visual information. This information is presented quite efficiently
in the 106 fibers of the optic nerve. In the visual cortex, there are at least two
magnitudes more neuronsto represent just the same information? It can be that
certain important features that are implicit in the visual information in the optic
nerve are made explicit in the representation in V1. These can become important
presumably in terms of the behavioral responses and goals of the animal and may be
used by other brain areas to produce reasonable behavior. It may even be (and there
is experimental evidence for this) that general signals concerning the state of the
whole animal and in particular its attentiveness and even motivational or emotional
aspects also contribute to the neural responses in V1. These would normally be
regarded as noise with respect to the representation of the experimental visual
stimulus.
From this kind of argument it becomes evident that the search for a neural
code, even in a rather peripheral area like V1, ultimately requires an integrative
information processing theory of the whole brain. In fact, a number of brain theories,
often for partial functionalities, have already been published (Palm 1982; Shaw and
Palm 1988; Edelman and Tononi 2000; Grossberg 1999; Hawkins and Blakeslee
2004; Hecht-Nielsen 2007) and can serve as a background for these ideas on coding
and information theory (see also Tononi et al. 1992, 1994).
The whole idea of a neural code may even be (and has been) criticized from
a philosophical point of view (e.g., Bar-Hillel and Carnap 1953, see also Palm
1985): The use of information theory may seem rather inconspicuous with respect to
the so-called mindbrain (or mindbody) problem, in particular, since information
theorists never hesitate to admit that information in the technical sense does not
concern the meaning of the messages. Thus information can be introduced almost
innocently in a scientific theory of the brain that ends with a theory of consciousness
or at least points towards such a theory. I do not wish to say that such a theory
is totally impossible (for example, I sympathize with the theory put forward by
Edelman (Edelman and Tononi 2000), but I think one should be aware of the points
in the argument where aspects of intentionality are brought in. The problem with
this use of information terminology in brain research is essentially that the concept
of information or entropy may appear to be an objective one (since entropy is, after
all, a concept of physics; see Chap. 14), but contains in fact a strong subjective
and teleological element, namely the choice of the repertoire, the description or
the partition through which the world is viewed. This aspect was rather hidden in
classical information theory. In this new broader approach it becomes more evident.
12.1 Understanding Brains in Terms of Processing and Transmission of Information
165
In this use of information theory in brain research, where the receivers and
senders of information are not people but perhaps parts of a brain and we can
only indirectly try to infer their purposes, let alone the code that is used to
transmit information between them, classical information theory may be a little
too restrictive. One would like to be able to deal more directly with instances
of information extraction, formation of languages or language-like descriptions
of physical events, the formulation of more goal-directed purposeful variants of
information, and the like. A need for this is expressed more or less explicitly in
many attempts to use information theoretical ideas in the theoretical understanding
of brains (e.g., Uttley 1979; Palm 1982; Optican and Richmond 1987; Barlow 1989;
Barlow and Foldiak 1989; Barlow et al. 1989; Abeles 1991; Coulter et al. 2009).
For example, the definition of the information contained in a stimulus (Butts 2003)
should somehow reflect the importance of this stimulus for the animal, or at least
the importance of the neurons responding to this stimulus.
I believe the concepts of novelty, surprise, description, and repertoire, as
developed in this book, could help us to find a better and more appropriate use of
information theory in brain research. It may be possible to use these new concepts
in order to get rid of some (implicit) assumptions about the neural structures that
are to be analyzed in an information theoretical way. My intention is not to criticize
the ubiquitous recent applications of information theory in neuroscience and brain
research [a good review was given in Borst and Theunissen (1999), see also the
recent books Gerstner and Kistler (2002); Kamimura (2002); Rieke et al. (1997)].
Rather I want to show, where the particular new concepts introduced here, which
bear a certain ambivalence between information- and significance-related concepts,
can be successfully employed to achieve a better understanding or a theoretical
underpinning of argumentations that have been put forward in these fields and that,
in fact, have partially inspired me to introduce these new concepts. My exposition
will mainly consist of various applications of the concepts of novelty and surprise
introduced in Chap. 10.
The concept of information is clearly very useful in the neurosciences. The
activation of a neuron in the brain is considered as a signal for other neurons; it
represents some message or partial description about the state of the outside world
or of the animal itself. The neuron derives its activation from its synaptic inputs
which come from other neurons or (sometimes) directly from sense organs. If we
take together the knowledge about the activations of all neurons in the brain into a
so-called activity state or activity vector, this should contain a rather comprehensive
description of the state of the environment and of the animal itself, in other words: of
the current situation. Such a representation is necessary as far as it concerns aspects
of the situation that are vital for the animal, because it is the basis on which the
animal has to choose, plan, and perform the proper action in the given situation.
Clearly every single neuron in the brain can only be monitoring very few
specific aspects of the whole situation, and these aspects can well be described
mathematically by a repertoire or by a description.
166
In this chapter such neural repertoires will be considered in three contexts:

1. Considering the synaptic inputs and outputs of a neuron as its environment,
what are the events there that a neuron is interested in, and how can it convey
interesting signals to other neurons? In terms of surprise and repertoires we could
formulate the question as follows: What could be the repertoire on a neurons
input space that describes the interest of the neuron, and how can the neuron
signal the amount of novelty or surprise it observes through its repertoire to other
neurons? This perspective on neural activity was first formulated explicitly by
Legendy (1975).
2. Considering the electrophysiologist who probes the activation of one or several
neurons, what are the events he should be interested in, i.e., through which
repertoire should he observe the neuron(s), and how much surprise can he expect
to get?
3. Considering the activity state of the brain as a whole, how does it describe the
state of the world including the animal, and what are the syntactical constraints
on such a description that go along with it being composed from single neuron
descriptions?
12.2 Neural Repertoires

Neural activity is transmitted by electrical signals, called action potentials or spikes.
Conventionally spikes are regarded as unitary events and so the sequence of spikes
produced by a neuron is described by a (stochastic) point process. We denote by Tn
the time of occurrence of the n-th spike of the neuron, so the sequence T1 ; T2 ; T3 ; : : :
describes the firing of the neuron. The input space of a neuron is given by its afferent
axons, each of which produces a spike sequence. Thus the input space is described
by the combined process .Tni /, where each entry Tni denotes the time of occurrence
of the n-th spike in the i -th afferent (n D 1; 2; 3; : : : , i D 1; : : : ; A where A is the
number of afferents).
Which events in this afferent space are interesting for the neuron?
To answer this question we need some kind of model for the neuron (see for
example Holden 1976; MacGregor 1987). If we assume a fairly simple but widely
accepted model of a neuron, the degree of interest of the neuron can be described by
just one physical variable: the depolarization of the neuron. This depolarization D is
a real function of the combined process .Tni / and the neurons interest corresponds
to the one-dimensional repertoire D . The depolarization D is high whenever many
excitatory afferents are spiking at about the same time, and when no inhibitory
afferents are spiking at about of before this time. Again a simple model describes
the effects of each afferent spike by a postsynaptic potential (PSP) and the total
depolarization d simply as the sum of these potentials. The time course of the PSP
may differ between different afferents and there are excitatory EPSPs which are
12.3 Experimental Repertoires in Neuroscience
167
Fig. 12.1 A single neuron
depolarizing and thus positive and inhibitory IPSPs, which are hyperpolarizing and
thus negative. This simple model leads to the equation
XX
hi .t Tni /;
(12.1)
D.t/ D
n
where hi is the PSP of the i -th afferent (compare Fig. 12.1).

This means that a neuron is interested in sudden bursts of spikes in its excitatory
afferents, in coincident spiking of its afferents, and possibly also in pauses of
its inhibitory afferents which coincide with interesting events in the inhibitory
afferents.
If the neuron wants to transmit the surprise it gets out of this interest, i.e., out
of the description D , then it has to produce an event which is interesting for the
next neuron. Thus it should respond itself with a burst of spikes to an interesting
event. This is what neurons in fact do. The next neuron (or better one of the neurons
to which our neuron is an afferent) may get excited by a burst in our neuron alone,
but of course the next neuron is more likely to get excited if the burst in our neuron
coincides with bursts or single spikes in some more of its afferents. But our neuron
will not have much influence on this; this coincidence can only be detected and
properly signaled by the next neuron.

The electrophysiologist who probes neural activity with his electrodes, is in a
situation that is quite similar to that of the single neuron. If he records intracellularly
he has access to the depolarization variable D, if he records extracellularly he has
168
access to the spike train(s), if he records from several neurons simultaneously they
can be regarded as his afferents. In addition he also has information about the
outside world, which he can see through his own eyes, not through the eyes of the
animal. Often he records some particular features of this state of the world which
he believes to be interesting for the animal and perhaps even for the particular
neuron(s) he is recording from. Often he controls these features as a so-called
stimulus. Of course, it is well possible that the animal or even the recorded neuron
is interested in or reacts to other aspects of the experimental situation which are
not captured by the stimulus as defined by the experimenter. In addition to the
neural responses the experimenter may also record behavioral responses of the
animal. In behavioral experiments with trained animals, the animal is rewarded
for certain responses to the different experimental stimuli. In these cases one can
use information to measure how well the different stimulus response configurations
are differentiated by the animal or by the neuron or neural population recorded
from. Usually the experimenter will look out for interesting neural events in order
to correlate them with observations on the stimulus or the behavioral response. In
order to define an interesting neural event, he normally has to rely on the same
features that are available to the single neuron, i.e., on bursts and coincidences.
In multiunit recordings one may also try to combine these features into a onedimensional repertoire using plausible models for the functions hi and working with
(12.1). Thus the experimenter may create a burst repertoire, a pause repertoire, a
coincidence repertoire, and even a depolarization repertoire in order to evaluate
his data. In contrast to the neuron, however, the experimenter will not only use the
novelty created by these repertoires, but also the statistically more relevant surprise.
In the following we will briefly describe these repertoires.
12.3.1 The Burst Repertoire

We consider again the times Ti of the occurrences of spikes in a single neuron,
i.e., the space
D f.Ti /i 2N W Ti C1 Ti g:
On we define the burst repertoire as
n D fAkt n W k 2 N and t 2 Rg;
where Akt n D Tn Tnk t.
The novelty or surprise of n is the burst novelty or burst surprise (cf. Palm
1981; Legendy and Salcman 1985) obtained at the moment of the n-th spike in the
train. One can plot for real spike trains this kind of burst novelty against time (see
Palm 1981).
For any concrete spike train ! 2 we can calculate its novelty with respect to
and we can for example compare it with the average novelty of to see whether it
169
bit
9
6
3
bit
60
40
20
bit
Fig. 12.2 Burst novelty as a function of time for 2 individual spike-trains and a simulated
Poisson-spike-train. a) Unstimulated neuron in a fly, b) Stimulated neuron in cat LGN, c) Geiger
counter
was really surprisingly surprising (cf. Chap. 10). Some experimental and theoretical
analysis of this model was given by Palm 1981). For these calculations one needs
a probability distribution p on the set x of possible spike trains. This distribution
should reflect some of the statistical properties of the observed spike trains, but only
to a certain degree. It should also be possible to regard it as the naive distribution
against which the surprise of the actually observed spike train can be measured. For
this reason I have considered the Poisson distribution on for the calculation of
burst novelty and surprise shown in Figs. 12.2 and 12.3.
A slightly more general model is the so-called renewal process (see also Grun
et al. 1994b; Grun and Rotter 2010). Here we observe the so-called interspike
interval distribution, i.e., the probability distribution of the time-intervals between
successive spikes D D TnC1 Tn and we assume that this random variable D is
170

16
14
surprise
12
10
8
6
4
2
0
500
1000
novelty
1500
2000
Fig. 12.3 Burst surprise as a function of burst novelty
independent of n and, in addition, that subsequent interspike-intervals are independent. In this case, the variables Tn Tnk in n have the same distribution for all
n. And this distribution is the distribution of the sum of k identical independent
versions of D. These distributions are quite well understood in probability theory
(e.g., Doob 1953). A special case is the Poisson distribution which arises from the
assumption that D is exponentially distributed, i.e., pD t D 1 e t . Even
for this simple distribution, I have not been able to calculate the relation between
burst novelty and burst surprise analytically. So I have used computer simulation to
obtain Fig. 12.3.
12.3.2 The Pause Repertoire

Again we consider D f.Ti /i g and now the repertoire is n D fBtn g, where
Btn D Tn Tn1 t. The novelty or surprise of n is the pause novelty or pause
surprise obtained at the moment of the n-th spike. This has (to my knowledge)
not yet been calculated and displayed for real spike trains. In many experimental
situations, it seems to be less interesting than the burst repertoire. We observe that
n is a chain and so surprise and novelty coincide for n .
12.3.3 The Coincidence Repertoire

We now consider several spike trains, or the corresponding times Tni of the
occurrences of the n-th spike in the i -th neuron. Thus
171
D f.Tni /W n 2 N; i D 1; : : : ; mg:
The coincidence repertoire is defined as
t D fCtk W k 2 N; > 0g;
(12.2)
where
Ctk
"
XX
n
#
1t Tni t k
D at least k spikes co-occurred in the short time interval t ; t :

This repertoire can be used to measure the novelty for coincidence at time t. Observe
that this repertoire is the union of repertoires
t D fCtk W k 2 Ng;
which correspond to the description D of the (12.1), when di .t/ WD 10t .
In many experimental paradigms is in fact fixed, for example as a time-bin width,
so the repertoire t is used, which is a chain, and surprise coincides with novelty.
In this case, one can calculate the coincidence surprise for multiple coincidences
of spikes observed in k time bins or sites across different neurons and/or temporal
instances. To this end, we have to calculate the probability that more than n
such coincidences of k spikes in the k bins are observed during a period of L
observations of these bins (for example, repetitions of a stimulus) given that each
bin j .j D 1; : : : ; k/ has received nj spikes. This probability is calculated for
the naive hypotheses that there is no interaction between these k sites, i.e., that
the k stochastic sequences of length L are stochastically independent. For ease of
modeling we assume that the time bins are short enough such that there can be at
most 1 spike in each bin. Thus we get k independent binary sequences x j , each of
j
j
j
length L, i.e., x j D .x1 ; : : : ; xL / where the probabilities pj D prxi D 1 are
j
unknown and all xi are independent. The task is to calculate
"
pk .n/ D pr
k
L Y
X
i D1 j D1
j
xi
L
X
j
D n
xi D nj for j D 1; : : : ; k
#
(12.3)
i D1
and
Pk .n/ D
nk
X
i Dn
pk .i /:
(12.4)
Q
j
j
Here xi 2 f0; 1g and therefore kj 1 xi D 1 exactly if there is a k-coincidence,
so the first sum in (12.4) counts the k-coincidences, where the other sums count the
172
number of spikes in the j -th site during the observation. This probability has first
been calculated for k D 2 in Palm et al. (1988) and was incorporated as an analysis
tool in early versions of the Joint Peri-Stimulus-Time-Histogram (JPSTH) program
(Aertsen et al. 1989), later for higher values of k by Grun et al. (2002a; 2002b) and
extended to more general recording conditions (Grun et al. 1999; Gutig et al. 2002).
The calculation is comparatively easy if we proceed by induction on k. The fact
that we dont know the parameters pj doesnt matter because all binary sequences
j
j
.x1 ; : : : ; xL / containing nj ones are equally probable and therefore we can use

combinatorial counting arguments of these nLj sequences.
Obviously for k D 1 we have
(
p1 .n/ D
1 if n D n1 ;
0 otherwise.
Assume that we know pk1 .n/ for all n (obviously pk1 .n/ D 0 for n > n1 ).
To calculate pk .n/ we observe that for getting exactly n k-coincidences we have
to have at least n coincidences in the first k 1 spike sequences and at least n spikes
in the k-th sequence and if we have these .k 1/-coincidences in i n places,
then exactly n of the nk spikes of the k-th sequence have to be in these i places, the
remaining ones have to be in the remaining L i places.
Thus
i Li
n1
X
n n n
pk .n/ D
(for n nk );
pk1 .i / Lk
(12.5)
i Dn
nk
where of course pk .i / D 0 for i > minfnj W j D 1; : : : ; k 1g.

From the recursion ((12.5)) we immediately get
n1 Ln1
i
p2 .i / D
n i
L 2
n2
and
P2 .n/ D
n2
X
i Dn
n1 Ln1
i
n i
L 2
n2
The formulae become a bit simpler if we assume that the sites have been
reordered in such a way that n1 n2 n3 : : : nk and furthermore that spikes are
relatively sparse such that n1 C nk L.
In this case,
i1 Li1 i2 Li2
n1
X
i2 n2 i2
n n3 n
p3 .n/ D
L
L
i2 Dn
n2
n3
for n n1 , where we have put i1 WD n1 . By induction we get
12.4 Neural Population Repertoires: Semantics and Syntax
pk .ik / D

ij Lij
ij C1 nj C1 ij C1

L
i2 Dn
ik1 Dn j D1
nj C1
Pk .n/ D

ij Lij
ij C1 nj C1 ij C1

:
L
i2 Dn
ik Dn j D1
nj C1
and
i1
X
i1
X
ik2 k1
X
Y
ik1 k1
X
Y
173
(12.6)
(12.7)
12.3.4 The Depolarization Repertoire

Here we again take D f.Tni /g and now simply the repertoire D R.D / where
D is defined in (12.1), with some specific choice for the functions hi . Very common
is the choice
hi .x/ D e x= or hi D xe x= :
(12.8)
For fixed this repertoire is of course a chain, and surprise coincides with
novelty. If we take hi D 10; , we obtain a kind of burst repertoire for the unified
spike train of all observed neurons. And we have a similar computational problem
as for the burst surprise when we try to consider (12.8) with variable and try to
compute the surprise.
12.4 Neural Population Repertoires: Semantics and Syntax

The firing of the neurons in the brain signifies something in the outside world. If we
want to interpret neural activity in this way, we have to consider as the neural input
space not only the direct afferents to the single neuron as in (12.2), but we have to
consider ultimately the state of the whole world, or more specifically, the state of the
animal and of its environment. Indeed, the firing of the afferents to a single neuron
in the brain is determined by the firing of their afferents and so on, and eventually all
this neural activation is derived from activation in the sensory afferents to the brain
which provide information about the internal state of the animal and the (external)
state of its environment.
If we consider the repertoire of a neuron as determined by its depolarization D,
it is certainly a one-dimensional repertoire of propositions about the state x 2
of the world. Interpreting a neurons activity as evidence for a proposition about
the external world in this way, is quite useful in neuroscience. This idea is related
to the concept of a receptive field [for a discussion of this aspect see for example
Johannesma (1981), Aertsen and Johannesma (1981) or Krone et al. (1986)].
174
The topic of this section is how to combine these various one-dimensional

descriptions provided by many (or even all) neurons in the brain. In the mathematical terminology introduced in this book, this combination or assembly of
individual neural descriptions can be formulated very easily: If the firing of one
neuron n relates the surprise or evidence for its description dn , then the activity
state combined of several neurons n 2 N provides the description dN D \n2N dn .
What does this imply for the corresponding assembly repertoire N of a large set
of N neurons?
1. First of all, such a repertoire is obviously \-stable and therefore tight, so that
there is a one-to-one correspondence between the assembly repertoire N and
the assembly description dN .
2. It is important to notice also that this assembly repertoire N need not be closed
under complementation or negation. One could argue against this, because, after
all, if one neuron does not fire, its not-firing also provides information, for
example, for the absence of a feature. But it is very doubtful whether the brain
will be capable of making use of this information in every case. Normally, the
firing of a neuron is much more significant than its not-firing, simply because
most neurons do not fire most of the time [compare also Legendy (2009),
Legendy (1975) or Legendy and Salcman (1985)]. Also the not-firing will in
most cases have no significant effect on the postsynaptic neuronsat least when
the neuron is excitatory (which is the clear majority). A more detailed discussion
of this aspect can be found in Palm et al. (1988).
Therefore it is reasonable not to assume that the repertoire N is closed under
negations in general. There are certainly some relative complements, which are represented by neural activity. For example, the absence of activation in an inhibitory
neuron may be a necessary condition for the firing of one of its postsynaptic neurons.
This means that this postsynaptic neuron can represent a proposition of the form A
and not B, where A and B are represented by neural activation. For the repertoire
N , this means that it may contain A, B, A \ B and A n B for some propositions
A and B. Thus it may occasionally happen that a proposition A 2 N is completely
split up into disjoint subcases A1 ; : : : ; An (whose union then is A). We do not
believe, however, that a typical neural assembly repertoire N contains a partition
of the whole space X . This is simply because each significant proposition signaled
by such a repertoire is so improbable that there are not enough of them to cover the
whole space X .
All this discussion has a few formal consequences on the repertoire N : it need
not be clean (i.e., free of unions). And it is usually neither closed under negation nor
closed under unions.
The description dN gives a direct interpretation of neural activity patterns in
terms of the outside world, i.e., the stimuli delivered to the animal and its behavioral
response.
175
12.5 Conclusion
We can use concepts from information theory in trying to understand the functioning
of brains on many different levels, ranging from the neuron to the entire organism or
person. On each level one can distinguish the novelty or the surprise obtained from
the incoming signals by the unit under consideration from the information provided
by these signals and transmitted to this unit. In every case this information and the
corresponding transmission channel capacity is larger than the surprise obtained.
The discrepancy between information and surprise is most obvious for the whole
organism, when we consider how little surprise we typically get out of how much
input information.
As for the relation between novelty and surprise in the brain, this issue is more
relevant for the statistical evaluation of neurophysiological observations. It has been
the subject of controversial discussions in the literature, in particular concerning the
significance of spatio-temporal spike-patterns (citations are collected on p. 176),
without the use of the terminology introduced in this book. Now this topic can be
formulated as a neat mathematical problem.
Problem: For a reasonable probability distribution for neural spike trains (like the
Poisson distribution which can often serve as a good zero-hypothesis), and for all the
repertoires defined in this chapter, one should try to calculate the average surprise
S./ and the surprise statistics, i.e., prob S t for all t 2 R.
This problem is actually quite easy for the poisson distribution and most of the
repertoires (the neuronal, the coincidence and the population repertoire), it is harder
(we do not know the analytical answer yet) for the burst repertoire. There are a few
results dealing with some instances of this problem that can be found in the literature
of the last 30 years. Most of these results are collected in the technical comments
below.
Once again, one can roughly identify the three quantities novelty, information,
and surprise introduced in this book with three viewpoints on brain activity:
novelty with the subjective view, seeing the world through the eyes of individual
neurons or neural populations (Letvin et al. 1959; Legendy 1975), information (or
transinformation) with the functional view of measuring the neurons contribution to
the animals experimental performance (Barlow 1961; Borst and Theunissen 1999;
Nemenman et al. 2008), and surprise with the physiological statistical view that tries
to find significant patterns of neural activation (Dayhoff and Gerstein 1983a; Abeles
and Gerstein 1988; Palm et al. 1988; Aertsen et al. 1989; Grun et al. 1994b, 2002a;
Martignon et al. 1995).

Many applications of classical information theory in neuroscience have appeared
during the last 30 years. Here I can only group a number of these papers according
to the topics briefly discussed in the beginning of this chapter:
176
1. Signal vs. noise in the variability of neural responses: Abbott (1994); Brunel
and Nadal (1998); Butts (2003); Butts and Goldman (2006); Christodoulou
and Bugmann (2001); Golomb et al. (1997); Kang and Sompolinsky (2001);
Kjaer et al. (1994); Knoblauch and Palm (2004); Mainen and Sejnowski (1995);
Shadlen and Newsome (1994, 1998); Softky and Koch (1992, 1993); Softky
(1995); Stevens and Zador (1998); Nakahara and Amari (2002); Nakahara et al.
(2006); Hansel and Sompolinsky (1996).
2. Rate coding vs. fine timing of spikes: Abbott (1994); Abeles et al. (1995);
Aertsen et al. (1989); Aertsen and Johannesma (1981); Bach and Kruger (1986);
Bair and Koch (1996); Barlow (1961); Bethge et al. (2002); Bialek et al. (1991);
Brown et al. (2004); Cessac et al. (2008); Deadwyler and Hampson (1997); Dean
et al. (2005); Denève (2008); Eckhorn et al. (1976); Engel et al. (2001); Gerstein
and Aertsen (1985); Gerstner and Kistler (2002); Gerstner et al. (1997); Golomb
et al. (1997); Gutig et al. (2002); Kjaer et al. (1994); Konig et al. (1995); Kostal
et al. (2007); Krone et al. (1986); Kruger and Bach (1981); Legendy and Salcman
(1985); Mainen and Sejnowski (1995); Markram et al. (1997); Morrison et al.
(2008); Nakahara and Amari (2002); Nirenberg and Latham (2003); Palm et al.
(1988); Panzeri and Schultz (2001); Panzeri et al. (1999); Perkel and Bullock
(1967); Pfister and Gerstner (2006); Rieke et al. (1997); Schneideman et al.
(2003); Seriès et al. (2004); Shadlen and Newsome (1994); Softky (1995); Softky
and Koch (1993); Tsodyks and Markram (1997); Vaadia et al. (1995).
3. Population code: Aertsen and Johannesma (1981); Amari and Nakahara, 2006);
Barlow (1989); Bethge et al. (2002); Bialek et al. (2007); Brenner et al. (2000);
Brunel and Nadal (1998); Butts and Goldman (2006); Coulter et al. (2009); Dan
et al. (1996); Deadwyler and Hampson (1997); Dean et al. (2005); Furber et al.
(2007); Gutnisky and Dragoi (2008); Kang and Sompolinsky (2001); Krone et al.
(1986); Legendy (2009); Linsker (1989b, 1992); Nirenberg and Latham (2005);
Osborne et al. (2008); Prut et al. (1998); Rolls et al. (1997); Schneideman et al.
(2003); Zemel and Hinton (1995).
4. Significance of spike patterns: Abeles (1991); Abeles et al. (1995, 1993); Abeles
and Gerstein (1988); Baker and Lemon (2000); Brown et al. (2004); Cessac
et al. (2008); Dan and Poo (2006); Dayhoff and Gerstein (1983a,b); Gerstein and
Aertsen (1985); Grun et al. (1994b, 2002a,b, 1999); Gutig et al. (2002); Hosaka
et al. (2008); Martignon et al. (2000, 1995); Masquelier et al. (2009); Nakahara
and Amari (2002); Palm et al. (1988); Pfister and Gerstner (2006); Tetko and
Villa (1992).
5. Neural coding in the visual system and natural scene statistics: Adelman et al.
(2003); Atick (1992); Atick and Redlich (1990, 1992); Barlow (1989); Bethge
et al. (2002); Butts et al. (2007); Dan et al. (1996); Dong and Atick (1995);
Field and Chichilnisky (2007); Gutnisky and Dragoi (2008); Haft and van
Hemmen (1998); Hoyer and Hyvarinen (2002); Hyvarinen and Karhunen (2001);
Hyvarinen and Hoyer (2001); Hyvarinen et al. (2009); Koepsell and Sommer
(2008); Koepsell et al. (2009); Krone et al. (1986); Legendy (2009); Linsker
(1989b,a); McClurkin et al. (1991); Optican et al. (1991); Optican and Richmond
(1987); Rolls et al. (1997); Seriès et al. (2004); Wang et al. (2010).

(3008)
676
200
176
number of bursts
Fig. 12.4 Histogram of

novelty values of
spike-bursts. Novelty based
on Poisson distribution.
Spikes from visual cortex
neurons in awake behaving
cats. For more details see
Legendy and Salcman (1985)
from where
Fig. 12.4 and 12.5 are adapted
177
100
55
23
24
19
0
S=0
10
20
30
40
50
Poisson surprise
80
190
Perhaps the first paper that tried to understand the organization of the brain from
the point of view of each single neuron in terms of the information that a single
neuron could obtain from its afferents and transmit to its efferents, was written by
Legendy (1975). Putting ourselves in the position of one single neuron, we can ask
ourselves What could it be interested in? and How could it transfer this into
surprising signals on its own axon?
The second question is very familiar to the single neuron physiologists who
record the spike trains from single neurons and try to make sense of it. What they
listen to are bursts, i.e., temporary increases in spiking activity. The surprising
event being that the neuron fires comparatively many spikes in comparatively short
time intervals. Typically bursts consist of anything between 5 to 50 spikes (for a
more elaborated statistics, see Legendy and Salcman 1985). The analysis of bursts
in terms of surprise or novelty has been initiated by Palm (1981) and Legendy and
Salcman (1985). Some results of the latter paper are shown in Figs. 12.4 and 12.5.
The first question has also become a technical question for those neurophysiologists that started to do multiunit recordings. In this case, the physiologists afferents
are the neurons he records from. Three answers have been given:
(a) Coincidence: A surprisingly large number of afferents (to the neuron or to the
physiologists electrode) fire within the same (short) time window.
(b) Afferent patterns: A certain pattern (i.e., subset of the afferents) fires within
the same short time window. This was probably most often the case when
a physiologist was surprised by a large coincidence (Kruger and Bach 1981;
Legendy and Salcman 1985; Bach and Kruger 1986; Abeles et al. 1993).
178
100
number of bursts
50
0
0
100
20
10
average spike rate during burst
50
0
0
50
100
number of spikes in burst
Fig. 12.5 Spike-burst statistics. a) Histogram of the spike rate during high surprise bursts (thick
bars: N > 10, thin bars: N > 20), b) Histogram of the number of spikes in high surprise bursts.
Preparation as in Fig. 12.4 from Legendy and Salcman (1985)
(c) Spatio-temporal patterns: A certain spatio-temporal pattern extended over a

longer time window (or a certain melody, if we identify the afferent neurons
with different musical notes) is repeated surprisingly often. Actually, for
melodies of even a small length already a single repetition is surprising
(Dayhoff and Gerstein 1983a,b).
Since I dont believe that single neurons possess a memory for detailed spatiotemporal patterns of a long duration (let us say more than 50 ms), the possibility (c)
is perhaps mainly of interest to the multiunit physiologists as a problem of statistical
evaluation of their recordings. And some of them indeed used conceptual ideas that
are closely related if not identical with the surprise concept (cf. Gerstein and Aertsen
1985; Abeles and Gerstein 1988; Palm et al. 1988; Aertsen et al. 1989; Tetko and
Villa 1992; Abeles et al. 1993; Martignon et al. 1995; Brown et al. 2004).
179
12.6.1 Coincidence
The most straightforward possibility for a single neurons surprise is in fact (a).
A typical neuron needs a certain number of coincident input spikes in order to
become sufficiently active. Additional bursting of some of these inputs may help so
that perhaps the better formulation of the surprising event has both the ingredients
of bursting and of coincidence: All afferents taken together fire surprisingly many
spikes within a surprisingly short time interval. The mathematical analysis of this
kind of novelty or surprise is identical with the analysis of burst novelty or surprise
as described above.
12.6.2 Coincidental Patterns

The single neuron could also have access to the surprise provided by short-time
patterns (b) through the additional mechanism of Hebbian learning. If a specific
combination of its afferent fires together a few times, the neuron can react more
strongly to exactly these afferents in the future, thus emphasizing this specific
combination. It was Hebbs idea to assume that this could be done by a mechanism
of synaptic plasticity that strengthens only those afferent connections to a neuron
that have together succeeded in activation it. Since a neuron is best activated by
coincident input activity, this mechanism creates a detector for surprising coincident
afferent patterns. Hebb synapses have been a purely speculative idea for a number
of years, but since the early 1990s they have entered neurobiological reality, and
the more detailed biochemical mechanisms of synaptic plasticity are intensely
investigated today and that are modeled in more detailed mechanisms of STDP
(Linsker 1989b; Bliss and Collingridge 1993; Dan and Poo 2006; Hosaka et al.
2008; Izhikevich and Desai 2003; Izhikevich 2007; Lisman and Spruston 2005;
Markram et al. 1997; Masquelier et al. 2009; Morrison et al. 2007, 2008; Pfister
and Gerstner 2006; Bi and Poo 1998; Song et al. 2000; van Rossum et al. 2000).
12.6.3 Spatio-Temporal Patterns

The case (c) of spatio-temporal patterns has also been analyzed in the literature
(Dayhoff and Gerstein 1983a,b, Grun et al. 1994a,b, 2002a,b). Here the relation
between novelty and surprise is particularly instructive. At first sight it seems that
each spiking pattern of a certain length L may be surprising by itself. We can model
this situation again for a set K of observed neurons using discrete time, i.e.,
180

i 2K
i
i
D bj
W bj 2 f0; 1g :
j 2N
i 2K
A pattern of length L is a sequence cji
j D1;:::;L
DW c. If we are interested
in all patterns of
Lg,
our cover is simply D fCc W c pattern of length
n length
L,
o
where Cc D ! D bji 2 W bji D cji 8j D 1; : : : ; L 8i D 1; : : : ; k . In this
case, every pattern would be very surprising, but exactly for this reason, this large
calculated novelty is not really surprising.
Let us model this situation more completely: We assume as a naive probability
assignment that all bji are independent and have the same probability q of being a
1. Thus, if Nc is the number of 1s in a pattern c, we simply have p.c/ D q Nc .1
q/LkNc .
If q D 1=2, then all patterns are equally improbable: p.c/ D 2Lk , and the
surprise for each pattern is equally high: S.c/ D Lk. If q is small, which is typical
for spike trains, then r WD .1 q/Lk is not incredibly small between 0 and 1 and

p.c/ D
q
1q
Nc
r;
implying N .c/ D Nc log
1q
C log r:
q
So the surprise increases with Nc and our measurement reduces essentially to the
coincidence surprise of Sect. 12.6.2.
Another interpretation of the pattern surprise, in particular, in the case where the
1s are much more surprising than the 0s, would be to consider only 1s,
i.e., the occurrences of spikes in a pattern. This leads more or less to the
same
n statistics: In this case, we describe oa pattern c by the proposition Cc D
bji 2 W bji D 1 for all i; j where cji D 1 and form the cover WD fDc W c pattern
of length Lg.
If again Nc denotes the number of 1s in a pattern c, then p.Cc / D q Nc . Therefore
N .Cc / D Nc . log q/ for all patterns c.
This is quite obvious because in every case the novelty of a proposition in
the cover depends only on the total number of spikes implied by it. Thus the
calculation of the corresponding surprise reduces essentially to the calculation of
burst surprise (see Sect. 12.3.1).
After these observations one may argue that the real surprise in the case of spatiotemporal patterns does not lie in the fact that each of the patterns is very surprising
by itself, but in the repetitive occurrence of one (or a few) of these patterns. This
case is best treated as the surprise of repetition (of an improbable event). This kind of
problem has already been analyzed to some extent by Dayhoff and Gerstein (1983b),
Abeles and Gerstein (1988). We will return to it in the next chapter.
References
181
References
Abbott, L. F. (1994). Decoding neuronal firing and modeling neural networks. Quarterly Reviews
of Biophysics, 27, 291331.
Abeles, M. (1991). Corticonics: Neural circuits of the cerebral cortex. Cambridge: Cambridge
University Press.
Abeles, M., & Gerstein, G. L. (1988). Detecting spatiotemporal firing patterns among simultaneously recorded single neurons. Journal of Neurophysiology, 60(3), 909924.
Abeles, M., & Lass, Y. (1975). Transmission of information by the axon: II. The channel capacity.
Biological Cybernetics, 19(3), 121125.
Abeles, M., Bergman, H., Margalit, E., & Vaadia, E. (1993). Spatiotemporal firing patterns in the
frontal cortex of behaving monkeys. Journal of Neurophysiology, 70(4), 16291638.
Abeles, M., Bergman, H., Gat, I., Meilijson, I., Seidemann, E., Tishby, N., & Vaadia, E. (1995).
Cortical activity flips among quasi stationary states. Proceedings of the National Academy of
Sciences of the United States of America, 92, 86168620.
Adelman, T. L., Bialek, W., & Olberg, R. M. (2003). The information content of receptive fields.
Neuron, 40(13), 823833.
Aertsen, A. M. H. J., & Johannesma, P. I. M. (1981). The spectro-temporal receptive field.
A functional characteristic of auditory neurons. Biological Cybernetics, 42(2), 133143.
firing correlation: Modulation of effective connectivity. Journal of Neurophysiology, 61(5),
900917.
Amari, S.-i., & Nakahara, H. (2005). Difficulty of singularity in population coding. Neural
Amari, S., & Nakahara, H. (2006). Correlation and independence in the neural code. Neural
Computation, 18(6), 12591267.
Arieli, A., Sterkin, A., Grinvald, A., & Aertsen, A. M. H. J. (1996). Dynamics of ongoing
activity: Explanation of the large variability in evoked cortical responses. Science, 273(5283),
18681871.
Atick, J. J., & Redlich, A. N. (1990). Towards a theory of early visual processing. Neural
Computation, 2(3), 308320.
Atick, J. J., & Redlich, A. N. (1992). What does the retina know about natural scenes? Cambridge:
MIT Press.
and Winston.
Bach, M., & Kruger, J. (1986). Correlated neuronal variability in monkey visual cortex revealed
by a multi-microelectrode. Experimental Brain Research, 61(3), 451456.
Bair, W., & Koch, C. (1996). Temporal precision of spike trains in extrastriate cortex of the
behaving macaque monkey. Neural Computation, 8(6), 11851202.
Baker, S. N., & Lemon, R. N. (2000). Precise spatiotemporal repeating patterns in monkey
primary and supplementary motor areas occur at chance levels. Journal of Neurophysiology, 84,
17701780.
Bar-Hillel, Y., & Carnap, R. (1953). Semantic information. In London information theory
symposium (pp. 503512). New York: Academic.
Barlow, H. B. (1961). Possible principles underlying the transformation of sensory messages.
Cambridge: MIT Press.
Barlow, H. B. (1989). Unsupervised learning. Neural Computation, 1, 295311.
Barlow, H. B., & Foldiak, P. (1989). Adaptation and decorrelation in the cortex. In C. Miall,
R. M. Durbin, & G. J. Mitcheson (Eds.), The computing neuron (pp. 5472). USA:
Addison-Wesley.
182
Barlow, H. B., Kaushal, T. P., & Mitchison, G. J. (1989). Finding minimum entropy codes. Neural
Computation, 1(3), 412423.
Barnard, G. A. (1955). Statistical calculation of word entropies for four Western languages. IEEE
Transactions on Information Theory, 1(1), 4953.
Bateson, G. (1972). Steps to an ecology of mind. London: Intertext Books.
Bell, A. J., & Sejnowski, T. J. (1995). An information-maximisation approach to blind separation
and blind deconvolution. Neural Computation, 7, 11291159.
Bethge, M., Rotermund, D., & Pawelzik, K. (2002). Optimal short-term population coding: When
Fisher information fails. Neural Computation, 14, 23172351.
Bi, G.-Q., & Poo, M.-M. (1998). Synaptic modifications in cultured hippocampal neurons:
Dependence on spike timing, synaptic strength, and postsynaptic cell type. The Journal of
Bialek, W., de Ruyter van Steveninck, R. R., & Tishby, N. (2007). Efficient representation as a
design principle for neural coding and computation. Neural Computation, 19(9), 2387-2432.
Bialek, W., Reike, F., de Ruyter van Steveninck, R. R., & Warland, D. (1991). Reading a neural
code. Science, 252, 18541857.
Bliss, T. V. P., & Collingridge, G. L. (1993). A synaptic model of memory: Long-term potentiation
in the hippocampus. Nature, 361, 3139.
Borst, A., & Theunissen, F. E. (1999). Information theory and neural coding. Nature Neuroscience,
2(11), 947957.
Brenner, N., Strong, S., Koberle, R., Bialek, W., & de Ruyter van Steveninck, R. (2000). Synergy
in a neural code. Neural Computation, 12(7), 15311552.
Brown, E. N., Kass, R. E., & Mitra, P. P. (2004). Multiple neural spike train data analysis: Stateof-the-art and future challenges. Nature Neuroscience, 7, 456461. doi: 10.1038/nn1228.
Brunel, N., & Nadal, J.-P. (1998). Mutual information, Fisher information, and population coding.
Neural Computation, 10(7), 17311757.
Butts, D. A. (2003). How much information is associated with a particular stimulus? Network:
Computation in Neural Systems, 14(2), 177187.
Butts, D. A., & Goldman, M. (2006). Tuning curves, neuronal variability and sensory coding.
PLOS Biology, 4, 639646.
Butts, D. A., Weng, C., Jin, J., Yeh, C.-I., Lesica, N. A., Alonso, J.-M., & Stanley, G. B. (2007).
Temporal precision in the neural code and the timescales of natural vision. Nature, 449(7158),
9295.
Cessac, B., Rostro-Gonzalez, H., Vasquez, J.-C., & Vieville, T. (2008). To which extend is
the neural code a metric? In Proceedings of the conference NeuroComp 2008. Informal
publication.
Cherry, C. (1966). On human communication. Cambridge: MIT Press.
Christodoulou, C., & Bugmann, G. (2001). Coefficient of variation (CV) vs mean inter-spikeinterval (ISI) curves: What do they tell us about the brain? Neurocomputing, 3840, 11411149.
Coulter, W. K., Hillar, C. J., & Sommer, F. T. (2009). Adaptive compressed sensinga new class
of self-organizing coding models for neuroscience.
Dan, Y., & Poo, M.-M. (2006). Spike timing-dependent plasticity: From synapse to perception.
Physiology Review, 86, 10331048.
Dan, Y., Atick, J. J., & Reid, R. C. (1996). Efficient coding of natural scenes in the lateral geniculate
nucleus: Experimental test of a computational theory. Journal of Neuroscience, 16(10),
33513362.
Dayhoff, J. E., & Gerstein, G. L. (1983a). Favored patterns in spike trains. I. Detection. Journal of
Neurophysiology, 49(6), 13341348.
Dayhoff, J. E., & Gerstein, G. L. (1983b). Favored patterns in spike trains. II. Application. Journal
of Neurophysiology, 49(6), 13491363.
Deadwyler, S. A., & Hampson, R. E. (1997). The significance of neural ensemble codes during
behavior and cognition. Annual Review of Neuroscience, 20, 217244.
Dean, I., Harper, N. S., & D. McAlpine (2005). Neural population coding of sound level adapts to
stimulus statistics. Nature Neuroscience, 8(12), 16841689.
References
183
Denève, S. (2008). Bayesian spiking neurons I: Inference. Neural Computation, 20, 91117.
Dong, D. W., & Atick, J. J. (1995). Statistics of natural time-varying images. Network, 6(3), 345
358.
Doob, J. L. (1953). Stochastic Processes. New York: Wiley.
Eckhorn, R. (1999). Neural mechanisms of scene segmentation: Recordings from the visual cortex
suggest basic circuits for linking field models. IEEE Transactions on Neural Networks, 10(3),
464479.
Eckhorn, R., Grusser, O.-J., Kroller, J., Pellnitz, K., & Popel, B. (1976). Efficiency of different neuronal codes: Information transfer calculations for three different neuronal systems. Biological
Edelman, G. M., & Tononi, G. (2000). A universe of consciousness: How matter becomes
imagination. New York: Basic Books.
Engel, A., Fries, P., & Singer, W. (2001). Dynamic predictions: Oscillations and synchrony in
top-down processing. Nature Reviews Neuroscience, 2(10), 704716.
Field, G. D., & Chichilnisky, E. J. (2007). Information processing in the primate retina: Circuitry
and coding. Annual Review of Neuroscience, 30, 130.
Furber, S. B., Brown, G., Bose, J., Cumpstey, J. M., Marshall, P., & Shapiro, J. L. (2007). Sparse
distributed memory using rank-order neural codes. IEEE Transactions on Neural Networks, 18,
648659.
Gerstein, G. L., & Aertsen, A. M. (1985). Representation of cooperative firing activity among
simultaneously recorded neurons. Journal of Neurophysiology, 54(6), 15131528.
Gerstein, G. L., & Mandelbrot, B. (1964). Random walk models for the spike activity of a single
neuron. Biophysical Journal, 4(1), 4168.
Gerstner, W., & Kistler, W. M. (2002). Spiking Neuron Models. New York: Cambridge University
Press.
Gerstner, W., Kreiter, A. K., Markram, H., & Herz, A. V. M. (1997). Neural codes: Firing rates
and beyond. Proceedings of the National Academy of Sciences of the United States of America,
94(24), 1274012741.
Golomb, D., Hertz, J., Panzeri, S., Treves, A., & Richmond, B. (1997). How well can we estimate
the information carried in neuronal responses from limited samples? Neural Computation, 9(3),
649665.
Grossberg, S. (1999). How does the cerebral cortex work? Learning, attention and grouping by the
laminar circuits of visual cortex. Spatial Vision, 12, 163186.
Grun, S., Aertsen, A. M. H. J., Abeles, M., Gerstein, G., & Palm, G. (1994a). Behaviorrelated neuron group activity in the cortex. In Proceedings 17th Annual Meeting European
Neuroscience Association. Oxford. Oxford University Press.
Grun, S., Aertsen, A. M. H. J., Abeles, M., Gerstein, G., & Palm, G. (1994b). On the significance of
coincident firing in neuron group activity. In N. Elsner, & H. Breer (Eds.), Sensory transduction
(p. 558). Thieme: Stuttgart.
Grun, S., Diesmann, M., & Aertsen, A. (2002a). Unitary events in multiple single-neuron spiking
activity: I. Detection and significance. Neural Computation, 14(1), 4380.
Grun, S., Diesmann, M., & Aertsen, A. (2002b). Unitary events in multiple single-neuron spiking
activity: II. Nonstationary data. Neural Computation, 14(1), 81119.
Grun, S., Diesmann, M., Grammont, F., Riehle, A., & Aertsen, A. (1999). Detecting unitary events
without discretization of time. Journal of Neuroscience, 94(1), 121154.
Grun, S., & Rotter, S. (Eds.) (2010). Analysis of spike trains. New York: Springer.
Gutig, R., Aertsen, A., & Rotter, S. (2002). Statistical significance of coincident spikes: Countbased versus rate-based statistics. Neural Computation, 14(1), 121153.
Gutnisky, D. A., & Dragoi, V. (2008). Adaptive coding of visual information in neural populations.
Nature, 452(7184), 220224.
Guyonneau, R., VanRullen, R., & Thorpe, S. J. (2004). Temporal codes and sparse representations:
A key to understanding rapid processing in the visual system. Journal of Physiology Paris,
98, 487497.
184
Haft, M., & van Hemmen, J. L. (1998). Theory and implementation of infomax filters for the retina.
Network, 9, 3971.
Hansel, D., & Sompolinsky, H. (1996). Chaos and synchrony in a model of a hypercolumn in visual
cortex. Journal of Computational Neuroscience, 3(1), 734.
Hawkins, J., & Blakeslee, S. (2004). On intelligence. New York: Times Books, Henry Holt and
Company.
Hebb, D. O. (1949). The organization of behavior: A neuropsychological theory. New York: Wiley.
Hecht-Nielsen, R. (2007). Confabulation theory. The mechanism of thought. Berlin: Springer.
Holden, A. V. (1976). Models of the stochastic activity of neurons. New York: Springer.
Hosaka, R., Araki, O., & Ikeguchi, T. (2008). STDP provides the substrate for igniting synfire
chains by spatiotemporal input patterns. Neural Computation, 20(2), 415435.
Hoyer, P. O., & Hyvarinen, A. (2002). A multi-layer sparse coding network learns contour coding
from natural images. Vision Research, 42(12), 15931605.
Hyvarinen, A., & Hoyer, P. O. (2001). A two-layer sparse coding model learns simple and complex
cell receptive fields and topography from natural images. Vision Research, 41(18), 24132423.
Hyvarinen, A., Hurri, J., & Hoyer, P. O. (2009). Natural Image Statistics. New York: Springer.
Hyvarinen, A., & Karhunen, J. (2001). Independent Component Analysis. New York: Wiley.
Izhikevich, E. M. (2007). Solving the distal reward problem through linkage of STDP and
dopamine signaling. Cerebral Cortex, 17, 24432452.
Izhikevich, E. M., & Desai, N. S. (2003). Relating STDP to BCM. Neural Computation, 15,
15111523.
Johannesma, P. I. M. (1981). Neural representation of sensory stimuli and sensory interpretation of
neural activity. Advanced Physiological Science, 30, 103125.
Kamimura, R. (2002). Information theoretic neural computation. New York: World Scientific.
Kang, K., & Sompolinsky, H. (2001). Mutual information of population codes and distance
measures in probability space. Physical Review Letter, 86(21), 49584961.
Kempter, R., Gerstner, W., & van Hemmen, J. L. (1999). Hebbian learning and spiking neurons.
Physical Review E, 59, 44984514.
Kjaer, T. W., Hertz, J. A., & Richmond, B. J. (1994). Decoding cortical neuronal signals: Network
models, information estimation, and spatial tuning. Journal of Computational Neuroscience, 1,
109139.
Knoblauch, A., & Palm, G. (2004). What is Signal and What is Noise in the Brain? BioSystems,
79, 8390.
Koepsell, K., & Sommer, F. T. (2008). Information transmission in oscillatory neural activity.
Biological Cybernetics, 99, 403416.
Koepsell, K., Wang, X., Vaingankar, V., Wei, Y., Wang, Q., Rathbun, D. L., Usrey, W. M., Hirsch,
J. A., & Sommer, F. T. (2009). Retinal oscillations carry visual information to cortex. Frontiers
in Systems Neuroscience, 3, 118.
Konig, P., Engel, A. K., & Singer, W. (1995). Relation between oscillatory activity and long-range
synchronization in cat visual cortex. In Proceedings of the National Academy of Sciences of the
United States of America, 92, 290294.
Kostal, L., Lansky, P., & Rospars, J.-P. (2007). Neuronal coding and spiking randomness. European
Journal of Neuroscience, 26(10), 26932701.
Krone, G., Mallot, H., Palm, G., & Schuz, A. (1986). Spatiotemporal receptive fields: A dynamical
model derived from cortical architectonics. Proceedings of the Royal Society of London. Series
B, Biological Sciences, 226(1245), 421444.
Kruger, J., & Bach, M. (1981). Simultaneous recording with 30 microelectrodes in monkey visual
cortex. Experimental Brain Research, 41(2), 191194.
Legendy, C. (2009). Circuits in the braina model of shape processing in the primary visual
cortex. New York: Springer.
Legendy, C. R. (1975). Three principles of brain function and structure. International Journal of
Legendy, C. R., & Salcman, M. (1985). Bursts and recurrences of bursts in the spike trains of
spontaneously active striate cortex neurons. Journal of Neurophysiology, 53(4), 926939.
References
185
Letvin, J. Y., Maturana, H. R., McCulloch, W. S., & Pitts, W. H. (1959). What the frogs eye tells
the frogs brain. Proceedings of the IRE, 47(11), 19401951.
Linsker, R. (1988). Self-organization in a perceptual network. Computer, 21, 105117.
Linsker, R. (1989a). An application of the principle of maximum information preservation to linear
systems. In D. S. Touretzky (Ed.), Advances in Neural Information Processing Systems (Vol. 1)
(pp. 186194). San Mateo: Morgan Kaufmann.
Linsker, R. (1989b). How to generate ordered maps by maximizing the mutual information between
input and output signals. Neural Computation, 1(3), 402411.
Linsker, R. (1997). A local learning rule that enables information maximization for arbitrary input
distributions. Neural Computation, 9, 16611665.
Lisman, J., & Spruston, N. (2005). Postsynaptic depolarization requirements for LTP and LTD: A
critique of spike timing-dependent plasticity. Nature Neuroscience, 8(7), 839841.
Loiselle, S., Rouat, J., Pressnitzer, D., & Thorpe, S. J. (2005). Exploration of rank order coding with
spiking neural networks for speech recognition. Proceedings of International Joint Conference
on Neural Networks, 4, 20762078.
MacGregor, R. J. (1987). Neural and brain modeling. New York: Academic.
MacKay, D. M., & McCulloch, W. S. (1952). The limiting information capacity of a neuronal link.
Bulletin of Mathematical Biology, 14(2), 127135.
Mainen, Z. F., & Sejnowski, T. J. (1995). Reliability of spike timing in neocortical neurons.
Science, 268(5216), 15031506.
Markram, H., Luebke, J., Frotscher, M., & Sakmann, B. (1997). Regulation of synaptic efficacy by
coincidence of postsynaptic APs and EPSPs. Science, 275, 213215.
Martignon, L., Deco, G., Laskey, K., Diamond, M., Freiwald, W. A., & Vaadia, E. (2000).
Neural coding: Higher-order temporal patterns in the neurostatistics of cell assemblies. Neural
Computation, 12(11), 26212653.
Martignon, L., von Hasseln, H., Grun, S., Aertsen, A. M. H. J., & Palm, G. (1995). Detecting
higher-order interactions among the spiking events in a group of neurons. Biological Cybernetics, 73(1), 6981.
Martignon, L., von Hasseln, H., Grun, S., & Palm, G. (1994). Modelling the interaction in a set
of neurons implicit in their frequency distribution: A possible approach to neural assemblies.
In F. Allocati, C. Musio, & C. Taddei-Ferretti (Eds.), Biocybernetics (Cibernetica Biologica)
(pp. 268288). Torino: Rosenberg & Sellier.
Masquelier, T., Guyonneau, R., & Thorpe, S. (2009). Competitive STDP-based spike pattern
learning. Neural Computation, 21(5), 12591276.
Massaro, D. W. (1975). Experimental psychology and human information processing. Chicago:
Rand McNally & Co.
McClurkin, J. W., Gawne, T. J., Optican, L. M., & Richmond, B. J. (1991). Lateral geniculate
neurons in behaving priimates II. Encoding of visual information in the temporal shape of the
response. Journal of Neurophysiology, 66(3), 794808.
Miller, J. G. (1962). Information input overload. In M. C. Yovits, G. T. Jacobi, & G. D. Goldstein
(Eds.), Self-Organizing Systems (pp. 6178). Washington DC: Spartan Books.
Morrison, A., Aertsen, A., & Diesmann, M. (2007). Spike-timing-dependent plasticity in balanced
random networks. Neural Computation, 19(6), 14371467.
Morrison, A., Diesmann, M., & Gerstner, W. (2008). Phenomenological models of synaptic
plasticity based on spike timing. Biological Cybernetics, 98, 459478.
Nakahara, H., & Amari, S. (2002). Information geometric measure for neural spikes. Neural
Nakahara, H., Amari, S., & Richmond, B. J. (2006). A comparison of descriptive models of a single
spike train by information geometric measure. Neural Computation, 18, 545568.
Nemenman, I., Lewen, G. D., Bialek, W., & de Ruyter van Steveninck, R. R. (2008). Neural coding
of natural stimuli: Information at sub-millisecond resolution. PLoS Computational Biology,
4(3), e1000025.
186
Nirenberg, S., & Latham, P. (2003). Decoding neural spike trains: How important are correlations?
Proceedings of the National Academy of Science of the United States of America, 100,
73487353.
Nirenberg, S., & Latham, P. (2005). Synergy, redundancy and independence in population codes.
Journal of Neuroscience, 25, 51955206.
Optican, L. M., Gawne, T. J., Richmond, B. J., & Joseph, P. J. (1991). Unbiased measures
of transmitted information and channel capacity from multivariate neuronal data. Biological
Optican, L. M., & Richmond, B. J. (1987). Temporal encoding of two-dimensional patterns by
single units in primate inferior temporal cortex. III. Information theoretic analysis. Journal of
Neurophysiology, 57(1), 162178.
Osborne, L. C., Palmer, S. E., Lisberger, S. G., & Bialek, W. (2008). The neural basis for
combinatorial coding in a cortical population response. Journal of Neuroscience, 28(50),
1352213531.
Palm, G. (1980). On associative memory. Biological Cybernetics, 36, 167183.
Palm, G. (1982). Neural assemblies, an alternative approach to artificial intelligence. New York:
Springer.
Palm, G. (1987a). Associative memory and threshold control in neural networks. In J. L. Casti, &
A. Karlqvist (Eds.), Real brains: artificial minds (pp. 165179). New York: Elsevier.
Palm, G. (1987b). Computing with neural networks. Science, 235, 12271228.
Palm, G. (1992). On the information storage capacity of local learning rules. Neural Computation,
4, 703711.
Palm, G., Aertsen, A. M. H. J., & Gerstein, G. L. (1988). On the significance of correlations among
neuronal spike trains. Biological Cybernetics, 59(1), 111.
Palm, G., & Sommer, F. T. (1992). Information capacity in recurrent McCullochPitts networks
with sparsely coded memory states. Network, 3(2), 177186.
Panzeri, S., & Schultz, S. R. (2001). A unified approach to the study of temporal, correlational,
and rate coding. Neural Computation, 13(6), 13111349.
Panzeri, S., Schultz, S. R., Treves, A., & Rolls, E. T. (1999). Correlations and the encoding of
information in the nervous system. Proceedings of the Royal Society of London Series B;
Biological Science, 266(1423), 10011012.
Perkel, D. H., & Bullock, T. H. (1967). Neural coding. Neurosciences Research Program Bulletin,
6(3), 223344.
Perrinet, L., Samuelides, M., & Thorpe, S. J. (2003). Coding static natural images using spike event
times: Do neurons cooperate? IEEE Transactions on Neural Networks, 15, 11641175.
Pfaffelhuber, E. (1972). Learning and information theory. International Journal of Neuroscience,
3, 83.
Pfister, J.-P., & Gerstner, W. (2006). Triplets of spikes in a model of spike timing-dependent
plasticity. The Journal of Neuroscience, 26(38), 96739682.
Prut, Y., Vaadia, E., Bergman, H., Haalman, I., Slovin, H., & Abeles, M. (1998). Spatiotemporal
structure of cortical activity: Properties and behavioral relevance. Journal of Neurophysiology,
79(6), 28572874.
Quastler, H. (1956a). Information theory in psychology: Problems and methods. Glencoe:
Free Press.
Quastler, H. (1956b). Studies of human channel capacity. In E. Cherry (Ed.), Information theory,
3rd London symposium (p. 361). London: Butterworths.
Rieke, F., Warland, D., de Ruyter van Steveninck, R., & Bialek, W. (1997). Spikes: Exploring the
neural code. Cambridge: MIT Press.
Rolls, E. T., Treves, A., & Tovee, M. J. (1997). The representational capacity of the distributed
encoding of information provided by populations of neurons in primate temporal visual cortex.
Experimental Brain Research, 114(1), 149162.
References
187
Schneideman, E., Bialek, W., & M. J. II. Berry (2003). Synergy, redundancy, and independence in
population codes. Journal of Neuroscience, 23, 1153911553.
Seriès, P., Latham, P., & Pouget, A. (2004). Tuning curve sharpening for orientation slectivity:
Coding efficiency and the impact of correlations. Nature Neurosience, 7(10), 11291135.
Shadlen, M. N., & Newsome, W. T. (1994). Noise, neural codes and cortical organization. Current
Opinion in Neurobiology, 4(4), 569579.
Shadlen, M. N., & Newsome, W. T. (1998). The variable discharge of cortical neurons: Implications
for connectivity, computation, and information coding. Journal of Neuroscience, 18(10), 3870
3896.
27, 379423, 623656.
Shaw, G., & Palm, G. (Eds.) (1988). Brain Theory Reprint Volume. Singapore: World Scientific.
Softky, W., & Koch, C. (1992). Cortical cells should fire regularly, but do not. Neural Computation,
4, 643646.
Softky, W. R. (1995). Simple codes versus efficient codes. Current Opinion in Neurobiology, 5(2),
239247.
Softky, W. R., & Koch, C. (1993). The highly irregular firing of cortical cells is inconsistent with
temporal integration of random EPSPs. Journal of Neuroscience, 13(1), 334350.
Song, S., Miller, K. D., & Abbott, L. F. (2000). Competitive Hebbian learning through spiketiming-dependent synaptic plasticity. Nature Neuroscience, 3, 919926.
Srinivasan, M. V., Laughlin, S. B., & Dubs, A. (1982). Predictive coding: A fresh view of inhibition
in the retina. Proceedings of the Royal Society of London Series B; Biological Science,
216(1205), 427459.
Stevens, C. F., & Zador, A. M. (1998). Input synchrony and the irregular firing of cortical neurons.
Nature Neuroscience, 1(3), 210217.
Tetko, I. V., & Villa, A. E. P. (1992). Fast combinitorial methods to estimate the probability of
complex temporal patterns of spikes. Biological Cybernetics, 76, 397407.
Thorpe, S. J., Guyonneau, R., Guilbaud, N., Allegraud, J.-M., & VanRullen, R. (2004). Spikenet:
Real-time visual processing with one spike per neuron. Neurocomputing, 5860, 857864.
Tononi, G., Sporns, O., & Edelman, G. M. (1992). Reentry and the problem of integrating multiple
cortical areas: Simulation of dynamic integration in the visual system. Cerebral Cortex, 2(4),
310335.
Tononi, G., Sporns, O., & Edelman, G. M. (1994). A measure for brain complexity: Relating
functional segregation and integration in the nervous system. Neurobiology, 91, 50335037.
Treves, A., & Panzeri, S. (1995). The upward bias in measures of information derived from limited
data samples. Neural Computation, 7, 399407.
Tsodyks, M., & Markram, H. (1997). The neural code between neocortical pyramidal neurons
depends on neurotransmitter releaseprobability. Proceedings of the National Academy of
Sciences of the United States of America, 94(2), 719723.
Tsodyks, M., Uziel, A., & Markram, H. (2000). Synchrony generation in recurrent networks with
frequency-dependent synapses. The Journal of Neuroscience, 20, 15.
Uttley, A. M. (1979). Information Transmission in the Nervous System. London: Academic.
Vaadia, E., Haalman, I., Abeles, M., Bergman, H., Prut, Y., Slovin, H., & Aertsen, A. M. H. J.
(1995). Dynamics of neuronal interactions in monkey cortex in relation to behavioural events.
Nature, 373, 515518.
van Essen, D. C., Olshausen, B., Anderson, C. H., & Gallant, J. L. (1991). Pattern recognition,
attention and information bottlenecks in the primate visual system. Proceedings of SPIE
Conference on Visual Information Processing: From Neurons to Chips, 1473, 1727.
van Rossum, M. C. W., Bi, G. Q., & Turrigiano, G. G. (2000). Stable Hebbian learning from spike
timing-dependent plasticity. The Journal of Neuroscience, 20, 88128821.
Wang, X., Hirsch, J. A., & Sommer, F. T. (2010). Recoding of sensory information across the
retinothalamic synapse. The Journal of Neuroscience, 30, 1356713577.
Wenzel, F. (1961). Uber

die Erkennungszeit beim Lesen. Biological Cybernetics, 1(1), 3236.
188
Yang, H. H., & Amari, S. (1997). Adaptive online learning algorithms for blind separation:
Maximum entropy and minimum mutual information. Neural Computation, 9, 14571482.
Yovits, M. C., Jacobi, G. T., & Goldstein, G. D. (Eds.) (1962). Self-organizing systems. Proceedings
of the Conference on Self-Organizing Systems held on May 22, 23, and 24, 1962 in Chicago,
Illinois. Washington: Spartan Books.
Zemel, R. S., & Hinton, G. E. (1995). Learning population codes by minimizing description length.
Neural Computation, 7, 549564.
Chapter 13
Surprise from Repetitions and Combination

of Surprises
In this chapter we consider the surprise for a repertoire which represents the interest
in several statistical tests which were performed more or less independently. Then
we consider the surprise obtained from repetitions of the same low-probability
event.
The interest in combining evidence from several statistical tests is not uncommon
in practical situations. One example occurs when several researchers have carried
out statistical studies to evaluate the efficiency of a new drug or a new scientific
hypothesis. In neuroscience, one example is the evaluation of firing coincidence
within a small group of neurons, which was carried out as in the preceding chapter
not only for one combination of sites, but for several different combinations,
leading to a number more or less independently performed statistical tests on the
same set of neurons. A particular example is the correlation analysis for two neuron
but for different time bins with respect to a stimulus in the so-called JPSTH (Aertsen
et al. 1989). In statistics the kind of analysis that can be carried out in these situations
is sometimes referred to as meta-analysis (Hedges and Olkin 1985; Hartung et al.
2008). The obvious question in such a situation is: How significant is an effect
which was studied in several instances and was found to be significant in some
cases and insignificant in others?
For example, one should not be very surprised if 5 out of 100 significance tests
which had been performed were significant at the 5% level. Of course, if one doesnt
know of the 95 insignificant tests one still may be impressed by the 5 reported
significant results. This is a severe problem for many practical attempts of metaanalysis, which is more related to the sociology of science and cannot be solved
mathematically.
13.1 Combination of Surprises

In our mathematical analysis of this problem, we assume that a number of measurements or statistical tests X1 ; : : : ; Xn were performed. These Xi are real valued
189
190
13 Surprise from Repetitions and Combination of Surprises
random variables and we assume in our first analysis that they are independent. The
statistical interest is expressed by the descriptions Xi .i D 1; : : : ; n/. The common
interest in all n tests is expressed by the description d D \niD1 Xi , i.e., d.!/ D
\niD1 Xi Xi .!/. Our task is to calculate the surprise of d , which we may call the
combined surprise.
First we observe that
Nd .!/ D log2
n
Y
pXi Xi .!/
i D1
D
n
X
log2 pXi Xi .!/
i D1
n
X
Yi .!/;
i D1
if we define the random variables Yi by

Yi .!/ WD log2 pXi Xi .!/:
Now S.d.!// D log2 pNd Nd .!/.
We can calculate this probability, because the random variables Yi being
derived from Xi are also independent. We also observe that Yi is monotonically
increasing with Xi .
What is the distribution of Yi ? To calculate this we consider the statement
Yi .!/ t , log2 pXi Xi .!/ t
, pXi Xi .!/ 2t
, Xi .!/ G.t/; where G.t/ is defined by pXi G.t/ D 2t :
Now we define Fi .t/ WD pYi t D pXi G.t/ D 2t for t 0.
Thus Yi is continuously distributed on RC with density fi .t/ D FPi .t/ D 2t
ln 2.
Now we can compute S.d.!// from Nd .!/ D s, because
n
X
Yi s:
S.d.!// D log2 pNd s D log2 p
i D1
For n D 2 we get
13.2 Surprise of Repetitions
191
Z1
pY1 C Y2 s D
pY2 s tf1 .t/ dt

0
Zs
D
Z1
F2 .s t/f1 .t/dt C
s
Zs
D
f1 .t/ dt
2t s 2t .ln 2/dt C
Z1
2t .ln 2/ dt
D s 2s ln 2 C 2s :
In the same way we can compute
p
" n
X
#
Yi s D 2
s
i D1
and therefore
S.s/ D s log2
n1
X
.s ln 2/i
i D0
n1
X
.s ln 2/i
i D0
!
:
Based on this calculation there is an easy description how to calculate the normalized combined surprise (i.e., the combined significance) of n independent statistical
tests: First we calculate the novelty or naive combined surprise s by summing up
the individual surprises or novelties. Then we calculate the combined surprise S.s/
by the above formula.

If you take part in a lottery, you would be very surprised, if your number
is drawn. Somehow this is true for every number, but every time the lottery
is played one number is drawn and one particular surprising event comes
true. However, this is not really surprising. Correspondingly, the repertoire
ffxgW x possible outcome of the lotteryg has a very high novelty, but zero surprise.
Maybe it is really surprising if the same number is drawn twice in a lottery within a
reasonably short period of time. If the same number is drawn three times, this is of
course much more surprising.
We now want to investigate the surprise obtained from repetitions of improbable
events. To do this we first develop a model for sequences of individually improbable
events, then we will describe a repetitionrepertoire and evaluate its surprise and its
normalized surprise. The interest in the repetition of unlikely events is of interest
192
not only for lotteries but also in some investigations in brain research where the
experimenters looked for repetitions of unlikely events in the firing of small groups
of neurons (. . . ).
We are interested in sequences of (random) experiments, where each time-step a
large number of very unlikely outcomes can happen. We model this by considering
random variables Xit for t D 1; : : : ; T and i D 1; : : : ; n, where T and n are quite
large integers, t stands for the time of the repetition of the experiment and Xit D 1
signifies that the unlikely event number i occurred at time t. For simplicity we
assume that Xit 2 f0; 1g and P Xit D 1 D p for all i and t, and that for t t 0 the
0
random vectors .Xit /i D1;:::;n and .Xit /i D1;:::;n are independent from each other. Then
we count the number of repetitions of each of the unlikely events by
Ni D
T
X
Xit :
t D1
Of course, we assume that p is very small but n should be large enough such
that p n is not small, it could even be equal to 1 as in the case of the lottery. In
many cases the events Xit D 1 and Xjt D 1 are mutually exclusive for i j ; in
some cases, we may assume them to be independent. Most often it is something in
between, i.e., pXit D 1; Xjt D 1 pXit D 1 pXjt D 1.
This also implies that pNi k; Nj k pNi k pNj k and
pNi < k; Nj < k pNi < k pNj < k. Also we can often assume
that Xis and Xit are independent1 for s t. In such cases it may be possible to
compute directly the probabilities that such events are repeated 2 or more times in
the sequence (see Exercise 1).
If we cannot make such assumptions, it is still possible to computeat least
approximatelythe probabilities for 3 or more repetitions, if pT is small, e.g., 18
(to give a definite value). This can be done by the Poisson approximation.
T
P
Let Ni WD
Xit . We now assume that Ni obeys the Poisson distribution (which
t D1
is approximately true when the variables xit are independent over time, but also in
most other practical cases). This means that
pNi D k D e
k
k
with
D p T:
Intuitively, we may say that we are surprised by the repetition of an unlikely

event, if Ni D k occurs for some i and k 3. The corresponding repertoire is
D fNi kW i D 1; : : : ; n ; k 3g.
This is typically not the case for the analysis of spike patterns.
193
Table 13.1 Surprise of 3 repetitions Ni D 3

1
nn
20
100

1,000
10,000
8
10
15
20
30
50
100
1.9624
2.8029
4.4456
5.6534
7.3802
9.5733
12.5617
0.0764
0.3455
1.4155
2.4594
4.0972
6.2599
9.2408
7.4019
8.3389
10.0564
11.2831
13.0198
15.2163
18.2054
5.0971
6.0259
7.7371
8.9624
10.6982
12.8944
15.8835
Table 13.2 Surprise of 4 repetitions Ni D 4

1
nn
20
100

1,000
10,000
8
10
15
20
30
50
100
6.7698
8.0249
10.3242
11.9647
14.2852
17.2177
21.2061
3.5071
4.7278
7.0073
8.6444
10.9636
13.8958
17.8842
12.4071
13.666
15.9675
17.6084
19.929
22.8615
26.85
10.0857
11.3443
13.6456
15.2865
17.6071
20.5396
24.5281
The novelty of the event Ni D k is

0
1
1
j
X

A D log.qk /:
log @e
j
j Dk
The surprise of this event is the negative logarithm of

p
n
[
!
Ni k D 1 p
i D1
n
\
!
Ni < k
i D1
1
n
Y
pNi < k
i D1
D 1 .1 qk /n :
Thus
S D log2 .1 .1 qk /n /:
We have tabulated some values for the novelty and the surprise of the repetition
event Ni D k for different values of D p T and n. We show two tables for
k D 3 and k D 4 (Tables 13.1 and 13.2).
Another interesting range is around the parameters of the lottery. Here we assume
that p n D 1. Here we consider different values of T and n (Table 13.3).
194
Table 13.3 Surprise of 3 repetitions in the lottery Ni D 3

T nn
104
105
106
107
10
19.196
25.8386
32.4823
39.1262
30
14.4441
21.084
27.7275
34.3713
100
9.2444
15.8741
22.5167
29.1604
1,000
0.3609
5.9332
12.5523
19.1947
15,000
0.0
0.0
1.233
7.4801
108
45.77
41.0151
35.8042
25.8385
14.118
1010
59.0577
54.3029
49.092
39.1262
27.4055

The topics of this chapter are of general interest in statistics and have occasionally
been treated in the context of statistical significance (which is closely related to the
surprise defined here), in particular, in the statics or rare events (related to surprise
of repetitions) and in statistical meta-analysis (related to combination of surprises).
The formula derived here for the combination of surprises had already been found
by Fisher, as I was told. I had developed it in the context of the analysis of joint
peri-stimulus-time histograms (JPSTH) and it was part of an integrated computer
program for the construction and analysis of JPSTHs in neurophysiology (Aertsen
et al. 1989). The statistical problems involved here have been treated repeatedly in
the physiological literature (e.g., Baker and Lemon 2000).
References
firing correlation: Modulation of effective connectivity. Journal of Neurophysiology, 61(5),
900917.
Baker, S. N., & Lemon, R. N. (2000). Precise spatiotemporal repeating patterns in monkey
primary and supplementary motor areas occur at chance levels. Journal of Neurophysiology, 84,
17701780.
Hartung, J., Knapp, G., & Sinha, B. (2008). Statistical meta-analysis with applications. Wiley
Series in Probability and Statistics. New York: Wiley.
Hedges, L., & Olkin, I. (1985). Statistical methods for meta-analysis. New York: Academic.
Chapter 14
Entropy in Physics
14.1 Classical Entropy

The term entropy was created in statistical mechanics; it is closely connected to
information and it is this connection that is the theme of this chapter. Let us first
describe the historical context of classical statistical mechanics.
In the 19th century, the law of conservation of energy was discovered. But daily
experience told the engineers that in the transformation of energy something was
always lost. An energy efficiency of 100% could not be achieved. Therefore, it
was decided to be impossible to build a perpetuum mobile (of the second kind).
This apparent contradiction to the law of energy conservation had to be explained
in statistical mechanics and phenomenological thermodynamics; it was done more
or less along the following lines: energy is actually not lost, it only becomes less
usable or useful. The usefulness of energy is in this explanation closely
related to its transformability. A form of energy can only be transformed into less
useful forms of energy, but not vice versa. In the models of statistical mechanics, this
usefulness of energy is related to its orderliness. For example, the kinetic energy
of a moving car is a relatively well-ordered form of energy, because all molecules
of the car do move (more or less) with the same velocity in the same direction.
When the driver steps on the brakes and the car stops, the kinetic energy of the car is
transformed by friction into heat energy. But this heat energy is basically also kinetic
energy; it is the kinetic energy of the unordered motion of the molecules of the air
surrounding the car. Thus the energy has been conserved during the deceleration, it
was only transferred from an ordered motion to an unordered motion.
This idea of an overall increase in unordered energy, i.e., a decrease in useful,
ordered energy during energy transformation or energy transfer was then formulated
as the second law of thermodynamics (the first law stating the conservation of
energy). It says that the entropy (of a closed system) cannot decrease. The entropy
H is thus regarded as a measure for disorder, the so-called negentropy H as a
measure for order, i.e., usefulness of energy.

195
196
14 Entropy in Physics
In statistical mechanics H is defined by Boltzmanns formula

H D k ln p;
which resembles our formula for novelty (resp. information). We now want to
discuss, in which sense H can be identified with novelty N , and what could be
the corresponding repertoire.
The classical papers in statistical mechanics concerned with Boltzmanns
H -theorem, which was an attempt to prove the second law, were mainly addressed
to two problems.
Problem 14.1. The problem of the spatial distribution of gas molecules in a
container.
In a container there are N molecules of a (so-called ideal) gas and one wants to
know the number N1 of molecules in the left half, or the number N2 of molecules
in the right half of the container. The problem is to show that for any starting
distribution of molecules the dynamics goes into the direction of bringing N1
towards N2 , i.e., that in the long run there will be equal numbers of molecules in
equal volumes, which means an equalization of pressure. Here the complete state
e 1 , when we consider N1 as a
of the gas is clearly viewed through the description N
function N1 W ! N on the state space .
Problem 14.2. The problem of the velocity distribution of gas molecules in a
container.
In a container there are N molecules of a gas. If one divides the range R of
possible velocities of individual molecules into small intervals Ri .i D 1; : : : ; n/,
ei of molecules whose velocities are in the subrange
one may ask for the number N
Ri . In this case the problem is to show an asymptotic development of the state of
the gas towards the so-called Maxwell distribution of velocities. Here one is only
interested in the vector .n1 ; : : : ; nn /, called the velocity distribution, which is again
a function of the state ! 2 .
The corresponding description can be written as
e1 \ N
e2 \ : : : \ N
e n:
N
It corresponds to a partition of into the sets
Ak1 ::: kn WD f! 2 W Ni .!/ D ki for i D 1; : : : ; ng:
In both cases there is an underlying physical repertoire, describing those
propositions about the microstate ! 2 that the physicist is interested in. In both
cases the repertoire is even a partition.
Now we are ready to elaborate the idea of the degradation of order or usefulness
of energy a little further. Think again of the decelerating car. The use of the brake has
transformed an ordered velocity distribution of molecules into a less ordered one.
14.1 Classical Entropy
197
Now the idea is that an ordered velocity distribution means a stronger restriction on
the (micro-)state ! 2 . In terms of the velocity-distribution repertoire, this leads
quite naturally to the definition: the velocity distribution k1 ; : : : ; kn is more ordered
than l1 ; : : : ; ln , if Ak1 ::: kn is less probable than Al1 ::: ln , i.e., if p.Ak1 ::: kn / p.Al1 ::: ln /.
Thus order is now defined as improbability with respect to a certain physical
repertoire or the corresponding description d D d . This idea leads us directly
to Boltzmanns formula. Again an additivity requirement is used to motivate the
logarithm and we obtain the negentropy
H.!/ D k ln p.d.!// D k.ln 2/N .!/:
The positive factor k is determined by thermodynamical considerations. (It is not
J
:)
dimensionless: k D 1:38 1023 K
Actually, a more extreme definition of entropy could also be envisaged in this
context, which is based on the use of surprise instead of novelty and produces the
same qualitative prediction, but of course differs quantitatively:
H.!/ D k.ln 2/S .!/:
It may be interesting to consider the quantitative differences of these two different
entropies in more detail. However, I am not enough of a physicist to do this.
There is a problem that we did not consider yet. How is the probability p on the
physical state space determined?
In classical mechanics there is indeed a unique probability measure p on the state
space , the so-called Liouville-measure, which is distinguished by the fact that it
is invariant under the classical (Hamiltonian) dynamics. If this Liouville-measure
is used for the calculation of the probabilities p for the two problems mentioned
above, it leads to the classical results that have been obtained by Boltzmann through
combinatorial considerations (his so-called complexion probabilities).
In this way, one can for a classical mechanical model of a natural phenomenon
calculate the entropy value for a specific description d.!/ of a state ! 2 . This
understanding of entropy as a special case of information or novelty, namely, for
a particular description d that reflects the interest of the physicist or engineer, is
slightly different from the conventional modern point of view as expressed for example by Brillouin (1962), but closer to the classical point of view [for more details, see
Palm (1985)]. It is also related to the idea of basing thermodynamical predictions
and even other much broader applications on the principle of maximal ignorance,
maximization of entropy or infomax as it is called today. In the experimental and
engineering context, this mode has by and large worked reasonably well up to the
present day: empirically, the second law of thermodynamics has always held.
On the other hand, it has been shown that the second law cannot strictly be true
in the framework of Hamiltonian dynamics (Poincare 1890). After his own attempts
to prove the second law, Boltzmann has also accepted the counterarguments (a good
collection of historical papers on thermodynamics can be found in Brush (1966))
and finally arrived at the conclusion that the second law does not hold strictly but
198
only in a probabilistic sense: the increase in entropy that is invariably observed

experimentally is not really sure but only overwhelmingly probable. The various
proofs of the second law that followed Boltzmanns first attempt have either turned
out to be faulty or have been given for slightly different dynamics (compare also
Palm (1985)). In statistical physics the state-space dynamics considered up to now
is often referred to as microdynamics and distinguished from macrodynamics,
which is regarded as a coarser approximation to microdynamics. Now the problem
is how to reconcile the reversibility of the underlying Hamiltonian microdynamics
with the apparent irreversibility of most processes in macrodynamics. Dynamical
systems theory provides a framework for the study of this relationship between
micro- and macrodynamics. In particular, chaotic and strongly mixing dynamical
systems have this property of being totally predictable given the knowledge of
the exact state x, but largely unpredictable, if only a coarse state description is
known, that may for example be given by a partition of propositions on x or
by a description of x. This argumentation has even led to the definition of the
dynamical entropy as an invariant for the classification of dynamical systems by
Kolmogorov and Sinai Kolmogorov (1958, 1959). It is the amount of information
needed to determine the next macrostate (after one unit of time) given complete
knowledge about all previous macrostates. For many dynamical systems this value
is nonzero (for any nontrivial partition ), and this loss of information may actually
be the reason for the second law of thermodynamics.
This indeed has some implication for the consequences of the second law
of thermodynamics for the long-term development of the universe. In the early
days of thermodynamics (and even today), many people believed that the whole
universe would have to evolve into a state of maximal entropy (Warmetod). Our
interpretation of the second law as dependent on an engineering, in particular human
view on the observable world (or even just on any kind of coarse graining of
the underlying physical microstates) implies that it only holds for the physicists
or the engineers repertoire. It is in fact not a law of physics, but a law that describes
the relation between the physical world and the (human) observer. Thus it does not
necessarily hold for the whole universe in the long run, but only in these parts that
we observe, for comparatively short periods of time (maybe centuries) with high
probability.
14.2 Modern Entropies and the Second Law

In modern physics, the concept of entropy has many facets, ranging from the very
theoretical to the very practical, almost engineering. This whole range is probably
necessary to connect the phenomenological concept of entropy in classical thermodynamics to the purely theoretical concepts that are used to derive certain theoretical
distributions in statistical mechanics or to classify even more abstract dynamical
systems in ergodic theory (Walters 1982; Smorodinsky 1971; Shields 1973).
14.2 Modern Entropies and the Second Law
199
The more theoretical entropies are easily described from the point of view of
information theory:
1. The entropy used in statistical mechanics for the derivation of theoretical
distributions is (apart from the sign) what we have called the information gain
in Definition 11.6, defined in terms of a density function f usually with respect
to the Lebesgue measure on the phase-space Rn .
2. The dynamical entropy used in ergodic theory for the classification of dynamical
systems is essentially what we have called the information rate in Definition 6.6
on page 83.
Both entropies are not used to prove (or at least understand theoretically) the
second law of thermodynamics from the underlying (micro-) dynamics. In fact, it
has turned out to be very hard to arrive at a proper theoretical understanding of the
second law at all. After all, the second law is an empirical law.
In the following we shall try to consider another version of physical entropy
which is closer to the phenomenological one and to the classical considerations
around the second law of thermodynamics. In order to formulate such an entropy,
we have to rely on a theoretical model of the dynamics that are studied in
thermodynamical experiments. Classically, one considers Hamiltonian dynamics on
a so-called state space . We shall consider a dynamical system, i.e., .; ; pI '/,
where .; ; p/ is a probability space and 'W ! is a mapping that describes
the evolution of a state x in time: on one time unit x moves to '.x/, then to ' 2 .x/
and so forth. The probability p should be the unique probability on that makes
sense for the physicist (see above) and that is invariant under the motion mapping '.
In this setting the physicist is normally not capable of knowing exactly the
point x in state space (for a gas in a box it would mean knowing all positions
and velocities of all molecules). Instead he knows the values of certain macroobservables, which are certain realvalued functions f1 ; : : : ; fn on the state space
(for example, pressure, temperature, energy). And he wants to describe or even
predict the evolution of the system over time in terms of these macro-observables.
In our language, this means that the physicist looks at the system through a
certain description d , which is given by d WD f for a single observable f
(see Definition 2.16 on page 31), or by dm WD f1 \ : : : \ fm for m observables
that he observes simultaneously.
Now it is straightforward to define the entropy of a state x by
H.x/ D .N d /.x/
as above.
There is one additional point that should be mentioned: Usually the macroobservables are not instantaneous measurements on the system, but rather time
averages. If gW ! R is an (instantaneous) observable, we define the time
average as
n1
1X
gn .x/ WD
g.' i .x//:
n i D0
200
We may now observe that in ergodic dynamical systems the averages gn converge
to a constant function G at least in probability (. . . ). This means that for any > 0
we can get
pjgn Gj > <
for sufficiently large n.
For the set M WD fx 2 W jgn .x/ Gj g, we have p.M / > 1 . Next, we
consider the observable f D gn and a fixed measurement accuracy > 0.
Then
p.f .x// D pfy 2 W jgn .y/ gn .x/j < g p.M / > 1
for x 2 M and < 2 .
If the average novelty N .g / is finite for the original instantaneous observable g,
one can easily find a constant c such that N gn .x/ < c for almost every x 2 .
Therefore the average novelty of f D gn can be estimated as
N .f / .1 / log.1 / C c;
which goes to zero as ! 0.
This means that for sufficiently large n the entropy for the average observable
f D gn will be almost zero for any fixed measuring accuracy .
If we start the system in an initial condition where we know that .f r/, this
corresponds to the negative entropy value
H.f r/ D log pjf rj < :
If we measure f again, averaging over some large number n of time steps, we will
almost always find f G and the average entropy we will get for this measurement
will be close to zero since
pjf Gj < > 1 :
In practive large n often means times below 1 s, and so this shows that we have
to expect the entropy to increase in time on the scale of seconds.
This argument is a theoretical underpinning for the second law of thermodynamics and its flavor of vagueness combined with triviality is typical for all such
arguments. Still it is in my opinion the best way of showing that entropy is a quantity
defined by the macro-observables for the system that is increasing or constant
over time, and therefore makes certain developments (the entropy-increasing ones)
irreversible. This phenomenon has been observed in connection with heat engines,
where the theoretical description of the state space would be a product of several
spaces i describing several subsystems, and the dynamics could be changed
by connecting or disconnecting some of these subsystems. In any case, such
a system can be led through irreversible processes and for certain procedural
14.3 The Second Law in Terms of Information Gain
201
sequences of connecting and disconnecting the phenomenological formula for

entropy shows this irreversibility, but without reference to the state-space models
of statistical mechanics, and before the times of Boltzmanns famous H -theorem
there has been no general argument for the second law.

In this section we finally present a different, fairly recent approach to proving the
second law. Again it is based on a practically reasonable, but generally invalid
approximation. Let us look a little more closely at the evolution of the entropy
for small times t, i.e., for times that are small compared to the relaxation time.
To simplify the picture assume that our observables f1 ; : : : ; fn are discretized,
i.e., R.fi / is a finite set of real numbers for i D 1; : : : ; m. Then the whole vector
f D .f1 ; : : : ; fn / has a finite range in Rn and we may write
R.f / D fr 1 ; : : : ; r N g Rn :
Then we can define pi D pf D r i .
Starting with a certain observation f 2 B (for example, f D r j or
f3 D s; f4 D s 0 ), we can in principle calculate the probability distribution p k
of the whole vector f of observables k time steps ahead.
pik D pf ' k D r i jf 2 B D
pf ' k D r i jf D r
r2B
pf D r
pf 2 B
So we obtain pik given f 2 B as a weighted sum of the probabilities pik given

f D r, summed over all r 2 B. The weights are again a probability vector
8
i
< pf D r
0
pf 2 B
pi D
:0
for r i 2 B and
otherwise.
This means that we can describe the probability distribution p k given any observation .f 2 B/ by means of a transition matrix
Mijk WD p.f ' k D r j jf D r i /;
i.e., we can describe our original system in terms of a Markov process.
With this notation, we may now look at the development of the entropy over time.
At our initial time t D 0 we have the observation f D ri and the entropy
H 0 D H.f D r i / D N .f D r i /:
202
At later times t D k we can only consider an average entropy

e ' k /jf D r i /:
H k D E.Np .f
If we introduce the probability vector q with q WD pf Dr i , we can write this
e 'k/
H k D Nqp .f
D
N
X
pf ' k D r j jf D r i log pf D r j
j D1
N
X
Mijk log pj :
j D1
We see that this is just the

negative of the subjective information between the
k
, i.e.,
probability vectors p and Mij
j D1;:::;N
H k D S.ei M k ; p/;
where ei D .0; : : : ; 0; 1; 0; : : : ; 0/ is the i -th unit vector. Here we use the notation
of Definition 3.1 in the following way: Two probability vectors p; q 2 0; 1n
considered as distributions for a random variable X with R.X / D f1; : : : ; ng. Then
S.q; p/ WD Spq .X /
and G.q; p/ WD Gpq .X /:
It should be clear that we can only speak of probabilities for the values of the
observables k time steps ahead .k 1/, even when we initially (at t D 0) knew the
values of all the observables fi .
This will usually be the case when
1. The observables .f1 ; : : : ; fn / D f do not determine the state x 2 , and
2. The time of one time step is not too small.
Furthermore, in this argument, the time of one time step should also not be too
large, i.e., small compared to the relaxation time, because otherwise the observables
f1 ; : : : ; fn will be almost constant already after one single time step (as in the
argument of the last section).
With time steps in this medium range, one may try a further simplification. This
simplification will, however, change the dynamics of the system, i.e., whereas the
transition matrices M k defined above, still describe the same dynamical model,
only viewed through a coarse repertoire (namely through the description of states
x in terms of the observables .f1 ; : : : ; fn /, we will now define a kind of averaged
or coarse-grained dynamics. We simply assume that the transition probabilities are
given by a first-order Markov process, i.e., defining
203
Nij WD Mij1 D pf ' D r j jf D r i ;

we assume that Mijk D .Nij /k .
With this simplified dynamics we can investigate mathematically the time
evolution of the probability vectors p k given any initial observation f 2 B
(compare Schlogl (1966)).
First we observe that the vector p D .p1 ; : : : ; pn / defined above satisfies
.pN /j D
n
X
i D1
pi Nij D
n
X
pf D r i p.f ' D r j jf D r i /
i D1
D p.f ' D rj / D pf D r j D pj ;
since p is '-invariant. I.e. p is N -invariant.
Defining p k as the probability vector p 0 N k for an initial vector p 0 that
corresponds to the initial observation f 2 B by
pi0 D
pf D r i
pf 2 B
for r i 2 B and pi0 D 0 otherwise, it is well known from ergodic theory (e.g., Walters
(1982)) that p k ! p, if p is the only N -invariant probability vector y. Thus
I.p k / ! I.p/ and S.p k ; p/ ! S.p; p/ D I.p/. Since S.p k ; p/ I.p k /
and I.p/ will usually be small for typical thermodynamical measurements f , one
can hope that S.p k ; p/ will decrease towards I.p/ and thus H k D S.p k ; p/
will increase. Unfortunately this is not the case in general. As an example one may
consider the start-vector p 0 D ei , where pi pj for every j D 1; : : : ; N . Then
usually H i < H 0 .
In this situation, Schlogl (1966) has suggested the information gain G.p k ; p/ of
k
p with respect to p, instead of the average entropy H k D S.p k ; p/, as a measure
for entropy. In the situation mentioned above, then obviously
G.p k ; p/ D S.p k ; p/ I.p k / ! 0
and it can indeed be shown that G.p k ; p/ decreases toward 0 in general.
Proposition 14.1
0 G.p kC1 ; p/ G.p k ; p/
for every k 2 N.
Proof. The positivity of G.q; p/ has been shown in Proposition 3.1. The main
inequality is again shown by means of the inequality log x x 1 and some
straightforward calculations using pN D p.
t
u
204

This chapter does not contain new technical material. It tries to give some reasonable
arguments for the second law of thermodynamics which certainly have been given
before by several authors during the last 120 years. The argument leading to
Proposition 14.1 is based on ideas of Schloegl (1966; 1971a; 1971b). The slightly
unorthodox viewpoint provided by our theory of novelty or surprise of repertoires
leads to the impression that the second law is not really a law of physics that
describes the objective dynamics of the physical world, but rather a law that
describes the dynamics of our knowledge about the state of the world as described
by the repertoire that is available to us.
References
Brillouin, L. (1962). Science and information theory (2nd ed.). New York: Academic.
Brush, S. G. (1966). Kinetic theory, Vol. 2, Irreversible processes. New York: Pergamon Press.
Kolmogorov, A. N. (1958). A new invariant for transitive dynamical systems. Doklady Akademii
nauk SSSR, 119, 861864.
Kolmogorov, A. N. (1959). Entropy per unit time as a metric invariant of automorphism. Doklady
Akademii nauk SSSR, 124, 754755.
Kuppers, B.-O. (1986). Der Ursprung biologischer Information - Zur Naturphilosophie der
Lebensentstehung. Munchen: Piper.
Poincare, H. (1890). Sur le problème des trois corps et les e quations de la dynamique. Acta
Matematica XIII, 13, 1270.
Schlogl, F. (1966). Zur statistischen Theorie der Entropieproduktion in nicht abgeschlossenen
Systemen. Zeitschrift fur Physik A, 191(1), 8190.
Schlogl, F. (1971a). Fluctuations in thermodynamic non equilibrium states. Zeitschrift fur Physik
A, 244, 199205.
Schlogl, F. (1971b). On stability of steady states. Zeitschrift fur Physik A, 243(4), 303310.
Shields, P. (1973). The theory of Bernoulli Shifts. Chicago: The University of Chicago Press.
Smorodinsky, M. (1971). Ergodic theory, entropy (Vol. 214). New York: Springer.
Walters, P. (1982). An introduction to ergodic theory. New York: Springer.
Part VI
Generalized Information Theory
Chapter 15
Order- and Lattice-Structures
In this part we want to condense the new mathematical ideas and structures that
have been introduced so far into a mathematical theory, which can be put in the
framework of lattice theory. In the next chapter we want to get a better understanding
of the order structure (defined in Definition 10.3) on the set of all covers. For this
purpose we now introduce a number of basic concepts concerning order and lattices
(Birkhoff 1967).

An order-relation is a binary relation with certain properties on a set S . It is usually
written as x y for x; y 2 S . As for any relation r, also an order relation can be
described by the set R of all pairs .x; y/ of elements from S , for which the relation
holds, i.e., by R D f.x; y/W x; y 2 S and x r yg.
The following nomenclature is most commonly used for simple relations
including order relations.
Definition 15.1. A relation r on a set S is called
i)
ii)
iii)
iv)
v)
vi)
vii)
Reflexive, if x r x for every x 2 S .

Symmetric, if x r y implies y r x for every x; y 2 S .
Transitive, if x r y and y r z implies x r z for every x; y; z 2 S .
Antisymmetric, if x r y and y r x implies x D y for every x; y 2 S .
Strictly antisymmetric, if x r y implies not y r x for every x; y 2 S .
Connecting, if x r y or y r x for every x; y 2 S .
Irreflexive, if not x r x for every x 2 S .
Definition 15.2.
i) Any reflexive, symmetric, and transitive relation is called an equivalence. For
equivalences, we often use the symbol .

207
208
15 Order- and Lattice-Structures
ii) Any reflexive, antisymmetric, and transitive relation is called a partial order
(p.o.) relation.
iii) Any reflexive and transitive relation is called an ordering.
iv) Any partial order relation that is connecting is called a total order relation.
v) Any transitive, irreflexive relation is called a strict order relation.
Proposition 15.1. Every strict order relation is strictly antisymmetric.
Proof. Assuming x r y and y r x implies x r x which contradicts irreflexivity.
t
u
Definition 15.3. Let be an arbitrary equivalence and x 2 S . The set e

x WD
fy 2 S W x yg is called the equivalence class of x.
Proposition 15.2. Let be an arbitrary equivalence.
i) The mapping d W S ! P.S / which maps x into d.x/ D e
x is a description.
ii) The set D fd.x/W x 2 S g is a partition of S .
Proof. (i) Clearly x 2 e
x.
x De
y.
(ii) If z 2 e
x \e
y , then x z and y z and thus x y, i.e., e
t
u
Usually, S has much more members than , because many elements x of S have
the same d.x/. More exactly: d.x/ D d.y/ if and only if x y. The set is
also referred to as D S= (reads: S modulo tilde). One says that every x 2 S is
identified with its equivalence class e
x and writes x D y= if and only if e
x De
y.
Given a partition we can define an equivalence by x y if and only if x and y
are in the same element of (or equivalently d .x/ D d .y/). Thus there is a oneto-one correspondence between partitions and equivalence relations. Furthermore
any mapping f W S ! T (T any set) gives rise to an equivalence relation f , namely
x f y if and only if f .x/ D f .y/. The corresponding partition is given by the
e.
complete description f
This chapter is a brief introduction to elementary order theory and provides
a slight generalization for some concepts from the most frequently considered
partially ordered sets (p.o.-sets) to what we call orderings.
Before we proceed with a more detailed discussion of order and orderings, here
are some examples.
Example 15.1. 1) The usual relation on Z (the integer numbers) is a total order
relation.
2) The relation on Zn defined by .x1 ; : : : ; xn / .y1 ; : : : ; yn / if and only if
xi i yi for every i D 1; : : : ; n is a partial order relation.
3) The usual relation between sets is a partial order relation on P.M / (the sets
of all subsets of M ).
4) The relation between real functions, defined by f g if and only if
f .x/ g.x/ for every x 2 X is a partial order relation on the set of all realvalued functions f W X ! R on a set X .
5) The relation between descriptions as introduced in Chap. 2 is also a partial
order relation.
t
u
209
If a partial order is not a total order relation, it can happen that two elements x
and y are uncomparable, i.e., neither x y nor x y. In these cases we need a
better definition for the minimum or the maximum of the two. The more general
terms are the meet x^y and the join x_y of x and y, which are used in analogy
to sets, where the meet x ^ y is the intersection and the join x _ y is the union of
the two sets x and y.
Intuitively, x ^ y is the largest element that is x and y and conversely x _ y
is the smallest element that is x and y. We will define this in the even more
general settings of orderings (Definition 15.5).
Definition 15.4. Let be an ordering on a set S . We define two relations < and
as follows:
i) x < y if and only if x y and not y x.
ii) x y if and only if x y and y x.
Proposition 15.3. Let be an ordering on S .
i) The relation < is a strict order relation.
ii) The relation is an equivalence.
iii) The relation on S= is a partial order relation.
Proof. (i) < is obviously irreflexive. We have to show transitivity: If x < y and
y < z, then x < z. To show that not z x, we assume z x and with x y
obtain z y, which contradicts y < z.
(ii) is obviously reflexive, transitive, and symmetric.
(iii) We have to show that is antisymmetric on the set S= of equivalence classes.
If x y and y x, then x y by definition. Thus x and y belong to the
same equivalence class, i.e., x D y in S= .
t
u
Definition 15.5. Let be an ordering on a set S and M S .
i) x is called minimal in M , if x 2 M and for every y 2 M , y x implies
y x.
ii) x is called maximal in M , if x 2 M and for every y 2 M , x y implies
y x.
iii) x is called a lower bound for M , if x y for every y 2 M .
iv) x is called an upper bound for M , if y x for every y 2 M .
v) x is called a largest element of M (also written x D max.M /), if x 2 M and
x is an upper bound for M .
vi) x is called a smallest element of M (also written x D min.M /), if x 2 M and
x is a lower bound for M .
vii) If the set U of all upper bounds of M has a smallest element x, this is called
the smallest upper bound (s.u.b.) of M and written as x D sup.M / D _.M /
(supremum).
210
viii) If the set L of all lower bounds of M has a largest element x, this is called
the largest lower bound (l.l.b.) of M and written as x D inf.M / D ^ .M /
(infimum).
ix) If M D fx; yg then sup.M / is written as x _ y and called the join and inf.M /
is written as x ^ y and called the meet of x and y.
Remarks:
Largest and smallest elements are unique up to equivalence.
Two minimal elements of a set M are either equivalent or incomparable.
If all minimal elements of a set M are equivalent, then they all are smallest
elements.
If two minimal elements of a set M are not equivalent, then the set M has no
smallest elements.
For a partial order the largest and the smallest element of M both are unique.
For a total order, every maximal element is the largest element of M and every
minimal element is the smallest element of M .
Proposition 15.4. A smallest element of a set M is a largest lower bound of M and
a largest element of M is a smallest upper bound of M .
Proof. Let x be the smallest element of M , i.e., x 2 M and y x for every y 2 M .
Thus x is a lower bound for M . If z is a lower bound for M , then z x 2 M . Thus
x is the largest lower bound for M .
t
u
Remark: A set M without a smallest (or largest) element may still have a l.l.b. (or
s.u.b.) outside M .
Proposition 15.5. Let be an ordering on S and M S .
i) x is minimal in M if and only if there is no y 2 M satisfying y < x.
ii) x is maximal in M if and only if there is no y 2 M satisfying x < y.
Proof. This follows directly from the definition.
t
u
The following (well-known) example shows that even for a total order relation
an infinite set M which has an upper bound may neither have a maximal element,
nor a largest element, nor a smallest upper bound. Consider
the usual ordering on
p
Q (the rational numbers) and M D fx 2 QW 0 x 2g.
Proposition 15.6. Let be an ordering on S and M S . If M is finite, then M
has maximal and minimal elements.
Proof. We can start with any element x 2 M . Either x is already maximal or there
is an y 2 M with y x. Now we proceed with y until we arrive at a maximal
element. The procedure ends in finitely many steps because M is finite.
t
u
211
This does not imply that M has a largest or a smallest element, not even that it
has an upper or a lower bound.
For example, let D f1; : : : ; 6g and S D .P./ n ; /. Consider
M D ff1g; f3g; f5g; f1; 2; 3g; f2; 3; 4g; f3; 4; 5g;
f4; 5; 6g; f1; 3g; f2; 4g; f3; 5g; f4; 6g; f2; 6gg
The minimal elements of M are
f1g; f3g; f5g; f2; 4g; f4; 6g; f2; 6g:
Thus M has no smallest element, but it has a lower bound in S , namely ;.
The maximal elements of M are
f1; 2; 3g; f2; 3; 4g; f3; 4; 5g; f4; 5; 6g; f2; 6g:
Note that f2; 6g is both maximal and minimal in M . M has no largest element and
no upper bound in S (because S ).
In the following we want to investigate the algebraic structure of the operations
join _ and meet ^ introduced in Definition 15.5.(ix). Since these are in general
not uniquely defined, but only up to equivalence, it is more convenient to consider
the equivalence classes S= for an ordering on S . On S= , i.e., if we identify
equivalent elements of S , the ordering becomes a partial order relation and join
and meet are unique (if they exist).
Thus in the following we will always assume that we have a partial order relation
on a set S .
Proposition 15.7. Let .Mi /i 2I be arbitrary many subsets of S and let xi D _Mi
for every i 2 I . Then _fxi W i 2 I g D _.[i 2I Mi /. Let yi D ^Mi for every i 2 I .
Then ^fyi W i 2 I g D ^.[i 2I Mi /.
The equation implies that the left hand side exists if the right hand side exists and
vice versa.
Proof.
y is an upper bound of fxi W i 2 I g , y xi for every i 2 I
, y x for every x 2 Mi for every i 2 I
, y x for every x 2 [i 2I Mi
, y is an upper bound of [i 2I Mi :
Thus fxi W i 2 I g and [i 2I Mi have the same upper bounds and therefore the same
smallest upper bounds.
t
u
212
Definition 15.6. Let be a partial order relation on a set S . If for any two x; y 2 S
both x ^ y and x _ y exists, then .S; ; ^; _/ is called a lattice.
Proposition 15.8. Let .S; ; ^; _/ be a lattice. Then the following is true:
i)
ii)
iii)
iv)
x ^ y D x , x _ y D y , x y.
.x ^ y/ _ x D x ^ .y _ x/ D x for any x; y 2 S .
x ^ y D y ^ x and x _ y D y _ x for any x; y 2 S .
x ^ x D x and x _ x D x for any x 2 S .
Proof. (i) If x y, then x is the smallest element of fx; yg. Therefore x is the
largest lower bound of fx; yg, i.e., x D x ^ y. Similarly y is the largest
element of fx; yg and therefore y D x _ y. Conversely, x D x ^ y y
and y D x _ y x.
(ii) x x ^ y and therefore (by (i)) x D .x ^ y/ _ x, x x _ y and therefore
(by (i)) x D .x _ y/ ^ x.
(iii) obvious from the definition
(iv) x is the smallest and largest element of fx; xg D fxg.
t
u
From each lattice operation ^ and _ we can retrieve the partial order by
defining x y if and only if x ^ y D x (, x _ y D y).
Proposition 15.9. Both lattice operations are associative and commutative and for
any finite set M D fx1 ; : : : ; xn g S we have _M D x1 _ x2 _ : : : _ xn (any order,
any bracketing) and ^M D x1 ^ x2 ^ : : : ^ xn (any order, any bracketing).
Proof. By induction on n:
n D 1 W _fx1 g D x1 D x1 _ x1 .
n D 2 W _fx1 ; x2 g D x1 _ x2 D x2 _ x1 by definition.
n ! n C 1 W Let M D fx1 ; : : : ; xnC1 g and Mi D M n fxi g. We have to show
_M D .x1 _ : : : _
xi _ : : : _ xnC1 / _ xi , where the first
bracket contains all xj (j D 1; : : : ; n C 1) except xi in any
order and bracketing. By the induction assumption .x1 _ : : : _
x
i _ : : : _ xnC1 / D _Mi . Since M D Mi [ fxi g we get from
Proposition 15.7
_M D _f_Mi ; _fxi gg D ._Mi / _ xi :
t
u
Remark:
From this it follows that a finite lattice S always has a maximal element, namely _S
and a minimal element, namely ^S . Usually, these are called 1 and 0, respectively,
i.e., 1 D _S and 0 D ^S .
0 and 1 are also used to name the smallest and largest element of a poset.
15.2 The Lattice D of Descriptions
213
Proposition 15.10. Let .S; ; ^; _/ be a lattice, M S and x 2 S . Then

i) ._M / ^ x _fy ^ xW y 2 M g if the two suprema exist.1
ii) .^M / _ x ^fy _ xW y 2 M g if the two infima exist.
Proof. Let z D _M , then z y for every y 2 M . Thus z ^ x y ^ x (because
every lower bound for fx; yg is also a lower bound for fx; zg). So z ^ x is an upper
bound for fy ^ xW y 2 M g, and therefore z ^ x _fy ^ xW y 2 M g. The second
proof works the same way.
t
u
In general the two reversed inequalities do not hold. When they hold (for finite
sets) the lattice is called distributive.
Definition 15.7. A lattice .S; ^; _/ is called distributive, if
i) x ^ .y _ z/ D .x ^ y/ _ .x ^ z/.
ii) x _ .y ^ z/ D .x _ y/ ^ .x _ z/ for all x; y; z 2 S .
Example 15.2. S D f;; f1g; f2g; f3g; f1; 2; 3gg with is not distributive. For
x D f1g, y D f2g, and z D f3g we get
.x ^ y/ _ z D ; _ z D z, but .x _ z/ ^ .y _ z/ D f1; 2; 3g ^ f1; 2; 3g D f1; 2; 3g;
and
.x _ y/ ^ z D f1; 2; 3g ^ z D z, but .x ^ z/ _ .y ^ z/ D ; _ ; D ;:
t
u
Proposition 15.11. Let be a partial order on S and S finite.
i) If x ^ y exists for any x; y 2 S then .S; / is a lattice.
ii) If x _ y exists for any x; y 2 S then .S; / is a lattice.
Proof. If x ^ y exists, then ^M exists for any finite set M S . Given x and y
in S , we have to determine x _ y. Let M be the set of all upper bounds of x and y.
M is not empty because y 2 M . Since S is finite, M is finite and we claim that
x _ y D ^M . Indeed, ^M x, since z x for every z 2 M . Similarly ^M y.
If z x and z y, then z 2 M and therefore z ^M .
t
u
15.2 The Lattice D of Descriptions

Let .; ; p/ be a probability space. In this section we consider the partially
ordered set .D; / of descriptions (Definition 2.3).
Proposition 15.12. .D; / is a lattice with join c [ d and meet c \ d .
All suprema exist if M is finite, but the inequalities also hold for infinite sets.
214
Proof. (i) c d means c.!/ d.!/ for every ! 2 . Thus is a partial order
just like the set relation .
(ii) .c [ d /.!/ c.!/ for every ! 2 . So c [ d c and also c [ d d .
Conversely, b.!/ c.!/ and b.!/ d.!/ for every ! 2 implies b.!/
c.!/ [ d.!/ for every ! 2 .
(iii) Similar to (ii).
t
u
Proposition 15.13. .D; / has a smallest and a largest element.
i) The smallest element is called 0 and defined by 0.!/ D f!g for ! 2 .
ii) The largest element is called 1 and defined by 1.!/ D for ! 2 .
Proof. obvious.
t
u
Proposition 15.14. .D; / is a distributive lattice.

Proof. For example, .b \ .c [ d //.!/ D b.!/ \ .c.!/ [ d.!// D .b.!/ \ c.!// [
.b.!/ \ d.!// D ..b \ c/ [ .b \ d //.!/ for every ! 2 .
t
u
Definition 15.8.
i) In a lattice .L; / with smallest and largest element 0 and 1, the complement ac
of an element a 2 L is defined by the relations a ^ ac D 0 and a _ ac D 1.
ii) A distributive lattice with 0 and 1 in which every element has a complement is
called a Boolean algebra.
Proposition 15.15. .D; / is a Boolean algebra with d c as defined in Definition 2.10.
Proof. For every ! 2 , we have d.!/ [ d c .!/ D and d.!/ \ d c .!/ D f!g.
t
u
Proposition 15.16.
i) c d implies N .c/ N .d /.
ii) I.c \ d / I.c/ C I.d /.
Proof. see Propositions 2.3 and 3.5.
t
u
In Chap. 2 we already considered the order properties of descriptions. There it

appeared more natural to consider the inverse ordering . Of course, .D; / is also
a distributive lattice, and Proposition 15.16 shows that N is monotonic and I is
subadditive on .D; /. We already know from Chapter 2, page 11 that I is not
monotonic and N not subadditive.

This chapter puts together known results from the theory of order and lattices
(Birkhoff 1967) which will be needed in the next two chapters. Some of the concepts
Reference
215
are presented in a slightly more general fashion than usual, because we cannot
assume our orderings to be antisymmetric. Sect. 15.2 contains a first application of
these ideas, recasting the results of Chaps. 2 and 3 in the lattice D of descriptions.
Reference
Birkhoff, G. (1967). Lattice theory (3rd ed.). Providence: American Mathematical Society.
Chapter 16
Three Orderings on Repertoires
The set of all repertoires actually has an interesting structure, when we look at
a repertoire in terms of its proper descriptions D./. This means that we should
consider two repertoires to be essentially the same if they have the same proper
descriptions, or we should say that is more refined than if the proper descriptions
in are contained in those in .
This idea leads to two almost equally reasonable definitions for an ordering of
repertoires, which we will call 1 and 2 . Then we will analyze these two orderings
and a third one in more detail.
16.1 Definition and Basic Properties

Definition 16.1. For two repertoires and , we define 1 by the following
condition:
For any description c 2 D./ there is a description d 2 D./ such that d c.
Definition 16.2. For two repertoires and , we define 2 by the following
condition:
For any description d 2 D./ there is a description c 2 D./ such that d c.
Definition 16.3. For two covers and , we define 3 by the following
condition: For any B 2 there is an A 2 such that B A.
Definition 16.1 leads to a very natural ordering of repertoires which is almost
the same as set containment. Definition 16.2 leads to a more complicated ordering
which is almost the same as the one used in (Palm 1976a,b) in connection
with topological entropy. Both definitions obviously coincide for tight covers.
Definition 16.3 leads to a third ordering which is known in the literature (Adler
et al. 1965; Walters 1982).
The example repertoires shown in Fig. 16.1 illustrate possible relationships
between the three orderings 1 , 2 , and 3 . For many covers the orderings lead
to the same reasonable results. For example, all orderings agree that cover is finer
217
218
16 Three Orderings on Repertoires
Fig. 16.1 The repertoires , , , D [ , D [ , D [ ,

D [ fg, D
[ ,
and D [ fg on D f1; 2; 3; 4; 5; 6g illustrate possible relationships between the orderings
1 , 2 , and 3 (see text)
than cover , i.e., i and i for i D 1; 2; 3. Similarly, all orderings

agree that and are incomparable, i.e., i and i , and that
and are
equivalent, i.e.,
i and i
.
However, there are also many cases where the orderings will lead to different
results. Actually, if a cover is finer than another cover with respect to one of the
three orderings, it may be that the two covers are incomparable with respect to any
of the two remaining orderings. For example, in Fig. 16.1 we have 1 , but
j and j for j D 2; 3. Similarly, we have 2
, but j
and
j
for j D 1; 3, and we have
3 , but
j and j
for j D 1; 2.
There are even many examples where the comparison of one ordering contradicts
the result of any of the two remaining orderings. For example, in Fig. 16.1 we have
1 , but j for j D 2; 3. Similarly, we have 2 , but j for
j D 1; 3. And finally, we have 3 , but j for j D 1; 2.
The most basic differences between the three orderings can be understood in
terms of their monotonicity properties with respect to adding a set.
Proposition 16.1. Let be a repertoire, and A . Then the following statements
are true:
i) 1 [ fAg
ii) [ fAg 3
16.1 Definition and Basic Properties
219
Proof. (i) Let a 2 D./. We define a second description a0 2 D. [ fAg/ as

follows: For any ! 2 we choose a0 .!/ D A if A a.!/, and otherwise
a0 .!/ D a.!/. Obviously a0 a.
(ii) Obvious: For any A0 2 we have also A0 2 [ fAg and A0 A0 .
t
u
For the ordering 2 no order can be predicted, when adding an element A to a
repertoire . For example, with the covers in Fig. 16.1: 2 WD [ and also
WD [ 2 .
Remark: Note that the definition of 2 differs from the definition of 3 only in
demanding proper choices for 2 , but not for 3 . Analogously, it would be possible
to define a further ordering 4 from the definition of 1 by leaving out the restriction
to proper choices. Thus for two repertoires and , we could define 4 by the
following condition: For any A 2 there is B 2 such that A B. However, it is
easy to see that this definition would be equivalent to the definition of 1 .
In the following we will see that the three orderings 1 , 2 , and 3 define
equivalence classes on the set of all repertoires.
Proposition 16.2. The relations 1 , 2 , and 3 are reflexive and transitive on
R, i.e.,
i) i
ii) i and i implies i
for repertoires , , , and i D 1; 2; 3.
Proof. (i) Obvious since A A for any choice A WD c.!/ from and ! 2 .
(ii) 1 : Let a 2 D./. 1 implies the existence of b 2 D./ such that b a.
With 1 , it follows the existence of c 2 D. / such that c b a.
2 : Let c 2 D. /. 2 implies the existence of b 2 D./ such that c b.
With 2 , it follows the existence of a 2 D./ such that c b a.
3 : Let C 2 . 3 implies the existence of B 2 such that C B. With
3 , it follows the existence of A 2 such that C B A.
t
u
In fact, 1 , 2 , and 3 are not even partial orders, because they are not
antisymmetric, i.e., i and i do not imply that D . An example
for this situation is given by the covers and shown in Fig. 16.1. Indeed, i
and i for i D 1; 2; 3, but .
This situation naturally leads to the definition of equivalence relations i for
i D 1; 2; 3, which are defined by ordering in both directions.
Definition 16.4. For i D 1; 2; 3 we define i by i and i . We also
define by D./ D D./.
Obviously 1 , 2 , 3 , and are equivalence relations, i.e., reflexive, symmetric,
and transitive. In the following section we will determine the equivalence classes of
the relations i . Note that has already been defined and analyzed in Sect. 9.
220
16.2 Equivalence Relations Defined by the Orderings

In the following we will find that the equivalence relations for orderings 1 and 2
are the same, and that the corresponding equivalence classes are the repertoires that
have the same proper choices, i.e., 1 D 2 D . For 3 it turns out that two covers
and are in the same equivalence class if they have the same flattenings f and
f (see Definition 9.13 on p. 117).
Proposition 16.3. The following are equivalent for two repertoires and .
i) 1
ii) 2
iii) D./ D D./
Proof. It is obvious that (iii) ) (i) and (iii) ) (ii).
(i) ) (iii) : Let a 2 D./, then there is a b 2 D./ with b a. Conversely, for
b there is an a0 2 D./ with a0 b a. Since a and a0 are both
minimal descriptions, a D a0 and therefore a D b. Thus a 2 D./.
Exchanging a and b in this argument shows D./ D./.
(ii) ) (iii) : is shown in the same way.
t
u
This indeed shows that 1 D2 D (of Definition 9.9). The following proposition further characterizes the equivalence classes for the orderings 1 and 2 .
Proposition 16.4. For any two repertoires and we have if and only if
c [ .
Proof. See Definition 16.4, Proposition 16.3, and Proposition 9.5.
t
u
So we see that the equivalence relation essentially disregards unions of

elements of a cover , because these are never used in proper descriptions. The
following proposition determines the equivalence classes for the third ordering 3 .
Proposition 16.5. For any two finite covers and we have 3 if and only if
and contain the same maximal sets, i.e., f D f .
Proof. ): Let 3 , 3 , and A 2 f . We have to show A 2 f .
Since 3 there must be B 2 with B A. Let us assume B A.
Since 3 there must be an A0 2 with A0 B A which contradicts
the maximality of A in . In the same way, it can be shown that for any
B 2 f we have B 2 f .
(: Let f D f . For any A 2 there is an A0 2 f with A0 A. Thus also
A0 2 f which shows 3 . 3 can be shown in the same
way.
t
u
In order to further characterize the equivalence classes for the ordering 3 ,
we introduce the following definition.
16.2 Equivalence Relations Defined by the Orderings
221
Definition 16.5. For any cover we define

WD fA ;W 9A0 2 W A A0 g
Thus contains all nonempty subsets of the propositions in . The following
proposition shows that the equivalence relation 3 essentially disregards subsets of
propositions of a cover .
Proposition 16.6. For any two covers and we have 3 if and only if
.
Proof. ) Let B 0 2 . Then there is B 2 with B B 0 and A 2 with
A B (since 3 ). Thus B 0 2 .
( Let B 2 . Then B 2 . Thus there is an A 2 with A B.
t
u
Proposition 16.7. For any two covers and we have 3 , if and only if
and .
Proof. We first observe that is equivalent to . With this
observation, Proposition 16.7 immediately follows from Proposition 16.6.
t
u
Proposition 16.8. For any two finite covers and we have 3 , if and only if
f .
Proof. (: f clearly implies f D f .
): Assume f D f but not f . Then A 2 f with A or B 2
with B . The first case implies A f which contradicts f D f .
Similarly, the second case implies B f which also contradicts f D f .
t
u
We can now consider equivalent covers as equal. For the orderings 1 and 2
we take the set R./ of all repertoires on modulo the equivalence relation and
call it R. Mathematicians write R WD R./= for this. Similarly, we can write
C WD C./= 3 for the set C./ of all covers of and the third ordering 3 . Now
the three orderings are antisymmetric modulo the respective equivalence relations.
Proposition 16.9. For any two finite covers the following are equivalent:
i)
ii)
iii)
iv)
3
D
f D f
f
Proof. The proof is immediate from Propositions 16.7 and 16.8.
t
u
As we will see in the next section, the set R of all repertoires turns out to be a
lattice for ordering 1 , but not for ordering 2 . Similarly, the set C of all covers turns
out to be a lattice for ordering 3 . A lattice is quite a strong order structure. This
means that now the two orderings i (i D 1; 3) are antisymmetric and for any two
covers and we have a join _i and a meet î as defined in the last section.
222
16.3 The Joins and Meets for the Orderings

We first will determine the joins and meets for the first ordering relation 1 . For this,
it turns out to be useful to characterize 1 in terms of set inclusion by application
of Proposition 16.4.
Proposition 16.10. For any two repertoires and we have 1 if and only if
[ .
Proof. (: Clear by c [ (Proposition 16.4).
): Assume 1 and [ . Then there is an A 2 with A [ . Therefore
there is an !1 2 A fB 2 [ W B Ag. Choose a description a 2 D./ with
a.!1 / D A0 A, where A0 2 c . Since there is no B 2 [ with B A,
we have b.!1 / a.!1 / for any description b 2 D./, which contradicts
1 .
t
u
This shows that 1 is (modulo ) the same as on R.
Proposition 16.11. For any two repertoires and the following statements are
true:
i)
ii)
iii)
iv)
[ 1 and [ 1 .
If a repertoire satisfies 1 and 1 , then 1 [ .
[ \ [ 1 and [ \ [ 1 .
If a repertoire satisfies 1 and 1 , then 1 [ \ [ .
Proof. The statements follow easily from Proposition 16.10:

(i) Clearly . [ /[ and . [ /[ .
(ii) 1 and 1 is equivalent to [ and [ . Thus clearly
[ [ .
(iii) Clearly [ \ [ [ and [ \ [ [ .
(iv) 1 and 1 is equivalent to [ and [ . This clearly implies
.[ \ [ /[ .
t
u
This means that the join _1 of and is simply [ and the meet ^1 of and
is [ \ [ . It also implies that the join and the meet are uniquely defined up to
equivalence. Indeed, if another cover besides [ would satisfy (i) and (ii), then
by (ii) 1 [ and also [ 1 and so [ .
In the following we show that 2 is not a lattice on R. For this, the following
proposition turns out to be helpful.
Proposition 16.12. Let , , and be repertories with 2 and 2 .
Then for any c 2 D. / and ! 2 ,
c.!/
fA \ BW 9a 2 D./; b 2 D./W A D a.!/; B D b.!/g:
Proof. Let c 2 D. /. By definition of 2 there are a 2 D./ and b 2 D./ with
a c and b c, and thus c.!/ a.!/ \ b.!/.
t
u
223
g1
<2
g2
b
2
Fig. 16.2 Repertoires illustrating that 2 is not a lattice. Both 1 and 2 are minimally larger than
and . Thus there is no unique meet for 2
It is sufficient to give an example of two covers and such that it is not

possible to find a unique smallest cover which is larger than both and . By
Remark 16.1 this shows that the join does not always exist. Proposition 16.4 implies
that also the meet does not always exist. Figure 16.2 illustrates such an example.
Here D f1; 2; 3; 4; 5; 6g and with A1 D f1; 2g, A1 D f1; 3; 4g, and B D f1; 3; 5g,
the repertoires are D fA1 ; A2 ; g and D fB; g. Let be a cover with 2
and 2 . With Proposition 16.12 we can constrain the proper choices of for
each ! 2 . Thus for any c 2 D. / and ! 2 there must be c.!/ C! for
C1 D .A1 [ A2 / \ B1 D f1; 3g;
C2 D A1 \ D f1; 2g;
C3 D A2 \ B1 D f1; 3g;
C4 D A2 \ D f1; 3; 4g;
C5 D B1 \ D f1; 3; 5g;
C6 D :
By requiring to be as small (i.e., coarse) as possible, we have to avoid choosing
subsets of the C! . Thus for ! D 6; 5; 4; 3 we will have to choose c.!/ D C! for
any c 2 D. /. However, for ! D 1; 2 we cannot properly choose c.1/ D C1 and
c.2/ D C2 . Then there would be also c 0 2 D. / with c 0 .1/ D C2 and thus b c 0 for
any b 2 D./ which would contradict 2 . Instead we have two choices: Either
we take c.2/ D C2 which implies choosing c.1/ D f1g, or, alternatively, we can
take c.2/ D f2g which allows c.1/ D C1 . Thus we remain with two incomparable
224
and presumedly smallest covers

1 D fC6 ; C5 ; C4 ; C3 ; C2 ; f1gg and 2 D fC6 ; C5 ; C4 ; C3 ; f2gg
that both are larger than and with respect to 2 (see Fig. 16.2). This suggests
that there is no unique meet for 2 . In order to finally prove this supposition,
assume that there is a cover lying between ; , and 1 ; 2 , i.e., satisfying 2 ,
2 , and 2 1 , 2 2 . As we will see this leads to a contradiction if we try
to find a proper choice d 2 D./ for ! D 1; 2: For any c1 2 D.1 /; c2 2 D.2 /, we
have to choose
c1 .1/ D f1g; c1 .2/ D f1; 2g
and
c2 .1/ D f1; 3g; c2 .2/ D f2g:
Since 2 1 ; 2 there must be d 2 D./ with

d.1/ f1; 3g; d.2/ f1; 2g:
On the other hand, since 2 ; we also have to require with Proposition 16.12
that
d.1/ C1 D f1; 3g; d.2/ C2 D f1; 2g:
Thus d.1/ D f1; 3g and d.2/ D f1; 2g, and therefore ff1; 2g; f1; 3gg. But then
there is another proper choice d 0 2 D./ with d 0 .1/ D f1; 2g which contradicts
2 since b.1/ D f1; 3; 5g for any b 2 D./. This argumentation not only
proves that with respect to 2 , there is no unique meet for ; , but also that there is
no unique join for 1 and 2 . In any case, C or R is not a lattice with respect to 2 .
In the following we show that 3 is indeed a lattice ordering on C. For any two
covers and there is a unique meet [ and a unique join \ 3 .
Proposition 16.13. For any two covers and the following statements are true:
i)
ii)
iii)
iv)
v)
\ D . /
\ 3 and \ 3 .
If a cover satisfies 3 and 3 , then 3 \ .
. [ / 3 and . [ / 3 .
If a cover satisfies 3 and 3 , then 3 [ .
Proof. (i), (ii) Let M 2 \ . Then there must be M 0 2 and M 00 2 with

M 0 M and M 00 M , which proves (ii). Thus M M 0 \ M 00 2
which proves (i).
(iii) Assume 3 , 3 , but 3 \ . Then there is C 2 such
that M C for all M 2 \ . This is a contradiction because there are
A 2 and B 2 with A C and B C , and thus C 2 \ .
(iv) 3 [ with Proposition 16.1.
(v) Let M 2 [ , and e.g., M 2 . Then there is a C 2 with M C . u
t
225
In the following we will write for the ordering 1 , ^ for the meet w.r.t. 1 ,
_ for the join with respect to 1 and 4 for the ordering 3 , f for the meet,
g for the join w.r.t 3 .
In Definition 10.4 we defined the sets C./, R./, T./, F./ and P./. To
this we add Ff ./ WD f 2 F./W finiteg. These sets can be made into lattices by
considering the proper orderings and using the proper equivalence relations. This
leads to the definition of five lattices.1
Definition 16.6. For a probability space .; ; p/, we define
C WD C./= 3 ; R WD R./= ; T WD T./; P WD P./= ; F WD F./:
i)
ii)
iii)
iv)
v)
vi)
.C; 4/ is the set of covers

.R; / is the set of repertoires
.T; / is the set of templates
.P; / is the set of partitions
.F; 4/ is the set of flat covers
.Ff ; 4/ is the set of flat covers
In view of the equivalence relations 1 and 3 and their characterization in

Propositions 16.4 and 16.7, we can identify .R; / with the p.o.-set of clean
repertoires and in addition to .C; 4/ modulo 3 we are led to consider the p.o.set .F; 4/ of flat covers, which is the same as .C; 4/ for finite .
This means that only for C and only for infinite , we have to work with
equivalence classes. In fact, if we consider the set Cf C of finite covers, we
obtain Cf D Ff by Proposition 16.9. The elements of the other four lattices can
be essentially identified with particularly nice representatives in their equivalence
classes, i.e., R = clean repertoires, T = templates, P = partitions and F = flat covers.
Proposition 16.14.
i) .R; ^; _/ is a distributive lattice.
ii) .C; f; g/ is a distributive lattice.
Proof. We have seen that on R modulo 1 every repertoire can be represented
by [ and with this becomes , i.e., if and only if [ [ . Thus
^ [ \ [ and _ [ [ [ . [ /[ . Similarly, on C modulo 3
every repertoire can be represented by and with this 4 becomes , i.e., 4
if and only if . Thus f [ and g \ . With these
remarks the proofs of the distributive laws reduces to those for set intersection and
union. For example,
^ . _ / D [ \ .[ [ [ / D .[ \ [ / [ .[ \ [ / D . ^ / _ . ^ /u
t
The lattice property of C and R follows from Propositions 16.11 and 16.13. For T; P, and Ff it
will be shown in the next section. F is not a lattice, in general.
226
16.4 The Orderings on Templates and Flat Covers

Now we consider two interesting subsets of R and C, namely the set T of templates
(i.e., tight, clean repertoires) and the set F of flat covers.
Since tight repertoires have exactly one proper description, it is clear that the two
orderings 1 and 2 coincide on tight repertoires. It is now necessary to determine
the join and meet of tight repertoires again because the ordinary join of and
need not be tight and we are now looking for the smallest tight repertoire that is
larger than and .
T = clean and tight repertoires
Since flat covers have only proper descriptions, it is clear that 2 and 3
coincide on F and the ordering is described by curved symbols 4. Again we have to
determine the join g and meet f on F. It turns out that the join does not generally
exist for arbitrary flat covers. However, it does exist for finite flat covers.
Proposition 16.15. For two flat covers ; 2 F we have
f D . [ /f :
Proof. follows from Propositions 16.13 and 16.7.
t
u
Proposition 16.16. For two finite flat covers ; 2 Ff we have

g D . /f :
Proof. follows from Propositions 16.13 and 16.7.
t
u
Example 16.1. Let D 1; 1 and consider

D f1; 0/; 0; 1g;

1
1 1
W n D 2; 3; : : : :
D ;1
n 2
n
In this case g does not exist.
t
u
Proposition 16.17. (i) On T the orderings 1 and 2 coincide.

(ii) On F the orderings 2 and 3 coincide.
(iii) On P all three orderings coincide.
Proof. (i) Is clear because templates allow only one proper description.
(ii) Is clear because for flat covers every description is proper, or every element is
minimal.
(iii) Is clear because P D T \ F.
t
u
Proposition 16.18. For two repertoires and in T the following holds:
i) The join of and in T is .
ii) The meet of and in T is [ \ [ .
227
Proof. (i) Clearly is tight, if is tight and is tight. Clearly d D d \ d ,

thus and .
If and , then d d and d d . Thus d d \ d D d ,
i.e., .
(ii) [ \ [ is tight: Let B; C 2 [ \ [ .
B; C 2 [ ) B \ C 2 [ ;
B; C 2 [ ) B \ C 2 [ :
Thus B \ C 2 [ \ [ . This shows that [ \ [ is \-stable.
t
u
Proposition 16.19. For tight covers , , and the following are equivalent:
i)
ii)
iii)
iv)
is the join of and in T.

.
. [ /\ .
d D d \ d .
Proof. (i) , (ii) : follows from Proposition 16.18.

(ii) , (iii) : since . [ /\ .
(ii) ) (iv) : since d D d \ d .
(iv) ) (ii) : R.d / D R.d \ d D R.d / .
t
u
It is surprising that the two opposite (set theoretical) orderings 1 and 3

coincide on partitions P D F \ T.

The three orderings introduced in this chapter coincide for partitions; two of them
have appeared in the literature in quite different contexts. Definition 16.3 has been
introduced in the study of topological entropy by Adler et al. (1965); Goodman
(1971); Goodwyn (1969), which is a counterpart for measure-based entropy in
ergodic theory (see Hopf 1937; Keynes and Robertson 1969; Krieger 1970; Palm
1976a,b). The definition is needed as a generalization of the common refinement of
partitions. In this context also, the definition of as the natural maximum of
and is introduced. The problem of finding the natural minimum, however, is
not considered, because it is not needed for the definition of topological entropy.
Definition 16.1 which is the the most natural in the context of this book (see
Proposition 16.10 and Definition 10.3), is in some sense opposite to Definition 16.3
and relies on the notion of description, but could as well be formulated with
elements of and . It has been first introduced in Palm (1976a,b) and used for
the generalization of information. The lattice structures provided by these orderings
have not yet been discovered and studied in the literature.
228
16.6 Exercises
1) Let D f0; 1g. Determine the lattices C; R; T; P, and F.
2) Let D 0; 1 with the Borel-sets and p the equidistribution. Let x D
fA 2 W p.A/ < xg, x D fA 2 W p.A/ xg, and x D f.a; b/W b a < xg.
What is the order relation between x and y , between x and y , and between
x and y for arbitrary x and y?
3) .; ; p/ as in Exercise 2). What are the minimal and the maximal elements in
C; R; T; P, and F?
4) Let D f1; 2; 3; 4g and consider C, R, T, P, and F. Is there always a smallest
and a largest element in these p.o.-sets? If yes, what is it? If these are left out,
what are the minimal and maximal elements in the remaining p.o.-sets?
5) Is (Definition 9.6) always the second-smallest element in the lattices C, R, T,
P, and F?
References
Adler, R. L., Konheim, A. G., & McAndrew, M. H. (1965). Topological entropy. Transactions of
the American Mathematical Society, 114, 309319.
Goodman, T. N. T. (1971). Relating topological entropy and measure entropy. Bulletin of the
London Mathematical Society, 3, 176180.
Goodwyn, L. W. (1969). Topological entropy bounds measure-theoretic entropy. Proceedings of
Hopf, E. (1937). Ergodentheorie. Berlin: Springer. Reprinted by Chelsea Publishing Co., New York
edition.
Keynes, H. B., & Robertson, J. (1969). Generators for topological entropy and expansivness.
Mathematical Systems Theory, 3, 5159.
Krieger, W. (1970). On entropy and generators of measure-preserving transformations. Transactions of the American Mathematical Society, 149, 453464.
Palm, G. (1976a). A common generalization of topological and measure-theoretic entropy.
Asterisque, 40, 159165.
verw. Geb., 36, 2745.
Walters, P. (1982). An introduction to ergodic theory. Berlin, Heidelberg, New York: Springer.
Chapter 17
Information Theory on Lattices of Covers
Classical information theory considers the Information I on the lattice .P; ^; _/

of partitions. The development of information theory rests essentially on three
properties of information:
1. Monotonicity, i.e., ) I./ I./,
2. Subadditivity, i.e., I. _ / I./ C I./,
3. Additivity, i.e., I. _ / D I./ C I.j/.
The third property requires a definition of conditional information, which has been
discussed in Sect. 11. There it is shown that we cannot expect additivity on R and
F, but we have it on P and T (Proposition 11.1). We will not repeat this discussion
here. Now we have only to collect the results from Chaps. 10 and 16 in order to
extend the first two properties from P to the lattice T of tight covers or templates
and to the lattice Ff of finite flat covers. The lattice structure or at least the partial
order 4 has already been discussed and used in ergodic theory in the context of
topological entropy (Adler et al. 1965, Walters 1982, Goodwyn 1969 and Goodman
1971). First we will see how far we can get with these properties on the full lattices
C of all covers and R of all repertoires, which can be identified by (Proposition 9.5)
with the lattice of clean repertoires.
17.1 The Lattice C of Covers

In Definition 10.1, we defined our information concepts only for repertoires, not
for general covers. In general, both I and N would often turn out to be infinite
for arbitrary, non-finitary covers, but we can use a slightly different definition (see
Definition 17.3).
We will consider a slightly more general version of the lattice C compared to
Definition 16.6. Usually we have based our definitions of covers and repertoires
on the -algebra . Covers, however, are most often used in topology where they
consist of open sets instead of measurable sets and we now want to include the
229
230
17 Information Theory on Lattices of Covers
consideration of open covers and also of covers built from other types of sets in our
definition. We again start with a probability space .; ; p/.
Definition 17.1. Let be a subset of (or even of P./).
i) is called \ -stable1 , if A \ B 2 for any A; B 2 ,
ii) is called [ -stable, if A [ B 2 for any A; B 2 .
Definition 17.2. Let be a \-stable and [-stable subset of containing .
S
i) A subset of is called an -cover if D .
ii) The set of all -covers is called C./.
On C./ we consider the ordering 4 or 3 as defined in the last section, and the
corresponding equivalence relation 3 .
.C./; 4/ is an ordering and .C./= 3 ; 4/ is a p.o. set. We define C WD
C./= 3 . The lattice C defined in Definition 16.6 was C .
Proposition 17.1. .C ; 4/ is a lattice with g D and f D [ .
t
u
Proof. see Proposition 16.13.

Proposition 17.2. .C ; 4/ is a distributive lattice.
t
u

Proposition 17.3.
i) .C ; 4/ has a smallest element, namely fg.
ii) If f!g 2 for all ! 2 , then .C ; 4/ has a largest element, namely

f!gW ! 2 :
t
u
Proof. Obvious.
Unfortunately, I and N are not defined on C and cannot be well defined,

because 3 is different from 1 as we saw in the last section. However, for the
subset F C./ of flat -covers I and N are well defined, because 3 and 2
coincide and flat covers are repertoires. It is in fact possible to give a different
definition of I on C that coincides with our Definition 10.1 for flat -covers
(see Palm 1976a).
Proposition 17.4. For 2 F we have 4 D and
I. / D inffI./W < ; partitiong:
Proof. The first statement was shown in Proposition 10.10. For the second, we can
use Proposition 10.11:
For any description d in we have I.d / D I.e
d / D I./, where WD R.e
d / is a
1
Compare Definition 9.10.
17.2 The Lattice Ff of Finite Flat Covers
231
partition and < .

Vice versa, for any partition < we can define a description d in which is
constant on every B 2 . Thus, I.d / I./.
u
t
Definition 17.3. For 2 C we define
I 0 . / WD inffI./W < ; partitiong:
Proposition 17.5. For ; 2 C the following holds:
i) < implies I 0 . / I 0 ./,
ii) I 0 . g / I 0 . / C I 0 ./.
Proof. (i) Follows directly from Definition 17.3.
(ii) Let , partition and 0 , 0 partition. Then g 0 D 0 is a
partition, g 0 < g and
(Prop. 3.5)
I. g 0 / D I.d f d0 / I.d / C I.d0 / D I./ C I. 0 /:

Thus I 0 . g / I 0 . / C I 0 ./.
t
u
So we get monotonicity as well as subadditivity for I 0 on C .
17.2 The Lattice Ff of Finite Flat Covers

On flat covers F the orderings 2 and 3 coincide, called 4. Thus I is monotonic
on F. .Ff ; 4/ is a lattice with join and meet f D . [ /f and g D . /f
by Propositions 16.15 and 16.16.
On F we have I./ D minfI./W < ; is a partitiong because of
Proposition 17.4. On F, however, the join does not always exist (see Example 16.1).
For this reason we will consider the sublattice of finite flat covers in this section.
Proposition 17.6. I is monotonic and subadditive on Ff .
t
u
Example 17.1. N is not monotonic and not subadditive on Ff .
Consider
.; ; p/ D D and D f1; 3; 5g; f2; 4; 6g , D f1; 2; 3; 4g; f3; 4;

5; 6g . Then f D [ and N . f / D 1. So we have f 4 and
N . f / > N ./ D log2 32 .

Consider .; ; p/ D E4 and D f1; 2; 3g; f1; 2; 4g , D f2; 3; 4g; f1; 3; 4g .
Then g D D f2; 3g; f2; 4g; f1; 3g; f1; 4g and N . g / D 1, but
t
u
N ./ C N ./ D 2 log2 43 < 1.
232
Proposition 17.7.

i) .Ff ; 4/ has a largest element, namely ! D f!gW ! 2 , if is finite.
ii) .Ff ; 4/ has a smallest element, namely fg.
t
u
Proof. Obvious.
Proposition 17.8. .Ff ; 4/ is a distributive lattice. It is a sublattice of .C; 4/.
Proof. As mentioned above f D . [ /f and g < . /f and for finite

flat covers these are equivalent .3 / to . [ / and . /, respectively.
t
u
17.3 The Lattice R of (Clean) Repertoires

The ordering D 1 on R is defined in such a way that N is monotonic on R. With
this partial order R becomes a lattice (see Proposition 16.14) with meet and join:
^ D [ \ [
and _ D [ :
Proposition 17.9. N is monotonic and subadditive on R.

Proof. see Propositions 10.5 and 10.6.
t
u
Example 17.2.
I is not monotonic on R:
Consider .; ; p/ D D and

D f1; 3; 5g; f2; 4; 6g and D f1; 2g; f3; 4g; f5; 6g :
Here _ D [ , but I./ D log2 3 > I. _ / D 1.
I is not subadditive on R:
Consider .; ; p/ D D 2 and the corresponding random variables X1 ; X2 .
Now take

D X1 D i W i D 1; : : : ; 6 [ X2 D 1; X2 1 and

D X2 D i W i D 1; : : : ; 6 [ X1 D 1; X1 1 :
Then I./ D I./ D 16 log 6 C 56 log 65 D log 6 56 log 5 and I. _ / D log 6.
t
u
Observe that I. _ / I./ I./ D log 6 C 53 log 5 > 0.
Proposition 17.10.

i) .R; / has a largest element, namely ! WD f!gW ! 2 , if is countable,
i.e., p.!/ 0 for every ! 2 .
ii) .R; / has a smallest element, namely fg.
Proof. Obvious.
t
u
17.4 The Lattice T of Templates
233
Proposition 17.11. R is a distributive lattice.

t
u
17.4 The Lattice T of Templates

On T the orderings 1 and 2 coincide and are called . the p.o. set T can be
identified with the p.o. set of tight descriptions (see Definition 2.11).
Proposition 17.12. For any two templates and we have if and only
if d d . In addition .T; / is a lattice with join _ D and meet
^ D [ \ [ . Thus T is a p.o. subset, but not a sublattice of R, because the
join is different.
t
u

Proposition 17.13. For any template we have
I./ D I.d /;
N ./ D N .d /;
S./ D S.d /:
t
u
Proof. see Definition 10.1.
Proposition 17.14. On .T; / both N and I are monotonic. I is also subadditive.

Proof. The corresponding assertions on d , d , and d D d \ d for templates
and have been shown in Propositions 2.6 and 3.5.
t
u
The following example shows that N is not subadditive.
Example 17.3. D fA; g; D fB; g ) D fA \ B; A; B; g, p.A/ D
p.B/ D 58 , p.A \ B/ D 14
8
5
log2 D N ./
8
5

3
3
1
8
C
log2
N . / D 2 C
4
8
8
5
N ./ D
N . / N ./ N ./ D
1
8
1 1
log2 D log2 5 1 > 0
2 2
5
2
t
u
Proposition 17.15.
i) .T; / has a largest element, namely ! D ff!gW ! 2 g, if is countable.
ii) .T; / has a smallest element, namely fg.
Proof. Obvious.
t
u
234
Proposition 17.16. .T; / contains P as a sublattice.

Proof. For two partitions and , both \ D [ \ [ and [ D are
again (equivalent to) partitions.
t
u
The following example shows that .T; / is not a distributive lattice.
Example 17.4. Consider D f1; 2; 3; 4g and

D f1; 2g; f3; 4g ;

D f1; 3g; f2; 4g ;

D f1; 4g; f2; 3g :
Then

. g / f D f1g; f2g; f3g; f4g f D
. f / g . f / D fg g fg D fg
t
u
Proposition 17.17. .T; / is a p.o. subset, but not a sublattice of .D; / and
of .R; /.
Proof. The join _ in R is [ which usually is not tight. So the join of and
in T is . [ /\ . / (see Proposition 16.19) which is somewhat larger. The
meet ^ in R and T is the same.
Conversely, the join in T is the same as in D, whereas the meet in D is the union
of descriptions which does not correspond to the meet in T.
t
u
17.5 The Lattice P of Partitions

The partitions P form a subset of the set R of all repertoires (or covers C). On P the
three orderings of Chap. 16 coincide2, because P D T \ F (see Proposition 9.13).
With this ordering, called , P is a lattice. Indeed, _ D and ^ D [ \
[ (see 16.11). It is also clear that any two partitions and coincide whenever
and . The lattice P has a smallest element, namely fg, and a largest
element ! WD ff!gW ! 2 g3 . Since partitions are tight covers, each partition
corresponds to exactly one description d and this description is complete. So I
and N actually coincide on P. On P we can easily prove the ideal theorems on I
and N .
Proposition 17.18. .P; / is a sublattice of the lattice .T; /. It is a p.o. subset,
but not a sublattice of .F; 4/.
It is surprising that the opposite orderings and 4 coincide on P.

Here our requirement added to Definition 9.1 leads to the requirement that p.!/ 0 for every
! 2 . So P only has a largest element, if is countable.
2
3
References
235
Proof. The join and meet in P are the same as in T, namely _ D and
^ D [ \ [ . In F they are g D . /f D (for partitions ; ) and
f D . [ /f . So the join is the same, but the meet is different.
t
u
Proposition 17.19. On .P; / both I and N are monotonic and subadditive.
Proof. N D I and I is monotonic and subadditive on T (Proposition 17.14).
t
u
Like T, also P is not a distributive lattice as can be seen from Example 17.4.

By collecting our previous results this chapter demonstrates that it is possible to
extend classical information theory to T; F and even R without loosing the most
important properties of information. The lattices T; F and R that are introduced
here, may be interesting and could further be investigated from a purely lattice
theoretical point of view. We have determined these lattices for small finite ,
i.e., #./ D 2; 3, and 4. We dont know how the size of these lattices grows
with n D #./.
In addition we have introduced information theory for arbitrary covers by various
kinds of sets (e.g. measurable covers, open covers, closed covers) in section 17.1
(Def. 17.2 and 17.3).
17.7 Exercises
1) For #./ D 2; 3, and 4 determine the lattices P, T, F, and R.
2) For #./ D 5; 6 determine the lattices P and F.
3) Find an example for small #./ that shows that T and P are not distributive
lattices.
4) A complement Ac of a lattice element A satisfies A ^ Ac D 0 (the smallest
element) and A _ Ac D 1 (the largest element).
For each of the lattices P, T, Ff , and R find examples for elements A that do
and that do not have a complement.
5) Are there examples of these lattices that have a unique second-largest or secondsmallest element?
References
Adler, R.L., Konheim, A.G., & McAndrew, M.H. (1965). Topological entropy. Transactions of the
American Mathematical Society, 114, 309319.
Goodwyn, L.W. (1969). Topological entropy bounds measure-theoretic entropy. Proceedings of
236
Goodman, T. N.T. (1971). Relating topological entropy and measure entropy. Bulletin of the
London Mathematical Society, 3, 176180.
Walters, P. (1982). An introduction to ergodic theory. Berlin, Heidelberg, New York: Springer.
Appendix A
Fuzzy Repertoires and Descriptions
Here we want to introduce a generalization of descriptions and repertoires to fuzzy

sets. In many applications of information theoretic ideas, in particular to neural
networks and learning systems it appears quite natural to consider fuzzy sets, and
the generalization of our concepts is actually quite straightforward.
The idea of making sets fuzzy, goes back to Zadeh (1965): the membership of
a point x to a set A is not expressed by a binary value, but instead by a degree of
membership, a number between 0 and 1; where 1 indicates certainty that x belongs
to A and 0 indicates certainty that x does not belong to A.
Given a probability space .; ; p/ we replace the propositions A 2 by fuzzy
propositions, i.e., random variables AW ! 0; 1.
Mathematically speaking, we replace the algebra by the lattice M of all
random variables on with values in 0; 1. M is also called the set of fuzzy
propositions or membership functions.
On M we have the ordering A B defined by pA B D 1 and the equality
A D B meaning again pA D B D 1, as the corresponding equivalence.
With this ordering M is a distributive lattice and
.A ^ B/.!/ D min.A.!/; B.!//;
.A _ B/.!/ D max.A.!/; B.!// :
M has a smallest element 0 and a largest element 1.
We can use the order-structure of M to define repertoires, and we use the new
viewpoint of fuzzy membership for a different definition of descriptions, which
actually reveals some more structure compared to Chap. 2, because now descriptions
are a special kind of fuzzy relations.
Definition A.1. 1. A measurable function DW . ; / ! 0; 1 is called a
fuzzy relation.
2. A fuzzy relation D is called fuzzy description, if for every A 2 with p.A/ 0
and almost every x 2 A, there is a subset B 2 ; x 2 B A; p.B/ 0, such
that D D 1 on B B.
G. Palm, Novelty, Information and Surprise, DOI 10.1007/978-3-642-29075-6,
237
238
A Fuzzy Repertoires and Descriptions
3. The range R.D/ of D is defined as R.D/ D fD! W ! 2 g. Here D! denotes

the random variable obtained by fixing the first argument in D, i.e., D! .! 0 / WD
D.!; ! 0 /.
Definition A.2. The adjoint D 0 of a fuzzy relation D is defined by D 0 .!; ! 0 / WD
D.! 0 ; !/. A fuzzy description D is called symmetric, if D D D 0 .
Definition A.3. The composition of two fuzzy relations C and D is defined as
C D.!; ! 0 / WD sup C.!; x/ ^ D.x; ! 0 / or by .C D/! D supx C! .x/ ^ Dx .
Definition A.4. A fuzzy relation D is called
1. Reflexive, if D.!; !/ D 1 for every ! 2 ,
2. Symmetric, if D D D 0 ,
3. Transitive, if D D D.
A.1 Basic Definitions

Now we can reformulate the main ideas of this book in the framework of fuzzy sets.
To keep this appendix short, definitions are just given in a condensed summary form
and most proofs are omitted because they are simple translations of the analogous
proofs in the book.
Definition A.5. 1. A fuzzy proposition A 2 M is called essential, if pA D 1 0.
2. A fuzzy cover is a countable set of essential fuzzy propositions A 2 M
_
with
f1AD1 W A 2 g D 1:
Proposition A.1. The range R.D/ of any fuzzy description D is a fuzzy cover.
Proposition A.2. Any description d in the sense of Definition 2.3 can be viewed as
a fuzzy description DW ! f0; 1g by D.!; ! 0 / WD 1! 0 2d.!/ .
Such a description D, satisfying D.!; ! 0 / 2 f0; 1g for every !; ! 0 2 , is called
crisp.
Definition A.6. The crisp version of a fuzzy cover is defined as cr D fAcr W
A 2 g where Acr WD A D 1.
The crisp version of a fuzzy description D is defined as Dcr .!/ WD D! D 1.
Clearly the crisp version of a fuzzy cover is a cover and the crisp version of a
fuzzy description is a description.
Definition A.7. A description d W ! is called symmetric, if ! 0 2 d.!/
implies ! 2 d.! 0 / for every !; ! 0 2 .
Proposition A.3. Any fuzzy description is reflexive.
The crisp version of a symmetric fuzzy description is symmetric.
The crisp version of a transitive fuzzy description is tight.
A.1 Basic Definitions
239
Definition A.8. Let D be a fuzzy description. We define

1.
2.
3.
4.
e
e
WD f! 0 W D! 0 D D! g,
Its completion: D.!;
! 0 / WD 1D! DD! 0 or1 D.!/
Its novelty: ND W ! R; ND .!/ D log2 E.D! /; N .D/ WD E.ND /,
Its surprise: S.D/ WD N .ND /,
e
Its information: I.D/ D N .D/.
e corresponds to a complete description or a partition as before.

D
Proposition A.4. N .D/ I.D/
We have a natural ordering on fuzzy descriptions with corresponding join and
meet:
.C ^ D/.!; ! 0 / D min.C.!; ! 0 /; D.!; ! 0 //;
.C _ D/.!; ! 0 / D max.C.!; ! 0 /; D.!; ! 0 // :
For this ordering N is clearly monotonic, but not subadditive. I, however, is not
monotonic, but subadditive. In fact, the extension of I to fuzzy descriptions does
e is nothing more than a partition. In order
not provide any new results, because D
to get a more interesting information theory, we again extend our definitions from
fuzzy descriptions to fuzzy repertoires. We also introduce fuzzy templates and fuzzy
partitions.
Definition A.9. Let be a fuzzy cover. A description D is a description in (terms
of) , if R.D/ , i.e., if for almost every ! 2 there is an A 2 such that
D! D A. D./ denotes the set of all fuzzy descriptions in .
Definition A.10. 1. Novelty
N ./ D supfN .D/ W D 2 D./g
2. Information
I./ D lim inffI.D/ W D 2 D./; N .D/ N ./ "g
"!0
For a practical calculation of N and I, it is useful to require some additional

regularity from the fuzzy cover . As in Chap. 9 these requirements are provided by
the notion of a repertoire.
The first definition defines e

D as a fuzzy relation, the second as a description in the sense of
Chap. 1.
240
A.2 Definition and Properties of Fuzzy Repertoires

Definition A.11. Let be a fuzzy cover. For A 2 , we define
S
1. A WD Acr n fBcr W B 2 R./; E.B/ < E.A/g and
2. WD fA W A 2 ; p.A / 0g.
A is the set of all ! 2 for which A is the description with maximal novelty. If
this description is unique for (almost) every ! 2 , we again call the repertoire
tight.
Definition A.12. Let be a fuzzy cover.
1.
2.
3.
4.
is called a fuzzy repertoire, if is a cover.

is called tight, if is a partition.
is called a fuzzy partition, if cr is a partition.
is called a fuzzy template, if for all A; B 2 with pA D 1; B D 1 0 also
A ^ B 2 .
If is a fuzzy repertoire, also I./ can be calculated easily. It simply reduces to

the calculation of I. /, or more exactly I.f / (see Definition 9.13).
Again we define templates as being stable against logical conjunction, in analogy
to our results in Chap. 9. However, tightness is now a much weaker condition.
Proposition A.5. Every fuzzy template is tight.
Proof. Let be a fuzzy template. If is not a partition, then there are A; B 2
with p.A \ B / 0. Thus pA D 1; B D 1 0, and therefore A ^ B 2 .
Clearly E.A^B/ E.A/, and E.A^B/ < E.A/ would imply .A^B/cr \A D ;
and even Bcr \ A D ; because A Acr , but this would imply B \ A D ;. So
E.A ^ B/ D E.A/. Together with A ^ B A this implies A ^ B D A. Similarly
A ^ B D B is shown. This contradicts A B.
t
u
Definition A.13. Let be a fuzzy cover. We define
^ WD f^W ;; finite, ; ^ essentialg.
^ is called the fuzzy consequence of . Clearly ^ is a fuzzy template.
It is now quite easy to discuss the practical computation of N ./ and I./ along
the lines of Chap. 10, at least for fuzzy repertoires.
N ./ can again be calculated as the expectation of a random novelty variable N .
N .!/ WD lim supf log E.A/ W A.!/ 1 "; A 2 g
"!0
Then N ./ D E.N /.
A.2 Definition and Properties of Fuzzy Repertoires
241
Based on this we may define an even fuzzier version of novelty:

Nf .!/ WD supfA.!/ log E.A/ W A 2 g
and
Nf ./ WD E.Nf /.
Clearly Nf N .
A perhaps even simpler and more general version of defining N is the following:
(
N .!/ WD
log2 E.A/
1
for ! 2 A ;
S
for ! :
Also I can be calculated based on . S is calculated as S./ D N .N /, and

Sf ./ D N .Nf /.
Definition A.14. Let and be two fuzzy repertoires.
means that for every C 2 D./ there is a D 2 D./ with D C .
It turns out that the union [ of two fuzzy repertoires is the natural join for
this ordering.
Proposition A.6.
i) implies N ./ N ./.
ii) N . [ / N ./ C N ./.
Proposition A.7. Let be a fuzzy repertoire. Then I./ D I.f /.
When we consider the set Tf of fuzzy templates, we have a natural join _ of
two templates and , i.e., the smallest template that is larger than [ . Now we
can show that on fuzzy templates I is monotonic and subadditive.
Definition A.15. For two fuzzy repertoires and we define
1. _ WD fA ^ BW A 2 ; B 2 ; A ^ B essentialg and
2. WD fA BW A 2 ; B 2 ; A B essentialg.
Proposition A.8. 1. For two tight fuzzy repertoires and , implies
I./ I./.
2. For two fuzzy templates and we have I. _ / D I. / I./ C I./.
The set Pf of fuzzy partitions with the ordering turns out to be a sublattice of
Tf with join _ . Even on Pf novelty N and information I do not coincide. Of
course, on Tf and in particular on Pf we still have N I and also monotonicity
of N and I. Proposition A.8 shows that I is subadditive on Tf and on Pf in
particular. However, N is not subadditive, neither on Tf nor on Pf , as the following
example shows.
242
Example A.1. Let D f1; : : : ; 8g and D fA1 ; A2 ; A3 ; A4 g, D fB1 ; B2 ; B3 ; B4 g

as defined in the table below (for 0 < a < 1).
Then N . / and N . _ / may be larger than N ./ C N ./. In fact,
8
1
1
1
log
C log 8 D 3 log.1 C 2a/
2
1 C 2a
2
2
1
1
8
1
8
1
8
N ./ D N ./ D log 4 C log
C log
C log
4
4
2 C 2a
4
2 C 4a
4
2 C 6a
3 1
1
D C .log.1 C a/ C log.1 C 2a/ C log.1 C 3a//
2
2 4
N . / D
A1
A2
A3
A4
1
0
0
0
1
0
0
0
a
1
0
0
a
1
0
0
a
a
1
0
a
a
1
0
a
a
a
1
a
a
a
1
B1
B2
B3
B4
1
a
a
a
0
1
a
a
0
1
a
a
0
0
1
a
0
0
1
a
0
0
0
1
0
0
0
1
1
0
0
0
t
u
Proposition A.9. 1. For a tight fuzzy repertoire we have

N ./ D
p.A / log2 E.A/
and
A2
I./ D I. / D
p.A / log2 E.A / :
A2
2. For a fuzzy partition we have

N ./ D
p.Acr / log2 E.A/
A2
I./ D I.cr / D
p.Acr / log2 E.Acr / :
A2
Reference
Zadeh, L.A. (1965). Fuzzy sets. Information and Control, 8, 338353.
and
Glossary
Notation
A; B
E
H
I
Nd .!/
P; Q
R
Sd .!/
T
X; Y; Z
B
C
C
F
G and Gpq .d /
I
IG
Z
L
M
L
N
N
Npq .d /
N
P
Q
R
R
Description.
Propositions.
Expectation value.
Entropy.
Information rate / Information as random variable.
The novelty provided by ! for the description d .
Transition Probabilites.
Range of a function.
Surprise (of an outcome !) of d .
Transinformation rate.
Random variables.
Borel -algebra.
Set of all covers.
Set of complex numbers.
Set of flat covers.
Novelty gain.
Information.
Information gain.
Set of integer numbers.
Average length of a code.
Mutual novelty.
Number of questions in a guessing strategy.
Set of natural numbers.
Average novelty.
Subjective novelty.
Novelty as a random variable.
Set of partitions.
Set of rational numbers.
Set of real numbers.
Set of repertoires.

243
244
Glossary
S
SL
S
T
T
Var
; ; ;
eN
c
C
e
X ; Y; Q
E
d; b; c
d\
e
p; q
.; ; p/
Surprise.
Surprise loss.
Surprise as a random variable.
Set of tight covers or templates.
Transinformation.
Variance.
Letters for covers and repertoires.
Average error of a Transition Probability.
The capacity of a channel. Also letter for a code.
A channel.
The completion of a description, e.g., dQ .
Stochastic proccesses.
The direction of a description, e.g., dE.
Descriptions.
The tightening of a description d .
Error probability of a Bayesian guess.
Letters for probabilities.
Probability space.
-algebra of propositions or events.
elementary event, element of .
Index
Additivity, 5, 14, 40
novelty, 14
Algebra, 4
-cover, 230
Alphabet, 3, 89, 91, 98, 100
index, 89
input, 91, 92
output, 89, 91, 92
Anticipation, 90, 93, 94
bound, 90
finite, 90
span, 90, 94
Antitone, 14
A-priori probability, 69
Asymptotic equipartition property, 85, 8587,
97, 98
Average entropy, 202, 203
Average error, 66
Average information, 81, 83, 84
Average length, 55, 56, 59
Average novelty, 18, 32, 200
Average number of questions, 53
Average surprise, 175
Bayesian guess, 69
Beginning, 55
Boolean algebra, 214
Burst novelty, 168
Burst repertoire, 168, 175
Burst surprise, 168
Capacity, 68, 91, 94, 97, 98, 100, 162

Chain, 117, 138, 156
Channel, 63, 66, 78, 89, 8994, 97, 98, 100,
101
without anticipation, 91, 93

capacity, 89, 91, 94, 97, 100, 175
deterministic, 89
finite memory, 90
with memory, 90, 92
without memory, 91
memoryless, 91, 94
simple, 91
Chebyshev-inequality, 81
Choice from, 110
Clean repertoire, 113, 174
Cleaned version of , 113
Closed under, 4, 4
Code, 54, 5559, 83, 105
Huffman, 5658
irreducible, 55, 56, 58
optimal, 55, 57, 83
Codeword, 55, 5559
beginning, 55
Coincidence repertoire, 168, 170, 171, 175
Complement, 214
Complete description, 19, 30, 31, 37, 39,
105
Completion, 19, 28, 32, 239
Composition, 68, 238
Conditional information, 41
Conditional novelty, 15, 38, 39
Conditional probability, 14, 38
Confusion matrix, 66
Consequential, 22
Consistent with, 110
Continuous mapping, 7
Continuous random variable, 138
Convex function, 133
Countable-valued function, 7
Cover, xiii, 105, 109, 110, 113, 115118, 150,
217

245
246
clean, 113
disjoint, 110
finitary, 111
flat, 117, 118
\-stable, 115
narrow, 117
partition, 110
product, 116
shallow, 117, 118
tight, 112, 113
Crisp, 238
version, 238
Depolarization repertoire, 168, 173

Description, 16, 1620, 24, 3032, 39, 40, 51,
53, 73, 84, 105, 110113, 115, 120,
134, 165, 167, 171, 174, 196
complete, 19, 105
by ordering, 134, 135
tight, 22, 115, 233
Description in, 239
Deterministic channel, 89
Difference repertoire, 135
Directed, 20, 34
Discrete mapping, 7
Disjoint cover, 110
Disjoint repertoire, 110
Distributed, 79, 81, 84, 85, 87, 88, 94, 98
identically, 79, 81, 84, 85, 87, 88, 94, 98
independent, 79, 81, 84, 85, 87, 88, 94, 98
independent identically, 79, 81, 84, 85, 87,
88, 94, 98
Distribution, 146, 152, 175
finite, 146
function, 9
product, 152
Distributive, 213
Dynamical entropy, 198
Element
minimal, 111
Elementary event, 9
Entropy, 27, 195, 197203
average, 202
Equivalence class, 208
Equivalent, 131
Error probability, 69, 97, 98, 100
Essential, 238
Essentially, 18
Event, 3, 4, 6, 9, 1517, 19, 105, 109, 110,
166168, 189
elementary, 9
Index
Excitatory postsynaptic potential (EPSP), 166
Expectation-value, 6, 7, 80
Finitary, 111
Finite function, 7
Finite memory, 94
Flat cover, 117
Flattening of a repertoire, 117
Function, 5
convex, 133
countable-valued, 7
finite, 7
integrable, 7
Fuzzy, 237
consequence, 240
descriptions, 237, 239
partitions, 240
propositions, 237, 237
relation, 237
templates, 240
High-probability, 98100
element, 98
pair, 98, 99
sequence, 86, 98
Huffman code, 57, 58
Identically distributed, 79
Improbability, 14
In terms of, 110
Independent, 79
Independent identically distributed, 79, 81, 84,
85, 87, 88, 94, 98
Independent proposition, 15
Information, 24, 27, 32, 40, 42, 51, 5860, 65,
81, 83, 84, 105, 109, 125, 126, 127,
132, 138, 161, 165, 195
average, 30, 81, 83, 84
gain, 37, 147, 152, 199, 201, 203
rate, 83, 8486, 100, 199
subadditive, 132
Inhibitory postsynaptic potential (IPSP), 167
Input alphabet, 89, 91, 92
Input process, 94, 98
Input sequence, 91
Integrable function, 7
\-stable, 174, 230
cover, 115
repertoire, 115
Irreducible, 55, 56
code, 55, 56, 58
Isotone, 14
Index
Join, 210
Krafts inequality, 54, 58

KullbackLeibler distance, 37
Largest element, 209

Largest lower bound (l.l.b.), 210
Lattice, 212, 221
Lebesgue-measure, 150, 199
Length, 5456, 58, 59
average, 55, 56, 59
Liouville-measure, 197
Lower bound, 209
Mapping, 5
continuous, 7
discrete, 7
identity, 94
Markov process, 79, 201, 202
Maximal, 209
Measurability, 8
Measurable, 8, 9, 18, 30, 66, 79
Meet, 210
Memory, 9092, 94, 97
bound, 90
finite, 90, 94
internal, 91
span, 90, 94
Minimal, 209
Minimal element, 111
Monotonicity, 30
Monty Hall problem, 13
Mutual information, 43
of random variables, 43
Mutual novelty, 42, 145
of descriptions, 42
Narrow, 117
Negentropy, 195, 197
Neural repertoire, 166, 175
Novelty, 14, 18, 21, 32, 36, 3840, 51, 84, 105,
125, 127, 132, 156, 161, 165, 166,
168, 170, 196, 239
additivity, 14
average, 18, 32, 51
burst, 168
conditional, 15, 38, 39
of d for , 106
provided by d for , 106
gain, 36, 146, 147
247
pause, 170
subjective, 36, 146
Optimal code, 57, 58, 83, 105, 124

Optimal guessing strategy, 5456, 58, 59,
105
Ordering, 132, 134, 134
Output alphabet, 89, 91, 92
Output sequence, 91
Pair-process, 85
Partition, xiii, 17
Partition of a repertoire, 110, 112, 196
Pause novelty, 170
Pause repertoire, 168, 170
Pause surprise, 170
Payoff-function, 6, 7
Population repertoire, 175
Postsynaptic potential (PSP), 166, 167
Prefix, 55
Probability, 4, 6, 8, 9, 14, 32, 33, 65, 66, 89,
9194, 98100, 156, 199202
a-priori, 69
conditional, 14, 38
distribution, 65, 66, 146, 175, 201
error, 97, 100
measure, 65
space, 3, 9, 199
transition, 65, 66, 89, 9194
vector, 133, 201203
Process, 79, 84, 94
discrete, 79
input, 94, 98
Markov, 79
stationary, 79, 91
stochastic, 79
Product, 116
Product of covers, 116
Product of distributions, 152
Product of repertoires, 117
Proper, 110
Proper choice, 110, 111113, 115, 118, 120
Proper description, 110, 111, 113
Proposition, 3, 4, 79, 1417, 105, 109113,
127, 173, 174
independent, 15
small, 130
Random variable, 6, 3033, 42, 43, 72, 81,

8385, 105, 112, 139, 152, 156
continuous, 138
248
discrete, 30, 31, 72
independent continuous, 139
Random vector, 6
Range, 5, 238
Reflexive, 238
Relation
antisymmetric, 207
connecting, 207
equivalence, 207
irreflexive, 207
ordering, 208
partial order, 208
reflexive, 207
strictly antisymmetric, 207
strict order, 208
symmetric, 207
total order, 208
transitive, 207
Repertoire, 105, 110, 111, 112, 113, 115,
117, 120, 125, 130, 134, 138, 148,
164168, 170, 171, 173175, 189,
191, 192, 196198, 202, 204
burst, 168, 175
clean, 113, 174
coincidence, 168, 170, 171, 175
depolarization, 168, 173
disjoint, 110
finite, 148
infinite, 146
\-stable, 115
neural, 161, 166, 175
partition, 110, 112, 196
pause, 168, 170, 175
population, 175
product, 117
shallow, 117, 118, 119
tight, 112, 113, 115, 117, 118, 120, 174
Scheme, 53, 55, 57, 58

Sequence, 5456, 59, 91
input, 91
output, 91
Sequence of measurable functions
identically distributed, 79
independent, 79
Shallow cover, 117, 118
Shallow repertoire, 118, 119
Shannons theorem, 97, 98, 100
-additive, 8
-algebra, 8, 65, 109, 148
Simple channel, 91, 94
Small proposition, 130
Index
Smallest element, 209
Smallest upper bound (s.u.b.), 209
Spike, 166168, 170
Stationary, 98
Stationary process, 79, 91, 97
Stochastic process, 79, 8088, 98, 100
i.i.d., 79, 81, 84, 85, 87
stationary, 79, 82, 83, 85, 97, 98, 100
Subadditive information, 132
Subjective information, 37, 202
Subjective novelty, 36, 146
Subjective surprise, 37, 156
Surprise, 24, 32, 32, 124, 125, 138, 139, 156,
166168, 170, 174, 175, 239
average, 84, 175
burst, 168
loss, 147
of an outcome !, 24
pause, 170
random variable, 31
subjective, 37, 156
Surprising, 19
Symmetric, 20, 238
Tautology, 4
Tight, 22, 112
description, 22, 115
repertoire, 112, 113, 115, 117, 118, 120,
174
Tightening, 23, 32, 115
Time average, 199
Transinformation, 42, 43, 65, 66, 145, 152, 154
of covers, 154
of descriptions, 42
of random variables, 43, 154
rate, 84
Transition matrix, 201, 202
Transition probability, 65, 66, 89, 9194
Transitive, 238
Uncertainty, 27
Uncomparable, 209
Uniformly disturbing, 71
[-stable, 230
Upper bound, 209
Weak law of large numbers, 80, 81, 85, 87
Yesno question, 14, 51, 53, 84

Novelty, Information, and Surprise

Uploaded by

Copyright:

Available Formats

Novelty, Information, and Surprise

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Novelty, Information, and Surprise

Uploaded by

Copyright:

Available Formats

Novelty, Information and Surprise

Surprise and Information of Descriptions

Prerequisites from Logic and Probability Theory . .. . . . . . . . . . . . . . . . . . . .

Improbability and Novelty of Descriptions . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .

Conditional and Subjective Novelty and Information .. . . . . . . . . . . . . . . . .

Coding and Information Transmission

On Guessing and Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .

Information Rate and Channel Capacity

Stationary Processes and Their Information Rate ... . . . . . . . . . . . . . . . . . . .

How to Transmit Information Reliably with Unreliable

8.3 Technical Comments .. . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 101

Repertoires and Covers

Repertoires and Descriptions.. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .

10 Novelty, Information and Surprise of Repertoires ... . . . . . . . . . . . . . . . . . . .

11 Conditioning, Mutual Information, and Information Gain .. . . . . . . . . . .

Information, Novelty and Surprise in Science

12 Information, Novelty, and Surprise in Brain Theory .. . . . . . . . . . . . . . . . . .

13 Surprise from Repetitions and Combination of Surprises . . . . . . . . . . . . .

Generalized Information Theory

15 Order- and Lattice-Structures . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .

16 Three Orderings on Repertoires . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .

17 Information Theory on Lattices of Covers . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .

Fuzzy Repertoires and Descriptions .. . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .

Examples of covers (top) and the induced hierarchical

Model for descriptions of events . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .

Example of a question strategy . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .

Data transmission on a channel .. . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .

Channel and stationary process .. . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .

A single neuron.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 167

Burst novelty as a function of time for 3 individual

Possible relationships between the orderings 1 , 2 ,

always be the case. For example, if we consider a roulette game represented by a

N.! 0 / .! 0 !/. This leads to the definition of surprise as

Organization of the Book

Introduces the basic concepts on an elementary level. It contains a

description, which maps into for a probability space .; ; p/,

This book contains a mathematical theory of information, novelty, and surprise.

Philosophy of the Book

optic nerve is not consciously perceived, neither by us nor by our experimental

Personal History of the Book

a well-known philosopher of science, was visiting at the same time as a fellow of

Surprise and Information of Descriptions

Prerequisites from Logic and Probability Theory

1.1 Logic and Probability of Propositions

G. Palm, Novelty, Information and Surprise, DOI 10.1007/978-3-642-29075-6 1,

1 Prerequisites from Logic and Probability Theory

1.2 Mappings, Functions and Random Variables

We will assume that the following requirements are fulfilled by p:

If A  B, then p.A/ p.B/

Proof. (i) If A  B, then B D A [ .B n A/ and so by 3:

1.2 Mappings, Functions and Random Variables

1 Prerequisites from Logic and Probability Theory

For a 2 A, by definition, f .a/ is always one unique element of B. Conversely,

f .A00 /  f .A0 / and f 1 .B 00 /  f 1 .B 0 /,

Proof. (i) is obvious, it is called the monotonicity of f and f 1 on sets.

1.3 Measurability, Random Variables, and Expectation Value

2. If a player gets 1 unit in case a proposition A 2 holds and nothing otherwise,

In most of the following, considering finite or at most countable-valued functions

1.3 Measurability, Random Variables, and Expectation Value

If A B, then p.A/ p.B/

Proof. (i) If A B, then B D A [ .B n A/ and so by 3:

f .A00 / f .A0 / and f 1 .B 00 / f 1 .B 0 /,

propositions An 2 , is called a -algebra. Usually a probabilityp is defined

Thus we have bE D b, cE c, and dE d .

Proposition 2.4. Let c and d be two descriptions. Then cQ \ dQ c \ d .

We observe that dQ cQ D c d . Here dQ .!/ D f!g for ! D 1; 2; 3; 4 and

Also c d does not imply cQ dQ , in general. The above is even an example

Proof. (i) We show that cQ dQ which implies N .c/