Academia.eduAcademia.edu

A theory of case-based decisions

2001, The Economic Journal

A normative theory attempts to provide recommendations regarding what to do. It follows that normative theories are addressed to an audience of people facing decisions who are capable of understanding their recommendations. However, not every recommendation qualifies as normative science. There are recommendations that may be classified as moral, religious, or political preaching. These are characterized by suggesting goals to the decision makers, and, as such, are outside the realm of academic activity. There is an 3 We apply the standard usage to the title of this book as well. 2. META-THEORETICAL VOCABULARY additional type of recommendations that we do not consider as normative theories. These are recommendations that belong to the domain of social planning or engineering. They are characterized by recommending tools for achieving pre-specified goals. For instance, the design of allocation mechanisms that yield Pareto optimal outcomes accepts the given goal of Pareto optimality and solves an engineering-like problem of obtaining it. Our use of "normative science" differs from both these types of recommendations. A normative scientific claim may be viewed as an implicit descriptive statement about decision makers' preferences. The latter are about conceivable realities that are the subject of descriptive theories. For instance, whereas descriptive theory of choice investigates actual preferences, normative theory of choice analyzes the kind of preferences that the decision maker would like to have, that is, preferences over preferences. An axiom such as transitivity of preferences, when normatively interpreted, attempts to describe the way the decision maker would prefer to make choices. Similarly, Harsanyi (1953, 1955) and Rawls (1971) can be interpreted as normative theories for social choice in that they attempt to describe to what society one would like to belong. According to this definition, normative theories are also descriptive. They attempt to describe a certain reality, namely, the preferences an individual has over the reality she encounters. To avoid confusion, we will reserve the term "descriptive theory" for theories that are not normative. That is, descriptive theories would deal, by definition, with "first-order" reality, whereas normative theories would deal with "second-order" reality, namely, with preferences over first-order reality. First-order reality may be external or objective, whereas second-order reality always has to do with subjective preferences that lie within the mind of an individual. Yet, first-order reality might include actual preferences, in which case the distinction between first

A Theory of Case-Based Decisions by Itzhak Gilboa and David Schmeidler 2 To our families 3 Acknowledgments We thank many people for conversations, comments, and references that have influenced our thinking and shaped this work. In particular, we wish to thank Yoav Binyamini, Steven Brams, Didier Dubois, Ilan Eshel, Eva Gilboa-Schechtman, Ehud Kalai, Edi Karni, Kimberly Katz, Daniel Lehman, Akihiko Matsui, Sujoy Mukerji, Roger Myerson, Andrew Postlewaite, Ariel Rubinstein, Dov Samet, Peter Wakker, Bernard Walliser, and Peyton Young. Special thanks go to Benjamin Polak and to Enriqueta Aragones for many specific comments on earlier drafts of several chapters. 4 “In reality, all arguments from experience are founded on the similarity which we discover among natural objects, and by which we are induced to expect effects similar to those which we have found to follow from such objects. And though none but a fool or madman will ever pretend to dispute the authority of experience, or to reject that great guide of human life, it may surely be allowed a philosopher to have so much curiosity at least as to examine the principle of human nature, which gives this mighty authority to experience, and makes us draw advantage from that similarity which nature has placed among different objects. From causes which appear similar we expect similar effects. This is the sum of all our experimental conclusions.” (Hume 1748, Section IV) Contents 1 Prologue 9 1 The Scope of This Book . . . . . . . . . . . . . . . . . . . . . 2 Meta-Theoretical Vocabulary . . . . . . . . . . . . . . . . . . 12 3 2.1 Theories and conceptual frameworks . . . . . . . . . . 13 2.2 Descriptive and normative theories . . . . . . . . . . . 16 2.3 Axiomatizations . . . . . . . . . . . . . . . . . . . . . . 20 2.4 Behaviorist, behavioral, and cognitive theories . . . . . 24 2.5 Rationality . . . . . . . . . . . . . . . . . . . . . . . . 25 2.6 Deviations from rationality . . . . . . . . . . . . . . . . 27 2.7 Subjective and objective terms . . . . . . . . . . . . . 27 Meta-Theoretical Prejudices . . . . . . . . . . . . . . . . . . . 29 3.1 Preliminary remark on the philosophy of science . . . . 29 3.2 Utility and expected utility “theories” as conceptual frameworks and as theories . . . . . . . . . . . . . . . . 31 3.3 On the validity of purely behavioral economic theory . 33 3.4 What does all this have to do with CBDT? . . . . . . . 35 2 Decision Rules 4 9 37 Elementary Formula and Interpretations . . . . . . . . . . . . 37 4.1 Motivating examples . . . . . . . . . . . . . . . . . . . 38 4.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.3 Aspirations and satisficing . . . . . . . . . . . . . . . . 47 5 6 CONTENTS 5 6 7 4.4 Comparison with EUT . . . . . . . . . . . . . . . . . . 50 4.5 Comments . . . . . . . . . . . . . . . . . . . . . . . . . 54 Variations and Generalizations . . . . . . . . . . . . . . . . . . 55 5.1 Average similarity . . . . . . . . . . . . . . . . . . . . . 55 5.2 Act similarity . . . . . . . . . . . . . . . . . . . . . . . 57 5.3 Case similarity . . . . . . . . . . . . . . . . . . . . . . 59 CBDT as a Behaviorist Theory . . . . . . . . . . . . . . . . . 61 6.1 W -Maximization . . . . . . . . . . . . . . . . . . . . . 61 6.2 Cognitive Specification: EUT . . . . . . . . . . . . . . 63 6.3 Cognitive Specification: CBDT . . . . . . . . . . . . . 64 6.4 Comparing the cognitive specifications . . . . . . . . . 65 Case-Based Prediction . . . . . . . . . . . . . . . . . . . . . . 67 3 Axiomatic Derivation 71 8 Highlights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 9 Model and Result . . . . . . . . . . . . . . . . . . . . . . . . . 74 9.1 Axioms . . . . . . . . . . . . . . . . . . . . . . . . . . 74 9.2 Basic result . . . . . . . . . . . . . . . . . . . . . . . . 76 9.3 Learning new cases . . . . . . . . . . . . . . . . . . . . 77 9.4 Equivalent cases . . . . . . . . . . . . . . . . . . . . . . 78 9.5 U-Maximization . . . . . . . . . . . . . . . . . . . . . . 80 10 Discussion of the Axioms . . . . . . . . . . . . . . . . . . . . . 82 11 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4 Conceptual Foundations 12 99 CBDT and Expected Utility Theory . . . . . . . . . . . . . . 99 12.1 Reduction of theories . . . . . . . . . . . . . . . . . . . 99 12.2 Hypothetical reasoning . . . . . . . . . . . . . . . . . . 101 12.3 Observability of data . . . . . . . . . . . . . . . . . . . 103 12.4 The primacy of similarity . . . . . . . . . . . . . . . . 104 12.5 Bounded rationality? . . . . . . . . . . . . . . . . . . . 105 7 CONTENTS 13 CBDT and Rule-Based Systems . . . . . . . . . . . . . . . . . 106 13.1 What can be known? . . . . . . . . . . . . . . . . . . . 106 13.2 Deriving case-based decision theory . . . . . . . . . . . 108 13.3 Implicit knowledge of rules . . . . . . . . . . . . . . . . 113 13.4 Two roles of rules . . . . . . . . . . . . . . . . . . . . . 115 5 Planning 14 15 117 Representation and Evaluation of Plans . . . . . . . . . . . . . 117 14.1 Dissection, selection, and recombination . . . . . . . . 117 14.2 Representing uncertainty . . . . . . . . . . . . . . . . . 120 14.3 Plan evaluation . . . . . . . . . . . . . . . . . . . . . . 123 14.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 126 Axiomatic Derivation . . . . . . . . . . . . . . . . . . . . . . . 128 15.1 Set-up . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 15.2 Axioms and result . . . . . . . . . . . . . . . . . . . . 130 15.3 Proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 6 Repeated Choice 16 17 Cumulative Utility Maximization . . . . . . . . . . . . . . . . 135 16.1 Memory-dependent preferences . . . . . . . . . . . . . 135 16.2 Related literature . . . . . . . . . . . . . . . . . . . . . 137 16.3 Model and results . . . . . . . . . . . . . . . . . . . . . 139 16.4 Comments . . . . . . . . . . . . . . . . . . . . . . . . . 143 16.5 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 The Potential . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 17.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . 147 17.2 Normalized potential and neo-classical utility 17.3 Substitution and Complementarity . . . . . . . . . . . 151 7 Learning and Induction 18 135 . . . . . 149 157 Learning to Maximize Expected Payoff . . . . . . . . . . . . . 157 8 CONTENTS 19 20 18.1 Aspiration level adjustment . . . . . . . 18.2 Realism and ambitiousness . . . . . . . . 18.3 Highlights . . . . . . . . . . . . . . . . . 18.4 Model . . . . . . . . . . . . . . . . . . . 18.5 Results . . . . . . . . . . . . . . . . . . . 18.6 Comments . . . . . . . . . . . . . . . . . 18.7 Proofs . . . . . . . . . . . . . . . . . . . Learning the Similarity Function . . . . . . . . 19.1 Examples . . . . . . . . . . . . . . . . . 19.2 Counter-example to U -maximization . . 19.3 Learning and expertise . . . . . . . . . . Two Views of Induction: CBDT and Simplicism 20.1 Wittgenstein and Hume . . . . . . . . . 20.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 159 161 164 167 169 172 186 186 189 193 195 195 196 Bibliography 200 Index 211 Chapter 1 Prologue 1 The Scope of This Book The focus of this book is formal modeling of decision making by a single person who is aware of the uncertainty she is facing. Some of the models and results we propose may be applicable to other situations. For instance, the decision maker may be an organization or a computer program. Alternatively, the decision maker may not be aware of the uncertainty involved or of the very fact that a decision is being made. Yet, our main interest is in descriptive and normative models of conscious decisions made by humans. There are two main paradigms for formal modeling of human reasoning, which have also been applied to decision making under uncertainty. One involves probabilistic and statistical reasoning. In particular, the Bayesian model coupled with expected utility maximization is the most prominent paradigm for formal models of decision making under uncertainty. The other employs rule-based deductive systems. Each of these paradigms provides a conceptual framework and a set of guidelines for constructing specific models for a wide range of decision problems. These two paradigms are not the only ways in which people’s reasoning may be, or has been, described. In particular, the claim that people reason by analogies dates back at least to Hume. However, reasoning by analogies 9 10 CHAPTER 1. PROLOGUE has not been the subject of formal analysis to the same degree that the other paradigms have. Moreover, there is no general purpose theory we are aware of that links reasoning by analogies to decision making under uncertainty. Our goal is to fill this gap. That is, we seek a general purpose formal model, comparable to the model of expected utility maximization, that will (i) provide a framework within which a large class of specific problems can be modeled; (ii) be based on data that are, at least in principle, observable; (iii) allow mathematical analysis of qualitative issues, such as asymptotic behavior; and (iv) be based on reasoning by analogies. We believe that human reasoning typically involves a combination of the three basic techniques, namely, rule-based deduction, probabilistic inference, and analogies. Formal modeling tends to opt for elegance, and to focus on certain aspects of a problem at the expense of others. Indeed, our aim is to provide a model of case- or analogy-based decision making that will be simple enough to highlight main insights. We discuss the various ways in which our model may capture deductive and probabilistic reasoning, but we do not formally model the latter. It should be taken for granted that a realistic model of the human mind would have to include ingredients of all three paradigms, and perhaps several others as well. At this stage we merely attempt to lay the foundations for one paradigm whose absence from the theoretical discussion we find troubling. The theory we present here does not purport to be more realistic than other theories of human reasoning or of choice. In particular, our goal is not to fine-tune expected utility theory as a descriptive theory of decision making in situations described by probabilities or states of the world. Rather, we wish to suggest a framework within which one can analyze choice in situations that do not fit existing formal models very naturally. Our theory is just as idealized as existing theories. We only claim that in many situations it is a more natural conceptualization of reality than are these other theories. This book does not attempt to provide even sketchy surveys of the estab- 1. THE SCOPE OF THIS BOOK 11 lished paradigms for formal modeling of reasoning, or of the existing literature on case-based reasoning. The reader is referred to standard texts for basic definitions and background. Many of the ideas and mathematical results in this book have appeared in journal articles and working papers (Gilboa and Schmeidler 1995, 1996a, 1997a,b, 2000a,b,c). This material has been integrated, organized, and interpreted in new ways. Additionally, several sections appear here for the first time. In writing this book, we made an effort to address readers from different academic disciplines. Whereas several chapters are of common interest, others may address more specific audiences. The following is a brief guide to the book. We start with two meta-theoretical sections, one devoted to definitions of philosophical terms, and the other to our own views on the way decision theory and economic theory should be conducted. These two sections may be skipped with no great loss to the main substance of the book. Yet, Section 2 may help to clarify the way we use certain terms (such as “rationality”, “normative science”, and the like), and Section 3 explains part of our motivation in developing the theory described in this book. Chapter 2 of the book presents the main ideas of case based decision theory (CBDT), as well as its formal model. It offers several decision rules, a behaviorist interpretation of CBDT, and a specification of the theory for prediction problems. Chapter 3 provides the axiomatic foundations for the decision rules in Chapter 2. In line with the tradition in decision theory and in economics, it seeks to relate theoretical concepts to observables and to specify conditions under which the theory might be refuted. Chapter 4, on the other hand, focuses on the epistemological underpinnings of CBDT. It compares it with the other two paradigms of human reasoning and argues that, from a conceptual viewpoint, analogical reasoning 12 CHAPTER 1. PROLOGUE is primitive, whereas both deductive inference and probabilistic reasoning are derived from it. Whereas Chapter 3 provides the mathematical foundations of our theory, the present chapter offers the conceptual foundations of the theory and of the language within which the mathematical model is formulated. Chapter 5 deals with planning. It generalizes the CBDT model from a single-stage decision to a multi-stage one, and offers an axiomatic foundation for this generalization. Chapter 6 focuses on a special case of our general model, in which the same problem is repeated over and over again. It relates to problems of discrete choice in decision theory and in marketing, and it touches upon issues of consumer theory. It also contains some results that are used later in the book. Chapter 7 addresses questions of learning, dynamic evolution, and induction in our model. We start with an optimality result for the case of a repeated problem, which is based on a rather rudimentary form of learning. We continue to discuss more interesting forms of learning, as well as inductive inference. Unfortunately, we do not offer any profound results about the more interesting issues. Yet, we hope that the formal model we propose may facilitate discussion of these issues. 2 Meta-Theoretical Vocabulary We devote this section to define the way we use certain terms that are borrowed from philosophy. Definitions of terms and distinctions among concepts tend to be fuzzy and subjective. The following are no exception. These are merely the definitions that we have found to be the most useful for discussing theories of decision making under uncertainty at the present state. While our definitions are geared toward a specific goal, several of them may facilitate discussion of other topics as well. 2. META-THEORETICAL VOCABULARY 2.1 13 Theories and conceptual frameworks A theory of social science can be viewed as a formal mathematical structure coupled with an informal interpretation. Consider, for example, the economic theory that consumer’s demand is derived from maximizing a utility function under a budget constraint. A possible formal representation of this theory consists of two components, describing two sets, C and P. The first set, C, consists of all conceivable demand functions. A demand function maps a vector of positive prices p ∈ Rn++ and an income level I ∈ R+ to a vector of quantities d(p, I) ∈ Rn+ , interpreted as the consumer’s desired quantities of consumption under the budget constraint that total expenditure d(p, I) · p does not exceed income I. The second set, P, is the subset of C that is consistent with the theory. Specifically, P consists of the demand functions that can be described as maximizing a utility function.1 When the theory is descriptive, the set P is interpreted as all phenomena (in C) that might actually be observed. When the theory is normative, P is interpreted as all phenomena (in C) that the theory recommends. Thus, whether the theory is descriptive or normative is part of the informal interpretation. The informal interpretation should also specify the intended applications of the theory. This is done at two levels. First, there are “nicknames” attached to mathematical objects. Thus Rn+ is referred to as a set of “bundles”, Rn++ – as a set of positive “price vectors”, whereas I is supposed to represent “income” and d – “demand”. Second, there are more detailed descriptions that specify whether, say, the set Rn+ should be viewed as representing physical commodities in an atemporal model, consumption plans over time, or financial assets including contingent claims, whether d denotes the demand of an individual or a household, and so forth. Generally, the formal structure of a theory consists of a description of 1 Standard (neo-classical) consumer theory imposes additional constraints. For instance, homogeneity and continuity are often part of the definition of demand functions, and utility functions are required to be continuous, monotone, and strictly quasi-concave. We omit these details for clarity of exposition. 14 CHAPTER 1. PROLOGUE a set C and a description of a subset thereof, P. The set C is understood to consist of conceivably observable phenomena. It may be referred to as the scope of the theory. A theory thus selects a set of phenomena P out of the set of conceivable phenomena C, and excludes its complement C\P. What is being said about this set P, however, is specified by the informal interpretation: it may be the prediction or the recommendation of the theory. Observe that the formal structure of the theory does not consist of the sets C and P themselves. Rather, it consists of formal descriptions of these sets, DC and DP , respectively. These formal descriptions are strings of characters that define the sets in standard mathematical notation. Thus, theories are not extensional. In particular, two different mathematical descriptions DP and DP′ of the same set P will give rise to two different theories. It may be a non-trivial mathematical task to discover relationships between sets described by different theories. It is possible that two theories that differ not only in the formal structure (DC , DP ) but also in the sets (C, P) may coincide in the real world phenomena they describe. For example, consider again the paradigm of utility maximization in consumer theory. We have spelled out above one manifestation of this paradigm in the language of demand functions. But the literature also offers other theories within the same paradigm. For instance, one may define the set of conceivable phenomena to be all binary relations over Rn+ , with a corresponding definition of the subset of these relations that conform to maximization of a real-valued function. The informal interpretation of a theory may also be formally defined. For instance, the assignment of nicknames to mathematical objects can be viewed as a mapping from the formal descriptions of these objects, appearing in DC and in DP , into a natural language, provided that the latter is a formal mathematical object. Similarly, one may formally define “real world phenomena” and represent the (intended) applications of the theory as a collection of mappings from the mathematical entities to this set. Finally, 2. META-THEORETICAL VOCABULARY 15 the type of interpretation of the theory, namely, whether it is descriptive or normative, can easily be formalized.2 Thus a theory may be described as a quintuple consisting of DC , DP , the nicknames assignment, the applications, and the type of interpretation. We refer to the first three components of this quintuple, that is, DC , DP , and the nicknames assignment, as a conceptual framework (or framework for short). A conceptual framework thus describes a scope and a description of a prediction or a recommendation, and it points to a type of applications through the assignment of nicknames. But a framework does not completely specify the applications. Thus, frameworks fall short of qualifying as theories, even if the type of interpretation is given. For instance, Savage’s (1954) model of expected utility theory involves binary relations over functions defined on a measurable space. The mathematical model is consistent with real world interpretations that have nothing to do with choice under uncertainty, such as choice of streams of consumption over time, or of income profiles in a society. The nickname “space of states of the world”, which is attached to the measurable space in Savage’s model, defines a framework that deals with decision under uncertainty. But the conceptual framework of expected utility theory does not specify exactly what the states of the world are, or how they should be constructed. Similarly, the conceptual framework of Nash equilibrium (Nash (1951)) in game theory refers to “players” and to “strategies”, but it does not specify whether the players are individuals, organizations, or states, whether the theory should be applied to repeated or to one-shot situations, to situations involving few or many players, and so forth. By contrast, the theory of expected utility maximization under risk (vonNeumann and Morgenstern (1944)), as well as prospect theory (Kahneman 2 Our formal model allows other interpretations as well. For instance, it may represent a formal theory of aesthetics, where the set P is interpreted as defining what is beautiful. One may argue that such a theory can still be interpreted as a normative theory, prescribing how aesthetical judgment should be conducted. 16 CHAPTER 1. PROLOGUE and Tversky’s (1979)) are conceptual frameworks according to our definition. Still, they may be classified also as theories, because the scope and nicknames they employ almost completely define their applications. Terminological remark: The discussion above implies that expected utility theory should be termed a framework rather than a theory. Similarly, non-cooperative games coupled with Nash equilibrium constitute a framework. Still, we follow standard usage throughout most of the book and often use “theory” where our vocabulary suggests “framework”.3 However, the term “framework” will be used only for conceptual frameworks that have several substantially distinct applications. 2.2 Descriptive and normative theories There are many possible meanings to a selection of a set P out of a set of conceivable phenomena C. Among them, we find that it is crucial to focus on, and to distinguish between two that are relevant to theories in the social sciences: descriptive and normative. A descriptive theory attempts to describe, explain, or predict observations. Despite the different intuitive meanings, one may find it challenging to provide formal qualitative distinctions between description and explanation. Moreover, the distinction between these and prediction may not be very fundamental either. We therefore do not dwell on the distinctions among these goals. A normative theory attempts to provide recommendations regarding what to do. It follows that normative theories are addressed to an audience of people facing decisions who are capable of understanding their recommendations. However, not every recommendation qualifies as normative science. There are recommendations that may be classified as moral, religious, or political preaching. These are characterized by suggesting goals to the decision makers, and, as such, are outside the realm of academic activity. There is an 3 We apply the standard usage to the title of this book as well. 2. META-THEORETICAL VOCABULARY 17 additional type of recommendations that we do not consider as normative theories. These are recommendations that belong to the domain of social planning or engineering. They are characterized by recommending tools for achieving pre-specified goals. For instance, the design of allocation mechanisms that yield Pareto optimal outcomes accepts the given goal of Pareto optimality and solves an engineering-like problem of obtaining it. Our use of “normative science” differs from both these types of recommendations. A normative scientific claim may be viewed as an implicit descriptive statement about decision makers’ preferences. The latter are about conceivable realities that are the subject of descriptive theories. For instance, whereas descriptive theory of choice investigates actual preferences, normative theory of choice analyzes the kind of preferences that the decision maker would like to have, that is, preferences over preferences. An axiom such as transitivity of preferences, when normatively interpreted, attempts to describe the way the decision maker would prefer to make choices. Similarly, Harsanyi (1953, 1955) and Rawls (1971) can be interpreted as normative theories for social choice in that they attempt to describe to what society one would like to belong. According to this definition, normative theories are also descriptive. They attempt to describe a certain reality, namely, the preferences an individual has over the reality she encounters. To avoid confusion, we will reserve the term “descriptive theory” for theories that are not normative. That is, descriptive theories would deal, by definition, with “first-order” reality, whereas normative theories would deal with “second-order” reality, namely, with preferences over first-order reality. First-order reality may be external or objective, whereas second-order reality always has to do with subjective preferences that lie within the mind of an individual. Yet, first-order reality might include actual preferences, in which case the distinction between first- 18 CHAPTER 1. PROLOGUE order and second-order reality may become a relative matter.4 ,5 Needless to say, these distinctions are sometimes fuzzy and subjective. A scientific essay may belong to several different categories, and it may be differently interpreted by different readers, who may also disagree with the author’s interpretation. For instance, the independence axiom of von-Neumann and Morgenstern’s expected utility theory may be interpreted as a component of a descriptive theory. Indeed, testing it experimentally presupposes that it has a claim to describe reality. But it may also be interpreted normatively, as a recommendation for decision making under risk. To cite a famous example, Maurice Allais presented his paradox (see Allais (1953)) to several researchers, including the late Leonard Savage. The latter expressed preferences in violation of expected utility theory. Allais argued that expected utility maximization is not a successful descriptive theory. Savage’s reply was that his theory should be interpreted normatively, and that it could indeed help a decision maker avoid such mistakes. Further, even when a theory is interpreted as a recommendation it may involve different types of recommendations. For instance, Shapley axiomatized his value for cooperative transferable utility games (Shapley (1953)). When interpreted normatively, the axioms attempt to capture decision makers’ preferences over the way in which, say, cost is allocated in different problems. A related result by Shapley shows that a player’s value can be computed as a weighted average of her marginal contributions. This result can be interpreted in two ways. First, one may view it as an engineering recommendation: given the goal of computing a player’s value, it suggests a 4 There is a temptation to consider a hierarchy of preferences, and to ask which are in the realm of descriptive theories. We resist this temptation. 5 In many cases first-order preferences would be revealed by actual choices, whereas second-order preferences would only be verbally reported. Yet, this distinction is not sharp. First, there may be first-order preferences that cannot be observed in actual choice. Second, one may imagine elaborate choice situations in which second-order preferences might be observed, as in cases where one decides on a decision-making procedure or on a commitment device. 2. META-THEORETICAL VOCABULARY 19 formula for its computation. Second, one may also argue that compensating a player to the extent of her average marginal contribution is ethically appealing, and thus the formula, like the axioms, has a normative flavor. To conclude, the distinctions between descriptive and normative theories, as well as between the latter and engineering on the one hand and preaching on the other, are not based on mathematical criteria. The same mathematical result can be part of any of these types of scientific or non-scientific activities. These distinctions are based on the author’s intent, or on her perceived intent. It is precisely the inherently subjective nature of these distinctions that demands that one be explicit about the suggested interpretation of a theory. It also follows that one has to have a suggested interpretation in mind when attempting to evaluate a theory. Whereas a descriptive theory is evaluated by its conformity with objective reality, a normative theory is not. On the contrary, a normative theory that suggests to people to do what they would anyway do is of questionable value. How should we evaluate a normative theory, then? Since we define normative theories to be descriptive theories dealing with second-order reality, a normative theory should also be tested according to its conformity to reality. But it is second-order reality that a normative theory should be compared to. For instance, a reasonable test of a normative theory might be whether its subjects accept its recommendations. It is undeniable, however, that the evaluation of normative theories is inherently more problematic than that of descriptive theories. The evidence for normative theories would have to rely on introspection and self-report data much more than would the evidence for descriptive theories. Moreover, these data may be subject to manipulation. To consider a simple example, suppose we are trying to test the normative claim that income should be evenly distributed. We are supposed to find out whether people would like to live in an egalitarian society. Simply asking people might confound their 20 CHAPTER 1. PROLOGUE ethical preferences with their self-interest. Indeed, a person might not be sure whether she subscribes to a philosophical or a political thesis due to pure conviction or to self-serving biases. A gedanken experiment such as putting oneself behind the “veil of ignorance” (Harsanyi (1953, 1955), Rawls (1971)) may assist one in finding one’s own preferences, but may still be of little help in a social context. Further, the reality one tries to model, namely, a person’s ethical preferences over the rules governing society, or her logicoaesthetical preferences over her own preferences are a moving target that changes constantly with actual behavior and social reality. Yet, the way we define normative theories admits a certain notion of empirical testing. 2.3 Axiomatizations In common usage, the term “axiomatization” refers to a theory. However, most axiomatizations in the literature apply to conceptual frameworks according to our definitions. In fact, the following definition of axiomatizations refers only to a formal structure (DC , DP ). An axiomatization of T = (DC , DP ) is a mathematical model consisting of: (i) a formal structure T ′ = (DC , DP ′ ), which shares the description of the set C with T , but whose set P ′ is described only in the language of phenomena that are deemed observable; (ii) mathematical results relating the set P ′ of T ′ to the set P of T . Ideally, one would like to have conditions on observable phenomena that are necessary and sufficient for P to hold, namely, to have a structure T ′ such that P = P ′ . In this case, T ′ is considered to be a “complete” axiomatization of T . The conditions that describe P ′ are referred to as “axioms”.6 Observe that whether T ′ = (DC , DP ′ ) is considered to be an axiomatization of T depends on the question of observability of terms in DP ′ . Conse6 As opposed to the original meaning of the word, an “axiom” need not be indisputable or self-evident. However, evaluation of axiomatic systems typically prefers axioms that are simple in terms of mathematical formulation and transparent in terms of empirical content. 2. META-THEORETICAL VOCABULARY 21 quently, the definition above will be complete only given a formal definition of “observability”. We do not attempt to provide such a definition here, and we use the term “axiomatization” as if observability were well-defined.7 Throughout the rest of this subsection we assume that the applications of the conceptual frameworks are agreed upon. We will therefore stick to standard usage and refer to axiomatizations of theories (rather than of formal structures). Because human decisions involve inherently subjective phenomena, it is often the case that the formulation of a theory contains existential quantifiers. In this case, a complete axiomatization would also include a result regarding uniqueness. For instance, consider again the theory stating that “there exists a [so-called utility] function such that, in any decision between two choices, the consumer would opt for the one to which the [utility] function attaches a higher value”. An axiomatization of this theory should provide conditions under which the consumer can indeed be viewed as maximizing a utility function in binary decisions. Further, it should address the question of uniqueness of this function: to what extent can we argue that the utility function is defined by the observable data of binary choices? There are three reasons for which one might be interested in axiomatizations of a theory. First, the meta-scientific reason mentioned above: an axiomatization provides a link between the theoretical terms and the (allegedly) observable terms used by the theory. True to the teachings of logical positivism (Carnap (1923), see also Suppe (1974)), one would like to have observable definitions of theoretical terms, in order to render the latter meaningful. Axiomatizations help us identify those theoretical differences that have observable implications, and avoid debates between theories that are observationally equivalent. One might wish to have an axiomatization of a theory for descriptive 7 Indeed, people who disagree about the definition of observability may consequently disagree whether a certain mathematical result qualifies as an axiomatization. 22 CHAPTER 1. PROLOGUE reasons. Since an axiomatization translates the theory to directly observable claims, it prescribes a way to test the empirical validity of the theory. To the extent that the axioms do rule out certain conceivable observations, they also ascertain that the theory is non-vacuous, that is, falsifiable, as preached by Popper (1934). Note also that there are many situations, especially in the social sciences, where it is impractical to subject a theory to direct empirical tests. In those cases, an axiomatization can help us judge the plausibility of the theory. In this sense, axiomatizations may serve a rhetorical purpose. Finally, one is often interested in axiomatizations for normative reasons. A normative theory is successful to the extent that it convinces decision makers to change the way they make their choices.8 A set of axioms, formulated in the language of observable choice, can often convince decision makers that a certain theory has merit more than its mathematical formulation can. Thus, the normative role of axiomatizations is inherently rhetorical. It is often the case that an axiomatization serves all three purposes. For instance, an axiomatization of utility maximization in consumer theory provides a definition of the term “utility” and shows that binary choices can only define such a utility function up to an arbitrary monotone transformation. This meta-scientific exercise saves us useless debates about the choice between observationally equivalent utility functions.9 On a descriptive level, such an axiomatization shows what utility maximization actually entails. This allows one to test the theory that consumers are indeed utility maximizers. Moreover, it renders the theory of utility maximization much more 8 Changing actual decision making is the ultimate goal of a normative theory. Such changes are often gradual and indirect. For instance, normative theories may first convince scientists before they sift to practitioners and to the general public. Also, a normative theory may change the way people would like to behave even if they fail to implement their stated policies, for, say, reasons of self-discipline. Finally, a normative theory may convince many people that they would like society to make certain choices, but they may not be able to bring them about for political reasons. In all of these cases the normative theories definitely have some success. 9 This, at least, is the standard microeconomic textbook view. For an opposing view see Beja and Gilboa (1992) and Gilboa and Lapson (1995). 2. META-THEORETICAL VOCABULARY 23 plausible, because it shows that relatively mild consistency assumptions suffice to treat consumers as if they were utility maximizers, even if they are not conscious of any utility function, of maximization, or even of the act of choice. Finally, on a normative level, an axiomatization of utility maximization may convince a person or an organization that, according to their own judgment, they should better adopt a utility function and act so as to maximize it. Similarly, Savage’s (1954) axiomatization of subjective expected utility maximization provides observable definitions of the terms “utility” and “(subjective) probability”. Descriptively, it provides us with reasons to believe that there are decision makers who can be described as expected utility maximizers even if they are not aware of it, thus making expected utility maximization a more plausible theory. Finally, from a normative point of view, decision makers who shrug their shoulders when faced with the theory of expected utility maximization may find it more compelling if they are exposed to Savage’s axioms. While our use of the term “axiomatization” highlights the role of providing a link between theoretical concepts and observable data, this term is often used in other meanings. Both in mathematics and in the sciences, an “axiomatization” sometimes refers to a characterization of a certain entity by some of its properties. One is often interested in such axiomatizations as decompositions of the concept that is axiomatized. Studying the “building blocks” of which a concept is made typically enhances our understanding thereof, and allows the study of related concepts. Axiomatizations in the sense used in this book will typically also have the flavor of decomposition of a construct. Thus, on top of the reasons mentioned above, one may also be interested in axiomatizations simply as a way to better understand what characterizes a certain concept and what is the driving force behind certain results. For instance, an axiomatization of utility maximization in consumer theory will often reduce utility theory to more primitive “ingredients”, and 24 CHAPTER 1. PROLOGUE will also suggest alternative theories that share only some of these ingredients, such as preferences that are transitive but not necessarily complete. 2.4 Behaviorist, behavioral, and cognitive theories We distinguish between two types of data. Behavioral data are observations of actions taken by decision makers. By contrast, cognitive data are choicerelated observations that are derived from introspection, self-reports, and so forth. We are interested in actions that are open to introspection, even if they are not necessarily preceded by a deliberate decision making process. Thus, what people say about what their choices might be, the reasons they give for actual or hypothetical choices, their recollection of and motivation for such choices are all within the category of cognitive data. In contrast to common psychological usage, we do not distinguish between cognition and emotion. Emotional motives, inasmuch as they can be retrieved by introspection and memory, are cognitive data. Other relevant data, such as certain physiological or neurological activities, will be considered neither behavioral nor cognitive. Theories of choice can be classified according to the types of data they recognize as valid, as well as by the types of theoretical concepts they resort to. A theory is behaviorist if it only admits behavioral data, and if it also makes no use of cognitive theoretical concepts. (See Watson (1913, 1930) and Skinner (1938).) We reserve the term “behavioral ” to theories that only recognize behavioral data, but that make use of cognitive metaphors. Neoclassical economics and Savage’s (1954) expected utility theory are behavioral in this sense: they only recognize revealed preferences as legitimate data, but they resort to metaphors such as “utility” and “belief”. Finally, cognitive theories make use of cognitive data, as well as of behavior data. Typically, they also use cognitive metaphors. Cognitive and behavioral theories often have behaviorist implications. That is, they may say something about behavior that will be consistent 2. META-THEORETICAL VOCABULARY 25 with some behaviorist theories but not with others. In this case, we refer to the cognitive or the behavioral theory as a cognitive specification of the behaviorist theories it corresponds to. One may view a cognitive specification of a behaviorist theory as a description of a mental process that implements the theory. 2.5 Rationality We find that purely behavioral definitions of rationality invariably miss an important ingredient of the intuitive meaning of the term. Indeed, if one adheres to the behavioral definition of rationality embodied in, say, Savage’s axioms, one has a foolproof method of making rational decisions: choose any prior and any utility function, and maximize the corresponding expected utility. Adopting this method, one will never be caught violating Savage’s axioms. Yet, few would accept an arbitrary choice of a prior as rational. It follows that rationality has to do with reasoning as well as with behavior. As a first approximation we suggest the following definition. An action, or a sequence of actions is rational for a decision maker if, when the decision maker is confronted with an analysis of the decisions involved, but with no additional information, she does not regret her choices.10 This definition of rationality may apply not only to behavior, but also to decision processes leading to behavior. Observe that our definition presupposes a decision making entity capable of understanding the analysis of the problems encountered. Consider the example of transitivity of binary preferences. Many people, who exhibit cyclical preferences, regret some of their choices when exposed to this fact. For these decision makers, violating transitivity would be considered irrational. Casual observation shows that most people feel embarrassed when it is shown to them that they have fallen prey to framing effects (Tversky and Kahneman (1981)). Hence we would say that, for most people, 10 Alternatively, one can substitute “does not feel embarrassed by” for “does not regret”. 26 CHAPTER 1. PROLOGUE rationality dictates that they be immune to framing effects. Observe, however, that regret that follows from unfavorable resolution of uncertainty does not qualify as a test of rationality. As another example, consider the decision maker’s autonomy. Suppose that a decision maker decides on an act, a, and ends up choosing another act, b, because, say, she is emotionally incapable of foregoing b. If she is surprised or embarrassed by her act, it may be considered irrational. But irrationality in this example may be due to the intent to choose a, or to the implicit prediction that she would implement this decision. Indeed, if the decision maker knows that she is incapable of foregoing act b, it would be rational for her to adjust her decisions and predictions to the actual feasible set. That is, if she accepts the fact that she is technologically constrained to choose b, and if she so plans to do, there will be nothing irrational in this decision, and there will be no reason for her to regret not making the choice a, which was never really feasible. Similarly, a decision maker who has limitations in terms of simple mistakes, failing memory, limited computational capacity, and the like, may be rational as long as her decision takes these limitations into account, to the extent that they can be predicted. Our definition has two properties that we find necessary for any definition of rationality and one that we find useful for our purposes. First, as mentioned above, it relies on cognitive or introspective data, as well as on behavioral data, and it cannot be applied to decision makers who cannot understand the analysis of their decisions. According to this definition it is meaningless to ask whether, say, bees, are rational. Second, it is subjective in nature. A decision maker who, despite all our preaching, insists on making frame-dependent choices, will have to be admitted into the hall of rationality. Indeed, there is too little agreement among researchers in the field to justify the hope for a unified and objective definition of rationality. Finally, our definition of rationality is closely related to the practical question of what should be done about observed violations of classical theories of choice, as we 2. META-THEORETICAL VOCABULARY 27 explain in the sequel. As such, we hope that this definition may go beyond capturing intuition to simplifying the discussion that follows.11 2.6 Deviations from rationality There is a large body of evidence that people do not always behave as classical decision theory predicts. What should we do about observed deviations from the classical notion of rational choice? Should we refine our descriptive theories or dismiss the contradicting data as exceptions that can only clutter the basic rules? Should we teach our normative theories, or modify them? We find the definition of rationality given above useful in making these choices. If an observed mode of behavior is irrational for most people, one may suggest a normative recommendation to avoid that mode of behavior. By definition of irrationality, most people would accept this recommendation, rendering it a successful normative theory. By contrast, there is a weaker incentive to incorporate this mode of behavior into descriptive theories: if these theories were known to the decision makers they describe, the decision makers would wish to change their behavior. Differently put, a descriptive theory of irrational choice is a self-refuting prophecy. If, however, an observed mode of behavior is rational for most people, they will stick to it even if theorists preach the opposite. Hence, recommending to avoid it would make a poor normative theory. But then the theorists should include this mode of behavior in their descriptive theories. This would improve the accuracy of these theories even if the theories are known to their subjects. 2.7 Subjective and objective terms Certain terms, such as “probability”, are sometimes classified as subjective or objective. Some authors argue that all such terms are inherently subjective, and that the term “objective” is but a nickname for subjective terms on 11 For other views of rationality, see Arrow (1986), Etzioni (1986), and Simon (1986). 28 CHAPTER 1. PROLOGUE which there happens to be agreement. (See Anscombe and Aumann (1963).) A possible objection is raised by the following example. Five people are standing around a well that they have just found in the field. They all estimate its depth to be more than 100 feet. This is the subjective estimate of each of the five people. Yet, while they all agree on the estimate, they also all agree that there is a more objective way to measure the depth of the well. Specifically, assume that Judy is one of the five people who have discovered the well, and that she recounts the story to her friend Jerry. Compare two scenarios. In the first scenario, Judy says, “I looked inside, and I saw that it was over 100 feet deep.” In the second scenario she says, “I had dropped a stone into the well and three seconds had passed before I heard the splash.” Jerry is more likely to accept Judy’s estimate in the second scenario than he is in the first. We would also like to argue that Judy’s estimate of the well’s depth in the second scenario is more objective than in the first. This suggests a definition of objectivity that requires more than agreement among subjective judgments: an assessment is objective if someone else is likely to agree with it. Generally, one may argue that objective judgments require some justification beyond the perhaps coincidental agreement among subjective ones. It may be useful to conceptualize the distinction between subjective and objective judgments as quantitative, rather than qualitative. As such, the first definition of objectivity suggests the following measure: an assessment is more objective, the more people happen to share it as their subjective assessment. The definition we propose here offers a different measure: an assessment is more objective, the more likely are people, who have not given thought to it and who have no additional relevant information, to agree with it. To consider another example, let us contrast classical with Bayesian statistics. If a classical statistician rejects a hypothesis, the belief that it should be rejected is relatively objective. By contrast, a prior of a Bayesian 3. META-THEORETICAL PREJUDICES 29 statistician is more subjective. Indeed, another classical statistician is more likely to agree with the decision to reject the hypothesis than is another Bayesian statistician likely to adopt a prior if it differs from her own. Two Bayesian statisticians may have the same subjective prior, but if they are about to meet a third one, they have no way to convince her to agree with their prior. There is no agreed upon method to adopt a prior, while there are agreed upon methods to test certain hypotheses.12 One may argue that our definition of objectivity again resorts to agreement among subjective judgments, only that the latter should include all hypothetical judgments made by various individuals at various times and at various states of the world. But for practical purposes of the following discussion, we find the distinction between objective terms and coincidence of subjective terms quite useful. 3 Meta-Theoretical Prejudices This section expresses our views on certain issues relating to the state of the art in decision theory as well as to its applications to economics and to other social sciences. These views undoubtedly correlate with our motivation for developing case-based decision theory. Yet, as we explain below, the theory may be suggested and evaluated independently of our prejudices. 3.1 Preliminary remark on the philosophy of science Analytical philosophy is a science, in the sense that the social sciences are. Namely, it uses formal or formalizable analytical models to describe, explain, and predict certain phenomena (as a descriptive science), or to produce recommendations and prescriptions (as a normative science). Generally, philos12 Two Bayesians will agree on a prior, i.e., will have a common prior, in practice, if this prior conveys minimal information (say, a uniform prior when it is well-defined). This is essentially Laplace’s dictum for the case of complete ignorance, which indicates a failure of the Bayesian claim that any prior information may be expressed by a probability function. 30 CHAPTER 1. PROLOGUE ophy is concerned with phenomena that are primarily concentrated in the human mind, such as language, religion, ethics, aesthetics, science, and others. Hence, philosophy is closer to the social or human sciences than it is to the natural ones. Many social scientists would agree that their theories are no more than suggestive illustrations or rough sketches.13 These theories are supposed to provide insights without necessarily being literally true. For instance, the economic model of perfect competition offers important insights regarding the way markets work. It obviously describes a highly idealized reality. That is, it is false. But it is neither worthless nor useless. Naturally, one has to study other models that are more realistic in certain aspects, such as models of imperfect competition, asymmetric information, and so forth. The analysis of these models qualifies and refines the insights of the perfect competition model, but it does not nullify them. Viewing philosophy as a social science, there is no reason to expect its theories and models to be more accurate or complete than are those of economics or psychology. For example, the logical positivists’ Received View offers useful guidelines for developing scientific theories. It is also very insightful when interpreted descriptively, as a model of how science is done. But one expects the Received View to be neither a perfect description of, nor a foolproof prescription for scientific activity. Hanson’s theory that observations are theory-laden (Hanson (1958)), for example, also provides profound insights and enhances our understanding of the scientific process. It qualifies, but does not nullify the insights gained from the Received View. Allowing ourselves a mild exaggeration, we argue for humility in evaluating theories in the social sciences, philosophy included: ask not, is there a counterexample to a theory; ask, is there an example thereof.1415 13 An extreme term, which was suggested by Gibbard and Varian (1978), is “caricatures”. In this sense, our view is reminiscent of verificationism. For a review, see, for instance, Lewin (1996). 15 It is inevitable that similar qualifications would apply to this very paragraph. 14 3. META-THEORETICAL PREJUDICES 3.2 31 Utility and expected utility “theories” as conceptual frameworks and as theories As mentioned in Subsection 2.1, in common usage the term “theory” often defines only a conceptual framework. Consider, for instance, utility theory, suggesting that, given a set of feasible alternatives, decision makers maximize a utility function. When an intended application is explicitly or implicitly given, this is a descriptive theory in the tradition of logical positivism, which would also pass Popper’s falsifiability test. The theoretical term “utility” is axiomatically derived from observations. The axioms define quite precisely the conditions under which the theory is falsified. It can therefore be tested experimentally. However, when the application is neither obvious nor agreed upon, it is not as clear what constitutes a violation of the axioms, or a falsification of the theory. In fact, almost any case in which the theory seems to be refuted can be re-interpreted in a way that conforms to the theory. One may always enrich the model, redefining the alternatives involved, in such a way that apparent refutations are reconciled. Utility theory is a conceptual framework in the language of Subsection 2.1 above. There is no canon of application that would state precisely what are the real world phenomena to which utility theory may be applied. In certain situations, such as carefully designed laboratory experiments, there is only one reasonable mapping from the formal structure to observed phenomena. But in many real life situations one has a certain degree of freedom in choosing this mapping. It is then not obvious what counts as a refutation of the theory. Thus, utility maximization may not be a theory in the Popperian sense. Rather, it is a conceptual framework, offering a certain way of conceptualizing and organizing data. Working within this conceptual framework, one may suggest specific theories that also define the set of alternatives among which choice is made. Similarly, expected utility theory seems to be well defined and clearly falsifiable when one accepts a certain application. Indeed, in certain labo- 32 CHAPTER 1. PROLOGUE ratory experiments it can be directly tested. But most empirical data do not present themselves with a clear definition of states of the world or of outcomes. Hence apparent violations of the theory may be accounted for by re-interpreting these terms. The “theories” of utility maximization and of expected utility maximization therefore serve a dual purpose: they are formulated as testable theories, but they are also used as conceptual frameworks for the generation of more specific theories. We believe that this dual status is useful. On the one hand, treating these conceptual frameworks as falsifiable theories with axiomatic foundations facilitates the scientific discussion and ensures that we do not engage in debates between observationally equivalent theories. On the other hand, decision theory would be very limited in its applications if one were to insist that these conceptual frameworks be used only when there is no ambiguity regarding their interpretation. It is therefore useful to insist that every conceptual framework be grounded in a falsifiable axiomatization, as if it were a specific theory, yet to use the framework also beyond the realm in which falsification is unambiguously defined.16 Moreover, many of the theoretical successes of utility theory and of expected utility theory involve situations in which one cannot reasonably hope to test their axioms. For instance, the First Welfare Theorem (Arrow and Debreu (1954)) provides a fundamental insight regarding the optimality of competitive equilibria, even though we cannot measure the utility functions of all individuals involved. Similarly, economic and game theoretic analysis that relies on expected utility maximization offers important insights and recommendations regarding financial markets, markets with incomplete information, and so forth, while neither subjective probabilities nor utility functions can practically be measured. Again, we believe that one should make sure that all theoretical terms are in-principle observable, but allow 16 Historically, conceptual frameworks often originate from theories. A formal theory may be suggested with an implicit understanding of its intended applications, but it might later be generalized and evolve into a conceptual framework. 3. META-THEORETICAL PREJUDICES 33 oneself some freedom in using these terms even when actual observations are impossible. 3.3 On the validity of purely behavioral economic theory The official position of economic theory, as reflected in many graduate (and undergraduate) level courses and textbooks, is that economics is and should be purely behavioral. Economists do not care what people say, they only care what they do. Moreover, it is argued, people are often not aware of the way they make decisions, and their verbal introspective account of the decision process may be more confusing than helpful. As long as they behave as if they follow the theory, we should be satisfied. One can hardly take issue with this approach if the underlying axioms are tested and proved true, or if the theory generates precise quantitative predictions that conform to reality. But economics often uses a theory such as expected utility maximization in cases where its axioms cannot be tested directly, and its predictions are at best qualitative. That is, theories are often used as rhetorical devices. In these cases, one cannot afford to ignore the intuitive appeal and cognitive plausibility of behavioral theories. The more cognitively plausible a theory is, the more effective rhetorical device will it be. Many economists would argue that the cognitive plausibility of a theory follows from that of its axioms. For instance, Savage’s behavioral axioms are very reasonable; thus it is very reasonable that people should and would behave as if they were expected utility maximizers. However, we claim that behavioral axioms that appear plausible when formulated in a given language may not be as convincing when this language is unnatural. For instance, Savage’s axiom P2 (often referred to as the “sure-thing principle”) is very compelling when acts are given as functions from states to outcomes. But in examples in which outcomes and states are not naturally given in the 34 CHAPTER 1. PROLOGUE description of the problem, it is not clear what all the implications of the surething principle are, namely, how preferences among acts are constrained by the sure-thing principle. It is therefore not at all obvious that actual behavior would follow this seemingly very compelling principle. More generally, the predictive validity of behavioral axioms is not divorced from the cognitive plausibility of the language in which they are formulated. Moreover, when there is no obvious mapping between the reality modeled and the mathematical model, it is no longer clear that the theory is indeed behavioral. For example, Savage’s model assumes that choices between acts are observable. This assumption is valid when one chooses between bets on, say, the color of a ball drawn from an urn. In this case it is quite clear that the states of the world correspond to the balls in the urn, and these states are observable by the modeler. But in many problems to which the theory is applied, states of the world need to be defined, and there is more than one way to define them. In these cases one cannot be sure that the states that are supposedly conceived of by the decision maker are those to which the formal model refers. The need to refer to states that exist in the decision maker’s mind renders the theory’s data cognitive rather than behavioral. As a matter of principle, a behavioral theory has to rely on data that are observable by an outside observer without ambiguity. In conclusion, the position that cognitive plausibility of economic or choice theories is immaterial seems untenable. A behaviorist theory, treating a decision maker as a “black box”, has to be supported by a convincing cognitive specification. On the other hand, a seemingly compelling mental account of a decision making process is also unsatisfactory as long as we do not know what types of behavior it induces. Thus, a satisfactory theory of choice should tell a convincing story both about the cognitive process and about the resulting behavior. 3. META-THEORETICAL PREJUDICES 3.4 35 What does all this have to do with CBDT? The motivation for the development of case-based decision theory has to do with our dissatisfaction with classical decision theory. Specifically, we find that expected utility theory is not always cognitively plausible, and that the behavioral approach, often quoted as its theoretical foundation, is untenable. Moreover, we believe that expected utility theory is not always successful as a normative theory, because it is often impractical. And in those cases where one cannot follow expected utility theory, it is neither irrational, nor even boundedly rational, to deviate from Savage’s postulates. Yet, the theory we present in this book is independent of these prejudices. In particular, the following chapters may be read also by people who believe only in purely behavioral theories, or who are only willing to consider CBDT as a descriptive theory of bounded rationality. 36 CHAPTER 1. PROLOGUE Chapter 2 Decision Rules 4 Elementary Formula and Interpretations Expected utility theory enjoys the status of an almost unrivaled paradigm for decision making in face of uncertainty. Relying on such sound foundations as the classical works of Ramsey (1931), de Finetti (1937), von Neumann and Morgenstern (1944), and Savage (1954), the theory has formidable power and elegance, whether interpreted positively or normatively, for situations of given probabilities (“risk”) or unknown ones (“uncertainty”) alike. While evidence has been accumulating that the theory is too restrictive (at least from a descriptive viewpoint), its various generalizations only attest to the strength and appeal of the expected utility paradigm. With few exceptions, all suggested alternatives retain the framework of the model, relaxing some of the more demanding axioms while adhering to the more basic ones. (See Machina (1987), Karni and Schmeidler (1991), Camerer and Weber (1992), and Harless and Camerer (1994) for extensive surveys.) Yet it seems that in many situations of choice under uncertainty, the very language of expected utility models is inappropriate. For instance, states of the world are neither naturally given, nor can they be simply formulated. Furthermore, sometimes even a comprehensive list of all possible outcomes is not readily available or easily imagined. The following examples illustrate 37 38 CHAPTER 2. DECISION RULES these points. 4.1 Motivating examples Example 2.1 As a benchmark, we first consider Savage’s famous omelet problem (Savage 1954, pp. 13-15): Leonard Savage is making an omelet using six eggs. Five of them are already cracked into a bowl. He is holding the sixth, and has to decide whether to crack it directly into the bowl, or into a separate, clean dish to examine its freshness. This is a decision problem under uncertainty, because Leonard Savage does not know whether the egg is fresh or not. Moreover, uncertainty matters: if he knew that the egg is fresh, he would be better off cracking it directly into the bowl, saving the need to wash another dish. On the other hand, a rotten egg would result in losing the five eggs already in the bowl; thus, if he knew that the egg were not fresh, he would prefer to crack it into the clean dish. In this example, uncertainty may be fully described by two states of the world: “the egg is fresh” and “the egg isn’t fresh”. Each of these states “resolves all uncertainty” as prescribed by Savage (1954). Not only are there relatively few relevant states of the world in this example, they are also naturally given in the description of the problem. In particular, they can be defined independently of the acts available to the decision maker. Furthermore, the possible outcomes can be easily defined. Thus this example falls neatly into decision making under uncertainty in Savage’s model. Example 2.2 A couple has to hire a nanny for their child. The available acts are the various candidates for the job. The decision makers do not know how each candidate would perform if hired. For instance, each candidate may turn out to be negligent or dishonest. Coming to think about it, they realize that other problems may also occur. Some nannies are treating children well, but cannot be trusted with keeping the house in order. Others appear to be 4. ELEMENTARY FORMULA AND INTERPRETATIONS 39 just perfect on the job, but are not very loyal and may quit the job on short notice. The couple is facing uncertainty regarding the candidates’ performance on several measures. However, there are several difficulties in fitting this problem into the framework of expected utility theory (EUT). First, imagining all possible outcomes is not a trivial task. Second, the states of the world do not naturally suggest themselves in this problem. Furthermore, should the decision makers try to construct them analytically, their number and complexity would be daunting: every state of the world should specify the exact performance of each candidate on each measure.1 Example 2.3 President Clinton has to decide on military intervention in Bosnia-Herzegovina2 . The alternative acts are relatively clear: one may do nothing, impose economic sanctions, use limited military force (say, only air strikes) or opt for a full-blown military intervention. The main problem is to decide what are the likely short-run and long-run outcomes of each act. For instance, it is not exactly clear how strong are the military forces of the warring factions in Bosnia; it is hard to judge how many casualties each military option would involve, and what would be the public opinion response; there is some uncertainty about the reaction of Russia, especially if it goes through a military coup. In short, the problem is definitely one of decision under uncertainty. But, again, neither all possible eventualities, nor all possible scenarios are readily 1 It may be easier to assess distributions of utility for each candidate and to then apply EUT to these distributions than to spell out all the states of the world. This proves our point, namely, that the state model is unnatural in this example. It follows that one cannot use de Finetti’s or Savage’s axiomatic derivations of subjective probability as foundations for the assessment of distributions in such examples. 2 USA military involvment in parts of what was Yugoslavia is on the political agenda in the USA since we wrote the first version of our first paper on CBDT (March 1992) till now (September 2000). 40 CHAPTER 2. DECISION RULES available. Any list of outcomes or of states is bound to be incomplete. Furthermore, each state of the world should specify the result of each act at each point of time. Thus, an exhaustive set of the states of the world certainly does not naturally pop up. In example 2.1, expected utility theory seems a reasonable description of how people think about the decision problem. By contrast, we argue that in examples such as 2.2 and 2.3, EUT does not describe a plausible cognitive process. Should the decision maker attempt to think in the language of EUT, she would have to imagine all possible outcomes and all relevant states. Often the definition of states of the world would involve conditional statements, attaching outcomes to acts. Not only would the number of states be huge, the states themselves would not be defined in an intuitive way. Moreover, even if the agent managed to imagine all outcomes and states, her task would by no means be done. Next she would have to assess the utility of each outcome, and to form a prior over the state space. It is not clear how the utility and the prior are to be defined, especially since past experience appears to be of limited help in these examples. For instance, what is the probability that a particular candidate for the job in example 2.2 will end up being negligent? Or being both negligent and dishonest? Or, considering example 2.3, what are the chances that a military intervention will develop into a full-blown war, while air strikes will not? What is the probability that a scenario that no expert predicted will eventually materialize? It seems unlikely that decision makers can answer these questions. Expected utility theory does not describe the way people actually think about such problems. Correspondingly, it is doubtful that EUT is the most useful tool for predicting behavior in decision problems of this nature. A theory that will provide a more faithful description of how people think would have a better chance of predicting what they will do. How, then, do people think about such decision problems? We resort to Hume (1748), who argued that “From causes which appear similar we expect similar effects. This is the sum 4. ELEMENTARY FORMULA AND INTERPRETATIONS 41 of all our experimental conclusions.” That is, the main reasoning technique that people use is drawing analogies between past cases and the one at hand.3 Applying this idea to decision making, we suggest that people choose acts based on their performance in similar problems in the past. For instance, in example 2.2, a common, and indeed very reasonable thing to do is to ask each candidate for references. Every recommendation letter provided by a candidate attests to his/her performance (as a nanny) in a different situation or “problem”. In this example, the agents do not rely on their own memory; rather, they draw on the experience of other employers. Each past “case” would be judged for its similarity; for instance, serving as a nanny to a month-old toddler is somewhat different from the same job when a twoyear old child is concerned. Similarly, the house, neighborhood and other factors may affect the relevance of past cases to the problem at hand. Thus we expect decision makers to put more weight on the experience of people whose decision problem was more similar to theirs. Furthermore, they may rely more heavily on the experience of people they happen to know, or judge to have similar tastes to their own. Next consider example 2.3. While military and political experts certainly do try to write down possible scenarios and to assign likelihoods to them, this is by no means the only reasoning technique used. (Nor is it necessarily the most compelling a-priori or the most successful a-posteriori.) Very often people’s reasoning employs analogies to past cases. For instance, proponents of military intervention tend to cite the Gulf War as a “successful” case. They stress the similarity of the two problems, say, as local conflicts in post3 We were first exposed to this idea as an explicit “theory” in the form of “case-based reasoning” (Schank (1986), Riesbeck and Schank (1989)), to which we owe the epithet “case-based”. (See also Kolodner and Riesbeck (1986) and Kolodner (1988).) Needless to say, our thinking about the problem was partly inspired by case-based reasoning. At this stage, however, there does not seem to be much in common between our theory and case-based reasoning, beyond Hume’s basic idea. It should be mentioned that similar ideas were also expressed in the economic literature by Keynes (1921), Selten (1978), and Cross (1983). 42 CHAPTER 2. DECISION RULES cold-war world. Opponents adduce the Vietnam War as a case in which military intervention is generally considered to have been a mistake. They also point to the similarity of the cases, for instance to the “peace-keeping mission” mentioned in both. Case-based decision theory (CBDT) attempts to formally capture this mode of reasoning as it applies to decision making. In the next subsection we begin by describing the simplest version of CBDT. This version is admittedly rather restrictive. It will be generalized later on to encompass a wider class of phenomena. At this point we only wish to highlight some features of the general theory, which are best illustrated by the simplest version presented below. 4.2 Model Assume that a set of problems is given as primitive, and that there is some measure of similarity on it. The problems are to be thought of as descriptions of choice situations, as stories involving decision problems. Generally, a decision maker would remember some of the problems that she and other decision makers encountered in the past. When faced with a new problem, the similarity of the situation brings this memory to mind, and with it the recollection of the choice made and the outcome that resulted. We refer to the combination of these three, the problem, the act, and the result, as a “case”. Thus similar cases are recalled, and based on them each possible decision is evaluated. The specific model we propose here evaluates each act by the sum, over all cases in which it was chosen, of the product of the similarity of the problem to the one at hand and the resulting utility. Formally, we start with three sets: let P be a set of decision problems, A – a set of acts that may be chosen at the current problem, and R – a set of possible outcomes. A case is a triple (q, a, r) where q is a problem, a is an act and r – an outcome. Thus, the set of conceivable cases is the set of all such triples: 4. ELEMENTARY FORMULA AND INTERPRETATIONS 43 C ≡P ×A×R . Observe that C is not the set of cases that actually have occurred. Moreover, it will typically be impossible for all cases in C to co-occur, because different cases in C may attach a different act or a different outcome to a given decision problem. The set of cases that are known to have occurred will thus be a subset of C. The next two components of the formal model are similarity and utility functions. The similarity function s : P × P → [0, 1] is assumed to provide a quantification of similarity judgments between decision problems. The concept of similarity is the main engine of the decision models introduced in this book. Performing similarity judgments is the chief cognitive task of the decision maker we have in mind. Hume (1748) has already suggested similarity as a basis for inductive reasoning. Formal models of similarity date back at least to Quine (1969b) and Tversky (1977),4 and much attention has been given to analogical reasoning. (See, for instance, Gick and Holyoak (1980, 1983), and Falkenhainer, Forbus, and Gentner (1989).) However, we are unaware of a formal theory of decision making that is explicitly based on similarity judgments. As is normally the case with theoretical concepts, the exact meaning of the similarity function will be given by the way in which it is employed. The way we use similarity between problems, to be specified shortly, implies uniqueness only up to multiplication by a positive number. The term “similarity” should not be taken too literally. Past decision problems affect the decision maker’s choice only if they are recalled. For 4 The classical notion of similarity, as employed in Euclidean geometry, suggests that similarity be modeled as an equivalence relation. Obviously, a real-valued similarity function can capture such a notion of similarity. But it allows similarity relations that are intransitive or asymmetric, and, importantly, it can also capture gradations of similarity. 44 CHAPTER 2. DECISION RULES instance, assume that a voter named John is asked to voice his view regarding military intervention in Bosnia. It is possible that, if asked, John would judge the Korean War to be similar to the situation in Bosnia. But it is also possible that John would not be reminded of the Korean War on his own. For our purposes, John may be described as attaching a zero similarity value to the two decision problems. Thus, while we use the suggestive term “similarity”, we think of this function as representing awareness and probability of recall as well as conscious similarity assessments. Our formulation assumes that similarity judgments cannot be negative. This may sometimes be too restrictive. For instance, Sarah may know that she and Jim always disagree about movies. If Sarah knows that Jim liked a movie, she will take it as a piece of evidence against watching this movie. This may be thought of as if Sarah’s problem has a negative similarity to a decision problem in which Jim was involved. However, for concreteness we choose to restrict the discussion to non-negative similarity values at this point.5 This section restricts similarity judgments to decision problems. We later extend the formal model to include similarity judgments between pairs of problems and acts, and even between entire cases. The decision maker is also characterized by a utility function: u:R→R. The utility function measures the desirability of outcomes. The higher is the value of this function, the more desirable is the outcome considered to be. Moreover, positive u-values can be associated with positive experiences, which the decision maker would like to repeat, whereas negative u-values correspond to negative experiences, which the decision maker would rather 5 This restriction may pose a problem in our example. It can be solved by assuming that Sarah assigns negative utility values to outcomes that Jim likes. The utility function is introduced below. 4. ELEMENTARY FORMULA AND INTERPRETATIONS 45 avoid. As in the case of the similarity function, the exact measurement of desirability of positive outcomes and of the undesirability of negative ones will be tied to the particular formula in which this function is employed. We now proceed to specify the decision model. A decision maker is facing a problem p ∈ P . We assume that she knows the possible courses of action she might take. These are denoted by A. We do not address the question of specification of the acts that are, indeed, available in a given decision problem, or of identification of the decision problem in the first place.6 The decision maker can base her decision on the cases she knows of. We refer to the set consisting of these cases as the decision maker’s memory. Formally, memory is a subset of cases: M ⊂C . Whereas C is the set of all conceivable cases, M represents those that actually occurred, and that the decision maker was informed of. Cases in which the decision maker was the protagonist would naturally belong to M . But so will cases in which other people were the protagonists, if they were related to our decision maker, reported in the media, and so forth. We assume that each decision problem q ∈ P may appear in M at most once. This will naturally be the case if the description of a decision problem is detailed enough. For instance, if the identity of the decision maker and the exact time of decision are part of the problem’s description, no problem would repeat itself precisely. Adopting this assumption allows us to use set notation and to obviate the need for indices. This notation involves no loss of generality, because a problem that does repeat itself in an identical fashion can be represented as several problems that happen to have the same features, though not the same formal description. 6 In Chapter 5, however, we discuss the way a decision maker develops plans. As such, it may be viewed as an attempt to model the process by which decision makers generate the set of acts available to them. 46 CHAPTER 2. DECISION RULES The most elementary version of CBDT offers that the decision maker would rank available acts according to the similarity-weighted sum of utilities they have resulted in the past. Formally, a decision maker with memory M , similarity function s, and utility function u, who now faces a new decision problem p, will rank each act a ∈ A according to U (a) = Up,M (a) =  s(p, q)u(r) , (∗) (q,a,r)∈M (where the summation over the empty set is taken to yield zero), and will choose an act that maximizes this sum. Observe that, given memory M , the function U only uses the u values of outcomes r that have appeared in M . For each act a ∈ A and each case c = (q, a, r) ∈ M in which act a was indeed chosen, one may view the product s(p, q)u(r) as the effect that case c has on the evaluation of act a in the present problem p. If act a has resulted in case c in a desirable outcome r, namely an outcome such that u(r) > 0, having case c in memory would make act a more attractive in the new problem as well. Similarly, recalling case c in which act a has resulted in an undesirable outcome r, that is, an outcome for which u(r) < 0, would render act a less attractive in the current decision problem. In both cases, the impact that case c would have on the way act a is viewed depends on the similarity of the problem at hand to the problem in case c. Finally, the overall evaluation of act a is obtained by summing up the products s(p, q)u(r) over all cases of the form (q, a, r), namely, over all cases in which a was chosen in the past. Let us consider example 2.3 again. President Clinton has to choose among several acts, such as “send in troops”, “use air strikes”, “do nothing”, and so forth. The act “send in troops” would be evaluated based on its performance in the past. The Vietnam War would probably be a case in which this act yielded an undesirable outcome. The Gulf War is another relevant case, in which the same act resulted in a desirable outcome. One has to assess the 4. ELEMENTARY FORMULA AND INTERPRETATIONS 47 degree of desirability or undesirability of each outcome, as well as the degree to which each past problem is similar to the problem at hand. One then multiplies the desirability values by the corresponding similarity values. The summation of these products will constitute the overall evaluation of the act, which may turn out to be positive or negative. An alternative act, “do nothing” might be judged based on its performance in various cases where no intervention was chosen, and so forth. The formula (∗) may be viewed as modeling learning. It attempts to capture the way in which experience affects decision making, and it suggests a dynamic interpretation: with the addition of new cases, memory grows. As we will see in Chapter 7, adding cases to memory is but the simplest form of learning. Yet, it appears to be the basis for other forms of learning as well. 4.3 Aspirations and satisficing The presentation of the elementary formula above highlights the distinction between desirable outcomes, whose utility values are positive, and undesirable ones, to which negative utility values are attached. While this distinction is rather intuitive, classical decision theory typically finds it redundant: higher utility values are considered more desirable, or, equivalently, less undesirable. All utility indices are used only in the context of pairwise comparisons, and their values have no meaning on any absolute scale. Indeed, most classical theories of decision allow the utility function to be “shifted” by adding a constant to all values, without changing the theory’s observable implications. This is not the case with the elementary formula (∗). One may easily observe that shifting the utility function by a constant will generally result in a different prediction of choice. Indeed, shifting the function u by a constant  c would change U (a) by c · (q,a,r)∈M s(p, q). Since there in no reason that  (q,a,r)∈M s(p, q) will be the same for different acts a, such a shift is not guaranteed to preserve the U -ranking of the acts. Specifically, if c > 0, such 48 CHAPTER 2. DECISION RULES a shift would favor acts that were chosen in the past in similar problems, whereas a shift by c < 0 would favor acts that have not been chosen in similar problems. It follows that the choice of the reference point zero on the utility scale cannot be arbitrary. This suggests that our intuitive distinction between desirable and undesirable outcomes might have a behavioral meaning as well. Indeed, consider a simple case in which there is no uncertainty. That is, whenever an act a ∈ A is chosen, a particular outcome ra ∈ R results. The decision maker is not assumed to know this fact. Moreover, she certainly does not know what utility value corresponds to which act. All she knows are the cases she has experienced, and she follows U -maximization as in (∗). Starting out with an empty memory, all acts are assigned a zero U-value. At this point the decision maker’s choice is arbitrary. Assume that she chose an act a that resulted in an undesirable outcome ra , that is, an outcome with u(ra ) < 0. Let us now consider the same decision maker confronted with the next problem. Suppose that this problem bears some similarity to the first one. In this case, act a will have a negative U -value (u(ra ) multiplied by the similarity of the two problems), whereas all other acts, which have empty histories, still have a U -value of zero. Thus the one act that has been tried, and that has resulted in an undesirable outcome, is the only one the decision maker has some information about, and she tries to veer away from this act. Her choice among the other ones will be arbitrary again. Suppose she chooses act b. If u(rb ) < 0, then, in the next similar problem, she will find that both a and b are unattractive acts, and will choose (in an arbitrary fashion) among the other ones, which have not yet been tried. This process may continue until the decision maker has a negative experience with each and every possible act. In this situation she would have to choose the lesser of these evils and opt for an act whose U -value is highest, that is, least negative. Assume, however, that the decision maker finds an act, say, d, that leads to a desirable outcome, namely, an act such that 4. ELEMENTARY FORMULA AND INTERPRETATIONS 49 u(rd ) > 0. In this case, in the next similar problem act d would have a positive U-value, whereas all other acts would have non-positive U-values. Thus d will get to be chosen again. Since act d always leads to the same desirable outcome rd , its U-value will always remain positive, and d will always be chosen. The decision maker will never attempt any act other than d, despite the fact that many other acts might not have been tried even once. One might view this mode of behavior as satisficing in the sense of Simon (1957) and March and Simon (1958). Taking the value zero on the utility scale to be the decision maker’s aspiration level, the decision maker would cling to an act that achieves this value without attempting other acts and without experimentation. It is only when the decision maker’s current choice is evaluated below the aspiration level that the decision maker is “unsatisficed” and is prodded to experiment with other options. Aspiration levels are likely to change as a result of past experiences.7 One might therefore wish to explicitly introduce notation that keeps track of these changes. Let uM denote the utility function given memory M , restricted to outcomes that have appeared in M . Assume that (for all M and M ′ ) uM and uM ′ differ by an additive constant on the intersection of their domains. We may then choose a utility function û and define the aspiration level given memory M, HM , such that uM = û − HM . In this case, we can reformulate (∗) as follows. The decision maker is maximizing U (a) = Up,M (a) =  s(p, q)[û(r) − HM ] . (∗′ ) (q,a,r)∈M To see the role of the aspiration level, consider the following example. Suppose that act a was chosen 10 times, yielding a payoff (û) of 1 each time. Act b, by contrast, was only tried twice, yielding a payoff of 4 each time. 7 Indeed, experience may and generally does shape both similarity judgments and desirability evaluations in more general ways. At this point we focus only on shifts of the utility function. 50 CHAPTER 2. DECISION RULES Assume that all past problems are equally similar to the one at hand. If the aspiration level is zero, one may equate the utility u with the payoff û, and thus the U -value of a exceeds that of b. Next, assume that, with the same payoff function û, the aspiration level is 2 rather than 0. This makes all the outcomes of act a undesirable, whereas those corresponding to b are still desirable. Hence, b will be preferred to a. Consider now a dynamic process of aspiration level adjustment. Assume, as above, that a decision maker starts out with an aspiration level of 0, and considers all outcomes of both acts a and b to be desirable. Having twice observed that 4 is a possible payoff, however, the decision maker considers the payoff 1 to be less satisfactory than she used to. In other words, her aspiration level rises. As we saw above, a large enough increase in the aspiration level would render b more desirable than a according to (∗′). In general, the function U , being cumulative in nature, may give preference to past choices that yielded desirable outcomes simply because they were chosen many times. Among those acts that, overall, yielded desirable outcomes, U-maximization does not opt for the highest average performance. It may even exhibit habit formation. But this conservative nature of U maximization is mitigated by the process of aspiration level adjustment. We devote Section 18 to this issue, and show that reasonable assumptions on the aspiration level adjustment process lead to optimal choices in situations where the same problem is encountered in exactly the same way. 4.4 Comparison with EUT In CBDT, as in EUT, acts are ranked by weighted sums of utilities. Indeed, the formula (∗) so resembles that of expected utility theory that one may suspect CBDT to be no more than EUT in a different guise. However, despite appearances, the two theories have little in common. First, note some mathematical differences between the formulae. In (∗) there is no reason for the coefficients s(p, ·) to add up to 1 or to any other constant. 4. ELEMENTARY FORMULA AND INTERPRETATIONS 51 More importantly, while in EUT every act is evaluated at every state, in U -maximization each act is evaluated over a different set of cases. To be precise, if a = b, the set of elements of M summed over in U (a) is disjoint from that corresponding to U (b). In particular, this set may well be empty for some a’s. On a more conceptual level, in expected utility theory the set of states is assumed to be an exhaustive list of all possible scenarios. Each state “resolves all uncertainty”, and, in particular, attaches a result to each available act. By contrast, in case-based decision theory the memory contains only those cases that actually happened. Correspondingly, the utility values that are used in (∗) are only those that were actually experienced. To apply EUT one needs to engage in hypothetical reasoning, namely to consider all possible states and the outcome that would result from each act in each state. To apply CBDT, no hypothetical reasoning is required. As opposed to expected utility theory, CBDT does not distinguish between certain and uncertain acts. In hindsight, a decision maker may observe that a particular act always resulted in the same outcome (i.e., that it seems to involve no uncertainty), or that it is “uncertain” in the sense that it resulted in different outcomes in similar problems. But the decision maker is not assumed to know a-priori which acts involve uncertainty and which do not. Indeed, she is not assumed to know anything about the outside world, apart from past cases. CBDT and EUT also differ in the way they treat new information and evolve over time. In EUT new information is modeled as an event (a subset of states) that has obtained. The model is restricted to this event and the probability is updated according to Bayes’ rule. By contrast, in CBDT new information is modeled primarily by adding cases to memory. In the basic model, the similarity function calls for no update in face of new information. Thus, EUT implicitly assumes that the decision maker was born with knowledge of and beliefs over all possible scenarios, and her learning con- 52 CHAPTER 2. DECISION RULES sists of ruling out scenarios that are no longer possible. On the other hand, according to CBDT the decision maker was born completely ignorant, and she learns by expanding her memory.8 Roughly, an EUT agent learns by observing what cannot happen, whereas a CBDT agent learns by observing what can. We do not consider case-based decision theory to be better than or a substitute for expected utility theory. Rather, we view the two theories as complementary. While these theories are justified by supposedly behavioral axiomatizations, we believe that their scope of applicability may be more accurately delineated if we attempt to judge the psychological plausibility of the various constructs. Two related criteria for classification of decision problems may be relevant. One is the problem’s description, the second is its relative novelty. If a problem is formulated in terms of probabilities, EUT is certainly a natural choice for analysis and prediction. Similarly, when states of the world are naturally defined, it is likely that they will be used in the decision maker’s reasoning process, even if a (single, additive) prior cannot be easily formed. However, when neither probabilities nor states of the world are salient (or easily accessible) features of the problem, CBDT may be more plausible than EUT. We may thus refine Knight’s dichotomous distinction between risk and uncertainty (Knight (1921)) by introducing a third category of structural ignorance: “risk” refers to situations where probabilities are given; “uncertainty” – to situations in which states are naturally defined, or can be simply constructed, but probabilities are not. Finally, decision under “structural ignorance” refers to decision problems for which states are neither (i) naturally given in the problem; nor (ii) can they be easily constructed by the decision maker. EUT is appropriate for decision making under risk. In face 8 In Chapter 7 we discuss other forms of learning, including changes of the similarity judgments. 4. ELEMENTARY FORMULA AND INTERPRETATIONS 53 of uncertainty (and in the absence of a subjective prior) one may still use those generalizations of EUT that were developed to deal with this problem specifically, such as non-additive probabilities (Schmeidler (1989)9 ) and multiple priors (Bewley (1986)10 , Gilboa and Schmeidler (1989)). However, in cases of structural ignorance CBDT is a viable alternative to the EUT paradigm. Classifying problems based on their novelty, one may consider three categories. When a problem is repeated frequently enough, such as whether to stop at a red traffic light, the decision becomes almost automated. Very little thinking is involved in making such a decision, and it may be best described as rule-based. When deliberation is required, but the problem is familiar enough, such as whether to buy insurance, it can be analyzed in isolation. In these situations the history of the same problem suffices for the formulation of states of the world and perhaps even of a prior, and EUT (or some generalization thereof) may be cognitively plausible. Finally, if the problem is unfamiliar, such as whether to get married or to invest in a politically unstable country, it needs to be analyzed in a context- or memory-dependent fashion, and CBDT is a more accurate description of the way decisions are made. Thus, rule-based systems are the simplest description of decision making when the decision maker is not aware of uncertainty or even of the fact that she is making a decision.11 Expected utility theory offers the best description of decisions in which uncertainty is present and the decision maker has enough data to analyze it. By contrast, case-based decision theory is probably the most natural description of decision making when the decision problem is amorphous and there are insufficient data to analyze it properly. 9 See also Gilboa (1987) and Wakker (1989). See also Aumann (1962). 11 Such decisions may also be viewed as a special type of case-based decisions. Specifically, a rule can be thought of as a summary of many cases, from which it was probably derived in the first place. See Section 13 for a more detailed discussion. 10 54 CHAPTER 2. DECISION RULES We defer a more detailed discussion of CBDT and EUT until Chapter 4, following the axiomatic derivation of CBDT in Chapter 3. 4.5 Comments Subjective Similarity As will be shown in Chapter 3, the similarity function in our model may be derived from observed preferences. Since different people may have different preferences, we should expect that the derived similarity functions would also differ across individuals. The similarity function is therefore subjective, as is probability in the works of de Finetti (1937) and Savage (1954). Yet, for some applications one may find that the data suggest an agreed-upon way to measure similarity, which may be viewed as objective. Cumulative Utility In the description of CBDT above, we advance a certain cognitive interpretation of the function u. We assume that it represents fixed preferences, and that memory may affect choices only by providing information about the u-value of the outcomes that acts yielded in the past. However, one may suggest that memory has a direct effect on preferences. According to this interpretation, the utility function is the aggregate U , while the function u describes the way in which U changes with experience. For instance, if the decision maker has a high aspiration level, corresponding to negative u values, she will like an option less, the more she used it in the past, and will exhibit change seeking behavior. On the other hand, a low aspiration level, corresponding to positive u values, would make her evaluate an option more highly, the more she is familiar with it, and would result in habit formation. Chapter 6 is devoted to this interpretation. Hypothetical Cases Consider the following example. Jane has to drive to the airport and she can choose between road A and road B. She chooses road A and arrives at the airport on time. On the way to the airport, however, she learns that road B was closed for construction. A week later Jane is faced with the same problem. Regardless of her aspiration level, it seems 5. VARIATIONS AND GENERALIZATIONS 55 obvious that she will choose the road A again. (Road constructions, at least in psychologically plausible models, never end.) This example shows that relevant cases may also be hypothetical, or counterfactual. More explicitly, Jane’s reasoning probably involves a counterfactual proposition such as “Had I taken road B, I would never have made it.” This may be modeled as a hypothetical case in which she took road B and arrived at the airport with a considerable delay. Hypothetical cases may endow a case-based decision maker with reasoning abilities she would otherwise lack. Moreover, it seems that any knowledge the agent possesses and any conclusions she deduces from it can, inasmuch as they are relevant to the decision at hand, be reflected by hypothetical cases. 5 5.1 Variations and Generalizations Average similarity The previous chapter introduced the basic decision criterion of CBDT, namely maximization of the function  (∗) U (a) = Up,M (a) = s(p, q)u(r) . (q,a,r)∈M This function is cumulative, summing up the impact of past cases. Consequently, the number of times a certain act was chosen in the past affects its perceived desirability. As mentioned above, according to the function U , an act a that was chosen relatively many times, yielding low but positive payoffs, may be preferred to an act b that was chosen less frequently, yielding consistently high payoffs. Moreover, this is true even if both acts were chosen frequently enough to allow statistical estimation of their payoffs means, or even if no uncertainty is present. We have argued that in presence of high and low payoffs, the aspiration level is likely to take an intermediate value, making the low payoffs appear negative. While this may make a U -maximizer prefer b over a in the 56 CHAPTER 2. DECISION RULES example above, one may still wonder whether the function U is the most reasonable way to evaluate alternatives. An obvious variation would be to use a similarity-based “average” utility function, namely  V (a) = s′ (p, q)u(r) (q,a,r)∈M where s(p, q) . ′ (q′ ,a,r)∈M s(p, q )  s′ (p, q) =  s(p, q ′ ) > 0 and 0 otherwise. V -maximization may be viewed as a way to formalize the idea of frequentist belief formation (insofar as it is reflected in behavior). Although beliefs and probabilities do not explicitly exist in this model, in some cases they may be implicitly inferred from the weights s′ . That is, if the decision maker happens to choose the same act in many similar cases, the evaluation function V may be interpreted as gathering statistical data, or as forming a “frequentist” prior. Observe that CBDT does not presuppose any a priori beliefs. Actual cases generate statistics, but no beliefs are assumed in the absence of data. V -maximization is more plausible than is U-maximization if very similar problems are encountered over and over again. But when memory consists only of remotely related problems, V -maximization may not be as convincing. In particular, it involves discontinuity at zero total similarity. For instance, if there is but one case in memory, (q, a, r), with problem similarity s(p, q) = ε, then V (a) = u(r) for every ε > 0, but V (a) = 0 for ε = 0. One may view both U-maximization and V -maximization as rough approximations, whose appropriateness depends on the range of the similarity function. In Section 18 we assume that the aspiration level changes over time, mostly as a result of past experiences. We show that, under certain assumptions, U -maximization, but not V -maximization, leads to asymptotically optimal choice in a stochastically repeated decision problem. In view of if (q ′ ,a,r)∈M 5. VARIATIONS AND GENERALIZATIONS 57 this result, we tend to view U -maximization as the primary formula, despite the fact that it seems inappropriate for repeated problems with a constant aspiration level. 5.2 Act similarity While it stands to reason that past performance of an act would affect the act’s evaluation in current problems, it is not necessarily the case that past performance is the only relevant factor in the evaluation process. Specifically, an act’s desirability may be affected by the performance of other, similar acts. For instance, suppose that Ann is looking for an apartment to rent. One of her options is to rent apartment A. Ann hasn’t lived in apartment A before. Thus, she has never chosen the particular act now available to her. But Ann has lived in the past in apartment B, which is located in the same neighborhood as is A. The act “renting apartment A” is similar to the act “renting apartment B” at least in that the two apartments are in the same neighborhood. It seems unavoidable that Ann’s experience with the similar act will affect her evaluation of the act she is now considering. Similarly, suppose that Bob tries to decide whether or not to buy a new product in the supermarket. He has never purchased this product in the past, but he has consumed similar products by the same producer. Thus the particular act that Bob is now evaluating has no history. But Bob’s memory contains cases in which he chose other acts that are similar to the one he now considers. Again, we would expect Bob’s decision to depend on his experience with similar acts. A decision maker is often faced with new acts, that is, acts that her memory contains no information about their past performance. According to formula (∗), the evaluation index attached to these acts is the default value zero. As in the examples above, this application of CBDT is not very plausible. Correspondingly, it may lead to counter-intuitive behavioral predictions. For instance, it would suggest that Ann will be as likely to 58 CHAPTER 2. DECISION RULES rent the apartment in the neighborhood she knows as an apartment in a neighborhood she does not know. Similarly, Bob will be predicted to buy the new product based on his aspiration level alone, without distinguishing among products by the record of their producers. In reality, however, decision makers are reminded of cases involving similar acts, and expect similar acts, chosen in similar problems, to result in similar outcomes. Thus we need to extend the model to reflect act similarity and not just problem similarity. Act similarity effects are especially pronounced in economic problems involving a continuous parameter. For instance, the decision whether or not to “Offer to sell at price p” for a specific value p, will likely be affected by the results of previous sell offers at different but close values of p. Generally, if there are infinitely many acts available to the decision maker, it is always the case that most of them are new to her. However, she will typically infer something about these new acts from the performance of other acts she has actually tried. While a straightforward application of CBDT to economic models with an infinite set of acts may result in counter-intuitive and unrealistic predictions, the introduction of act similarity may improve these predictions. Observe that act similarity effects are not restricted to the evaluation of new acts. Even if an act was chosen in the past in a similar problem, its evaluation is likely to be colored by the performance of similar acts. The need for modeling act similarity may sometimes be obviated by redefining “acts” and “problems”. For instance, Ann’s acts may be simply “To Buy” and “Not to Buy”, where each possible purchase is modeled as a separate decision problem. However, such a model is hardly very intuitive, especially when many acts are considered simultaneously. It is more natural to explicitly model a similarity function between acts. Moreover, in many cases the similarity function is most naturally defined on problem-act pairs. For example, “Driving on the left in New York” may be more similar to “Driving on the right in London” than to “Driving on the left in London”, 59 5. VARIATIONS AND GENERALIZATIONS “Buying when the price is low” may be more similar to “Selling when the price is high” than to “Selling when the price is low”, and so forth. In short, we would like to have a model in which the similarity function s is defined on problem-act pairs, and, given a memory M and a decision problem p, each act a is evaluated according to the weighted sum ′ (a) = U ′ (a) = Up,M  s((p, a), (q, b))u(r) . (•) (q,b,r)∈M Observe that a case (q, b, r) in memory may be viewed as a pair ((q, b), r), where the problem-act pair (q, b) is a single entity, describing the circumstances from which the outcome r resulted. That is, when past cases are considered, the distinction between problems and acts is immaterial. Indeed, it may also be fuzzy in the decision maker’s memory. By contrast, when evaluating currently available acts, this distinction is both clearer and more important: the problem refers to the given circumstances, which are not under the decision maker’s control, whereas the various acts describe alternative choices. 5.3 Case similarity It is sometimes convenient to think of a further generalization, in which the similarity function is defined over entire cases, rather than over problem-act pairs alone. According to this view, the decision maker may realize that a similar act in a similar problem may lead to a correspondingly similar (rather than identical) result. For instance, assume that John has used automatic vending machines for soft drinks in the past, and that he now faces a vending machine for sandwiches for the first time. There is evident similarity between pushing a button on this machine and pushing buttons on the soft drink machines he has used. Based on this similarity, John expects that pushing a button will result in receiving the product shown in the picture that is adjacent to the button. That is, he expects a sandwich to drop out 60 CHAPTER 2. DECISION RULES of the sandwich machine in spite of the fact that he has never seen a sandwich dropping out of any machine before. Indeed, he has never seen such a machine either. Given the similarity of the problem and the act to previous problem-act pairs, John expects a correspondingly similar outcome to result. He probably considers such an outcome to be more likely than any of the outcomes he has experienced. While one may attempt to fit this type of reasoning into the framework of U ′ -maximization by a re-definition of the results, it is probably most natural to assume a similarity function that is defined over whole cases. Thus, the case (soft drink machine, push button A, get drink A) is similar to the case (sandwich machine, push button B, get sandwich B ) more than to (sandwich machine, push button B, get drink A). If we assume that the decision maker can imagine the utility of every outcome (even if it has not been actually experienced in the past), we are naturally led to the following generalization of CBDT:   ′′ ′′ (••) U (a) = Up,M (a) = s((p, a, r), (q, b, t))u(r) . r∈R (q,b,t)∈M In this formula every outcome r is considered as a possible result of act a in problem p, and the weight of the utility of outcome r is the sum of similarity values of the present case, should act a be chosen and outcome r result in it, to past cases. Observe that case similarity can also capture asymmetric inferences.12 For instance, consider a seller of a product, who posts a price and observes an outcome out of {sale, no-sale}. Consider two acts, offer to sell at $10 and offer to sell at $12. Should the seller fail to sell at $10, she can safely assume that asking $12 would also result in no-sale. By contrast, selling at $10 provides but indirect evidence regarding the result of offer to sell at $12. Denoting two generic decision problems (say, days) by p and q, the similarity between (p, offer to sell at $10, no-sale) and (q, offer to sell at $12, no-sale) 12 This point is due to Ilan Eshel. 6. CBDT AS A BEHAVIORIST THEORY 61 is higher than that between (q, offer to sell at $12, no-sale) and (p, offer to sell at $10, no-sale). 6 6.1 CBDT as a Behaviorist Theory W -Maximization Our focus so far has been on CBDT as a theory that attempts to describe an actual mental process. It makes explicit reference to cognitive terms such as “utility” and “similarity”, and it has a claim to cognitive plausibility. But the reference to cases, representing factual knowledge on the one hand, and to choices on the other also allows a behaviorist version of CBDT. Such a theory would treat cases that actually happened as stimuli, or input, and decisions as responses, or output, without specifying the mental process by which cases affect decisions. The behaviorist case-based decision theory we suggest here retains the additivity of the theories presented above, but abstracts away from the structure of cases and the related similarity and utility functions. Specifically, we assume that cases are abstract entities. For each case c and each act a we assume that there is a number wp (a, c) that summarizes the degree to which case c supports the choice of a in the given problem p. Further, we assume that these support indices are accumulated additively, that is, that given a set of cases M, the decision maker will choose an act that maximizes  (◦) W (a) = Wp,M (a) = wp (a, c) . c∈M over a ∈ A. In the behaviorist interpretation, M is the set of cases to which the decision maker is exposed when faced with the decision problem p. For instance, it may represent the collection of stories that were published in the media available to her. Memory, the set of cases, is the input, or the stimulus. The act chosen is the output, or the response. The theory of W -maximization 62 CHAPTER 2. DECISION RULES makes no reference to any cognitive constructs such as probability, similarity, or utility. It does not try to capture wants or needs, beliefs or evaluations. It merely lumps together the impact of stimulus c on potential response a by a number wp (a, c), and argues that the impact of several stimuli is additive. In contrast to the theories described in previous sections, this theory allows cases to be completely abstract, in the sense that they do not have to comprise of a decision problem, an act, and an outcome. Cases here need not have any formal relationship to the acts among which the decision maker has to choose. Hence the interpretation of a “case” may go beyond an explicit story of a previous decision. For instance, suppose that Jane has to decide whether to go on a trip to the country. A relevant case might be the fact that it rained the day before. This case specifies neither an act that was chosen nor an outcome that was experienced. It is still possible, and indeed likely that it will influence Jane’s choice of an act in the current decision problem. Similarly, a stock market crash in 1929 may be a case that affects an investment decision of a person who was born after this case occurred.13 It will be convenient (and essential for the axiomatic derivation) to allow cases in memory to appear more than once.14 It is then natural to define a memory to be a function I : M → Z+ (where Z+ stands for the non-negative integers) that counts occurrences of cases in M . In this formulation, CBDT offers that the decision maker will choose an act that maximizes W (a) = Wp,I (a) =  I(c)wp (a, c) . (⋄) c∈M over a ∈ A. 13 As argued above, all the abstract cases discussed here, to the extent that they might affect decisions, can be translated into the language of problems, acts, and outcomes as hypothetical cases. Such a translation, however, constitutes a cognitive task. The present formulation is consistent with a purely behaviorist approach. 14 Alternatively, one may assume that each case is unique, but that two cases may be equivalent in the eyes of the decision maker, as far as problem p is concerned. One can then require that to each case there be infinitely many other cases that are equivalent to it in this sense. 6. CBDT AS A BEHAVIORIST THEORY 6.2 63 Cognitive Specification: EUT In the next chapter we axiomatize the behaviorist theory of W - maximization. While we believe that the axioms suggested therein may convince the reader that our theory is plausible, both descriptively and normatively, we wish to support it also by a cognitive account. Descriptively, our belief in the theory as a valid description of actual decision making will be enhanced if we can describe a mental process that implements it. From a normative viewpoint, such a process is essential to make the theory a practical recommendation for decision making. In Section 2 we refer to such a mental process as a “cognitive specification” of the theory. A behaviorist theory may have more than one cognitive specification. While we view W - maximization as the behaviorist manifestation of mental processes involving similarity and utility judgments, it is also true that expected utility theory is a cognitive specification of W - maximization. To see this, assume that the decision maker conceives of a set of states of the world Ω. Assume further that she forms beliefs µ over Ω as follows. Each case c induces a probability measure µc over Ω, such that, given memory I, the decision maker forms beliefs 15 µ=  I(c)µc . c∈M Should this decision maker maximize the expected value of a utility function u : A × Ω → R with respect to µ, she may be viewed as maximizing W in (⋄) with wp (a, c) = 15  u(a, ω)dµc (ω) . Observe that, in order to be a probability measure, µ may have to be normalized (separately for each memory I). The process of generation of subjective beliefs in this way is axiomatized in Gilboa-Schmeidler (2000b). 64 CHAPTER 2. DECISION RULES 6.3 Cognitive Specification: CBDT We have seen that expected utility theory with case-based probabilities is a possible cognitive specification of W -maximization. Naturally, the casebased decision rules of U -, U ′ -, and U ′′ -maximization discussed above are also cognitive specifications of the same theory. Specifically, assume, as in the previous sections, that all cases c are triples of the form (q, b, r), namely, that each case describes an act chosen in a decision problem and an outcome that resulted from it. The theory of U -maximization corresponds to the following definition of the support weights wp (a, c):  s(q, p)u(r) if a = b . wp (a, c) = wp (a, (q, b, r)) = 0 otherwise That is, U -maximization tells a story about a mental process in which each act is evaluated based only on those cases in which it was chosen, and for such a case c = (q, a, r), wp (a, c) = wp (a, (q, a, r)) is the product of the similarity of the problems by the utility of the outcome. Similarly, U ′ maximization would result from defining wp (a, c) = wp (a, (q, b, r)) = s((p, a), (q, b))u(r). In other words, the cognitive account of wp (a, c) suggested by U ′ -maximization involves the separation of similarity and utility, where similarity is defined for problem-act pairs. Finally, U ′′ -maximization may be viewed as W -maximization where one defines wp (a, c) = wp (a, (q, b, r)) =  s((p, a, t), (q, b, r))u(r). t∈R Note that V -maximization is not a special case of W -maximization. To fit the W function, one may define s(p, q) u(r) ′ (q ′ ,a,r)∈M s(p, q ) wpM (a, c) = wpM (a, (q, b, r)) =  6. CBDT AS A BEHAVIORIST THEORY 65 if a = b and if the denominator is positive, and wpM (a, c) = 0 otherwise. But in this case wpM (a, c) depends on the entire memory M , and not only on the case c in it. Formally, in V -maximization memory changes the similarity function. This is done in a proportional manner: the similarity function is normalized to add up to 1. Generally, memory may affect similarity judgments in more subtle ways, as will be discussed in Section 19. Whenever the similarity function depends on memory, the resulting decision rule should not be expected to be a special case of W -maximization. Viewing both EUT and various variants of CBDT as special cases of W maximization facilitates the comparison between them. While both theories can be described as accumulating evidence from past cases, in EUT the decision maker is capable of hypothetical thinking, as she uses a case that occurred in the past both for the evaluation of the act she has indeed chosen in this case (if there was one), and for the evaluation of acts she has not chosen. Moreover, the same probabilities of states of the world are used for all acts. Thus an expected utility maximizer trusts her hypothetical thinking as much as she trusts her actual experience. By contrast, a case-based decision maker who maximizes U or U ′ evaluates acts based only on the outcomes that have actually been experienced in the past. She does not attempt to imagine what outcomes might have resulted from other acts in the same situations. 6.4 Comparing the cognitive specifications Viewed thus, both theories appear extreme. U ′ -maximization (and, perforce, U -maximization) does not allow the decision maker to engage in hypothetical thinking, unless hypothetical cases are explicitly introduced into memory. By contrast, EUT insists that hypothetical thinking is possible and, moreover, that it is just as influential as actual experience. However, the behaviorist theory of W -maximization allows many intermediate degrees of hypothetical thinking. Indeed, U ′′ -maximization is a cognitive specification of W -maximization that spans this range. In one extreme, we may define 66 CHAPTER 2. DECISION RULES s((p, a, t), (q, b, r)) in (••) as s((p, a, t), (q, b, r)) =  s((p, a), (q, b)) if t = r , 0 otherwise yielding U ′ -maximization as a special case, and precluding hypothetical thinking about the outcome that could have resulted in case (q, b, r) had a been chosen rather than b. On the other extreme, we may set s((p, a, t), (q, b, r)) = µc ({ω ∈ Ω | a(ω) = r})  yielding EUT (with respect to beliefs µ = c∈M I(c)µc ) as a special case. But s((p, a, t), (q, b, r)) may, in general, be non-zero for t = r (as in EUT), allowing the decision maker to imagine that a different act in a different problem might result in a different outcome, while still depending on t, allowing a distinction between actual and hypothetical outcomes. Thus, the behaviorist theory of W -maximization offers a wide range of decision making patterns, as described by its cognitive specification of U ′′ -maximization. The theory of W -maximization has non-behaviorist interpretations as well. First, the weights wp (a, c), measuring the support that a case c provides to act a, may be directly accessible by introspection, even if they do not take one of the forms suggested by the cognitive specifications above. Second, one may apply W -maximization to a decision maker’s memory that may not be observable by the modeler directly, provided that this memory is available to the decision maker by introspection. Thus W -maximization may be a behavioral or even a cognitive theory. In the next chapter we axiomatize CBDT in the general form of W maximization.16 But most of the discussion that follows relates to the simplest model, namely, U-maximization. While this model is rather restrictive, it highlights the way in which CBDT differs from EUT. Thus we focus on the model that is most likely to be wrong but also most likely to be insightful.17 16 As mentioned above, W -maximization does not generalize V -maximization. For an axiomatization of V -maximization, see Gilboa and Schmeidler (1995). 17 An exception is the discussion of planning in Chapter 5, which requires the use of 7. CASE-BASED PREDICTION 7 67 Case-Based Prediction In a prediction problem a predictor is faced with certain circumstances, and is asked to name one of possible eventualities that might arise. Prediction may be regarded as a special type of decision making under uncertainty: the acts available to the predictor are the possible predictions, and the possible outcomes are success (for a correct prediction) and failure (for a wrong one). In a more general model, one may also rank predictions on a continuous scale, measuring the proximity of the prediction to the eventuality that actually transpires, allow set-valued predictions, probabilistic predictions, and so forth. As any other type of decision under uncertainty, prediction is based on past knowledge. Indeed, the formal structure of a prediction problem would typically include also a history, or memory, namely, a set of examples, consisting of circumstances that were encountered in the past, coupled with the eventualities that are known to have resulted from them. This structure also encompasses problems referred to in the literature as classification, categorization, or learning. Viewed as a special case of decision problems, prediction problems are characterized by the following feature: outcomes are assumed to be independent of acts. The predictor, viewed as a decision maker, believes that she is an outside observer who may guess the eventuality that would result from circumstances, but that she cannot affect it in any way. She affects the outcome of her own decision problem by guessing the eventuality correctly or incorrectly, but not through the eventuality itself. It follows that, when considering a past case of prediction, the decision maker knows what the eventuality that actually transpired would have been the same had she predicted differently. Thus she can estimate the success of each possible prediction she could have made with the same ease of imagination and the U ′′ -maximization. 68 CHAPTER 2. DECISION RULES same certainty as for the prediction she actually did make. In fact, in formally describing a past case of prediction, one may completely suppress the predictor’s choice and focus only on the circumstances, namely, the decision problem, and the eventuality. Let us consider a simple formal model in which memory M contains examples of the form (q, r) ∈ P × R, i.e., pairs of problems, or circumstances, q and eventualities r. Assume that a similarity function s : P × P → [0, 1] is given as above. Given a problem p ∈ P , it is natural to suggest that possible eventualities be ranked according to  ′ (r) = W ′ (r) = Wp,M s(p, q) . (q,r)∈M That is, an eventuality r is assigned a numerical value corresponding to the sum, over all examples in which r actually transpired, of the similarity values of the problems in the past to the one at hand. It is easily seen that W ′ is a special case of W defined above. Indeed, consider a decision problem in which A = R. That is, the set of possible predictions coincides with the set of eventualities. Let cases be pairs (q, t) ∈ P × R and define the support weights wp (r, c) by:  s(p, q) if r = t . wp (r, c) = wp (r, (q, t)) = 0 otherwise With these definitions, W -maximization in the decision problem coincides with W ′ -maximization in the prediction problem. Moreover, W -maximization also allows more general prediction forms. That is, knowing that eventuality t transpired in a past case may make another eventuality r = t more or less plausible. The following chapter provides an axiomatic derivation of W -maximization in the context of decision under uncertainty. It can also be interpreted as an axiomatization of a general prediction rule based on abstract cases. With the special structure above in mind, one may easily formulate additional conditions that would characterize W ′ -maximization. (See Gilboa and Schmeidler 7. CASE-BASED PREDICTION 69 (2000b,c), in which we follow this path.) Observe that W ′ -maximization generalizes kernel estimates of density functions. (See Akaike (1954), Rosenblatt (1956), and Parzen (1962).) 70 CHAPTER 2. DECISION RULES Chapter 3 Axiomatic Derivation 8 Highlights We devote this chapter to the axiomatization of W -maximization. In this section we describe the model and our key assumptions. Formal discussion is presented in the following section. The decision maker is asked to rank acts in a set A by a binary preference relation, based on her memory M . Further, we assume that the decision maker has a well-defined preference relation on A not only for the actual memory M , but also for other, hypothetical memories that can be generated from M by replication of cases in it. While the representation (⋄) in Section 6.1 allows repetitions in memory, we now require that preferences be defined for any such vector of repetitions. In many situations it is not hard to imagine cases appearing in memory repeatedly. For instance, in the recommendation letters example one may imagine several letters relating practically identical stories about a candidate. Cases that are available to a physician in treating a specific patient will often be identical to the best of the physician’s knowledge. Moreover, cases that are known to be different may still be considered equivalent as far as the present decision is concerned. Patterns in the history of the weather or of the stock market can also be imagined to have occurred in history more or 71 72 CHAPTER 3. AXIOMATIC DERIVATION less times than they actually have. Yet, the assumption that cases can be repeated is restrictive. Consider the example of war in Bosnia again. We quoted the Vietnam war as a relevant case. But it is not entirely clear what is meant by assuming that the Vietnam war occurred, say, five times. In what order do these repeated wars supposed to occur? Were the relevant decision makers aware of the previous occurrences? And, on a more philosophical level, can anything ever repeat itself in exactly the same way? Will the repetition itself not make it different? Indeed, no two cases are identical. But many cases can be identical to the best of the decision maker’s knowledge, or to the best of her judgment, at least as far as the present decision problem is concerned. As explained below, one might derive the notion of identicality of cases from preference relations as well. In this approach, the philosophical problem is resolved, because identicality is not assumed primitive. The decision maker is not supposed to believe that cases are identical. Rather, if the decision maker’s preferences reveal that two cases are equivalent in her eyes for the problem at hand, we may treat them as if they were identical. Still, one would need to assume that for each case there are infinitely many other cases that are equivalent to it in this sense. Another approach to the axiomatic derivation of CBDT involves a different type of hypothetical rankings: rather than assume that entire cases repeated themselves in exactly the same way, one may assume that each case occurred only once, but that it has resulted in a different outcome than the one actually experienced. We follow this tack in the axiomatization of U - and of V - maximization with a given utility function in Gilboa and Schmeidler (1995), in the axiomatization of U ′ -maximization in Gilboa and Schmeidler (1997), and in that of U ′ -maximization with an endogenously derived utility function in Gilboa, Schmeidler, and Wakker (1999). There are situations where it is easier to imagine an entire case repeating itself with the same outcome, that to imagine the same case resulting in a different outcome, and 8. HIGHLIGHTS 73 there are situations where the converse holds. Observe, however, that both types of hypothetical rankings are needed only for the axiomatization, not for the formulation of the theory itself. They will be required for elicitation of parameters, but not for application of the theory by a decision maker who can estimate these parameters directly. The repetitions model makes yet another implicit assumption: not only do repetitions make sense, they are also supposed to contain all relevant information for the generation of preferences. Specifically, the order in which cases have supposedly occurred is not reflected in memory. This assumption is not too restrictive, since the description of a case may include a time parameter or certain relevant features of history. Yet, it has a flavor of de Finetti’s exchangeability condition. Our main assumption within this set-up is a combination axiom. Roughly, it states that, if act a is preferred to act b given two disjoint databases of cases, than a should be preferred to b also given their union. To see the basic logic of this condition, imagine that the science of medicine equips physicians with an algorithm to provide recommendation to patients given their personal experience. Imagine that a patient consults two physicians who work in different hospitals, and who therefore have been exposed to disjoint databases of patients. Imagine further that both physicians recommend treatment a over treatment b. Should the patient insist that they get together to consult her case? A negative answer elicits an implicit belief in the combination axiom: applying the same decision making algorithm to the union of the two databases should result in the same conclusion as applying it to each database separately (if these yielded an identical result). Yet, the combination axiom may seem rather implausible is various cases. We first present the formal model and then discuss this and the other axioms in more detail. 74 CHAPTER 3. AXIOMATIC DERIVATION 9 Model and Result The formal model and the results in this section appear in Gilboa and Schmeidler (2000b). In that paper we discuss predictions and inductive inference, where one assumes that rankings of eventualities by their plausibility are given as data. Here we interpret the rankings as preferences over possible acts. Let M be a finite a non-empty set of cases and A – a set of acts available at the decision problem p under discussion. We assume that A contains at least two acts. For simplicity of notation we suppress p throughout. Denote J = ZM + = {I|I : M → Z+ } where Z+ stands for the non-negative integers. J is the set of hypothetical memories, or simply memories. We assume that for every I ∈ J the decision maker has a binary relation “at least as desirable as” on A denoted by I (i.e., I ⊂ A × A).1 In other words, we are formalizing here a rule, as opposed to an act, of decision making. The rule generates decisions not only for the available set of cases, but also for hypothetical ones. Some of the hypothetical collections of cases, i.e., vectors in J, may become actual when the decision maker acquires additional information. (On this see also Subsection 9.3) In conclusion, our structural assumption (axiom 0) is the existence of a binary relation I on A for all I ∈ J. 9.1 Axioms We will use the four axioms stated below. In their formalization let ≻I and ≈I denote the asymmetric and symmetric parts of I , as usual. I is complete if x I y or y I x for all x, y ∈ A. Finally, algebraic operations 1 This mathematical structure is akin to, and partly inspired by Young (1975) and Myerson (1995), who derive scoring rules in voting theory. Rather than a binary relation I on A, they assume a function selecting a subset of A for every I ∈ J, to be interpreted as the set of winning candidates given a distribution of ballots I. Our model assumes more information, since we require a complete ranking of alternatives in A for each I. In return, we obtain an almost unique representation, and we do not need to assume symmetry among the alternatives. 9. MODEL AND RESULT 75 on J are performed pointwise. A1 Ranking: For every I ∈ J, I is complete and transitive on A. A2 Combination: For every I, J ∈ J and every a, b ∈ A, if a I b (a ≻I b) and a J b, then a I+J b (a ≻I+J b). A3 Archimedean Axiom: For every I, J ∈ J and every a, b ∈ A, if a ≻I b, then there exists l ∈ N such that a ≻lI+J b. Observe that in the presence of Axiom 2, Axiom 3 also implies that for every I, J ∈ J and every a, b ∈ A, if a ≻I b, then there exists l ∈ N such that for all k ≥ l, a ≻kI+J b. Axiom 1 simply requires that, given any conceivable memory, the decision maker’s preference relation over acts be a weak order. Axiom 2 states that, if act a is preferred to act b given two disjoint memories, then a should also be preferred to b given the combination of these memories. In our setup, combination (or concatenation) of memories takes the form of adding the number of repetitions of each case in the two memories. Axiom 3 is a continuity condition. It states that if, given the memory I, the decision maker strictly prefers act a to act b, then, no matter what is her ranking for another memory, J, there is a number of repetitions of I that is large enough to overwhelm the ranking induced by J. Finally, we need a diversity axiom that is not necessary for the functional form we would like to derive. While the theorem we present is an equivalence theorem, it characterizes a more restricted class of preferences than those discussed in the introduction. Specifically, we require that for any four acts, there is a memory that will distinguish among all four of them. A4 Diversity: For every list (a, b, d, e) of distinct elements of A there exists I ∈ J such that a ≻I b ≻I d ≻I e. If |A| < 4, then for any strict ordering of the elements of A there exists I ∈ J such that ≻I is that ordering. We remind the reader that the next section is devoted to a detailed discussion of the axioms, and proceed to state their implication. 76 9.2 CHAPTER 3. AXIOMATIC DERIVATION Basic result The key result of this chapter can now be formulated. Theorem 3.1 : Let there be given A, M, and {I }I∈J as above. Then the following two statements are equivalent if |A| ≥ 4 : (i) {I }I∈J satisfy A1-A4; (ii) There is a matrix w : A × M → R such that:  for every I ∈ J and every a, b ∈ A,  (∗∗)    a I b iff I(c)w(a, c) ≥ c∈M c∈M I(c)w(b, c) , and for every list (a, b, d, e) of distinct elements of A, the convex hull of differences of the row-vectors (w(a, ·) − w(b, ·)), (w(b, ·) − w(d, ·)), and (w(d, ·) − w(e, ·)) does not intersect RM −. Furthermore, in this case the matrix w is unique in the following sense: w and w  both satisfy (∗∗) iff there are a scalar λ > 0 and a matrix u : A × M → R with identical rows (i.e., with constant columns) such that w  = λw + u . Finally, in the case |A| < 4, the numerical representation result (as in (∗∗)) holds, and uniqueness as above is guaranteed. The main point of the theorem is that the infinite family of rankings has a linear representation via a real matrix of order |A| × |M|. For (a, c) ∈ A × M the number w(a, c) is interpreted as the amount of support that case c lends to act a. Given any real matrix of order |A| × |M |, one can define for every I ∈ J a weak order on A through (∗∗). It is easy to see that these orders would satisfy A2 and A3 but not necessarily A4. For example, A4 will be violated if a row of the matrix dominates another row. It is therefore natural to wonder whether one can obtain a numerical representation using only A1-A3. The answer is given by the following proposition. 9. MODEL AND RESULT 77 Proposition 3.2 Axioms A1, A2, and A3 do not imply the existence of a matrix w satisfying (∗∗). In other words, A1, A2, and A3 are necessary but (jointly) insufficient conditions for the existence of a matrix w satisfying (∗∗).2 9.3 Learning new cases The decision rule is constructed in anticipation of incorporating new information. Let us assume that the decision maker conceives of cases in a set M and has information represented by I ∈ ZM + . Correspondingly, she ranks acts by I . Assume now that the decision maker acquires a new database. If the new database consists of additional repetitions of cases that the decision maker has conceived of, namely, cases in M , it can be represented by a vector J ∈ ZM + , and the decision maker’s new preference will be I+J . But if the new database contains at least one case c that is not in M , it cannot be rep′ resented by a member of ZM + . One may choose another set M and represent ′ the new database by J ∈ ZM + , but then I and J do not belong to the same space and they cannot be summed up. In particular, the combination axiom does not apply and the representation theorem cannot be invoked. True, one M′ may apply the theorem separately to ZM + and to Z+ , but then the theorem would yield a matrix w ′ for memory M ′ that may be completely unrelated to the matrix w of the memory M . Specifically, even if case c belongs to both M and M ′ , the degree of support it lends to an act a in memory M need not be the same as in memory M ′ . This would imply that the support that a case lends to an act may depend not only on their intrinsic features, but also on other cases in memory. Should one want to rule out this dependence, one should require an additional axiom. Before we introduce it, a piece of notation will be useful: 2 A theorem in Ashkenazi and Lehrer (2000) suggests a condition on {I }I∈J that is necessary and sufficient for a representation as in (∗∗). This condition is less intuitive than the axioms we use and is therefore omitted. 78 CHAPTER 3. AXIOMATIC DERIVATION for two memories M and M ′ satisfying M ⊂ M ′ , and for each I ∈ ZM + , let ′ ′ ′ I ′ ∈ ZM + be the extension of I to M defined by I(c) = 0 for c ∈ M \M . We can now state the condition guaranteeing that the similarity function is independent of memory: A5 Case Independence: If M ⊂ M ′ , then I = I ′ for all I ∈ ZM +. We now state the sought for result. Its proof is quite simple and hence omitted. Proposition 3.3 : Let there be given a set of cases C, a set of alternatives A, and a family M of finite subsets of C . Assume that for every M ∈ M, {I }I∈ZM , satisfy A1-A4, that |A| ≥ 4, and that M is closed under union. + Then the following two statements are equivalent: (i) A5 holds on M; ( ii) For every c ∈ C and every a ∈ A there exists a number w(a, c) such that (∗∗) of Theorem 3.1 (ii) holds for every M ∈ M. 9.4 Equivalent cases Axiom 5 and Proposition 3.3 enable us to resolve an additional modeling difficulty within our framework. From the description of our model it seems that the initial information of the decision maker is M or, equivalently, 1M ∈ 3 Assume, for instance, that the observed cases are 5 flips of a coin, ZM +. say HT HHT . Are these five separate cases, with M = {ci |i = 1, ..., 5} and I = (1, 1, 1, 1, 1)? Or is the correct model one where M ′ = {H, T } and ′ I = (3, 2) ∈ ZM + ? Or should we perhaps opt for a third alternative where the decision maker’s knowledge is J = (3, 2, 0, 0, 0) ∈ ZM + ? It would be nice to know that these modeling choices do not affect the result. 3 1E stands for the idicator vector of a set E in the appropriate space. In our model we identify M with 1M ∈ ZM +. 9. MODEL AND RESULT 79 First we note that, indeed, under Axiom 5 and the proposition, the rankings of A for (3, 2) and for (3, 2, 0, 0, 0) should be identical. Second, if the decision maker believes that c1 , c3 , c4 ∈ M are the “same” case, and so are c2 , c5 ∈ M , we should also expect the rankings for (1, 1, 1, 1, 1) and for (3, 2, 0, 0, 0) to be identical. Moreover, if this is the case also whenever we “replace” occurrences of c1 by occurrences of, say, c3 , we may take this as a definition of “sameness” of cases. Formally, for two cases c1 , c2 ∈ M , define c1 ≃ c2 if the following holds: for every I, J ∈ J such that I(c) = J(c) for all c = c1 , c2 , and I(c1 ) + I(c2 ) = J(c1 ) + J(c2 ), we have I =J . In conclusion we state the obvious: Proposition 3.4 (i) ≃ is an equivalence relation on M . (ii) If (∗∗) holds then: c1 ≃ c2 ⇔ ∃β ∈ R s.t., ∀a ∈ A, w(a, c1 ) = w(a, c2 ) + β It follows that one may start out with a large set of distinct cases, and allow the decision maker’s subjective preferences to reveal possible equivalence relations between them. It is then possible to have an equivalent representation of the decision maker’s knowledge, based on the smallest set of different cases. Thus for the minimal representing matrix w, not only are all columns different, but no column is obtained from another by addition of a constant. This derivation of equivalence relation between cases, however, still assumes that cases may be repeated in an identical manner. As mentioned above, one may find this objectionable on philosophical grounds. In this case, one may prefer a model that does not assume that identicality of cases is primitive. Such a model may be adapted from that presented above as follows. Let there be given a set of distinct cases C. Assume that for every finite M ⊂ C there exists a relation M on A. Define cases i, j ∈ C to be equivalent if M ∪{i} =M ∪{j} for all finite M ⊂ C with M ∩ {i, j} = ∅. It is easy to verify that this is indeed an equivalence relation. The structural assumption is then that every equivalence class of this relation is infinite. 80 9.5 CHAPTER 3. AXIOMATIC DERIVATION U -Maximization Consider again the model described in Section 4, where a case consists of a problem, an act, and an outcome. With this additional structure one may distinguish U-maximization from U ′ - and U ′′ -maximization by the fact that in U -maximization an act is evaluated based only on its own history. This may be formulated by the following axiom. A6 Specificity: For all I, J ∈ J, and all a, b ∈ A, if I(c) = J(c) whenever c = (q, a, r) or c = (q, b, r), then a I b iff a J b. A6 requires that if two memories agree on the number of occurrences of all cases involving acts a and b, then the two memories induce the same preference order between these two acts. Observe that this axiom should hold if the history of each given act a affects only the evaluation of this very act. In other words, if we knew that the impact of act a’s history were specific to the evaluation of act a itself, we would expect A6 to hold. The following result shows that this axiom is also sufficient to obtain the kind of representation we seek. That is, under this additional axiom, the evaluation of each act does not depend on past performance of other acts. Proposition 3.5 Assume that {I }I∈J satisfy A1-A4 and A6. Then there exists a matrix w : A × M → R such that  for every I ∈ J and every a, b ∈ A,         a I b iff  (∗ ∗ ∗)     iff  c∈M I(c)w(a, c) ≥ c∈M I(c)w(b, c)         c=(q,a,r)∈M I(c)w(a, c) ≥ c=(q,b,r)∈M I(c)w(b, c) Further, this matrix w is unique up to multiplication by a positive number and it satisfies w(a, (q, b, r)) = 0 whenever a = b. A6 implies that the matrix w of Theorem 3.1 has the following property: if acts b, d were not chosen in case c, then w(b, c) = w(d, c). Since every 9. MODEL AND RESULT 81 column in the matrix w can be shifted by a constant, one may set w(b, c) to zero whenever act b was not chosen in case c. Given this choice of w, the last two lines of (∗ ∗ ∗) are obviously equivalent. Moreover, this choice makes w unique up to multiplication by a positive number. We omit the formal proof of the proposition, which follows this reasoning. Observe that the resulting matrix w consists mostly of zeroes: in every column there exists at most one entry that differs from zero. In case a utility function u : R → R is given, one may write w(a, (q, a, r)) = s(q)u(r) (unless u(r) vanishes and w(a, (q, a, r)) does not).4 Interpreting s(q) as the similarity of the past problem q to the problem at hand p yields Umaximization. Indeed, with a given utility function one may decompose the function w even without A6 and write w(a, (q, b, r)) = s((p, a), (q, b))u(r) to get U ′ -maximization. (Again, such a decomposition assumes that u(r) = 0 implies w(a, (q, b, r)) = 0.) Observe, however, that in the absence of a given utility function there is no unique decomposition of w(a, (q, b, r)) (or of w(a, (q, a, r))) to a product of a similarity function and a utility function. Indeed, if cases can only repeat themselves in memory in their entirety, one cannot hope to observe the separate impacts of similarity judgment and of desirability evaluation on behavior.5 If a function w can be written as the product of similarity s and utility u, it is also the product of −s and −u. Thus even the ordinal ranking of past outcomes cannot be gleaned from the function w alone. For instance, if Jane tends to choose act a in problem p after having chosen a in case (q, a, r), it is possible that she liked outcome r and that she considers problem p to be similar to problem q (i.e., u(r), s(p, q) > 0), but it is also possible that she did not like outcome r, and that she considers problems p 4 There are many situations in which a utility function may be assumed given. In particular, one may have a cognitive measure of utility that is independent of the behavioral theory under discussion. See, for instance, Alt (1936) for such a derivation. 5 This problem is reminiscent of the axiomatization of state-dependent utility with subjective probability. See Karni, Schmeidler, and Vind (1983) and Karni and Mongin (2000). 82 CHAPTER 3. AXIOMATIC DERIVATION and q to be dissimilar (i.e., u(r), s(p, q) < 0). To obtain a decomposition of the function w into a product of utility and similarity based on behavior data alone, one would need to consider not only repetitions of cases, but also hypothetical cases in which different outcomes resulted from the same problem-act pairs. Such an approach is used by Gilboa and Schmeidler (1995, 1997a) and Gilboa, Schmeidler, and Wakker (1999). 10 Discussion of the Axioms Axiom 1 states that the decision maker has a weak preference over the set of acts given any memory. This axiom is standard in decision theory, and it is as plausible here as in most models. Once one accepts the structural assumptions, namely, that the decision maker has preferences given any conceivable memory and that only the number, and not the order of cases matters, A1 is not too objectionable.6 As mentioned above, the combination axiom (A2) is the most fundamental of our assumptions. Whereas its basic logic seems sound, it is by no means universally applicable. In particular, the CBDT rule of V -maximization does not satisfy it. To see this, consider the following version of “Simpson’s Paradox” (see, for instance, DeGroot (1975)).7 One has to choose between betting on coin a coming up Head or on coin b coming up Head. In database I coin a resulted in Head 110 out of 1,000 tosses, whereas coin b – 10 out of 100 tosses. Thus, V -maximization, which is a rather reasonable decision rule in this context, would prefer a (with success rate of 11%) to b (with 10%). In database J, a resulted in Head 90 out of 100 tosses, whereas b – in 800 out 6 It should be mentioned that both completeness and transitivity of preferences, and, in particular, transitivity of indifference, have been challenged in decision theory. (See Luce (1956) and Fishburn (1970, 1985).) The criticism of these axiom applies here as in any other context in which it was raised. 7 Simpson’s paradox was suggested to us as a counterexample to the combination axiom by Bernard Walliser. 10. DISCUSSION OF THE AXIOMS 83 of 1,000. Again, V -maximization would prefer a to b. But in the combined database a resulted in Head 200 times out of 1,100, whereas b – 810 out of 1,100, and V -maximization would prefer b to a. The implicit assumption that makes V -maximization a reasonable decision rule in this situation is that the coin tosses for both coin a and coin b result from independent sampling from fixed distributions. But this assumption also makes the scenario above unlikely. Observing databases such as I and J probably indicates that this assumption is false, and that not all tosses of the coins are similar. In particular, in database I the percentage of Heads is much lower than in database J, for both coins. It is possible, for instance, that the tossing machine in database J has a Head bias. Introducing the tossing machine as part of the description of a case would resolve the inconsistency. Specifically, coin tosses from database I would differ from tosses of the same coins from database J, and it would no longer make sense to apply V -maximization to the union of the two databases. The combination axiom may also be violated when small databases are considered. For instance, consider a statistician who has to decide whether to accept or to reject a null hypothesis. Assume that the statistician uses a standard hypothesis testing technique. It is normally the case that one can find two samples, each of which is too small to reach statistical significance, and whose union is large enough for this purpose. Thus, the decision given each database would be “accept”, whereas given their union it would be “reject”. It should be noted, however, that this is due to the inherent asymmetry between acceptance and rejection of a null hypothesis, which often stems from the fact that the null hypothesis enjoys the status of “accepted view” or “common wisdom”. This, in turn, probably relies on past cases that precede the samples under discussion. If we were to consider the entire database on which the statistician relies, we would include these past cases as well as the new samples. In this case, the two databases defined by past cases combined with each of the new samples are not disjoint. 84 CHAPTER 3. AXIOMATIC DERIVATION To consider another example, assume that an inspector has to decide whether a roulette wheel is fair. Observing a hundred cases in which the wheel came up on red, she may conclude that the roulette is biased and that legal action is called for. Obviously, she should make the same decision if she observes a hundred cases, all of which involving black outcomes. Yet, the combination of the two databases will not warrant any action.8 Observe, however, that this example hinges on the fact that the acts are not described in sufficient detail. If one were to choose among “no act”, “sue for a red bias”, and “sue for a black bias”, the violation of the combination axiom would disappear.9 Perhaps the most important violation of the combination axiom occurs when the function w(a, c) itself is learned from experience. We refer to this process as “second-order induction”, since it describes how one learns how to learn from experience. Chapter 7 discusses this issue in detail. The Archimedean axiom states that conceivable memories are comparable, or that no conceivable memory can be infinitely more weighty than another. It is violated, for instance, by decision systems that are based on nearest neighbor classifications. To consider a concrete example, imagine that a physician has to decide whether to perform a surgery on a patient. One possible algorithm for making this decision is to consider the most similar patient who has undergone the surgical procedure, and to perform the surgery on the current patient if and only if the most similar past case of surgery was successful. For this decision rule, a single most-similar case of, say, a successful operation will outweigh any number of less similar cases of unsuccessful operations. We view this example as a criticism of nearest neighbor approaches more than of the Archimedean axiom. At any rate, one may drop the Archimedean axiom and develop a nonstandard analysis version of our theorem. 8 This example is based on an example of Daniel Lehman. As in the example of Simpson’s paradox, the scenario described in this example is highly unlikely under the assumption of stochastic independence (or exchangeability). 9 10. DISCUSSION OF THE AXIOMS 85 In Gilboa and Schmeidler (2000b) we use the combination and the Archimedean axioms as applied to prediction, and we identify several classes of examples in which they are unlikely to hold. Since every prediction problem can be embedded in a decision problem, these classes of examples apply here as well. Yet, it appears that these axioms are rather plausible in a wide range of applications. We now turn to the diversity axiom. Its main justification is necessity: without it the representation of preferences by W -maximization cannot be derived. We readily admit that it is too restrictive for many purposes. For instance, consider three acts, a =“sell 100 shares”, b =“sell 200 shares”, and d =“sell 300 shares”. One may argue that act b will always be ranked between acts a and d. But this contradicts the diversity axiom. Yet, with a large enough set of conceivable cases M , the diversity axiom is not too objectionable. For instance, there might be circumstances under which obtaining the optimal portfolio would require selling precisely 200 shares. Having strong enough evidence from past cases that the present problem belongs to this category, one may indeed rank, say, b over a and a over d. The above notwithstanding, it would be useful to have another axiom that, in the presence of A1-3, would yield W -maximization. It will be clear from the proof that one does not need the full strength of A4 to obtain the desired result. It suffices that there be enough quadruples of acts for which the conclusion of the axiom holds, rather than that it hold for all quadruples. Such a weakening would allow act b in the example above to be always ranked between a and d, provided that there are other acts, say e and f , such that for every pair of distinct acts {x, y} ⊂ {a, b, d}, every strict ranking of {x, y, e, f } is obtained for some I. For simplicity of exposition we formulate A4 in its full strength, rather than tailor it to its exact usage in the proof. At present we are not aware of any elegant alternative to A4. 86 CHAPTER 3. AXIOMATIC DERIVATION 11 Proofs Proof of Theorem 3.1: Theorem 3.1 is reminiscent of the main result in Gilboa and Schmeidler (1997). In that work, cases are assumed to involve numerical payoffs, and algebraic and topological axioms are formulated in the payoff space. Here, by contrast, cases are not assumed to have any structure, and the algebraic and topological structures are given by the number of repetitions. This fact introduces two main difficulties. First, the space of “contexts” for which preferences are defined is not a Euclidean space, but only integer points thereof. This requires some care with the application of separation theorems. Second, repetitions can only be non-negative. This fact introduces several complications, and, in particular, changes the algebraic implication of the diversity condition. We present the proof for the case |A| ≥ 4. The proofs for the cases |A| = 2 and |A| = 3 will be described as by-products along the way. We start by proving that (i) implies (ii). We first note that the following homogeneity property holds: Claim 3.1 For every I ∈ ZM + and every k ∈ N, I =kI . Proof: Follows from consecutive application of the combination axiom.  In view of this claim, we extend the definition of I to functions I whose values are non-negative rationals (denoted Q+ ). Given I ∈ QM + , let k ∈ N be M such that kI ∈ Z+ and define I = kI . I is well-defined in view of Claim 3.1. By definition and Claim 3.1 we also have: Claim 3.2 (Homogeneity) For every I ∈ QM + and every q ∈ Q , q > 0 : qI = I . Claim 3.2, A1, and A2 imply: 87 11. PROOFS Claim 3.3 (The ranking axiom) For every I ∈ QM + , I is complete and transitive on A, and (the combination axiom) for every I, J ∈ QM + and every a, b ∈ A and p, q ∈ Q , p, q > 0: if a I b (a ≻I b) and a J b, then a pI+qJ b (a ≻pI+qJ b). Two special cases of the combination axiom are of interest: (i) p = q = 1, and (ii) p + q = 1. Claims 3.2 and 3.3, and the Archimedean axiom, A3, imply the following version of the axiom for the QM + case: Claim 3.4 (The Archimedean axiom) For every I, J ∈ QM + and every a, b ∈ A, if a ≻I b, then there exists r ∈ [0, 1) ∩ Q such that a ≻rI+(1−r)J b. It is easy to conclude from Claim 3.4 (and 3.3), that for every I, J ∈ QM + and every a, b ∈ A, if a ≻I b, then there exists r ∈ [0, 1) ∩ Q such that a ≻pI+(1−p)J b for every p ∈ (r, 1) ∩ Q. The following notation will be convenient for stating the first lemma. For every a, b ∈ A let Y ab ≡ {I ∈ QM + | a ≻I b} and W ab ≡ {I ∈ QM + | a I b}. Observe that by definition and A1: Y ab ⊂ W ab , W ab ∩ Y ba = ∅, and W ab ∪ Y ba = QM + . The first main step in the proof of the theorem is: Lemma 3.5 For every distinct a, b ∈ A there is a vector wab ∈ RM such that, ab (i) W ab = {I ∈ QM + | w · I ≥ 0}; ab (ii) Y ab = {I ∈ QM + | w · I > 0}; ab (iii) W ba = {I ∈ QM + | w · I ≤ 0}; ab (iv) Y ba = {I ∈ QM + | w · I < 0}; (v) Neither wab ≤ 0 nor wab ≥ 0; (vi) −w ab = w ba . 88 CHAPTER 3. AXIOMATIC DERIVATION Moreover, the vector wab satisfying (i)-(iv), is unique up to multiplication by a positive number. The lemma states that we can associate with every pair of distinct acts a, b ∈ A a separating hyperplane defined by wab · x = 0 (x ∈ RM ), such that a I b iff I is on a given side of the plane (i.e., iff wab · I ≥ 0). Observe that if there are only two acts, Lemma 3.5 completes the proof of sufficiency: for instance, one may set wa = w ab and wb = 0. It then follows that a I b iff wab · I ≥ 0, i.e., iff wa · I ≥ wb · I. More generally, we will show in the following lemmata that one can find a vector wa for every alternative a, such that, for every a, b ∈ A, wab is a positive multiple of (wa − wb ). Before starting the proof we introduce additional notation: let W ab and Y ab denote the convex hulls (in RM ) of W ab and Y ab , respectively. For a subset B of RM let int(B) denote the set of interior points of B. Proof of Lemma 3.5: We break the proof into several claims. Claim 3.6 For every distinct a, b ∈ A, Y ab ∩ int(Y ab ) = ∅ . Proof: By the diversity axiom Y ab = ∅ for all a, b ∈ A, a = b. Let I ∈ M Y ab ∩ ZM + and let J ∈ Z+ with J(m) > 1 for all m ∈ M. By the Archimedean |M| axiom there is an l ∈ N such that K = lI + J ∈ Y ab . Let (xj )2j=1 be the 2|M | distinct vectors in RM with coordinates 1 and −1. For j, (j = 1, ..., 2|M | ), define yj = K + xj . Obviously, yj ∈ QM + for all j. By Claim 3.4 there is an rj ∈ [0, 1) ∩ Q such that zj = rj K + (1 − rj )yj ∈ Y ab (for all j). Clearly, the convex hull of { zj | j = 1, ..., 2|M | }, which is included in Y ab , contains an open neighborhood of K.  Claim 3.7 For every distinct a, b ∈ A, W ba ∩ int(Y ab ) = ∅ . Proof: Suppose, by way of negation, that for some x ∈ int(Y ab ) there are (yi )ki=1 and (λi )ki=1 , k ∈ N such that for all i, yi ∈ W ba , λi ∈ [0, 1], Σki=1 λi = 1, and x = Σki=1 λi yi . Since x ∈ int(Y ab ), there is a ball of radius ε > 0 around x included in Y ab . Let δ = ε/(2Σki=1 ||yi ||) and for each i let qi ∈ Q ∩ [0, 1] such 89 11. PROOFS that |qi − λi | < δ , and Σki=1 qi = 1. Hence, y = Σki=1 qi yi ∈ QM + and ||y − x|| < ba ε, which, in turn, implies y ∈ Y ab ∩QM + . Since for all i : yi ∈ W , consecutive application of the combination axiom (Claim 3.3) yields y = Σki=1 qi yi ∈ W ba . On the other hand, y is a convex combination of points in Y ab ⊂ QM + and thus it has a representation with rational coefficients (because the rationals are an algebraic field). Applying Claim 3.3 consecutively as above, we conclude that y ∈ Y ab – a contradiction. The main step in the proof of Lemma 3.5: The last two claims imply that (for all a, b ∈ A, a = b) W ab and Y ba satisfy the conditions of a separating hyperplane theorem. So there is a vector wab = 0 and a number c so that Moreover, wab · I ≥ c for every I ∈ W ab wab · I ≤ c for every I ∈ Y ba . wab · I > c for every I ∈ int(W ab ) wab · I < c for every I ∈ int(Y ba ) . By homogeneity (Claim 3.2), c = 0. Parts (i)-(iv) of the lemma are restated as a claim and proved below. ab ab Claim 3.8 For all a, b ∈ A, a = b: W ab = {I ∈ QM = + | w · I ≥ 0}; Y M ab ba M ab ba {I ∈ Q+ | w · I > 0}; W = {I ∈ Q+ | w · I ≤ 0}; and Y = {I ∈ ab QM + | w · I < 0}. ab Proof: (a) W ab ⊂ {I ∈ QM + | w · I ≥ 0} follows from the separation result and the fact that c = 0. ab (b) Y ab ⊂ {I ∈ QM · I > 0}: assume that a ≻I b, and, by way of + | w ab negation, w · I ≤ 0. Choose a J ∈ Y ba ∩ int(Y ba ). Such a J exists by Claim 3.6. Since c = 0, J satisfies wab · J < 0. By Claim 3.4 there exists r ∈ [0, 1) such that rI + (1 − r)J ∈ Y ab ⊂ W ab . By (i), wab · (rI + (1 − r)J) ≥ 0. 90 CHAPTER 3. AXIOMATIC DERIVATION But wab · I ≤ 0 and wab · J < 0, a contradiction. Therefore, Y ab ⊂ {I ∈ ab QM + | w · I > 0}. ab (c) Y ba ⊂ {I ∈ QM · I < 0}: assume that b ≻I a and, by way of + | w ab negation, w · I ≥ 0. By Claim 3.6 there is a J ∈ Y ab with J ∈ int(Y ab ) ⊂ int(W ab ). The inclusion J ∈ int(W ab ) implies wab · J > 0. Using the Archimedean axiom, there is an r ∈ [0, 1) such that rI + (1 − r)J ∈ Y ba . The separation theorem implies that w ab · (rI + (1 − r)J) ≤ 0, which is impossible if wab · I ≥ 0 and wab · J > 0. This contradiction proves that ab Y ba ⊂ {I ∈ QM + | w · I < 0}. ab (d) W ba ⊂ {I ∈ QM · I ≤ 0}: assume that b I a, and, by way + | w of negation, w ab · I > 0. Let J satisfy b ≻J a. By (c), w ab · J < 0. Define r = (wab · I)/(−wab · J) > 0. By homogeneity (Claim 3.2), b ≻rJ a. By Claim 3.3, I + rJ ∈ Y ba . Hence, by (c), w ab · (I + rJ) < 0. However, direct computation yields w ab · (I + rJ) = wab · I + rwab · J = 0, a contradiction. It ab follows that W ba ⊂ {I ∈ QM + | w · I ≤ 0}. ab (e) W ab ⊃ {I ∈ QM + | w · I ≥ 0}: follows from completeness and (c). ab (f) Y ab ⊃ {I ∈ QM + | w · I > 0}: follows from completeness and (d). ab (g) Y ba ⊃ {I ∈ QM + | w · I < 0}: follows from completeness and (a). ab (h) W ba ⊃ {I ∈ QM + | w · I ≤ 0}: follows from completeness and (b).  Completion of the proof of the Lemma. M Part (v) of the Lemma, i.e., wab ∈ / RM + ∪ R− for a = b, follows from the facts that Y ab = ∅ and Y ba = ∅. Before proving part (vi), we prove uniqueness. Assume that both wab and uab satisfy (i)-(iv). In this case, uab · x ≤ 0 M implies wab · x ≤ 0 for all x ∈ RM + . (Otherwise, there exists I ∈ Q+ with uab · I ≤ 0 but wab · I > 0, contradicting the fact that both w ab and uab satisfy (i)-(iv).) Similarly, uab · x ≥ 0 implies w ab · x ≥ 0. Applying the ab same argument for wab and uab , we conclude that {x ∈ RM + | w · x = 0} = ab  ab  ba {x ∈ RM + | u · x = 0}. Moreover, since int(Y ) = ∅ and int(Y ) = ∅, ab it follows that {x ∈ RM · x = 0} ∩ int(RM + | w + ) = ∅. This implies that 91 11. PROOFS {x ∈ RM | wab · x = 0} = {x ∈ RM | uab · x = 0}, i.e., that wab and uab have the same null set and are therefore a multiple of each other. That is, there exists α such that uab = αwab . Since both satisfy (i)-(iv), α > 0. Finally, we prove part (vi). Observe that both wab and −wba satisfy (i)-(iv) (stated for the ordered pair (a, b)). By the uniqueness result, −wab = αwba for some positive number α. At this stage we redefine the vectors {wab }a,b∈A from the separation result as follows: for every unordered pair {a, b} ⊂ A one of the two ordered pairs, say (b, a), is arbitrarily chosen and then wab is rescaled such that w ab = −wba . (If A is of an uncountable power, the axiom of choice has to be used.) Lemma 3.9 For every three distinct eventualities, f, g, h ∈ A, and the corresponding vectors w fg , wgh , wf h from Lemma 3.5, there are unique α, β > 0 such that: αwf g + βwgh = wf h . The key argument in the proof of Lemma 3.9 is that, if w f h is not a linear combination of wf g and wgh , one may find a vector I for which ≻I is cyclical. If there are only three alternatives f, g, h ∈ A, Lemma 3.9 allows us to complete the proof as follows: choose an arbitrary vector wf h that separates between f and h. Then choose the multiples of w f g and of wgh defined by the lemma. Proceed to define w f = wfh , wg = βwgh , and w h = 0. By construction, (wf − wh ) is (equal and therefore) proportional to wf h , hence f I h iff wf · I ≥ wh · I. Also, (wg − w h ) is proportional to wgh and it follows that g I h iff w g · I ≥ w h · I. The point is, however, that, by Lemma 3.9, we obtain the same result for the last pair: (w f − wg ) = (wf h − βwgh ) = αwf g and it follows that f I g iff wf · I ≥ wg · I. Proof of Lemma 3.9: First note that for every three distinct eventualities, f, g, h ∈ A, if wf g and wgh are colinear, then for all I either f ≻I g ⇔ g ≻I h or f ≻I g ⇔ 92 CHAPTER 3. AXIOMATIC DERIVATION h ≻I g. Both implications contradict diversity. Therefore any two vectors in {wf g , wgh , wf h } are linearly independent. This immediately implies the uniqueness claim of the lemma. Next we introduce Claim 3.10 For every distinct f, g, h ∈ A, and every λ, µ ∈ R, if λwf g + µwgh ≤ 0, then λ = µ = 0. Proof: Observe that Lemma 3.5(v) implies that if one of the numbers λ, and µ is zero, so is the other. Next, suppose, per absurdum, that λµ = 0,and consider λwf g ≤ µwhg . If, say, λ, µ > 0, then wf g · I ≥ 0 necessitates whg · I ≥ 0. Hence there is no I for which f ≻I g ≻I h, in contradiction to the diversity axiom. Similarly, λ > 0 > µ precludes f ≻I h ≻I g; µ > 0 > λ precludes g ≻I f ≻I h; and λ, µ < 0 implies that for no I ∈ QM + is it the case that h ≻I g ≻I f . Hence the diversity axioms holds only if λ = µ = 0.  We now turn to the main part of the proof. Suppose that wf g , wgh , and whf are column vectors and consider the |M | × 3 matrix (w f g , wgh , whf ) as a 2-person 0-sum game. If its value is positive, then there is an x ∈ ∆(M ) such that wf g · x > 0, w gh · x > 0, and whf · x > 0. Hence there is an I ∈ QM + ∩ ∆(M ) that satisfies the same inequalities. This, in turn, implies that f ≻I g, g ≻I h, and h ≻I f - a contradiction. Therefore the value of the game is zero or negative. In this case there are λ, µ, ζ ≥ 0, such that λwf g + µwgh + ζwhf ≤ 0 and λ + µ + ζ = 1. The claim above implies that if one of the numbers λ, µ and ζ is zero, so are the other two. Thus λ, µ, ζ > 0. We therefore conclude that there are α = λ/ζ > 0 and β = µ/ζ > 0 such that (∗) αwf g + βw gh ≤ w f h Applying the same reasoning to the triple h, g, and f, we conclude that there are γ, δ > 0 such that (∗∗) γw hg + δwgf ≤ whf . 93 11. PROOFS Summation yields (∗ ∗ ∗) (α − δ)wf g + (β − γ)wgh ≤ 0. Claim 3.10 applied to inequality (∗ ∗ ∗) implies α = δ and β = γ. Hence inequality (∗∗) may be rewritten as αw f g + βwgh ≤ wf h , which together with (∗) yields the desired representation. Lemma 3.9 shows that, if there are more than three alternatives, the preference ranking of every triple of alternatives can be represented as in the theorem. The question that remains is whether these separate representations (for different triples) can be “patched” together in a consistent way. Lemma 3.11 There are vectors {wab }a,b∈A,a=b , as in Lemma 3.5, and for any three distinct acts, a, b, d ∈ A, the Jacobi identity w ab + wbd = w ad holds. Proof: The proof is by induction, which is transfinite if A is uncountably infinite. The main idea of the proof is the following. Assume that one has rescaled the vectors wab for all alternatives a, b in some subset of acts A′ ⊂ A, and one now wishes to add another act to this subset, d ∈ A′ . Choose a ∈ A′ and consider the vectors w ad , wbd for a, b ∈ A′ . By Lemma 3.9, there are unique positive coefficients α, β such that wab = αwad + βwdb . One would like to show that the coefficient α does not depend on the choice of b ∈ A′ . Indeed, if it did, one would find that there are a, b, c ∈ A′ such that the vectors wad , wbd , wcd are linearly dependent, and this contradicts the diversity axiom. Claim 3.12 Let A′ ⊂ A, |A′ | ≥ 3, d ∈ A\A′ . Suppose that there are vectors {wab }a,b∈A′ ,a=b , as in Lemma 3.5, and for any three distinct acts, a, b, e ∈ A′ , wab + wbe = w ae holds. Then there are vectors {wab }a,b∈A′ ∪{d},a=b , as in Lemma 3.5, and for any three distinct acts, a, b, e ∈ A′ ∪{d}, w ab +wbe = w ae holds. 94 CHAPTER 3. AXIOMATIC DERIVATION Proof: Choose distinct a, b, c ∈ A′ . Let wad ,wbd , and w cd be the vectors provided by Lemma 3.5 when applied to the pairs (a, d), (b, d), and (c, d), respectively. Consider the triple {a, b, d}. By Lemma 3.9 there are unique coefficients λ({a, d}, b), λ({b, d}, a) > 0 such that (I) wab = λ({a, d}, b)wad + λ({b, d}, a)wdb Applying the same reasoning to the triple {a, c, d}, we find that there are unique coefficients λ({a, d}, c), λ({c, d}, a) > 0 such that wac = λ({a, d}, c)wad + λ({c, d}, a)wdc . or (II) wca = λ({a, d}, c)wda + λ({c, d}, a)wcd . We wish to show that λ({a, d}, b) = λ({a, d}, c). To see this, we consider also the triple {b, c, d} and conclude that there are unique coefficients λ({b, d}, c), λ({c, d}, b) > 0 such that (III) w bc = λ({b, d}, c)wbd + λ({c, d}, b)w dc . Since a, b, c ∈ A′ , we have wab + wbc + wca = 0 and it follows that the summation of the right-hand sides of (I), (II), and (III) also vanishes: [λ({a, d}, b) − λ({a, d}, c)]wad + [λ({b, d}, c) − λ({b, d}, a)]wbd + [λ({c, d}, a) − λ({c, d}, b)]wcd = 0. 11. PROOFS 95 If some of the coefficients above are not zero, the vectors {w ad , w bd , wcd } are linearly independent, and this contradicts the diversity axiom. For instance, if w ad is a non-negative linear combination of wbd and wcd , for no I will it be the case that b ≻I c ≻I d ≻I a. We therefore obtain λ({a, d}, b) = λ({a, d}, c) for every b, c ∈ A′ \{a}. Hence for every a ∈ A′ there exists a unique λ({a, d}) > 0 such that, for every distinct a, b ∈ A′ w ab = λ({a, d})wad + λ({b, d})wdb . Defining wad = λ({a, d})w ad completes the proof of the claim. The lemma is proved by an inductive application of the claim. In case A is not countable, the induction is transfinite. Note that Lemma 3.11, unlike Lemma 3.9, guarantees the possibility to rescale simultaneously all the w ab -s from Lemma 3.5 such that the Jacobi identity will hold on A. We now complete the proof that (i) implies (ii). Choose an arbitrary act, say, e in A. Define we = 0, and for any other alternative, a, define wa = wae , where the wae -s are from Lemma 3.11. Given I ∈ QM + and a, b ∈ A we have: a I b ⇔ wab · I ≥ 0 ⇔ (wae + w eb ) · I ≥ 0 ⇔ (wae − wbe ) · I ≥ 0 ⇔ w a · I − wb · I ≥ 0 ⇔ w a · I ≥ wb · I The first implication follows from Lemma 3.5(i), the second from the Jacobi identity of Lemma 3.11, the third from Lemma 3.5(vi), and the fourth from the definition of the wa -s. Hence, (∗∗) of the theorem has been proved. It remains to be shown that the vectors defined above are such that conv({wa − wb , w b − wd , w d − we }) ∩ RM − = ∅ for every distinct a, b, d, e. Indeed, in Lemma 3.5(v) we have shown that w a − wb ∈ / RM − . To see this, one only uses the diversity axiom for the pair {a, b}. Lemma 3.9 has shown, among other things, that a non-zero linear combination of wa −wb and wb −wc cannot be in RM − , using the diversity axiom for triples. Linear independence of all three vectors was established in Lemma 3.11. However, the full im- 96 CHAPTER 3. AXIOMATIC DERIVATION plication of the diversity condition will be clarified by the following lemma. Being a complete characterization, we will also use it in proving the converse implication, namely, that part (ii) of the theorem implies part (i). The proof of the lemma below relies on Lemma 3.5. It therefore holds under the assumption that for any distinct a, b ∈ A there is an I such that a ≻I b. Lemma 3.13 For every list (a, b, d, e) of distinct elements of A, there exists I ∈ J such that a ≻I b ≻I d ≻I e iff conv({wab , wbd , wde }) ∩ RM − = ∅ . Proof: There exists I ∈ J such that a ≻I b ≻I d ≻I e iff there exists I ∈ J such that wab · I, wbd · I, wde · I > 0. This is true iff there exists a probability vector p ∈ ∆(M) such that w ab · p, w bd · p, wde · p > 0. Suppose that wab , wbd , and wde are column vectors and consider the |M |× 3 matrix (wab , wbd , wde ) as a 2-person 0-sum game. The argument above implies that there exists I ∈ J such that a ≻I b ≻I d ≻I e iff the maximin in this game is positive. This is equivalent to the minimax being positive, which means that for every mixed strategy of player 2 there exists d ∈ M that guarantees player 1 a positive payoff. In other words, there exists I ∈ J such that a ≻I b ≻I d ≻I e iff for every convex combination of {wab , wbd , wde } at least one entry is positive, i.e., conv({wab , wbd , wde }) ∩ RM − = ∅.  This completes the proof that (i) implies (ii).  Part 2: (ii) implies (i) are representable by It is straightforward to verify that if {I }I∈QM + {wa }a∈A as in (∗∗), they have to satisfy Axioms 1-3. To show that Axiom 4 holds, we quote Lemma 3.13 of the previous part.  Part 3: Uniqueness It is obvious that if w a = αwa +u for some scalar α > 0, a vector u ∈ RM , and all a ∈ A, then part (ii) of the theorem holds with the matrix w  replacing w. 97 11. PROOFS Suppose that {wa }a∈A and {w a }a∈A both satisfy (∗∗), and we wish to show that there are a scalar α > 0 and a vector u ∈ RM such that for all a ∈ A, w a = αwa + u. Choose a = e (a, e ∈ A, e satisfies we = 0). From the uniqueness part of Lemma 3.5 there exists a unique α > 0 such that (w a − w e ) = α(w a − we ) = e . αwa . Define u = w We now wish to show that, for any b ∈ A, w b = αwb + u. By definition of α and of u, this equality holds for b = e and for b = a. Assume, then, that a = b = e. Again, from the uniqueness part of Lemma 3.5 there are unique γ, δ > 0 such that (w b − w a ) = γ(wb − wa ) (w e − w b ) = δ(we − w b ) . Summing up these two with (w e ) = α(w a − we ), we get a − w 0 = α(wa − we ) + γ(w b − wa ) + δ(we − wb ) = α(wa − we ) + γ(w b − we ) + γ(w e − w a ) + δ(we − w b ). Thus (α − γ)(wa − we ) + (γ − δ)(wb − w e ) = 0 . and, since we = 0, (α − γ)wa + (γ − δ)wb = 0 . Since w a = (wa −we ) = 0, wb = (w b −we ) = 0, and (wa −we ) = λ(wb −we ) if 0 = λ ∈ R, we get α = γ = δ. Plugging α = γ into (w b − w a ) = γ(w b − w a ) proves that w b = αwb + u.  This completes the proof of the Theorem 3.1. 98 CHAPTER 3. AXIOMATIC DERIVATION Proof of Proposition 3.2 – Insufficiency of A1-3: We show that without the diversity axiom representability is not guaranteed. Let A = { a, b, d, e } and |M| = 3. Define the following vectors in R3 : wad = (0, −1, 1); wae = (1, 0, −1); wab = (−1, 1, 0); wbd = (2, −3, 1); wde = (1, 2, −3); wbe = (3, −1, −2) , and wxy = −w yx and w xx = 0 for x, y ∈ A. For x, y ∈ A and I ∈ J define: x I y iff w xy · I ≥ 0 . It is easy to see that with this definition the axioms of continuity and combination, and the completeness part of the ranking axiom hold. Only transitivity requires a proof. This can be done by direct verification. It suffices to check the four triples (x, y, z) where x, y, z ∈ A are distinct and in alphabetical order. For example, since 2wab + wbd = w ad , a I b and b I d imply a I d. Suppose by way of negation that there are four vectors in R3 , wa , wb , wd , we that represent I for all I ∈ J as in Theorem 3.1. By the uniqueness of representations of half-spaces in R3 , for every pair x, y ∈ A there is a positive real number λxy such that λxy wxy = (wx − wy ). Further, λxy = λyx . Since (wa − wb ) + (wb − wd ) + (wd − w a ) = 0 , we have λab (−1, 1, 0) + λbd (2, −3, 1) + λda (0, 1, −1) = 0 . So, λbd = λda , and λab = 2λbd . Similarly, (wa − wb ) + (wb − we ) + (we − wa ) = 0 implies λab (−1, 1, 0) + λbe (3, −1, −2) + λea (−1, 0, 1) = 0, which in turn implies λab = λbe and λea = 2λbe . Finally, (wa − wd ) + (w d − we ) + (w e − wa ) = 0 implies λad (0, −1, 1) + λde (1, 2, −3) + λea (−1, 0, 1) = 0. Hence, λad = 2λde and λea = λde . Combining the above equalities we get λad = 8λda , a contradiction. Observe that in this example the diversity axiom does not hold. For explicitness, consider the order (b, d, e, a). If for some I ∈ J, say I = (k, l, m), b ≻I d and d ≻I e, then 2k − 3l + m > 0 and k + 2l − 3m > 0. Hence, 4k − 6l + 2m + 3k + 6l − 9m = 7k − 7m > 0. But e ≻I a means m − k > 0, a contradiction.  Chapter 4 Conceptual Foundations This chapter deals with the conceptual and epistemological foundations of case-based decision theory. The discussion is most concrete when CBDT is juxtaposed with the two other formal theories of reasoning. We begin with a comparison of CBDT with EUT. We then proceed to compare CBDT with rule-based systems. We argue that, on epistemological grounds, CBDT is more naturally derived than the two other approaches. 12 CBDT and Expected Utility Theory We devote this section to several remarks on the comparison between CBDT and EUT, and to our views concerning their possible applications and philosophical foundations. For the purposes of this discussion we focus on the most naive, and simplest version of CBDT, namely, U -maximization. 12.1 Reduction of theories While CBDT is probably a more natural framework in which one may model satisficing behavior, EUT can be used to explain this behavior as well. For instance, assume that we observe a decision maker who never uses certain acts available to her. By ascribing to her an appropriately chosen prior beliefs, we may describe this decision maker as an expected utility maximizer. In 99 100 CHAPTER 4. CONCEPTUAL FOUNDATIONS fact, it is probably possible to provide an EUT account of any application in which CBDT can be used, by using a rich enough state space and an elaborate enough prior on it. Conversely, one may also simulate an expected utility maximizer by a U -maximizer whose memory contains a sufficiently rich set of hypothetical cases.1 Indeed, let there be given a set of states of the world Ω and a set of consequences R. Denote the set of acts by A = RΩ = {a : Ω → R}. Assume that the agent has a utility function u : R → R and a probability measure µ on an algebra of subsets Ω. (For simplicity we may consider a finite Ω.) The corresponding case-based decision maker would have a hypothetical case for each pair of state of the world ω and act a: M = {((ω, a), a, a(ω)) | ω ∈ Ω, a ∈ A} Letting the similarity of the problem at hand to the problem (ω, a) be µ(ω), U -maximization reduces to expected utility maximization. (Naturally, if Ω or R are infinite one would have to extend CBDT to deal with an infinite memory.) Furthermore, Bayes’ update of the probability measure may also be reflected in the similarity function: a problem whose description indicates that an event B⊂Ω has occurred should be set similar to degree zero to any hypothetical problem (ω, a) where ω ∈ B. Since one can mathematically embed CBDT in EUT and vice versa, it is probably impossible to choose between the two on the basis of predicted observable behavior.2 Each is a refutable theory given a description of a decision problem, where its axioms set the conditions for refutation. But in most applications there is enough freedom in the definition of states or cases, probability or similarity, for each theory to account for the data. Moreover, a problem that is formulated in terms of states has many potential translations 1 As mentioned above, U ′′ -maximization can simulate expected utility maximization without introducing hypothetical cases explicitly. 2 Matsui (2000) formally proves an equivalence result between EUT and CBDT. His construction does not resort to hypothetical cases. 12. CBDT AND EXPECTED UTILITY THEORY 101 to the language of cases and vice versa. It is therefore hard to offer real world phenomena that conform to one theory but that cannot be explained at all by the other. It is even harder to imagine a clear-cut test that will select the “correct” theory. To a large extent, EUT and CBDT are not competing theories; they are different conceptual frameworks, in which specific theories are formulated.3 Rather than asking which one of them is more accurate, we should ask which one is more convenient. The two conceptual frameworks are equally powerful in terms of the range of phenomena they can describe. But for each phenomenon, they will not necessarily be equally intuitive. Furthermore, the specific theories we develop in these conceptual frameworks need not provide the same predictions given the same observations. For instance, assume that a theorist observes a collection of phenomena that could be classified as satisficing behavior. She can explain all the data observed whether she uses EUT or CBDT as her conceptual framework. But the language of CBDT will be more conducive to observing this pattern and to forming a theory of satisficing for future predictions. Needless to say, there are large classes of phenomena for which the language of EUT will be the more natural, which will also generate more successful predictions. Hence we believe that there is room for both languages and for the two conceptual frameworks that employ these languages. 12.2 Hypothetical reasoning Judging the cognitive plausibility of EUT and CBDT, one notes a crucial difference between them: CBDT, as opposed to EUT, does not require the decision maker to think in hypothetical or counterfactual terms. In EUT, whether explicitly or implicitly, the decision maker considers states of the world and reasons in propositions of the form “If the state of the world were ω and I chose a, then r would result”. In U -maximization no such 3 See the discussion in Section 2. 102 CHAPTER 4. CONCEPTUAL FOUNDATIONS hypothetical reasoning is assumed. Similarly, there is a difference between EUT and CBDT in terms of the informational requirements they entail regarding the utility function: to consciously implement EUT, one needs to know the utility function u, that is, its value for any consequence that may result from any act. For U - (or U ′ -) maximization, on the other hand, it suffices to know the u-values of those outcomes that were actually experienced. The reader will recall, however, that our axiomatic derivation of CBDT involved preferences between acts given hypothetical memories. It might appear therefore that CBDT depends on hypothetical reasoning just as EUT does. But this conclusion would be misleading. First, one has to distinguish between elicitation of parameters by an outside observer, and application of the theory by the decision maker herself. While elicitation of parameters such as the decision maker’s similarity function may involve hypothetical questions, a decision maker who knows her own tastes and similarity judgments need not engage in any hypothetical reasoning in order to apply CBDT. By contrast, hypothetical questions are intrinsic to the application of EUT. Second, when states of the world are not naturally given, the elicitation of beliefs for EUT also involves inherently hypothetical questions. Classical EUT maintains that no loss of generality is involved in assuming that the states of the world are known, since one may always define the states of the world to be all the functions from available acts to conceivable outcomes. This view is theoretically very appealing, but it undermines the supposedly behavioral foundations of Savage’s model. In such a construction, the set of conceivable acts one obtains is much larger than the set of acts from which the decision maker can actually choose. Specifically, let there be given a set of acts A and a set of outcomes X. The states of the world are X A , that is, the functions from acts to outcomes. The set of conceivable acts will be A A ≡ X (X ) , that is, all functions from states of the world to outcomes. Hence the cardinality of the set of conceivable acts A is by two orders of magnitude 12. CBDT AND EXPECTED UTILITY THEORY 103 larger than that of the actual ones A. Yet, using a model such as Savage’s, one needs to assume a (complete) preference order on A, while in principle preferences can be observed only between elements of A. Differently put, such a canonical construction of the states of the world gives rise to preferences that are intrinsically hypothetical and is a far cry from the behavioral foundations of Savage’s original model. In summary, when there is no obvious mapping from the formal state space to real-world scenarios, both EUT and CBDT rely on hypothetical questions or on contingency plans for elicitation of parameters. The Savage questionnaire to elicit EUT parameters will typically involve a much larger set of acts than the corresponding one for CBDT. Importantly, the Savage questionnaire contains many acts that are hard to imagine. Furthermore, when it comes to application of the theory, CBDT clearly requires less hypothetical reasoning than does EUT. 12.3 Observability of data In order to view choice behavior as observable, EUT needs to assume that the states of the world are known to an outside observer. Correspondingly, CBDT needs to assume that cases are observable in this sense. It might appear that states of the world are part of the objective description of the decision problem, whereas cases in the decision maker’s memory are entirely subjective and cannot be directly observed. This would indeed be true if one were to use EUT only in problems where states are naturally defined, and to apply CBDT to all cases in a person’s memory. But, as argued above, EUT is often applied to situations where states are not given. In these situations it is not obvious what are the states conceived of by the decision maker. Further, CBDT can be applied only to cases that are objectively known, say, cases that were published in the media. With such applications, choice behavior in the EUT model would appear to rely on unobservable data, whereas behavior in the CBDT model would be based on sound, objective data. Thus, there 104 CHAPTER 4. CONCEPTUAL FOUNDATIONS is nothing inherent to states or to cases that makes one concept more easily observable than the other. 12.4 The primacy of similarity We argue that any assignment of probabilities to events, and therefore any application of EUT, relies on subjective similarity judgments in one form or another. Consider first the “classical” approach, according to which, for example, equal probabilities are assigned to the sides of a die about to be rolled. Evidently, such an assignment, justified by Laplace’s “principle of insufficient reason”, is based on perceived similarity between any two possible outcomes. The “frequentist” approach uses empirical frequencies to determine probabilities. It thus relies on the concept of repetition, namely, on the idea that the same experiment is repeated over and over again. But what counts as a repetition of the same experiment is a matter of subjective judgment. Consider the example of medical data. Suppose that one asks one’s doctor what is the probability of success of a certain medical procedure. One would hardly be happy with empirical frequencies that pertain to all patients who have ever undergone the procedure. Rather, one would like to know what are the empirical frequencies in a sub-population of patients with similar characteristics such as age and gender. But if one obtains too specific a definition of “similar patients”, one might be left with an empty database, since no two human bodies are absolutely identical. It follows that there is a subjective element in the judgment of similar cases over which one would like to compute empirical frequencies. In principle, this observation applies to any application of the frequentist method. When using repetitions of a toss of a coin over and over again, one implicitly assumes that the conditions under which the coin is tossed are identical. But no two instances can be completely identical. Rather, one resorts to subjective judgment to determine what conditions are similar enough to be viewed as identical. Finally, using a subjectivist approach of assigning subjective probabilities by 12. CBDT AND EXPECTED UTILITY THEORY 105 responding to de Finetti’s or Savage’s questionnaires one again has to assume that different binary decisions can be made under identical conditions. Indeed, very little can be said about single instances in life. One never knows what would the history of the world have been had different events taken place, or had different choices been made. Our ability to discuss counterfactuals in an intelligent manner, as well as our ability to assign probabilities to events, relies on our subjective similarity judgments. We therefore conclude that the notion of similarity is primitive. It lies at the heart of probability assignments, as well as at the heart of induction (as we argue in the Chapter 7). While we do not attempt to defend the particular way in which CBDT makes use of similarity between cases, we maintain that the language of CBDT is more basic than the language of EUT (or the language of rule-based systems). 12.5 Bounded rationality? There is a temptation to view EUT as a normative theory, and CBDT – as a descriptive theory of “bounded rationality”. We maintain, however, that normative theories should be constrained by practicality. Theoretically, one may always construct a state space. Further, Savage’s axioms appear very compelling when formulated relative to such a space. But they do not give any guidance for the selection of a prior over the state space. Worse still, in many interesting applications the state space becomes too large for Bayesian learning to correct a “wrong” prior. Thus it is far from clear that, in complex problems for which past data are scarce, case-based decision making is any less rational than expected utility maximization. Using the definition of rationality proposed in Section 2, there may be many situations in which decision makers who do not follow EUT would not wish to change their behavior even after being confronted with its analysis. Consider, for instance, the example of military intervention in Bosnia discussed in Section 4. Suppose that a decision theorist addresses President 106 CHAPTER 4. CONCEPTUAL FOUNDATIONS Clinton and suggests that he be rational and that he use expected utility theory rather than analogies to past cases. Clinton may say, “Fine, I see your theoretical point. But how should I form a prior over this vast state space?” In absence of a practical way to generate such a prior, Clinton may not be embarrassed by the fact that he bases his decision on analogies to the Vietnam and the Gulf wars. He may well choose to follow the same decision procedure even if he agrees that Savage’s axioms make sense. More generally, we believe that, as long as one employs a definition of rationality that relates to reasoning, and not only to behavior, one may find cases in which CBDT is be more rational than is EUT. 13 CBDT and Rule-Based Systems In this section we compare CBDT to rule-based systems from an epistemological viewpoint, and proceed to offer a conceptual derivation of CBDT. That is, we argue for a particular epistemological position, and show that the theory presented above is a natural application of this position for the problem of decision making under uncertainty. 13.1 What can be known? In philosophy and in artificial intelligence (AI) it is common to assume that general rules of the form “For all x, P (x)” are objects of knowledge. Some rules of this form may be considered analytic propositions, which are true by definition. Yet, many are synthetic propositions, and represent empirical knowledge.4 Knowledge of such a proposition is supposedly obtained by explicit induction, that is, by generalization of particular cases that can be viewed as instances of the proposition. 4 By “synthetic” propositions we refer to non-tautological ones. While this distinction was already rejected by Quine (1953), we still find it useful for the purposes of the present discussion. 13. CBDT AND RULE-BASED SYSTEMS 107 Explicit induction is a natural cognitive process. It is particularly common for people to convey knowledge by formulating rules. Yet, as was already pointed out by Hume, explicit induction does not rely on sound logical foundations. Hume (1748, Section IV) writes, “... The contrary of every matter of fact is still possible; because it can never imply a contradiction, and is conceived by the mind with the same facility and distinctness, as if ever so conformable to reality. That the sun will not rise to-morrow is no less intelligible a proposition, and implies no more contradiction than the affirmation, that it will rise. We should in vain, therefore, attempt to demonstrate its falsehood.” Hume claims that synthetic propositions are not necessarily true. Useful rules, which have implications regarding empirical facts in the future, cannot be known.5 Explicit induction suggests that such rules are nevertheless objects of knowledge. Because unwarranted generalizations may be refuted, explicit induction raises the problem of knowledge revision and update. Much attention has been devoted to this problem in the recent literature in philosophy and AI. (See Levi (1980), McDermott and Doyle (1980), Reiter (1980) and others.) The Humean alternative to this problem will attempt to avoid induction in the first place, rather than performing it and then dealing with the problems it poses. Following this line of thought, knowledge representation should confine itself to those things that can indeed be known. Facts, or cases can be known, while rules can at best be conjectured. In this approach, knowledge revision in unnecessary. While rules can be contradicted by other rules or by evidence, cases cannot contradict each other. Admittedly, even cases may not be objectively known. Terms such as “empirical knowledge” and “knowledge of a fact” prove elusive to define. 5 Quine (1969a) writes, “... I do not see that we are further along today than where Hume left us. The Humean predicament is the human predicament”. 108 CHAPTER 4. CONCEPTUAL FOUNDATIONS (See, for instance, Moser (1986) for an anthology on this subject.) Furthermore, if we take Hanson’s (1958) view that observations may be theory-laden, we might find that the very formulation of the cases that we allegedly observe may depend on the rules that we believe to apply. If this is the case, one cannot know or even represent cases as independent of rules. Moreover, even if one could, such a representation would not solve the epistemological problem, since cases cannot be known with certainty any more than rules can. We are sympathetic to both claims, but we tend to view them as issues of secondary importance. We believe that the theoretical literature on epistemology and knowledge representation may benefit, and indeed has benefited from drawing distinctions between theory and observations, and between the knowledge of a case and that of a rule. As argued in Section 3, philosophy, being a social science, cannot expect models to be more than “approximations”, “metaphors”, or “images”. Namely, we should not expect philosophical models to provide a complete and accurate description of reality. In this spirit, we choose to employ a conceptual framework in which cases can be known, and can be observed independently of rules or theories. 13.2 Deriving case-based decision theory Assume, then, that cases are the object of knowledge. How should they be described, and how are they used in decision making? In this subsection we argue for the particular model of CBDT presented in Chapter 2. First, we maintain that the two questions are intrinsically linked. The choice of knowledge representation has to be guided by the use one makes of knowledge. Focusing on our final goal, namely, to provide a theory of decision making, we take the view that knowledge is ultimately used for action. Thus, we should choose a way to represent cases in anticipation of the way they will be used in decision making. Hence the formal structure of a case should reflect its decision making aspect, if such exists. 13. CBDT AND RULE-BASED SYSTEMS 109 Assume, first, that all cases tell stories of past decision. When considering such a case, it is natural to set our clocks to the time at which decision occurred. At that point, the past was whatever was known to the decision maker at the time when she was called upon to act. This may be referred to as the decision problem. The present was the act chosen by the decision maker. The future, at the time of decision, was what resulted from the choice. This is referred to as the outcome. Thus, the description of a case as a triple of decision problem, act, and outcome, is a natural way to organize knowledge of cases for the purposes of decision making. As mentioned in Section 6, there are cases that do not explicitly involve any decision making, yet are relevant to future decisions. For instance, deciding whether or not to go on a trip to the country, and observing clouds in the sky, it is relevant that yesterday cloudy skies were followed by rain. Using the notion of hypothetical cases, one may translate such a case, which does not involve any choice, to a case that does: one might imagine that, had one gone on a trip yesterday, when the sky was cloudy, the trip would have been a disaster. Alternatively, one may suppress the act in the description of the case, and think of the “problem” simply as the “circumstances” that were followed by an outcome. Formally, one may assume that in any decision problem there is at least one act that is available, say, “do nothing”, and thus fit cases that involve no decisions into our framework. Circumstances are the conditions under which an outcome has been observed. Presumably, a case is relevant only to cases that have similar circumstances. In other words, the circumstances of a past case help us delineate its scope of relevance. But some cases have no such qualifications. For instance, Hume’s sun rises every morning. A case of the sun rising in the past does not involve a set of circumstances. We may thus suppress the decision problem, or represent it by an empty set of circumstances, and find that the resulting case is relevant to all future cases. It appears that only the outcome, namely, a description of what hap- 110 CHAPTER 4. CONCEPTUAL FOUNDATIONS pened, is essential for a case to be informative. The problem and the act may or may not be present. When they are, they may be jointly viewed as antecedents, whereas the outcome is the consequent. The distinction between the problem and the act is along the line drawn by control: those circumstances that were not under the decision maker’s control are referred to as the problem, and those that were – as the act. How are the cases used in decision making, then? Again, we resort to Hume (1748) who writes, “In reality, all arguments from experience are founded on the similarity which we discover among natural objects, and by which we are induced to expect effects similar to those which we have found to follow from such objects. ... From causes which appear similar we expect similar effects. This is the sum of all our experimental conclusions.” As quoted above, Hume argued against explicit induction, by which one formulates a general rule. Instead, he suggested that people engage in implicit induction: they learn from the past regarding the future by employing similarity judgments. The “causes” to which Hume refers are the circumstances we mention above, the combination of the decision problem and the act. The “effect” corresponds to our outcomes. Hence our description of a case may be taken as a formalization of Hume’s implicit structure, with the further specification allowing a distinction between problems and acts. When observing phenomena over which one has no control, this distinction is trivial, and our notion of a problem coincides with “circumstances” and with Hume’s “cause”. But when we focus on decision making and on the way in which one may affect circumstances and thereby – also outcomes, dividing circumstances into problems and acts seems natural. Moreover, when considering a current decision problem, the distinction between problems and acts is a quintessential of rational choice: one has to 13. CBDT AND RULE-BASED SYSTEMS 111 know what is under one’s control and what is not, which variables are decision variables and which are given. As mentioned in Section 5, however, when learning from the experience one has with past problems, the distinction between problem and act might be blurred. What seems to be relevant is the circumstances that they jointly constitute, where the question regarding future problems would be whether one can bring about similar circumstances. We assume that cases are objectively known. By contrast, the similarity function is a matter of subjective judgment. Where does one derive it? How do and how should people measure similarities? We do not offer any general answers to these questions. But one might hope that the questions will be better defined, and the answers easier to obtain, if we know how similarity is used. That is, we should first ask ourselves, what would we do with a similarity function if we had one? A well-known procedure that employs similarities is the “nearest neighbor” (NN) approach applied to classification problems. (See Fix and Hodges (1951, 1952), Royall (1966), Cover and Hart (1967), and Devroye, Gyorfi, and Lugosi (1996)). In these problems a “classifier” is faced with an instance that belongs to one of several possible categories. Based on examples of correct categorizations of past instances, the classifier has to guess the classification of the present one. The simplest nearest neighbor algorithm suggests choosing the category to which belongs the most similar instance among the known examples. k-NN algorithms would consider the k most similar cases, and choose the category that is most common amongst them. This corresponds to a majority vote among the k most similar past instances, and it is also generalized to a weighted majority vote, where more similar known examples get a higher weight in the voting scheme. Viewed merely as a classification system, we argue that one has to use all cases rather than a few nearest neighbors. When only one nearest neighbor is used, the algorithm appears rather extreme: should one most similar case single-handed outweigh any number of slightly less similar ones? Indeed, 112 CHAPTER 4. CONCEPTUAL FOUNDATIONS this is the main theoretical reason to consider k-NN approaches. But for any k > 1, the weight assigned to a past instance depends not on its intrinsic similarity to the new one, but also on that of other instances. This can be shown to result in violations of the combination axiom of Section 9: it is possible that a k-NN system would provide an identical classification given two disjoint databases of examples, but a different one given their union. It seems logical that each case would be relevant for the new classification to a degree that is independent of other cases. Thus, one may wish to use all neighbors, near or far, in each problem. They might be weighted by their similarities. Indeed, this would result in the prediction rule discussed in Section 7.6 But how does one generate a decision making procedure based on these ideas? As mentioned in Section 7, classification and prediction are decision problems of a very special type, in which the decision maker’s choices do not affect eventualities. Thus, by definition, each past case provides a correct answer, namely, the eventuality that has actually occurred, and each possible decision can be evaluated in relation to this correct answer. By contrast, in the general problem of decision making under uncertainty, acts affect outcomes. Considering a past case, one knows what resulted from the act that was actually chosen, but one does not know what would have resulted from other acts that could have been chosen. Thus, there is no “correct” choice, and there is no database of examples to learn correct or optimal choices from. If the decision maker is satisfied with the outcome she has experienced in similar cases in the past, she might take the same decision in the new one as well. But what would a nearest neighbor approach prescribe if the decision maker finds her choice in similar cases truly disappointing? She would probably 6 Using all known examples may prove to be inefficient from a computational point of view. This difficulty may be overcome by using only examples whose similarity to the present instance is above a prespecified threshold. Such a solution is equivalent to rounding off low similarity values to zero, and it may thus render the problem tractable. The key point, however, is that the decision whether to use a given example depends only on its own similarity value, and not on the relative ranking of this value. 13. CBDT AND RULE-BASED SYSTEMS 113 choose a different act than the one she has chosen. But which one? And how does the decision maker decide whether the most similar cases involved good or bad choices on her part? It follows that we need to employ some notion of utility, measuring desirability of outcomes. It is the interaction between utility of outcomes and similarity of circumstances that should determine whether one would like to repeat the choice made in similar cases. Combining this notion with the arguments above for using all cases in each problem, one comes up with the CBDT decision rules suggested in Chapter 2. These functions evaluate every act for each problem. Moreover, the impact that a case has on the decision is independent of other cases that may or may not exist in memory. 13.3 Implicit knowledge of rules Jane and Jim are two decision makers who face the same problem: they are thirsty, and they see an automatic vending machine for soft drinks. Jane is a case-based decision maker, while Jim is a rule-based one. Jim knows the rule, “if coins are put into a vending machine and a button is pushed, a can comes out”. Jane has no knowledge of such general rules. But she has many past cases in memory, in which she has put some coins in a machine, pushed a button, and got a can. Based on this knowledge, Jane chooses to put some coins in and push a button. Jane and Jim would probably seem identical in their behavior to an outside observer. When Jane lets the coins drop into the appropriate slot she has an air of self-confidence that would make an outside observer believe that she knows what the outcome of her act is going to be. It appears as if Jane knows the same rule as does Jim. Jane might not have such a rule explicitly formulated in her mind, and she might not even be able to formulate it should we ask her about it. But she behaves as if she knew such a rule, and, as such, exhibits implicit knowledge of this rule. This implicit knowledge is the result of the process we refer to as “implicit induction”. 114 CHAPTER 4. CONCEPTUAL FOUNDATIONS Next assume that Jane walks confidently toward the machine, drops the coins in the slot, pushes a button, and observes that nothing happens. No can, no money. Jane is surprised. She evidently had some expectations regarding the outcome of her act, and these were proven wrong. Jane now has to reconsider her choice. She will probably find that, given the new case, the act of dropping coins has a low U -value. Consequently, she is likely to choose another act (the new U -maximizer), such as “try another machine”. Suppose now that the same unfortunate occurrence happens to Jim. Jim is not only surprised to find that the machine does not deliver. Jim has a problem with his knowledge base. Evidently, one of the rules he believed in is false. Is it that “if coins are put into a vending machine and a button is pushed, a can comes out”? Or perhaps he made a mistake in identifying the machine, and the rule that is wrong is “large colorful boxes with buttons on them are vending machines”? Jim has to decide whether to retracts one or more rules, or to refine it, in order to render the rules and the observations in his knowledge base consistent. On top of this non-trivial task, Jim still has to cope with thirst. Thus, both Jane and Jim have to do something about their predicament, and the set of alternative choices they have is identical. But Jim looks for another machine in a state of mental commotion, trying to put his knowledge base in order. Jane looks for another machine with peace of mind. There are many databases of cases that can be thought of as corresponding to certain rules. These databases can be described either by cases or by rules, where the latter enjoy the advantage of parsimony. But there are other databases in which rules do not provide a good description of the data. In general, rules are but rough approximations of databases, and in many instances they provide unsatisfactory descriptions thereof. In such instances case-based knowledge representation is found to be less pretentious and more flexible than rule-based representation. Specifically, case-based representation does not need to deal with inconsistencies, and it deals in a continuous 13. CBDT AND RULE-BASED SYSTEMS 115 and graceful manner with situations that are on the borderline between the realms of applicability of different rules. 13.4 Two roles of rules Mary tells John, “Sarah is always late”. Thus, she conveys information in the form of a general rule. Let us distinguish between two scenarios: first, assume that John has never met Sarah. In this scenario, the rule that Mary relates to John is a parsimonious description of many cases that are known to Mary but not to John. In the second scenario, John knows Sarah just as well as does Mary. In this situation, the rule that Mary relates does not tell John about cases that were unknown to him. Rather, it may draw his attention to a certain regularity in the cases in his memory.7 Case-based decision makers who do not know rules might still use them as convenient tools for the transfer of information. But rules can have two different roles: they can summarize many cases,8 and they can point out similarities between cases. As our example suggests, there is nothing in the rule itself that characterizes its specific role. Rather, whether a rule conveys cases or similarities would depend on the context: the state of knowledge of the communicators (and their knowledge thereof). For instance, most people learn that “objects are drawn to the ground” as an explicit rule when they already have a very clear implicit knowledge thereof. Telling this rule to a child of six years of age does not add much to the child’s set of cases that is not already there. But it does help the child see the similarity between many cases and use it for generalizations. In both roles, rules may contradict each other. As parsimonious descriptions of cases, one rule may summarize one set of cases, while another rule describes another set. Similarly, as cues for similarity, two rules may point out two different patterns that are simultaneously evident in the same set of 7 The same statement might also insinuate that Mary is upset, and this may be the only new information that Mary is trying to convey. We ignore this possibility here. 8 Riesbeck and Schank (1989) refer to rules in this capacity as “ossified cases”. 116 CHAPTER 4. CONCEPTUAL FOUNDATIONS cases. Such rules may be thought of as proverbs. They do not purport to be literally true. Rather, their main goal is to affect similarity judgments. In this capacity, the fact that rules tend to contradict each other poses no theoretical difficulty. Indeed, it is well known that proverbs are often contradictory.9 To sum, a CBDT model may use rules. A rule may be a summary of cases or a feature that is common to many cases and that can thus help define the similarity function. In the CBDT model, however, rules are not taken literally and they may well contradict each other, or be inconsistent with some cases. 9 The notion of a “rule” as a “proverb” also appears in Riesbeck and Schank (1989). They distinguish among “stories”, “paradigmatic cases”, and “ossified cases”, where the latter “look a lot like rules, because they have been abstracted from cases”. Thus, CBR systems would also have “rules” of sorts, or “proverbs”, which may, indeed, lead to contradictions. Chapter 5 Planning 14 14.1 Representation and Evaluation of Plans Dissection, selection, and recombination CBDT describes a decision as a single act that directly leads to an outcome. In many cases of interest, however, one may take an act not for its immediate outcome, but in order to be in a position to take another act following it. In other words, one may plan ahead. In this section we extend CBDT to a theory of case-based planning. The formal model of CBDT distinguishes between problems, acts, and results. When planning is considered, the distinction between problems and results is blurred. The outcome of today’s acts will determine the decision problem faced tomorrow. Thus the formal model of case-based planning will not distinguish between the two. Rather, we employ a unified concept of a “position”, which also can be viewed as a set of circumstances. A position might be a starting point for making a decision, that is, a problem, but also as the end result, namely, an outcome. We will therefore endow a position with (i) a set of available acts (in its capacity as a problem) and (ii) a utility valuation (when considered an outcome). Part of the planning process will be the decision, whether a certain position should be a completion of the plan, or a starting point for additional acts. 117 118 CHAPTER 5. PLANNING The basic premise of our model is that planning is case-based, but that it is carried out for each problem separately. That is, as in the single-stage version of CBDT, the decision maker has only her memory to rely on in planning her actions. But she does not have to restrict her attention to those plans that she has used, or learned of, in their entirety. Rather, she can use various parts of plans she knows of and recombine them in a modular way for the problem at hand. Consider the following example. Mary owns an old car, and she would like to buy a new one. She has a friend, John, who has recently sold a used car, similar to her own, for $4,000. By the similarity of the cars and the recency of the sale, Mary imagines that she, too, can get $4,000 for her old car. Naturally, the new car will be more expensive. While its sticker price is rather high, Mary heard that another acquaintance of hers, Jim, managed to get it for $9,000. Since Mary wants to get a car of the same make and model, she believes that she, too, can get it for the same amount. It is therefore left to figure out how to have that amount. If Mary succeeds in selling her old car, she will need additional $5,000. But this should not be a problem. Or at least so says her friend Sarah. Indeed, last week Sarah just walked into the bank and walked out with a $5,000 loan which she used for some home renovation. With all this information available to her, Mary devises her plan. She will first sell her old car, which should fetch $4,000, just like John’s car did. Then she’ll get a $5,000, as did Sarah. Finally, she should be able to get her desired car for $9,000, as did Jim. Figure 5.1 illustrates Mary’s plan as a graph. A node in this graph represents a position, that is, a set of circumstances that are relevant to an agent, before or after making a decision. Arcs denote acts that may be chosen by agents, namely, possible decisions. The arc corresponding to an act leads from the position node at which it was taken to the position node it yielded. The arcs on top are part of Mary’s memory, while those at the bottom are 119 14. REPRESENTATION AND EVALUATION OF PLANS part of her plan. Memory Sarah applied for a loan John sold his car John had an old car John had Sara had $4,000 $x Jim bought a new car Sara had Jim had $x+$5,000 $y Jim had $y-$9,000 and a new car Plan sell old car have old car apply for a loan buy new car have $4,000 have $9,000 have new car Figure 5.1: Planning as dissection, selection and recombination Memory arcs need not be isolated. They may be parts of involved and elaborate stories. For instance, John might have sold his car as part of his plan to take a trip around the world. Sarah took the loan to renovate her home, which might have been part of her plan to sell it and move elsewhere, and so forth. The cases that are depicted above as part of memory are therefore assumed to have been selected from entire stories stored in memory. The point of this example is that, in order for a case-based planner to come up with a plan such as Mary’s, she does not need to know of anyone who has done this entire sequence of acts. It suffices that, for each act 120 CHAPTER 5. PLANNING in each position, she knows of someone who has done a similar act in a similar position. According to this view, the essence of planning is dissection, selection, and recombination. Thinking involves analyzing stories, namely, sequences of cases, and coming up with new combinations of their building blocks. Planning is case-based in the sense that only similarity to known cases can serve as a basis for plan construction and evaluation. But a casebased planner is smarter than a case-based decision maker in that the former can span a wider range of plans than those that were actually tried in the past. 14.2 Representing uncertainty Next consider the incorporation of uncertainty. While Mary’s plan seems very reasonable, it might be wise on her part to consider various eventualities. What happens if she does not succeed to sell her old car for $4,000? After all, she can attempt to sell the car, but she cannot be sure of the outcome. And what will happen if her loan application is denied? In short, uncertainty is present. The standard approach to describe decision problems of this nature employs decision trees. According to this approach, we are supposed to introduce chance nodes at this point, and allow an act to lead to a chance node, which, in turn, may lead to several position nodes. The distinction between chance nodes and decision nodes, or between states and acts, is fundamental to rational decision making, and may well be the most important cornerstone of decision theory. Confounding acts and states is probably one of the worst sins a theorist can commit, and it leads to such problems as choosing one’s beliefs, or to entertaining beliefs over acts. Yet it seems as if we intend to commit this sin. The reason is that we seek to model decision situations in which decision makers may have very poor knowledge of the potential scenarios that might ensue from each given decision. In such problems, as in the one-stage version 14. REPRESENTATION AND EVALUATION OF PLANS 121 of CBDT, one has to specify how attractive is a choice about which little or no information exists. Hence default values play an important role in our theory. In the plan evaluation rule suggested below, each position will have a utility index, that may be interpreted as the desirability of being in this position indefinitely (and with certainty). Every position will be assigned a weight, and plans will be evaluated by the weighted average of the utility function. Thus, even if the decision maker plans to act at a given position, the utility of this position will play a role in the evaluation of the plan. Modeling this feature in a standard decision tree would make it cyclical: one of the outcomes of an act will be the position at which it was taken. We find it more convenient to suppress chance nodes and retain the acyclic nature of the tree. We therefore model uncertainty simply by allowing more than one arc to leave a given position for a certain act. The plan tree we get might be thought of as a decision tree in which a chance move was integrated with the decision move preceding it. Thus potential cycles collapse into single nodes, and one retains a simpler mathematical structure. Going back to our example, what does Mary plan to do if she doesn’t manage to sell her car for $4,000? She might reason as follows: if I get at least $3,000, I might still proceed with the previous plan, but I will borrow more money from the bank. It stands to reason that, if they were willing to lend $5,000 to Sarah, they will lend me $6,000,1 and the cost of the loan will still not be prohibitive. But if I get something between $2,000 and $3,000 for my car, I can’t really get the car I want. I will go ahead with the general plan, and borrow up to $6,000, but I will buy a less expensive car. Finally, if I do not get even $2,000, I will just drop the whole idea. We illustrate this plan in Figure 5.2. To simplify the graphical exposition, we assume that the old car can only fetch one of three prices: $2,000, $3,000, 1 Mary might also know of another case, in which Carolyn got a loan of $6,000 to pay her dentist. 122 CHAPTER 5. PLANNING and $4,000. For simplicity we also suppress the role of memory in the graph. However, every arc should be viewed as supported by known cases, as in Figure 5.1. have $4,000 have $9,000 have new car buy new car get a $5,000 loan sell car sell car have $3,000 have $9,000 buy new car get a $6,000 loan sell car have $2,000 have $8,000 get a $6,000 loan have new car have a newer car buy less expensive car Figure 5.2: Uncertainty represented by multiple arcs Observe that three arcs correspond to the act “sell car” at the initial position. The plan, however, deals with four contingencies: the three prices mentioned in the graph, and prices that are lower than $2,000. If only such offers are received, Mary will just forget about the idea of getting a new car. This would mean staying at the initial position. One may therefore assume that at each position and for each act there is also an arc leading from that position to itself. For simplicity we suppress these arcs, and deal with acyclic graphs. But when we come to evaluate a plan we will recall that any position may also be viewed as a terminal node. Observe also that the acts and positions are only sketchily described above. For instance, the act “sell car” should be described as “offer to sell old car for $4,000”. Finally, one notes that the top two branches of the plan seem to be identical from a certain point onwards. In this example, this is only due to our imprecise description of the positions: in the top branch Mary will be richer by $1,000, 14. REPRESENTATION AND EVALUATION OF PLANS 123 and this is supposed to be reflected in the description of the position. But in other examples one should expect to find plan paths coinciding, and there is little doubt that in actual planning it is much more efficient to consider general directed acyclic graphs rather than trees. For the purposes of the theoretical study of rules for plan evaluation, however, we restrict attention to trees despite the wastefulness of this representation. 14.3 Plan evaluation How should Mary evaluate this plan? We assume that she has a payoff assessment for each position. This function should reflect, say, the desirability of staying in the initial position with her old car, or of being stuck with no loan and no car, and so forth. Each position will then be assigned a weight, and the plan will be evaluated by the weighted sum of the payoffs. Thus, the weight associated with the initial position should reflect the support that memory lends to the prospect of not being able to sell the old car for $2,000 or more. These weights naturally depend on the plan. If, for instance, Mary compares the plan above to another plan in which she does not attempt to sell her car at all, the weight of the initial position in the second plan will be higher than in the first. We now turn to the formal model. Let P be a finite and non-empty set of positions. We assume that it is endowed with a binary relation ≻ ⊂ P × P , interpreted as “can follow”. Assume that ≻ is a partial order, namely, that it is transitive and irreflexive. Let A denote a finite and non-empty set of acts. For p ∈ P , let Ap ⊂ A be the set of acts available at p. We introduce a0 ∈ A, interpreted as “do nothing”, that is, as the “null act”, and assume that a0 ∈ Ap for all p ∈ P . A case is a triple (p, a, q) where p, q ∈ P ; a ∈ Ap ; q ≻ p. The decision maker is assumed to know of past cases, constituting her memory M , and to imagine future cases that might occur. The set of cases imagined by the decision maker is denoted by C. The decision maker is at an initial position p0 ∈ P . The set of imagined cases C defines a graph 124 CHAPTER 5. PLANNING on P , with arcs Ĉ ≡ {(p, q) ∈ P 2 |∃a ∈ Ap s.t. (p, a, q) ∈ C} . Assume that this graph is a tree whose root is p0 . Under this assumption, cases in C can be identified with arcs in this graph. A plan is a function N : P → A such that N (p) ∈ Ap ∀p. Without loss of generality we assume that a plan does not prescribe any action at a position that is considered terminal. That is, N(p) = a0 whenever p is ≻-maximal. For a plan N, let PN ⊂ P be the positions that are consistent with N (not including the initial position): PN ≡ p∈P ∃k ≥ 1, ∃p1 , .., pk ∈ P s.t., ∀0 < i < k − 1, (pi , N (pi ), pi+1 ) ∈ C, where pk = p A utility function is a real-valued function defined on positions. It will prove convenient to normalize it so that the utility of the initial position is zero. Formally, we define a utility function to be u : P \{p0 } → R. A support function S : C → R+ assigns a weight to each arc, such that, for every (p′ , a′ , p) ∈ C and every a ∈ Ap ,  S((p, a, q)) ≤ S((p′ , a′ , p)) (p,a,q)∈C Note that only one arc (p′ , a′ , p) leads to any position p. The value S((p, a, q)) is interpreted as the support memory lends to the future case (p, a, q) if act a is to be taken at position p. (In a Bayesian interpretation, S((p, a, q)) would be the probability of the case (p, a, q) conditioning on a plan that prescribes, among other choices, to take a at p. Given the plan, however, S reflects ex-ante probabilities, i.e., S((p, a, q)) is not conditional on arriving at p.) 125 14. REPRESENTATION AND EVALUATION OF PLANS A support function S and a plan N induce a weight function wSN : PN → R+ defined by  wSN (p) = S((p′ , N(p′ ), p)) − S((p, N (p), q)) (∗) (p,N (p),q)∈C where (p′ , N(p′ ), p) is the unique arc leading to p, and p′ ∈ PN ∪{p0 }. Observe that, if N, N ′ are two plans and p ∈ PN ∩ PN ′ , it follows that wSN (p) +   ′ wSN (q) = wSN (p) + q∈PN ,q≻p ′ wSN (q) = S((p′ , N(p′ ), p)) . q∈PN ′ ,q≻p That is, if two plans prescribe the choice of acts that may lead to a given position p, then these two plans induce the same total weight on the sub-tree emanating from p. The plans may, however, differ in the weights they induce for specific positions in this sub-tree. Conversely, every collection of weight functions {wN : PN → R+ }N that satisfies wN (p) +   ′ wN (q) = w N (p) + q∈PN ,q≻p ′ wN (q) (•) q∈PN ′ ,q≻p whenever p ∈ PN ∩ PN ′ , defines a unique support function S satisfying (∗) by S((p′ , N(p′ ), p)) = wN (p) +  wN (q) . (∗∗) q∈PN ,q≻p We assume (and later axiomatize) the following planning rule: given memory M and an initial position p0 ∈ P , the planner may be ascribed a support function S ≡ SM,p0 , such that, given any payoff function u, she ranks plans according to U(N ) = UM,p0 ,u (N ) =  p∈PN wSN (p)u(p) . 126 CHAPTER 5. PLANNING In this formulation, as well as in the axiomatization that follows, M is an abstract index. Thus it can be a set of cases, but also a set of rules, a data base of empirical frequencies, or anything else. If M is indeed a collection of cases, one may further postulate that the support function is derived from similarity to past cases in an additive manner:  S((p, a, q)) = s((p, a, q), (p′ , a′ , q ′ )) (p′ ,a′ ,q ′ )∈M for some case similarity function s. Moreover, one may derive the additive formula axiomatically in an analogous way to our results from Chapter 3. In this case, the formula above is a straightforward extension of CBDT in a single-stage problem. Observe that uncertainty is reflected in our model by the fact that a position and an act may evolve into several other positions. But the model also lends itself easily to situations of ignorance, namely, situations where the decision maker has but a vague concept of the possible eventualities. This is captured by the fact that the plan evaluation scheme may assign some weight to the payoff at a position even if one plans to act in it. In this sense, a position is taken to be the default for whatever may happen when an act is taken at it. The default value at the initial position is normalized to zero, and is therefore absent from the formula. But the payoff at any other position appears in the formula with a corresponding weight. The entire formula may be viewed as a linear function of default values. 14.4 Discussion In what ways do our plans differ from Bayesian decision trees? Superficial differences are the following: (i) Bayesian decision trees distinguish between decision nodes and chance nodes, whereas plan trees allow both the planner and so-called Nature to act at every position; and (ii) In a Bayesian decision tree payoff is only assessed at the terminal nodes, whereas in a plan tree it is assessed at every position. 14. REPRESENTATION AND EVALUATION OF PLANS 127 These differences are mostly notational. One may embed our model in a Bayesian decision tree: one only has to introduce a “chance node” after every act at every position, and allow each position to lead immediately to a corresponding terminal node whose ex-ante probability is that position’s weight. Despite this formal equivalence, we believe that plan trees are closer to the way people conceive of plans, and may therefore be more useful than Bayesian decision trees for descriptive and normative purposes alike. In particular, plan trees are better suited to describe the dynamic process by which a plan evolves. Typically, a decision maker is only aware of certain cases when she plans. As her plan is being executed, she might think further ahead, be reminded of additional cases, or learn new information. The way people think about plans does not admit a clear, static distinction between decision nodes and terminal nodes. Rather, this distinction is dynamic: a set of circumstances that was thought of as terminal yesterday might be the beginning of a new story today. Is it rational to develop plans in a piecewise manner, rather than thinking them through? Indeed, our model describes people who think ahead several steps, and then treat positions as if they were terminal nodes. But this is often the only way possible. In complex environments, such as a game of chess, one cannot even imagine the entire Bayesian decision tree. Given this constraint, one may not be embarrassed to learn that one has not charted the entire tree. In other words, partial planning may well pass our rationality test of Section 2 on grounds of feasibility. Planning trees, where weights are assigned to positions, can also be interpreted temporally: consider first a planning tree whose positions are linearly ordered. One may think of wSN (p) as the time the planner will be in position p (rather than the probability that the terminal outcome be p). More generally, wSN (p) may be the (discounted) expected time of being at p. Our notion of a “position” is also closely related to the notion of a “state” in the literature on Markov chains and dynamic programming, as well as in 128 CHAPTER 5. PLANNING the planning literature. Indeed, dynamic programming models also do not distinguish between a “decision node” and an “outcome”. Planning trees, however, are acyclic and do not assume any stationarity. 15 15.1 Axiomatic Derivation Set-up We devote this section to an axiomatic derivation of the plan evaluation rule presented in Section 14. We assume that, given every payoff function u, there is a preference relation u on plans. By referring to {u }u , we implicitly assume that planners can meaningfully express preferences between pairs of plans given various payoff functions, and not only given their actual function. Thus, the planner is assumed to be able to say, “I plan to sell my house and buy a condo; but if taxes were lower, I would plan not to sell the house”. This set-up deviates from the revealed preference paradigm. First, we expect planners to express preferences on what they will do if a certain position is reached in the future. Then we also require that they condition these preferences on new, imagined payoffs. We believe that such a deviation from the revealed preferences paradigm is inherent to the formal study of planning. Plans often are not carried out for a variety of reasons. In particular, a planner may find, upon reaching a position, new information that makes her choose a different path than she has previously planned to take. Therefore, one cannot restrict oneself to the planner’s actual choices and has to resort to cognitive and introspective data. To simplify notation, we will assume that preferences are defined on payoff vectors rather than on plans. That is, given a payoff function u and a plan N, we identify N with the restriction of u to PN , which is an element of FN ≡ RPN . Formally, we impose the following axiom: for every two plans N, Ñ, and every two payoff functions u, u′ , if u(p) = u′ (p) for all p ∈ PN ∪ PÑ , 129 15. AXIOMATIC DERIVATION then N u Ñ ⇔ N u′ Ñ. Denote F = N FN and define ′ ⊂ F ×F as follows: for x ∈ FN , y ∈ FÑ ,  , x ′ y if there exists a (by the axiom above, for all) payoff function N =N u such that u(p) = x(p) ∀p ∈ PN , u(p) = y(p) ∀p ∈ PÑ , and N u Ñ.  such that Thus ′ ranks any two vectors x ∈ FN , y ∈ FÑ , N = N x(p) = y(p) for all p ∈ PN ∩ PÑ . We will assume without loss of generality that N = Ñ implies that PN \PÑ = ∅. (In particular, PN = ∅ for all N .) That is, we assume that the set of imagined cases is rich enough so that each of two distinct plans reaches positions that the other does not. This assumption, however, is only made for notational convenience. Many interesting applications will involve situations in which the decision maker does not have any idea what would be the outcome of a given act in a given position. Rather than dealing with absence of arcs, we model these situations by arcs (imagined cases) with zero support. We say that P̂ ⊂ P is (-) null if for all x, x′ , y, y ′ ∈ F such that x(p) = x′ (p); y(p) = y ′ (p) for all p ∈ P̂ , we have x  y ⇔ x′  y ′ . 130 15.2 CHAPTER 5. PLANNING Axioms and result In the following we start with a relation ′ that ranks utility vectors corresponding to different plans. We will define  to be its transitive closure. Thus  may compare two vectors x, y ∈ FN that describe the utility derived from the same plan N, if there exists another vector z ∈ FÑ for Ñ = N that may indirectly compare x and y. Let  ⊂F × F satisfy the following axioms: A1 Order:  is reflexive and transitive, and whenever x ∈ FN , y ∈ FÑ , N = Ñ, x  y ⇔ x ′ y. We define !, ≻, ≺, ≈ as usual. Implicit in the following axioms is the assumption that a utility function is given, and that the vectors x, y ∈ FN are already measured on a utility scale. A2 Continuity and Comparability: For every two distinct plans N, Ñ, and every x ∈ FN , the sets {y ∈ FÑ |y ≻ x} and {y ∈ FÑ |y ≺ x} are open in FÑ . If PÑ \PN is non-null, they are also non-empty. A3 Monotonicity: For every two plans N, Ñ , and every x, z ∈ FN , y ∈ FÑ , if x(p) ≥ z(p) ∀p ∈ PN , then z  y implies x  y and y  x implies y  z. A4 Separability: For every two plans N, Ñ, and every x, z ∈ FN , y, w ∈ FÑ , if x ≈ w then x  y iff x + z  y + w. A5 Consequentialism: For every two plans N, Ñ , and every x ∈ FN , y ∈ FÑ , if for every p ∈ PN ∩ PÑ for which N(p) = Ñ(p) it is true that x(q) = y(q̃) for all q ≻ p, q ∈ PN and all q̃ ≻ p, q ∈ PÑ , as well as x(p) = y(p), then x ≈ y. A6 Non-Triviality: There exist two plans N, Ñ such that PN and PÑ are non-null, and N(p0 ) = Ñ(p0 ). Axioms A1, A2, A3, A4, and A6 are rather standard. (See Gilboa and Schmeidler (1995).) A5 is a new axiom that captures the intuition that plans 131 15. AXIOMATIC DERIVATION only matter to the extent that they affect outcomes. If two plans are bound to yield the same utility, they are bound to be equivalent. Theorem 5.1 Assume that  satisfies A6. Then the following are equivalent: (i)  satisfies A1-A5; (ii) There is a support function S such that, for every two plans N, Ñ, and every x ∈ FN , y ∈ FÑ , xy iff U(x) ≥ U(y) where, for x ∈ FN U (x) =  wSN (p)x(p) . p∈PN Furthermore, in this case S (hence also wSN ) is unique up to multiplication by a positive number. Note that wSN is uniquely derived from S by (∗) of Subsection 14.3. 15.3 Proof That (ii) implies (i) is obvious. We prove the converse. Consider an act a ∈ Ap0 . Let [Na ] be the set of plans that prescribe act a at position p0 . We claim that PN is null for some N ∈ [Na ] iff it is null for all N ∈ [Na ]. Indeed, assume that PN is null for N ∈ [Na ] and consider N ′ ∈ [Na ]. By A6 there exists a plan Ñ such that PÑ is non-null and Ñ ∈ [Na ]. By A2 there exists y ∈ FÑ such that y ≈ x for all x ∈ FN . Consider an arbitrary z ∈ FN ′ . Let z ′ , z ′′ ∈ FN ′ be constant vectors that equal the minimal and maximal value of z, respectively, throughout FN ′ . By A5, z ′ , z ′′ are equivalent to the corresponding constant vectors in FN and hence also to y. A3 implies that 132 CHAPTER 5. PLANNING z ≈ y. This being true for all z ∈ FN ′ , we conclude that PN ′ is null. We can therefore distinguish between sets [Na ] where PN is null for all N ∈ [Na ], and sets [Na ] where PN is null for no N ∈ [Na ]. We refer to the former as sets of null plans. Observe that, by A5, all null plans are equivalent. Next choose a plan Na ∈ [Na ] for each set that consists of non-null plans. By A6 there are at least two such sets. Note that FNa ∩ FNb = ∅ for a = b. Restricting attention to F̂ = Na FNa , we find that  satisfies all the axioms of the representation theorem of Appendix B in Gilboa and Schmeidler (1995). It follows that there exist non-negative and non-zero vectors of weights sNa on PNa such that, for all x, y ∈ F̂ , x  y iff U(x) ≥ U(y), where,  for x ∈ FNa , U(x) = p∈PNa w Na (p)x(p). Further, the vectors are unique up to multiplication by a positive number. Consider a plan Na′ ∈ [Na ] where Na′ = Na . Applying the same logic to ′ the collection {Na′ } ∪ {Nb }b=a we obtain representation by weights {wNa } ∪ {ŵNb }b=a for all vectors in F̂ ′ = b=a FNb ∪ FNa′ . By uniqueness, there exists ′ a λ > 0 such that λŵNb = wNb for b = a. By transitivity and A6, {λwNa } ∪ {wNb } induces a representation on all b FNb ∪ FNa′ . Continuing inductively, we obtain that for every non-null plan N there exists a non-negative and non-zero vector of weights wN on PN such that, for all x, y ∈ N non−null FN ,  N x  y iff U (x) ≥ U (y), where, for x ∈ FN , U (x) = p∈PN w (p)x(p). Further, the vectors are unique up to multiplication by a positive number. We wish to show that for every pair of non-null plans, N, N ′ , w N and wN satisfy wN (p) +  q∈PN ,q≻p ′ wN (q) = wN (p) +  ′ wN (q) ′ (•) q∈PN ′ ,q≻p for every p ∈ PN ∩ PN ′ . Consider p ∈ PN ∩ PN ′ , and the following utility 15. AXIOMATIC DERIVATION 133 function u: u(p) = 1; u(q) = 1 whenever q follows p inN or in N ′ ; u(q) = 0 otherwise. Let x ∈ FN and y ∈ FN ′ be the vectors induced by u. By A5, x ≈ y. Hence (•) follows. Finally, we extend the representation to null plans. Define w(p) = 0 for all p ∈ PN where N is null. We have already established that all x ∈ FN are equivalent to a certain y ∈ FÑ where Ñ is non-null. Since x ≈ 2x we can use A4 to conclude that y ≈ 2y, and U(y) = 0 follows. This completes the proof of Theorem 5.1. 134 CHAPTER 5. PLANNING Chapter 6 Repeated Choice 16 16.1 Cumulative Utility Maximization Memory-dependent preferences The rule of U-maximization was suggested in Section 4 as a way to cope with uncertainty. But it can also be differently interpreted: rather than viewing the aggregation of past experiences in U as providing information regarding uncertain payoffs, it can be viewed as a dynamic model of changing preferences. To highlight this aspect of U-maximization, let us focus on a case with no uncertainty, and where the decision maker encounters a sequence of problems that are, according to her subjective judgment, identical. That is, for each act i ∈ A there exists a unique outcome ri ∈ R such that only cases of the form (·, i, ri ) may appear in the decision maker’s memory with i as their second component. Further, the similarity function between any two problems is constant, say, 1. Examples of such decisions abound. Every day Mary has to decide where to go for lunch. The problems she is facing on different days are basically identical. Moreover, every choice she makes would lead to a payoff that may be assumed known.1 John decides every morning whether he should drive to 1 One may assume that the payoff is uncertain, but is drawn from the same distribution every time an act is chosen. As mentioned below, our results extend to the stochastic 135 136 CHAPTER 6. REPEATED CHOICE work, use public transportation, or walk. In short, many everyday decisions are of this type. Our goal is to analyze the pattern of choices that would result from U-maximization in these problems. The general model of U-maximization reduces to a simple process governed by few parameters: the decision maker has a cumulative index of satisfaction for each possible act i, denoted Ui , which depends on past choices. In every period, a maximizer of the satisfaction index is chosen. As a result of experiencing the outcome associated with the act, say, consuming the product, the satisfaction index of the act is updated. In the simplest version analyzed here, the update is additive, deterministic, and constant. That is, a number ui is added to Ui whenever i is chosen (and Ui is unaffected by a choice of j = i). If it is positive, the alternative appears to be better the more it has been chosen. If negative, the alternative is less desirable as a result of experiencing it. The decision maker is implicitly assumed to be aware of Ui , but not necessarily of ui . Consider a decision maker for whom all alternatives appear equally desirable at the outset. As observed in Section 4, if there is at least one alternative i whose payoff exceeds the aspiration level, namely, ui > 0 , then, as soon as one such alternative is chosen, it will be chosen forever, and the decision maker may be said to be satisficed in the sense of Simon (1957). If, however, all alternatives have negative ui ’s, the decision maker will never be satisficed. A decision maker with a low aspiration level, namely, one for which ui > 0, would tend to like an alternative more, the more it was chosen in the past, thus explaining habit-formation. By contrast, a high aspiration level implies that the decision maker would appear to be bored with alternatives she recently experienced, and would seek change. Suppose that Mary can choose between a Pizza place and a Chinese restaurant for lunch. Let us assume that uP izza = −1 and that uChinese = −2, and that the initial U value is zero for both alternatives. It is easy to see version of the model. 16. CUMULATIVE UTILITY MAXIMIZATION 137 that, asymptotically, Mary will choose the Pizza place twice as frequently as she will the Chinese restaurant. Because she has a relatively high aspiration level, she does not stick to one lunch option forever. Moreover, the u-values of the alternatives determine the limit relative frequencies of choice: the latter are inversely proportional to the former. We later prove that this result holds in general. One may interpret the vector of relative frequencies of choices as a bundle in neoclassical consumer theory, that is, as a vector of quantities of products. Viewed thus, if the aspiration level of the individual is low, she is easily satisficed and will select a corner point as an aggregate choice, as if she had a convex neoclassical utility function over the space of bundles. This corner point may not be optimal in the usual sense, namely, it may not be a maximizer of ui . If, however, the aspiration level is relatively high, the consumer will keep switching among the various products. The relative frequencies with which the various products are consumed will converge to a limit, which will be an interior point, as if the individual were maximizing a concave neoclassical utility over the bundle space. According to this interpretation, tastes, and thus decisions, are intrinsically context- or history-dependent. The decision maker does not attempt to maximize the function u. This function is only the rate by which U changes as result of experience. u may be interpreted as an index of instantaneous utility. But it is U, the cumulative satisfaction index, that the decision maker maximizes. Thus, U is the utility function. Whereas it changes as a function of history, it does so at a constant rate, given by u. 16.2 Related literature Before presenting the formal model, we mention some related literature. Our discussion of small, repeated decisions may bring to mind Herrnstein and Prelec’s “melioration theory”. (See Herrnstein and Prelec (1991).) Indeed, the small decisions we study are very close to what they call “distributed 138 CHAPTER 6. REPEATED CHOICE choice”. Yet, our focus is different from theirs. Melioration theory deals primarily with choices that change intrinsically as a result of experience. In examples such as addictive behavior, or other problems relating to selfdiscipline (regarding work, savings, and the like), the payoff associated with an act depends on how often it has been chosen in the past. By contrast, we assume that the instantaneous utility of act i, ui , is constant. Our decision maker may not wish to have the same meal every day; but there is no change in the pleasure derived from it as in the case of addiction, nor are there any long-term effects as in the case of savings. Further, melioration theory compares actual behavior to an optimum, defined by the highest average quality that can be obtained if inter-period interaction is taken into account. We study more commonplace phenomena, in which no such interaction exists. Correspondingly, in our model there is no well defined optimum, and, at least when high aspiration levels are involved, we do not deal with patently suboptimal choice. Finally, the decision rules in the two models differ. Whereas melioration theory uses average, ours employs sums. Our model is similar to the learning model of Bush and Mosteller (1955) in that, in every period, the outcome of the most recent choice is compared to an aspiration level, and the choice is reinforced if (and only if) its outcome exceeds the aspiration level. However, there are two main differences between these models. First, Bush and Mosteller are motivated by learning; thus, in their model past experiences serve mostly as sources of information, rather than determinants of preferences. Second, their model is inherently random, where a decision maker’s history is summarized by a probability vector, describing the probability of each choice. By contrast, in our model the choice is deterministic. Stochastic choice is useful to summarize hidden variables. But our goal here is to expose the decision process, and we focus on the deterministic implications of our basic assumptions. Gilboa and Pazgal (1996, 2000) extend the present model by assuming that the rate of change of Ui , namely, ui , is a random variable, whose dis- 16. CUMULATIVE UTILITY MAXIMIZATION 139 tribution depends only on the act i. They also conduct an empirical study of the random payoff model. Finally, the basic model suggested here was applied to electoral competition in Aragones (1997). She found that changeseeking behavior on the part of the voters gives rise to ideological behavior of the competing parties. 16.3 Model and results Let the set of alternatives be A = {1, 2, . . . , n}. For i ∈ A, denote the rate of the change of the utility function of i by ui . Since the interesting case will be that of negative rates of change, it will prove convenient to define ai = −ui . For a sequence x ∈ A∞ , define the number of appearances of choice i ∈ A in x up to stage t ≥ 0 to be F (x, i, t) = #{1 ≤ j ≤ t|x(j) = i} , and let U(x, i, t) denote the cumulative satisfaction index of alternative i ∈ A at that stage: U (x, i, t) = F (x, i, t)ui . For a vector u = (u1 , . . . , un ) let S(u) denote the set of all sequences of choices that are stagewise U -maximizing. That is, S(u) = {x ∈ A∞ | for all t ≥ 1, x(t) ∈ arg max U (x, i, t − 1)} . i∈A Of special interest will be the relative frequencies of the alternatives, denoted f(x, i, t) = F (x, i, t) t and their limit f (x, i) ≡ lim f (x, i, t) . t→∞ 140 CHAPTER 6. REPEATED CHOICE (We will use this notation even if the limit is not guaranteed to exist.) Omission of the index of the alternative will be understood to refer to the corresponding vector: f (x, t) = (f (x, 1, t), . . . , f (x, n, t)) f (x) = (f (x, 1), . . . , f (x, n)) . Finally, denote Y = {u = (u1 , . . . , un ) | ∀i ui = 0} V = {u = (u1 , . . . , un ) | ∀i ui < 0} . We can now formulate our first result: Proposition 6.1 Assume that u ∈ Y . Then: (i) for all x ∈ S(u), f (x) exists; (ii) There exists x ∈ S(u) for which f (x) is one of the extreme points of the (n − 1)-dimensional simplex iff this is the case for all x ∈ S(u), and this holds iff ui > 0 for some i ∈ A; (iii) f (x) is an interior point of the (n − 1)-dimensional simplex iff ui < 0 for all i ∈ A; (iv) if u ∈ V (i.e., ai > 0) for all i ∈ A, then for all x ∈ S(u) f(x) is given by  j =i aj  ; f (x, i) =  k∈A j =k aj (v) for every interior point y in the (n−1)-dimensional simplex there exists a vector u ∈ V , unique up to a multiplicative scalar, such that f (x) = y for all x ∈ S(u). 16. CUMULATIVE UTILITY MAXIMIZATION 141 Remark: In the case that some utility values do vanish (u ∈ / Y ), this result does not hold. Consider, for instance, the extreme case where ui = 0 for all i ∈ A. Then we get S(u) = A∞ and, in particular, f (x) need not exist for every x ∈ S(u). Furthermore, one may get the entire (n − 1)-dimensional simplex as the range of f (x) when it is well defined. When we aggregate choices over time, relative frequencies can be interpreted as quantities defining a bundle of products in the language of consumer theory. With this interpretation in mind, this result shows that low aspiration levels (positive ui ’s) are related to extreme solutions of the consumer’s aggregate choice problem, whereas high aspiration levels (negative ui ’s) give rise to interior solutions. Similarly, high and low aspiration levels may explain change seeking behavior and habit formation, respectively. Proposition 6.1 also shows that the interpretation of the instantaneous utility function u as “inverse frequency” holds in general: assume that all utility indices are negative, and consider two alternatives i, j ∈ A. The limit relative frequencies with which they will be chosen (for all x ∈ S(u)) are in inverse proportion to their utility levels: aj uj f (x, i) = = . f(x, j) ai ui For example, if n = 3, u1 = −1, u2 = −2, and u3 = −3, alternative 1 will be chosen twice as frequently as alternative 2, and 3 times as frequently as alternative 3. It follows that the relative frequencies of choice will converge to (6/11, 3/11, 2/11). Thus the instantaneous utility of an alternative, ui , has a different meaning than the utility value in standard choice theory, or in neoclassical consumer theory. On the one hand, the fact that alternative 2 has a lower utility than alternative 1 does not imply that the former will never be chosen. It will, however, be chosen less often than its substitute. On the other hand, the utility indices do not merely rank the alternatives; they also provide the frequency ratios, and are therefore cardinal. Proposition 6.1 may be viewed as an axiomatization of the instantaneous 142 CHAPTER 6. REPEATED CHOICE utility function u. Explicitly, for any choice sequence x ∈ A∞ the following two statements are equivalent: (i) there is a vector u ∈ V such that x ∈ S(u); (ii) f (x) exists, it is strictly positive, and x ∈ S(u) for any u ∈ V satisfying ui fi = uj fj for all i, j ∈ A. Thus the function u can be derived from observed choice, and it is unique up to a multiplicative scalar. We also find that any point in the interior of the simplex may be the aggregate choice of a cumulative utility decision maker for an appropriately chosen instantaneous utility function. While it is comforting to know that the theory is general enough in this sense, one may worry about its meaningfulness. Are there any choices it precludes? The following result shows that in an appropriately defined sense, very few sequences are compatible with it. Proposition 6.2 Assume that u ∈ Y . Then: (i) if ui > 0 for some i ∈ A, then S(u) is finite; (ii) if ui < 0 (i.e., ai > 0) for all i ∈ A, then |S(u)| =  n! if ai /aj is irrational for all i, j ∈ A ℵ otherwise (iii) denote S− =  S(u) u∈V + S =  S(u) . u∈Y \V Let p = (pi )i∈A be some probability vector on A, and let λp be the induced product measure on A∞ (endowed with the product σ-field). Then S − is an uncountable subset of a λp -null set, whereas S + is finite, and if p is not degenerate, S + is a λp -null set. 16. CUMULATIVE UTILITY MAXIMIZATION 143 Thus there are uncountably many sequences of choices that are U -maximizing. Yet part (iii) of the proposition states that, overall, the set S ≡ S − ∪ S + is “small” by any of the reasonable measures defined above. Furthermore, it is easy to verify that finitely many observed choices may often suffice to conclude that they cannot be a prefix of any sequence in S. To sum, the repeated choice theory presented here is non-vacuous. On the contrary, it is too easily refutable. However, Gilboa and Pazgal (1996) show that, if the ui ’s as well as the initial values U (·, i.0) are random variables, one may choose their distributions so that every finite sequence of choices would have a positive probability. Yet, they also show that, with probability 1, there will be limit frequencies of choice as in the deterministic case, where the limit frequencies are inversely proportional to the expected instantaneous payoffs. 16.4 Comments Fading Memory In this section we assume that the similarity function is constant. However, there are other ways in which one might model the fact that a problem is repeatedly encountered. For instance, the weight assigned to past cases may decrease exponentially with time, reflecting the fact that past cases may be forgotten, that they are deemed less relevant, or that their impact on future choices is diminishing with time. Incidentally, such a model does not require that the decision maker remember more than one number per alternative. (See Aragones (1997).) The Interpretation of Aspirations The notion of “aspiration level” in this model need not be taken literally. Consider, for instance, the choice of a piece of music to listen to. A decision maker may listen to “Don Giovanni” three times as often as to “Turandot”. In our model, the former would have a utility of −1, while the latter – of −3. Yet it would be wrong, if not blasphemous, to suggest that these works do not achieve the decision maker’s aspiration level. In this case, zero utility level may better be interpreted as 144 CHAPTER 6. REPEATED CHOICE a “bliss point”. Thus, negative utility values need not conjure up sour faces in one’s mind. They may also describe perfectly content decision makers who merrily and gingerly alternate choices to keep their lives interesting. On the other hand, a satisficed decision maker need not be happy in any intuitive sense. For instance, consider a consumer who often suffers from acute headaches. If she is loyal to a certain brand of medication, she is satisficed according to our model. Yet she is by no means happy.2 16.5 Proofs Proof of Proposition 6.1: First assume that ui > 0 for some i ∈ A. After at most (n − 1) choices of alternatives with negative utility, one with ui > 0 is chosen. From that point on, that alternative is the unique U -maximizer and is chosen for ever. Hence in this case f (x) exists for all x ∈ S(u), and it is one of the extreme points of the (n − 1)-dimensional simplex. Note, however, that it need not be the same for all such x’s. Let us now consider the case of ui < 0 (i.e., ai > 0) for all i ∈ A. At stage t ≥ 0 alternative i is weakly preferred to j iff U (x, i, t) = F (x, i, t)ui ≥ F (x, j, t)uj = U (x, j, t) or, equivalently, F (x, i, t)ai ≤ F (x, j, t)aj . Hence for every stage t ≥ 0, regardless whether i is to be chosen at it or not, we have F (x, i, t) ≤ aj F (x, j, t) + 1 . ai Thus, for t ≥ 1 we obtain f(x, i, t) ≤ 2 1 aj f (x, j, t) + . ai t Eva Gilboa-Schechtman has pointed out to us that the interpretation of zero utility level is likely to differ between pleasure seeking activities (such as going to a concert) and pain avoidance activities (such as taking a medication). 16. CUMULATIVE UTILITY MAXIMIZATION 145 For t ≥ n none of the frequencies vanishes, and it follows that f(x, i, t) aj = . t→∞ f (x, j, t) ai ∃ lim With finitely many alternatives, this also implies that f (x) exists. Furthermore, we find that it is independent of x. Finally, it follows that  j =i aj  . f (x, i) =  k∈A j =k aj To sum, we have proven that f (x) exists in both cases, hence (i) is proven. If there is an alternative i with ui > 0, only extreme points could be limit frequencies; on the other hand, if ui < 0 for all i ∈ A, only interior points can be chosen. Thus the existence of one x ∈ S(u) for which f (x) is an extreme point implies that for some alternative i, ui > 0, hence for all x ∈ S(u) f (x) is an extreme point of the simplex. This concludes the proof of (ii). Claims (iii) and (iv) were explicitly proven above. We are left with claim (v), i.e., that for every interior point y in the (n − 1)-dimensional simplex there exist negative utility indices (ui )i∈A such that f (x) = y for all x ∈ S(u), and that these are unique up to a multiplicative scalar. Let there be given such a point y = (y1 , . . . , yn ). Define ui = −ai =  − j =i yj . For all i, j ∈ A, yi /yj = aj /ai . It follows that f (x) = y for all x ∈ S(u). Furthermore, this equation shows that the utility (ui )i∈A is unique up to multiplication by a positive scalar. This completes the proof.  Proof of Proposition 6.2: In light of the preceding analysis, (i) is immediate. Consider the case ui < 0 for all i ∈ A. If ai /aj is irrational for all i, j ∈ A, after the first n stages, where each alternative is chosen once, U -maximization determines the choice uniquely. Thus there are n! sequences in S(u). If, however, ai /aj is rational for some i, j ∈ A, say ai /aj = l/k for some integers l, k ≥ 1, after exactly l choices of alternative j and k choices of i, there will come a stage where both i and j are among the maximizers of U. The choice at this stage is arbitrary, and therefore there are at least 146 CHAPTER 6. REPEATED CHOICE two different continuations that are consistent with U-maximization. Since for every choice made at this point there will be a similar one after 2l choices of alternative j and 2k choices of i, there are at least 4 such continuations at this stage. By similar reasoning, there are at least |{0, 1}N | = ℵ sequences in S(u). Since ℵ is also an upper bound on |S(u)|, (ii) is established. We need to show (iii). First consider S − . Denote Sp = {x ∈ A∞ |∃f (x) = p} . By the strong law of large numbers, λp (Sp ) = 1. Hence λp (S − \Sp ) = 0. It therefore suffices to show that λp (S − ∩ Sp ) = 0. First consider the case where pi /pj is irrational for all i, j ∈ A. Then S − ∩ Sp is finite and of measure zero. Next assume that pi /pj = l/k for some integers l, k ≥ 1. In this case S − ∩ Sp is uncountable, but it is a subset of the event in which for every m ≥ 1, the (ml + 1)-th appearance of j occurs after the mk-th appearance of i, and the (mk + 1)-th appearance of i occurs after the ml-th appearance of j. Since at every point there is a positive λp -probability of a sequence of (mk + 1) consecutive appearances of i, this event is of measure zero. Finally, the set S + is finite. Furthermore, it is λp -null unless p is a degenerate probability vector. This completes the proof of the proposition.  147 17. THE POTENTIAL 17 17.1 The Potential Definition In this chapter we re-consider the decision process described in Section 16 as a maximization of a function that is defined on the simplex of relative frequencies. This analysis serves two purposes. First, it allows us to compare the asymptotic behavior of the dynamic process to neoclassical consumer theory, which is atemporal. Second, it provides a useful theoretical tool. In particular, we employ the potential to study the effect of act similarity in repeated choice. For a choice vector x ∈ A∞ , let xi ∈ At be the t-prefix of x. Let · denote vector concatenation. Denote the set of all prefixes by A∗ = t≥0 At . For a function ϕ : A∗ → R, xi ∈ At , and i ∈ A, define ∂ϕ (xi ) = ϕ(xi · (i)) − ϕ(xi ) . ∂i is the change in the value of ϕ that will result from adding i to That is, ∂ϕ ∂i the choice vector xi . Denote by Ui : A∗ → R the U-value of alternative i. That is, Ui (xt ) = U (x, i, t). Then ui may be viewed as the derivative of Ui with respect to experiencing alternative i: ∂Ui (xt ) = ui ∂i for all xt ∈ A∗ . Similarly, for j = i, ∂Ui (xt ) = 0 . ∂j Suppose we wish to measure the well-being of the decision maker at a certain time. One possibility to do so is by the sum of all past experiences. Define Q : A∗ → R by Q(xt ) = t  r=1 Ux(τ ) (xτ −1 ) 148 CHAPTER 6. REPEATED CHOICE for xτ ∈ A′ . A U -maximizing decision maker may be described as maximizing her Q function at any given time t. Furthermore, Q is uniquely defined (up to a shift by an additive constant) by ∂Q (xt ) = Ui (xt ) ∂i for all xi ∈ A∗ and i ∈ A. Hence Q is a single function, such that the utility of alternative i, Ui , is its derivative with respect to i. It thus deserves the title the potential of the utility. In certain ways, the potential is closer to the utility function in neoclassical consumer theory than either U or u are. Both u and U are defined for a single alternative, whereas the neoclassical function is defined for bundles. Correspondingly, if past experiences linger in the decision maker’s memory, neither U nor u may qualify as measures of overall well-being, while the neoclassical utility does. By contrast, the potential function Q is defined for bundles (implicit in the vector xt ), and may be interpreted as a measure of well-being. Furthermore, since ∂Q = Ui ∂i and ∂Uj = ∂j  ui 0 i=j , i=j we get ∂ 2Q = ∂j∂i  ui 0 i=j . i=j If we were to define convexity and concavity of Q by its second derivatives, we would find that Q is concave if ui < 0 (for all i ∈ A) and convex if ui > 0. Indeed, we have found earlier (see Proposition 6.1) that negative u values induce limit frequencies of choice that are also predicted by a concave neoclassical utility function, while positive u values correspond to a convex one. Thus the potential parallels the neoclassical utility function also in 149 17. THE POTENTIAL terms of the relationship between concavity/convexity and interior/corner solutions. ¿From a mathematical viewpoint the potential is a rather different creature from the neoclassical utility function. While the former is defined on choice sequences, the latter is defined on bundles. However, since all permutations of a given sequence are equivalent in terms of the behavior they induce, a sequence may be identified with the relative frequencies of the alternatives in it. It follows that, after an appropriate normalization, we may re-define the potential on the simplex of bundles. 17.2 Normalized potential and neo-classical utility Formally, for x ∈ A∞ and t ≥ 0, recall that F (x, i, t) denotes the number of appearances of i in x up to time t. Let T (x, i, k) stand for the time at which the k-th appearance of i in x occurs, that is, T (x, i, k) = min{t ≥ 0|F (x, i, t) ≥ k} . This function will be taken to equal ∞ if the set on the right is empty. However, for the case ui < 0, T will be finite. We obtain Q(xt ) = t  U(x, x(τ ), τ − 1) τ =1 = (x,i,t) n F  i=1 = (x,i,t) n F  i=1 = (k − 1)ui k=1 n 1 2 U (x, i, T (x, i, k) − 1) k=1 ui F (x, i, t)[F (x, i, t) − 1] . i=1 Recall that f (x, i, t) is the relative frequency of i in x up to time t. Thus, n 1 1 Q(xt ) = ui f (x, i, t)[f(x, i, t) − ] . 2 2 i=1 t t 150 CHAPTER 6. REPEATED CHOICE For a point f = (f1 , . . . , fn ) in the (n − 1)-dimensional simplex, define the normalized potential to be n 1 q(f ) = ui fi2 . 2 i=1 Then, for large enough t, Q(xt ) ≈ q(f (x, t)) . t2 At any given time t ≥ 0, the decision maker has a value for the normalized potential, and behaves as if she were trying to (approximately) maximize it. To be precise, the decision maker chooses an alternative so as to maximize Q(xt+1 ), or, equivalently, to maximize Q(xt+1 )/(t + 1)2 . However, in the long run this is approximately equivalent to maximization of the normalized potential q. Considering the optimization problem M AX s.t. q(f ) n  fi = 1 i=1 fi ≥ 0 , it is straightforward to check that, should it have an interior solution (relative to the simplex), the solution must satisfy fi ui = const . Indeed, Proposition 6.1 shows that, if ui < 0 (for all i ∈ A), there is an interior solution satisfying the above condition. Furthermore, it shows that the “greedy”, or “hill climbing” algorithm implemented by our decision maker converges to this solution. At this point it is tempting to identify the (normalized) potential with the neoclassical utility. However, some distinctions should be borne in mind. 17. THE POTENTIAL 151 First, the neoclassical utility is assumed to be globally maximized by a oneshot decision. The potential is only locally improved at every stage. Should the potential be convex, for instance, our decision maker may be satisficed without optimizing it. Second, the neoclassical utility is assumed to be maximized given the budget constraint. By contrast, the local improvement of the potential is unconstrained. In Gilboa and Schmeidler (1997b) we introduce prices into the model in a way that retains this feature, namely, that lets the consumer follow the gradient of greatest improvement, where prices and income are internalized into the evaluation of possible small changes. Yet, as long as prices and income are ignored, and if the potential is concave, our model may be viewed as a dynamic derivation of neoclassical utility maximization. Specifically, a neoclassical consumer who happens to have a concave, quadratic, and additively separable utility function over the simplex of frequencies will end up making the same aggregate choice as the corresponding unsatisficed cumulative utility consumer. Our theory thus provides an account of how such a consumer gets to maximize her utility. Starting with a neoclassical utility, and adopting the identification of quantities with frequencies, one may suggest a hill-climbing algorithm as a reasonable dynamic model of consumer optimization, independently of our theory. From this viewpoint, the cumulative utility model provides a cognitive interpretation of the gradients considered by the optimizing consumer. 17.3 Substitution and Complementarity Consider a consumer who chooses a daily meal out of {beef, chicken, f ish}. Suppose that she likes them to the same degree, but that she seeks change. Say, ubeef = uchicken = uf ish = −1 . If the consumer judges beef as closer to chicken than fish is, after having chicken she might prefer fish to beef. To capture these effects, we introduce 152 CHAPTER 6. REPEATED CHOICE a similarity function between products, sA : A × A → [0, 1], that reflects the fraction of one product’s instantaneous utility that is added to another product’s cumulative satisfaction index. Specifically, every time product i is consumed, sA (j, i)ui is added to Uj , for every product j, with sA (i, i) = 1. For instance, suppose that sA (beef, chicken) = sA (chicken, beef ) = 1 where the other off-diagonal sA values are zero. For these similarity values, when the choice set is {chicken, fish} or {beef, f ish}, the relative frequencies of consumption are (1/2, 1/2). However, for the set {beef, chicken, f ish}, they may be (1/4, 1/4, 1/2) rather than (1/3, 1/3, 1/3), as would be the case in the absence of product similarities. In fact, in this example beef and chicken are practically identical, and we can only say that the frequencies of {beef -or-chicken, f ish} will converge to (1/2, 1/2). One interpretation of the product-similarity function is that it measures substitutability: the closer are two products to be perfect substitutes, the higher is the similarity between them, and the greater will be the impact of consuming one on boredom with another. If the similarity between two products is zero, they are substitutes in the sense that both can still be chosen in the same problem, but the consumption of one of them does not affect the desirability of the other. Next suppose that the recurring choice in our model is a purchase decision, rather than a consumption decision. In each period the consumer chooses a single product, but she also has a supermarket cart, or kitchen shelves, which allow her eventually to consume several products together. Having bought tomatoes yesterday and cucumbers today, the consumer may have a salad. Following this interpretation, it is only natural to extend the productsimilarity function to negative values in order to model complementarities. If the product-similarity between, say, tea, and sugar is negative, (and so are their utilities), having just purchased the former would make the consumer more likely to purchase the latter next. 153 17. THE POTENTIAL Recall that the instantaneous utility function u may be interpreted as the consumption-derivative of U: ∂Ui = ui . ∂i In the presence of product-similarity, when product i is chosen, Uj is changed by sA (j, i)u(i). That is, ∂Uj = sA (j, i)ui . ∂i Combining the above, the similarity function is the ratio of derivatives of the U functions: sA (j, i) = ∂Uj /∂i . ∂Ui /∂i Using the potential function, we may write ∂2Q ∂Uj /∂i ∂Ui ∂Uj = = sA (j, i)ui = · ∂i ∂j∂i ∂Ui /∂i ∂i ∂ 2 Q/∂j ∂i sA (j, i) = 2 . ∂ Q/∂i2 The neoclassical theory defines the substitution index between two products as the cross derivative of the utility function (with respect to product quantities). By comparison, the product similarity function is the cross derivative of the potential function, normalized by its second derivative with respect to one of the two. Furthermore, if we define substitutability between two products i and j as the impact consumption of i has on the desirability of j, namely sA (j, i)ui , it precisely coincides with the cross derivative of the potential. This reinforces the analogy between the potential in our model and the neoclassical utility. In the presence of product similarity, the consumer’s choice given a sequence of past choices xt will not depend on the order of the products in xt , though the potential generally will. The U -value of product i is U (x, i, t) = t  τ =1 sA (i, x(τ ))ux(τ ) = n  j=1 F (x, j, t)sA (i, j)uj 154 CHAPTER 6. REPEATED CHOICE which depends only on the number of appearances of each product in xt . However, the value of the potential is Q(xt ) = t  τ =1 U (x, x(τ ), τ − 1) = τ −1 t   sa (x(τ ), x(v))ux(τ ) (v) . τ =1 v=1 It is easy to see that the potential will be invariant with respect to permutations of xt iff sA (j, i)ui = sA (i, j)uj ∀i, j ∈ A that is, iff ∂ 2Q ∂ 2Q = ∂j∂i ∂i∂j ∀i, j ∈ A . Note that this is the appropriate notion of symmetry in this model: the product similarity function itself may be symmetric without guaranteeing that the impact of consumption of i on desirability of j equals that of j on i. Rather, we define symmetry by sA (j, i)ui = sA (i, j)uj , that is, by the equality of the cross-derivatives of the potential. Under this assumption, n n i−1  1 sA (i, j)uj F (x, i, t)F (x, j, t) . ui F (x, i, t)[F (x, i, t) − 1] + Q(xt ) = 2 i=1 i=2 j=1 Since sA (i, i) = 1 for all i ∈ A we get n 1  1  Q(xt ) = ui f (x, i, t) . sA (i, j)uj f (x, i, t)f (x, j, t) − 2 i,j∈A 2t i=1 t2 Hence Q(xt ) t2 can be approximated by the normalized potential defined on the simplex by 1 q(f ) = fSf τ , 2 17. THE POTENTIAL 155 where f = (f1 , . . . , fn ) is a frequency vector and the matrix S is defined by Sij = sA (i, j)uj . Suppose that S is negative definite. In this case the hill-climbing algorithm implemented by a cumulative utility consumer will result in maximization of the normalized potential. Hence, if we start out with a quadratic and concave neoclassical utility u, it defines a matrix S for which the corresponding cumulative utility consumer behaves as if she were locally maximizing u. Furthermore, for any function u that can be locally approximated by a quadratic q, local u-maximization would result in similar choices to those of the cumulative utility consumer characterized by q. Since S is symmetric, it can be diagonalized by an orthonormal matrix. That is, there exists an (n × n) matrix P with P t = P −1 such that P t SP is diagonal, with the eigenvalues of S along its main diagonal. Since the matrix P can be thought of as rotation in the bundle space, we may offer the following interpretation: the consumer is deriving utility from n “basic commodities”, which are the eigenvectors of S, and their u-values are the corresponding eigenvalues. (These would be negative for a negative definite S.) One such commodity may be, for instance, a certain combination of tea and sugar. However, the consumer can only purchase the products tea and sugar separately. By choosing the right mix of the products, it is as if the basic commodity was directly consumed.3 According to this interpretation, there is zero similarity between the basic commodities; that is, there are no substitution or complementarity effects between them. These effects among the actual products are a result of the fact that these products are ingredients in the desired basic commodities. 3 Note that the analogy to the additively separable case is not perfect. Specifically, the constraint that the sum of frequencies be 1 is also rotated. Hence the negative eigenvalues are not related to frequencies as directly as in Section 16. 156 CHAPTER 6. REPEATED CHOICE Chapter 7 Learning and Induction This chapter studies several ways in which case-based decision makers learn from their experience. First, we discuss aspiration level adjustment and their impact on behavior. We then proceed to discuss examples in which the similarity function is also learned. Finally, we conclude with some comments on inductive reasoning. 18 18.1 Learning to Maximize Expected Payoff Aspiration level adjustment The reader may find decision makers who maximize the function U less rational than expected utility maximizers. The former satisfice rather than optimize. They count on their experience rather than attempt to figure out what will be the outcomes of available choices. They follow what appears to be the best alternative in the short run rather than plan for the long run. They use whatever information they happened to acquire rather than intentionally experiment and learn from a growing experience. There are many applications for which we find this boundedly rational image of a decision maker quite plausible, and it can even pass the rationality test proposed in Section 2. Especially in novel situations, people often find it hard to specify the space of states of the world in a satisfactory way, let 157 158 CHAPTER 7. LEARNING AND INDUCTION alone to form a prior over it. Many if not most of the decisions taken by governments, for instance, are made in complex environment that have not been encountered before in precisely the same way. It is therefore not surprising that the commandments of EUT, appealing as they are, are hard to follow. One does not have enough trials to figure out all the possible eventualities, not to mention their frequencies. Further, there is little value in investing in learning and experimentation, because the knowledge acquired will be obsolete before it gets to be used. Similarly, planning carefully for the long run may prove a futile endeavor. However, when the environment is more or less fixed, and the situation may be modeled as a repeated choice problem, case-based decision makers do appear to be a little too naive and myopic. While we argue (see Section 12) that CBDT is not designed for these situations to begin with, we show here that the irrational and shortsighted aspect of CBDT is, for the most part, an artifact of the implicit assumption that aspiration levels do not change. Indeed, the basic formulation of CBDT in Section 4 normalized the utility function in such a way that the utility ascribed to an untried alternative, namely, the aspiration level, was zero. While this normalization can be performed for each choice situation separately, there is no reason to assume that it can be done for all of them simultaneously, that is, that the aspiration level does not change over time, or as a result of past experiences. In this section we propose two properties of aspiration-level adjustment rules, which we find descriptively plausible in general. We show that in the special case of a repeated choice problem, these properties also guarantee optimality. Thus, these properties can also be supported on normative grounds. While there are many rules that may guarantee optimal choice in this special case, we hope to convince the reader that the rules discussed here are also fairly intuitive. 18. LEARNING TO MAXIMIZE EXPECTED PAYOFF 18.2 159 Realism and ambitiousness We assume that the aspiration level is adjusted according to a rule that is both realistic and ambitious. Realism means that the aspiration level is set closer to the best average performance so far experienced. Ambitiousness can take one of two forms: it may simply imply that the initial aspiration level is high; alternatively, it may be modeled as suggesting that the aspiration level is increased in certain periods. More precisely, we model realism by assuming an adaptive rule that sets the new aspiration level at a certain weighted average of the old one and the maximal average performance so far encountered. Thus, if all the acts that were attempted in the past failed to perform up to expectations, the latter will have to be scaled down. Conversely, if the aspiration level is exceeded by the performance of certain acts, it is gradually increased.1 While we do not provide an axiomatic derivation of this property, we would like to motivate it from several distinct viewpoints. First, suppose that we read the aspiration level as an answer to the question, What can you reasonably hope for in the current problem? If a moderately rational decision maker is to provide the answer, some adaptation of the aspiration level to actual performance seems inevitable. Second, realism appears to be a plausible description of people’s emotional reactions: people seem to be able to adapt to circumstances. For simplicity, we do not distinguish here between scaling-up and scaling-down. For our purposes it suffices that over the long run the aspiration level is adjusted. Third, one might argue for the optimality of realism. As we show in this section, adjusting aspiration levels in realistic and ambitious way leads to expected payoff maximizing choices. That is, we show that the combination of realism and ambitiousness is behaviorally optimal. But realism has an emotional justification as well. 1 As will be clear in the sequel, the specific adjustment rule is immaterial; it is crucial, however, that it is gradually pushing the aspiration level towards the actual best average performance. 160 CHAPTER 7. LEARNING AND INDUCTION If we interpret the aspiration level emotionally, when it is set too high the decision maker is bound to be disappointed. Thus, if one were to choose an aspiration level (say, for one’s child), one would like to set it at a realistic level.2 The ambitiousness property may have two separate (though compatible) meanings: static ambitiousness simply states that the initial aspiration level is relatively high. A high initial aspiration level reflects the fact that our decision maker is aggressive, and entertains great expectations. Whether the decision maker’s initial aspiration level is high enough will depend on a variety of psychological, sociological, and perhaps also biological factors. While our optimism assumption may not be universally true, it is not blatantly implausible either. (See Shepperd, Ouellette, and Fernandez (1996) for related empirical evidence.) The second meaning that the ambitiousness assumption may take is dynamic: that is, that the decision maker never quite loses hope. Specifically, we will assume that in certain decision periods, the aspiration level is set to exceed the best average performance by a certain constant. In order to make this compatible with realism, we will allow these decision periods to become more and more infrequent. (As a matter of fact, for the optimality result we will require that the update periods have a limit frequency of zero.) That is, the longer is one’s memory, the less does one tend to increase the aspiration level in this somewhat arbitrary manner. However, dynamic ambitiousness requires that these update periods never end. Regardless of the memory’s length, a dynamically ambitious decision maker still sometimes stops to ask, Why can’t I do better than that? As in the case of static ambitiousness, the claim of this assumption to 2 Continuing this line of reasoning, it would seem that lowering aspirations well below what could realistically be expected may increase happiness even further. But this would result in sub-optimal choices, as will be clear in the sequel. Moreover, this option may not be psychologically feasible in the long run. For a more detailed discussion of these issues, see Gilboa and Schmeidler (1996b). 18. LEARNING TO MAXIMIZE EXPECTED PAYOFF 161 descriptive validity can be qualified at best. Indeed, we are not trying to claim that all decision makers are realistic but ambitious, just as we do not believe that all people necessarily choose optimal acts in repeated problems. The main point is that the properties of realism and ambitiousness correspond to some general intuition, and they make sense beyond the special case in which a certain problem is encountered over and over again. It is reassuring to know that in this special case these general properties also guarantee optimal choice. 18.3 Highlights The repeated problem we discuss in this section is akin to the multi-armed bandit problem (see Gittins (1979)): the decision maker is repeatedly faced with finitely many options (“arms”), each of which is guaranteed to yield an independent realization of a certain random variable (with finite expectation and variance). Our results are as follows: first assume that the decision maker is realistic and statically ambitious. Then, given the distributions governing the various arms, and any probability lower than 1, there exists a high enough initial aspiration level such that, with this probability at least, the limit frequency of the expected-utility maximizing acts will be 1. Thus the initial aspiration level depends both on the given distributions and on the desired probability with which this frequency will indeed be 1. Our second result assumes a decision maker who is realistic and dynamically ambitious. We prove that for any given distributions, and any initial aspiration level, the limit frequency of the optimal acts will be 1 with probability 1. This is a stronger statement than the previous one. It is guaranteed that almost always the best acts will be almost-always chosen, and that the same algorithm obtains optimality for any given distributions. Thus, dynamic ambitiousness is safer than static ambitiousness. Roughly, it is more important not to lose hope than to have great expectations. The intuition behind both results can be easily explained in the deter- 162 CHAPTER 7. LEARNING AND INDUCTION ministic case. Suppose that every time an act is chosen, it yields the same outcome. For the first theorem, assume that the decision maker starts out with a very high aspiration level. Thus all options seem unsatisfactory, and the decision maker switches from one to another, as prescribed by CBDT in case of negative utility values. Specifically, in this case the frequencies of choice are inversely proportional to the utility values, as shown in Chapter 6. Hence, a high aspiration level prods the decision maker to experiment with all options with approximately equal frequencies. But the aspiration level cannot remain high for long. As time goes by it is updated towards the best average performance so far encountered. In the deterministic case, the average performance of an act is simply its instantaneous utility value. Thus, in the long run the aspiration level tends to the maximal utility value. Correspondingly, an act that achieves this value is almost satisficing, and the set consisting of all these acts will be chosen with a limit frequency of 1. Next consider a dynamically-ambitious decision maker in a similar setup. Such a decision maker may have started out with too low an aspiration level; thus she may be choosing a sub-optimal act, while the aspiration level is being adjusted upward towards the utility value of this act. However, if at some point the aspiration level is set above this utility value, this act is no longer satisficing, and the decision maker will try a new one. In the long run all acts will be tried out, and the aspiration level will be realistically adjusted towards the maximal utility value. As opposed to the case of static ambitiousness, the aspiration level does not converge to this value, since it is pushed above it from time to time. Yet, these periods are assumed to have zero limit frequency, and thus the optimality result holds. The general cases of both results, in which the available acts yield stochastic payoffs, are naturally more involved, but the proofs follow the same basic intuition. We note here that both realism and ambitiousness are crucial for the optimality results. If our decision maker is realistic but not ambitious, she 18. LEARNING TO MAXIMIZE EXPECTED PAYOFF 163 may well choose a sub-optimal act forever. In this case her choice is random in the following sense: an act is randomly selected at the first stage, and then it is chosen for ever. On the other hand, if the decision maker is, say, statically ambitious but not realistic, then all choices seem to her almost equally unsatisfactory. In this case the choice is close to random in the sense that all acts will be chosen with approximately the same frequency. (See Chapter 6.) By contrast, the combination of the two guarantees that all acts will be experimented with, but also that in the long run experimentation will give way to optimal choice. In a sense, our results may be viewed as explaining the evolution of optimal (expected-utility maximizing) choice: a case-based decision maker who is both realistic and ambitious will learn to be an expected-utility maximizer. These results only hold if the decision problem is repeated long enough in the same form. But this is precisely the case in which EUT seems most plausible, that is, when history is long enough to enable the decision maker to figure out what are the states of the world, and to form a prior over them based on their observed frequencies. Furthermore, a case-based decision maker is more open minded than an expected utility maximizer. While the latter may have a-priori beliefs whose support fails to contain the true distribution, the former does not entertain prior beliefs, and thus cannot be wrong about them. In the context of optimization problems, one may view our results as reinforcing a general principle by which global optimization may be obtained by local optimization coupled with the introduction of noise. The annealing algorithms (Kirkpatrick, Gellatt, and Vecchi (1982)) are probably the most explicit manifestation of this principle. Genetic algorithms (Holland (1975)) are another example, in which the adaptive process leads to a local optimum of sorts, and the cross-over process allows the algorithm to explore new horizons. Yet another example of the same principle may be found in evolutionary models in game theory such as Foster and Young (1990), Kan- 164 CHAPTER 7. LEARNING AND INDUCTION dori, Mailath, and Rob (1993) and Young (1993). In these models, a myopic best-response rule may lead to equilibria that are Pareto dominated (“local optima”), even in pure-coordination games. But the introduction of mutations provides the “noise” that guarantees (in such games) a high probability of a Pareto-dominating equilibrium (a “global optimum”). From this viewpoint, one may interpret our results as follows: the realistic nature of the aspiration-level adjustment rules induces convergence to a local optimum, namely, to a high frequency of choice of the best acts among those that were tried often enough. Ambitiousness plays the role of the noise, which prods the decision maker to choose seemingly sub-optimal acts, and, in the long run, to converge to a global optimum. Annealing algorithms simulate physical phenomena; genetic algorithms and evolutionary game theory models are inspired by biological metaphors; by contrast, our process is motivated by psychological intuition. As mentioned above, we find this intuition valid beyond the specific model at hand. 18.4 Model We now turn to the formal model. Let A = {1, 2, . . . , n} be a set of acts (n ≥ 1). For i ∈ A let there be given a distribution Fi on R (endowed with the Borel σ-algebra), to be interpreted as the (conditional) distribution of the utility yielded by act i whenever it is chosen. We assume that Fi has finite expectation and variance, denoted µi and σi , respectively. The underlying state space will be a subset of S0 = (R × A × R)N where N denotes the natural numbers. A state ω = ((H1 , a1 , x1 ), (H2 , a2 , x2 ), . . . ) ∈ S0 is interpreted as follows: for all t ≥ 1, in period t, the aspiration level is Ht at the beginning of the period, an act at is chosen, and it yields a payoff of xt . It will be convenient to define, for every t ≥ 1, the projection functions Ht , xt : S0 → R and at : S0 → A 18. LEARNING TO MAXIMIZE EXPECTED PAYOFF 165 with the obvious meaning. Next we define a function C : S0 × A × N → 2N to be the set of periods, up to a given period, in which a given act was chosen, at a given state. That is, C(ω, i, t) = {j < t|aj (ω) = i} . We are also interested in the number of times a certain act was chosen up to a given period. Therefore, we define a function K : S0 × A × N → N ∪ {0} by K(ω, i, t) = #C(ω, i, t) . We are mostly interested in the relative frequencies of the decision maker’s choices. It will be convenient to define a function f : S0 × A × N → [0, 1] to measure relative frequency up to a given time, that is, f (ω, i, t) = K(ω, i, t) . t Dropping the period index will refer to the limit: f(ω, i) = lim f (ω, i, t) . t→∞ Finally, we extend this notation to subsets of A: for D ⊂ A we define  f (ω, D, t) = f(ω, i, t) i∈D f (ω, D) = lim f (ω, D, t) . t→∞ We now turn to define the CBDT functions in this context. Let U : S0 × A × N → R be defined by  [xj (ω) − Ht (ω)] . U(ω, i, t) = j∈C(ω,i,t) That is, U is the cumulative payoff of an act, measured relative to the current aspiration level Ht . Observe that past outcomes xj (for j < t) are 166 CHAPTER 7. LEARNING AND INDUCTION re-evaluated in light on the new aspiration level Ht . Since the similarity function is assumed to be identically 1, this definition coincides with the function U of Chapter 2. We will also use the notation V (ω, i, t) = U(ω, i, t) . K(ω, i, t) (Thus, “V (ω, i, t) is well defined” means “K(ω, i, t) is positive.”) As in the case of U, this definition coincides with the function V of Chapter 2 because the similarity of any two problems is assumed to be 1. Since the values of both U and V depend on the aspiration level Ht , it is convenient to have a separate notation for the absolute average utility of each act. Thus we denote  j∈C(ω,i,t) xj (ω) X(ω, i, t) = . K(ω, i, t) Note that X(ω, i, t) is well defined whenever V (ω, i, t) is and that X(ω, i, t) = V (ω, i, t) + Ht (ω) . We now wish to express the fact that the decision maker considered is a U-maximizer. We do it by restricting the state space as follows: define S1 ⊂ S0 by S1 = {ω ∈ S0 | at (ω) ∈ arg max U (ω, i, t) ∀t ≥ 1} . i∈A Similarly, we further restrict the state space to reflect the fact that the aspiration level is updated in an adaptive manner. First define, for t ≥ 2 and ω ∈ S0 , the relative and absolute maximal average performance to be, respectively, V (ω, t) = max{V (ω, i, t) | i ∈ A, K(ω, i, t) > 0} X(ω, t) = max{X(ω, i, t) | i ∈ A, K(ω, i, t) > 0} . 18. LEARNING TO MAXIMIZE EXPECTED PAYOFF 167 Next, for a given α ∈ (0, 1) and H1 ∈ R we finally define the state space to be   H1 (ω) = H1 and for t ≥ 2 Ω = Ω(α, H1 ) = ω ∈ S1  . Ht (ω) = αHt−1 (ω) + (1 − α)X(ω, t) Endow S0 with the σ-algebra generated by the Borel σ-algebra on (each copy of) R and by the algebra 2A on (each copy of) A. Let Σ = Σ(α, H1 ) be the induced σ-algebra on Ω. Finally, we turn to define the underlying probability measure. Given Ω and Σ, a probability measure P on Σ is consistent with (Fi )i∈A if for every t ≥ 1 and i ∈ A, the conditional distribution of xt that it induces, given that at = i, is Fi , and, furthermore, xt is independent (according to P ) of the random variables H1 , a1 , x1 , . . . , Ht−1 , at−1 , xt−1 , Ht . Notice that distinct measures on Σ that are consistent with (Fi )i∈A can only disagree regarding the choice of an act at where arg maxi∈A U (ω, i, t) is not a singleton. 18.5 Results Our first result is: Theorem 7.1 Let there be given A = {1, . . . , n}, (Fi )i∈A as above, α ∈ (0, 1) and ε > 0. There exists H0 ∈ R such that for every H1 ≥ H0 and every measure P on (Ω(α, H1 ), Σ(α, H1 )) that is consistent with (Fi )i∈A , P ({ω ∈ Ω | ∃f (ω, arg max µi ) = 1}) ≥ 1 − ε . i∈A The theorem guarantees that, if we focus on those states ω at which there is a limit choice frequency for the set of expected utility maximizers, and it equals 1, this set is measurable and it has arbitrarily high probability provided that the initial aspiration level is high enough. Note that Theorem 7.1 cannot guarantee an aspiration level that is uniformly high enough for all given distributions (Fi )i∈A . Indeed, it is obvious 168 CHAPTER 7. LEARNING AND INDUCTION that any initial aspiration level may turn out to be too low. By contrast, our second result guarantees optimality for all given distributions, regardless of the initial aspiration level and with probability 1. The assumption that drives this much stronger conclusion is that the aspiration level is “pushed up” every so often. That is, that in a certain set of periods, which is infinite but sparse (i.e., has a limit frequency of zero), the aspiration level is not adjusted by averaging its previous value and the best average performance value; rather, in these periods it is set to be at some level above the best average performance value, regardless of the previous aspiration level. Formally, we define a new probability space as follows. Let there be given H1 ∈ R and α ∈ (0, 1) as above. Assume that NA ⊂ N and h > 0 are given. NA is interpreted as the set of periods in which the decision maker is ambitious. The number h should be thought of as the increase in the aspiration level. Define Ω = Ω(α, H1 , NA , h)     (ω) = and for 2 H H t ≥   1 1  = Ht (ω) = X(ω, t) + h if t ∈ NA ω ∈ S1  .    Ht (ω) = αHt−1 (ω) + (1 − α)X(ω, t) if t ∈ / NA Next define Σ = Σ(α, H1 , NA , h) to be the corresponding σ-algebra. Similarly, a measure P on Σ is defined to be consistent with (Fi )i∈A as above. We can now state: Theorem 7.2 Let there be given A = {1, . . . , n}, (Fi )i∈A as above, α ∈ (0, 1), H1 ∈ R, NA ⊂ N and h > 0. If NA is infinite but sparse, then for every measure P on (Ω(α, H1 , NA , h), Σ(α, H1 , NA , h)) that is consistent with (Fi )i∈A , P ({ω ∈ Ω | ∃f(ω, arg max µi ) = 1}) = 1 . i∈A 18. LEARNING TO MAXIMIZE EXPECTED PAYOFF 18.6 169 Comments Hybrid of Summation and Average As briefly mentioned above, the decision rule we use here is a hybrid: our decision makers choose acts by U maximization; however, when it comes to adjusting their aspiration levels, they use the maximal V value. This apparent inconsistency calls for an explanation. Following the discussion in Section 4 and Chapter 6, we would like to suggest the following interpretation. In general, memory affects one’s decisions in two ways: first, as a source of information, which is especially crucial for decisions under uncertainty; second, as a primary effect in a dynamic choice situation. Memory helps one to reason about the world, but it also changes one’s tastes. Thus, there are two fundamental questions to which memory is key: first, “What do I want to do now?” and second, “What do I think of this act?” In answering the first question, memory plays a dual role: as a source of information and as a factor affecting preferences. In answering the second, memory only serves as a source of information. Correspondingly, we would like to suggest that U offers an answer to the first, whereas V – to the second. Consider the following example: every day Mary has to choose a restaurant. She chooses a U -maximizer. Assume that her aspiration level is high and that she therefore exhibits change seeking behavior. Assume next that Mary has a guest, and he asks her which is the best restaurant in town, namely, which restaurant should one go to if one has only one day to spend there (with no memory). Then, according to this interpretation, Mary will recommend a V -maximizing, rather than a U -maximizing act. Asked why she is not choosing this restaurant herself, Mary might say, Oh, I was there just yesterday. Having visited it recently, Mary attaches to the restaurant a relatively low U value. But the very fact that the restaurant was recently chosen need not change its V value. The optimality rule discussed in this paper is therefore not as inconsistent 170 CHAPTER 7. LEARNING AND INDUCTION as it may appear at first glance: our decision makers are U -maximizers in their choices. This means that memory enters their decision considerations not only as a source of information. With a high aspiration level, this also allows them to keep switching among the alternative acts, and to continue trying acts whose past average performance happened to be poor. On the other hand, asking themselves, “What can I reasonably hope for?” or “What choice would I recommend to someone who hasn’t tried any of the options?” they base their answers on V -maximizing acts. As we have shown, adjusting their aspiration level based on the maximal V value also colors past experiences differently. In the long run, the dissatisfaction with V -maximizing acts decreases, and thus their relative frequency tends to 1. The above notwithstanding, it certainly makes sense to consider two simpler alternatives. The first, “U -rule”, prescribes that decisions will be made so as to maximize U , and that the aspiration level will be adjusted according to U as well. The second is the corresponding “V -rule”. Both rules seem sensible, but none of them guarantees optimal choice in the long run. Using the U -rule, the aspiration level need not converge. (As a matter fact, it is not obvious what is the right way to define the aspiration level adjustment rule in this case.) Using the V -rule, the decision maker may never retry certain alternatives that happened to have particularly low realizations in the first few periods. We omit the simple examples. Infinitely Many Acts The discussion in this section focused on finitely many alternatives. Considering generalizations to large sets of acts, one would like to take into account the fact that infinitely many acts are typically endowed with additional structure. For instance, prices and quantities may be modeled as continua, which also have a metrizable topology. It is natural to reflect this topology in an act similarity function. For instance, having set a price at $20, a seller may have some idea about the outcomes that are likely to result from a price of $20.01. Since these two acts are similar, past experience with one of them enters the evaluation of another. Given a compact metric topological 18. LEARNING TO MAXIMIZE EXPECTED PAYOFF 171 space of acts and a similarity function that is monotone with respect to the metric, and assuming continuity of u, one would expect a similar optimality result to hold. Other Adjustment Rules Our results do not hinge on the specific aspiration level adjustment rule. First, the aspiration level need not be adjusted in every period, and the adjustment periods do not have to be deterministically set. All that is required is that there be infinitely many of them with a high enough probability. Similarly, the realistic adjustment need not be done by a weighted average with fixed weights. The conclusion of Theorem 7.1 will hold whenever the following two conditions are satisfied. (i) The adjustment process guarantees convergence, that is, that for all a, b ∈ R and ε > 0, if X(ω, t) ∈ (a, b) for all t ≥ T1 for some T1 , then there exists T2 such that for almost all t ≥ T2 , Ht (ω) ∈ (a − ε, b + ε). (ii) The adjustment is not too fast: for all R ∈ R and all T0 ≥ 1 there is a number H0 such that for all H1 > H0 and all t ≤ T0 , Ht > R. Similarly, the conclusion of Theorem 7.2 holds under the following conditions: (i) the adjustment process guarantees convergence as above; and (ii) the aspiration level increases over an infinite but sparse set of periods. Neither the increase h nor the set of increase periods NA need be deterministic or exogenously given. Both may depend of the state ω, on past acts, and on their results. It is essential, however, that, for each ω, h be bounded away from zero (and not too large), and that NA be infinite but sparse. Finally, one may assume that in the periods of ambitiousness, the aspiration is set so as to exceed (by h) its own previous value, rather than the maximal average performance level. Retrospective Evaluation Note that when the aspiration level is updated in our model, the u value of past experiences is also updated. That is, outcomes that have been obtained in the past are re-evaluated according to the newly defined aspiration level. Thus we implicitly assume that the decision maker can reflect upon the outcomes themselves, sometimes realizing that they were 172 CHAPTER 7. LEARNING AND INDUCTION not as unsatisfactory as they seemed at the time. Alternatively, one may assume that only the utility value of past experiences is retained in memory, and that the original evaluation of an outcome will be forever used to judge the act that led to it. Our first result does not hold in this case, since a very high initial aspiration level may make an expected utility maximizing act have a very low U value, to the extent that it may never be chosen again. While one may argue for the psychological plausibility of the alternative assumption, it seems that it is “more rational” to re-evaluate outcomes based on an adjusted aspiration level, rather than compare each outcome to a possibly different aspiration level. At any rate, the second result holds under the alternative assumption as well: having infinitely many periods in which the expected utility of any act is a negative number bounded away from zero guarantees that all acts will be chosen infinitely often with probability 1. Games Pazgal (1997) has shown that if realistic but ambitious case-based decision makers repeatedly play a game of common interest, they will converge to the optimal play. His definition of realism is slightly different from the one we use here, in that the players are assumed to update their aspiration levels toward the best payoff encountered. Thus it implicitly assumes that the players know that the game is of common interest. 18.7 Proofs Proof of Theorem 7.1: A few words on the strategy of the proof are probably in order. The general idea is very similar to the deterministic case described above: let the initial aspiration level be high enough so that each act is chosen a large enough number of times, and then notice that the aspiration level tends to the maximal expected utility. In the deterministic case, each act should be chosen at least once in order to get its average performance X equal to its expectation. In the stochastic case, more choices are needed, and a law of large numbers will be invoked for a similar conclusion. Thus 18. LEARNING TO MAXIMIZE EXPECTED PAYOFF 173 the initial aspiration level should be high enough to guarantee that each of the acts is chosen enough times to get the average close to the expectation. If the supports of the given distributions Fi were bounded, one could find high enough aspiration levels such that all possible realizations of all possible choices seem similarly unsatisfactory. This would guarantee that, as long as the aspiration level is beyond a certain bound, all acts are chosen with similar frequencies, and therefore all of them will be chosen enough times for the law of large numbers to apply. However, these distributions need not have a bounded support. They are only known to have a finite variance. Thus the proof is slightly more involved, as we explain below. Let us first assume, without loss of generality, that for some r ≤ n, µ1 = µ2 = · · · = µr > µr+1 ≥ · · · ≥ µn . Furthermore, we assume that r < n. (The theorem is trivially true otherwise.) Next denote I = arg max µi = {1, 2, . . . , r} i∈A δ= µ1 − µr+1 . 3 The number δ is so chosen that, if the average values are δ-close to the corresponding expectations, then the maximal average value is obtained by a maximizer of the expectation. We now turn to find the number of times that is needed to guarantee, with high enough probability, that the averages are, indeed, δ-close to the expectations. Given ε > 0 as in the theorem and i ∈ A, let Ki ≥ 1 be such that: for every k ≥ Ki and every sequence of i.i.d random variables Xi1 , Xi2 , . . . , Xik , each with distribution Fi ,   k  1    j (Xi − µi ) ≤ δ) ≥ (1 − ε)1/2n P r(  k j=1 174 CHAPTER 7. LEARNING AND INDUCTION where P r is the measure induced by the distribution Fi . Notice that such Ki exists by the strong law of large numbers. (See, for instance, Halmos (1950).) Let K = maxi∈A Ki . We now turn to the construction of the initial aspiration level. As explained above, we would like to be able to assume that the Fi ’s have bounded supports, in order to guarantee that each act is chosen at least K times. We will therefore find an event with a high enough probability, on which the random variables xt are, indeed, bounded. We start by finding, for each i ∈ A, bounds bi , bi ∈ R such that, for any random variable Xi distributed according to Fi , P r(bi ≤ Xi ≤ bi ) ≥ (1 − ε)1/4nK where P r is some probability measure that agrees with Fi . Notice that such bounds exist since Fi has a finite variance. Without loss of generality assume also that bi > µi + 2δ for all i ∈ A. Next define b = min bi i∈A and b = max bi . i∈A The critical lower bound on the aspiration level (for the “experimentation” period, during which every act is chosen at least K times) is chosen to be R = 2b − b . Let us define, for every T ≥ 1, the event BT = {ω ∈ Ω|∀t ≤ T, b ≤ xi (ω) ≤ b}. Notice that, since the given measure P is consistent with (Fi )i∈A , P (BT ) ≥ (1 − ε)T /4nK . Hence, provided that T is not too large, BT will have a high enough probability. In order to show that T need not be too large to get enough (≥ K) observations of each act, we first show that, on BT and with sufficiently high aspiration level, the first T choices are more or less evenly distributed among the acts: 18. LEARNING TO MAXIMIZE EXPECTED PAYOFF 175 Claim 7.1 Let there be given T ≥ n, and ω ∈ BT . Assume that for all t ≤ T , Ht (ω) > R. Then for all i, j ∈ A and all t such that n ≤ t ≤ T , K(ω, i, t) ≤ 2K(ω, j, t) . Proof: Assume the contrary, and let t0 be the minimal time t such that n ≤ t ≤ T and K(ω, i, t0 ) > 2K(ω, j, t0 ) for some i, j ∈ A. Notice that K(ω, a, n) = 1 for all a ∈ A, hence t0 > n. It follows from minimality of t0 that at0 −1 (ω) = i, i.e., that i was the last act chosen. Consider the following bounds on the U values of the two acts: U (ω, i, t0 − 1) ≤ K(ω, i, t0 − 1)(b − Ht0 −1 (ω)) U (ω, j, t0 − 1) ≥ K(ω, j, t0 − 1)(b − Ht0 −1 (ω)) . The optimality of i at stage t0 − 1 implies U(ω, i, t0 − 1) ≥ U(ω, j, t0 − 1) ; K(ω, i, t0 − 1)(b − Ht0 −1 (ω)) ≥ K(ω, j, t0 − 1)(b − Ht0 −1 (ω)) . Recalling that Ht0 −1 (ω) > R ≥ b ≥ b, this is equivalent to K(ω, j, t0 − 1) b − Ht0 −1 (ω) ≥ . K(ω, i, t0 − 1) b − Ht0 −1 (ω) By minimality of t0 we know that K(ω, j, t0 − 1) 1 = . 2 K(ω, i, t0 − 1) We therefore obtain b − Ht0 −1 (ω) ≤ 2(b − Ht0 −1 (ω)) 176 CHAPTER 7. LEARNING AND INDUCTION which implies Ht0 −1 (ω) ≤ 2b − b = R , a contradiction.  We now set T0 = 2nK, and will prove that, as long as the aspiration level is kept above R, after T0 stages, each act will be chosen at least K times on the event BT0 . Formally, Claim 7.2 Let there be given ω ∈ BT0 and assume that Ht (ω) > R for all t ≤ T0 . Then for i ∈ A, K(ω, i, T0 ) ≥ K . Proof: If K(ω, i, T0 ) < K for some i ∈ A, then by Claim 7.1, K(ω, j, T0 ) < 2K for all j ∈ A. Then we get  T0 = K(ω, j, T0 ) < 2nK = T0 , j∈A which is impossible.  We finally turn to choose the required level for the initial aspiration level. Choose a value 1 H0 = H0 (ε) > b + 2( )T0 (b − b) α and let us assume for the rest of the proof that H1 ≥ H0 . We verify that this bound is sufficiently high in the following: Claim 7.3 Let there be given ω ∈ BT0 , and assume that H1 ≥ H0 . Then for all t ≤ T0 , Ht (ω) > R. Proof: For all 1 < t ≤ T0 , Ht (ω) ≥ αHt−1 (ω) + (1 − α)b 18. LEARNING TO MAXIMIZE EXPECTED PAYOFF 177 or Ht (ω) − b ≥ α(Ht−1 (ω) − b) . Hence 1 Ht (ω) − b ≥ αt (H1 − b) > 2αt ( )T0 (b − b) ≥ 2(b − b) α Ht (ω) > 2b − b = R .  Combining the above, we conclude that, for H1 ≥ H0 , K(ω, i, T0 ) ≥ K for all ω ∈ BT0 and all i ∈ A. Furthermore, for a measure P , consistent with (Fi )i∈A , P (BT0 ) ≥ (1 − ε)T0 /4nK = (1 − ε)1/2 . We now define the event on which the limit frequency of the expectedutility maximizing acts is 1: let B⊂BT0 be defined by    ∀t ≥ T0 , ∀i ∈ A,  . B = ω ∈ BT0  |X(ω, i, t) − µi | < δ By the choice of K and the independence assumption, we conclude that P (B|BT0 ) ≥ (1 − ε)1/2 . Hence P (B) ≥ (1 − ε). The proof of the theorem will therefore be complete if we prove the following: Claim 7.4 Assume that H1 ≥ H0 and let P be a measure on (Ω(α, H1 ), Σ(α, H1 )) that is consistent with (Fi )i∈A . Then, for P -almost all ω in B, ∃f (ω, I) = 1 . (Recall that I = arg maxi∈A µi .) Proof: Given ω ∈ B and ξ > 0, we wish to show that, unless ω is in a certain P -null event (to be specified later), there exists a T = T (ω, ξ) such that for all t ≥ T , f (ω, I, t) ≥ 1 − ξ . 178 CHAPTER 7. LEARNING AND INDUCTION It is sufficient to find a T = T (ω, ξ) such that for some i ∈ I, for all t ≥ T , and for all j ∈ I, K(ω, j, t) f (ω, j, t) ξ = ≤ . K(ω, i, t) f (ω, i, t) n(1 − ξ) We remind the reader that for all t ≥ T0 and all a ∈ A we have |X(ω, a, t) − µa | ≤ δ . Also, since HT0 (ω) > R > µ1 , for all t ≥ T0 we have Ht (ω) > µ1 − δ = µr+1 + 2δ . That is, the aspiration level will be adjusted towards the average performance of one of the expected-utility maximizing acts, and will be bounded away from the expected utility and from the average performance value of sub-optimal acts. We will need a uniform bound on Ht (ω). To this end, note that for all a ∈ A and t ≤ T0 , X(ω, a, t) < R, by definition of the set BT0 . For t ≥ T0 , the same inequality holds since X(ω, a, t) < µa + δ < ba ≤ R. Since Ht+1 (ω) is a convex combination of Ht (ω) and X(ω, t) = maxa∈A X(ω, a, t) < R, we conclude that for all t ≥ 1, Ht+1 (ω) ≤ max{Ht (ω), R}. By induction, it follows that for all t ≥ 1, Ht (ω) ≤ H1 . Let O(ω)⊂A be the set of acts that are chosen infinitely often at ω. That is, O(ω) = {a ∈ A|K(ω, a, t) −→ ∞ as t → ∞}. We would first like to establish the fact that some expected utility maximizing acts are indeed chosen infinitely often. Formally, Claim 7.5 O(ω) ∩ I = ∅. 18. LEARNING TO MAXIMIZE EXPECTED PAYOFF 179 Proof: Let T̃ ≥ T0 be such that for all t ≥ T̃ , at (ω) ∈ O(ω). Assume the contrary, i.e., that O(ω) ∩ I = ∅. (In particular, at (ω) ∈ I for all t ≥ T̃ .) For all t ≥ T̃ ≥ T0 we also know that X(ω, j, t) < Ht (ω) − δ for all j ∈ I. Hence, for j ∈ I, U (ω, j, t) = K(ω, j, t)V (ω, j, t) = K(ω, j, t)[X(ω, j, t) − Ht (ω)] < −δK(ω, j, t) . This implies that U(ω, j, t)→ − ∞ as K(ω, j, t) → ∞. Thus, for all j ∈ O(ω)\I, U (ω, j, t) −→ −∞ . t→∞ On the other hand, consider some i ∈ I⊂(O(ω))c . Let L satisfy L > K(ω, i, t) for all t ≥ 1. Then U (ω, i, t) = K(ω, i, t)V (ω, i, t) = K(ω, i, t)[X(ω, i, t) − Ht (ω)] > L(b − H1 ) . It is therefore impossible that only members of I c would be U -maximizers from some T̃ on.  We now assume that for all a ∈ O(ω), X(ω, a, t) −→ µa . By the strong t→∞ law of large numbers, this is the case for all ω ∈ B apart from a P -null set. Choose ζ > 0 such that ζ< ξδ 6n(1 − ξ) and let T1 ≥ T0 be such that for all t ≥ T1 and all i ∈ O(ω) ∩ I, |X(ω, i, t) − µ1 | < ζ . For all t ≥ T1 we also conclude that |X(ω, t) − µ1 | < ζ 180 CHAPTER 7. LEARNING AND INDUCTION (where, as above, X(ω, t) = maxa∈A X(ω, a, t)). It follows that the aspiration level, Ht+1 (ω), which is adjusted to be some average of its previous value Ht (ω) and X(ω, t), will also converge to µ1 . To be precise, there is T2 ≥ T1 such that for all t ≥ T2 , |Ht (ω) − µ1 | < 2ζ . We wish to show that there exists T (ω, ξ) such that for all t ≥ T (ω, ξ), all i ∈ O(ω) ∩ I and all j ∈ I the following holds: K(ω, j, t) ξ ≤ . K(ω, i, t) n(1 − ξ) It will be helpful to start with: Claim 7.6 For all t ≥ T2 , all i ∈ O(ω) ∩ I and all j ∈ I, if ai (ω) = j, then K(ω, j, t) < ξ K(ω, i, t) . 2n(1 − ξ) Proof: Let there be given t, i and j as above. Observe that U(ω, i, t) = K(ω, i, t)V (ω, i, t) = K(ω, i, t)[X(ω, i, t) − Ht (ω)] ≥ −3K(ω, i, t)ζ while U(ω, j, t) = K(ω, j, t)V (ω, j, t) = K(ω, j, t)[X(ω, j, t) − Ht (ω)] ≤ −K(ω, j, t)δ . The fact that at (ω) = j implies that U(ω, j, t) ≥ U(ω, i, t). Hence − K(ω, j, t)δ ≥ −3K(ω, i, t)ζ 3ζ K(ω, j, t) ≤ K(ω, i, t) . δ However, the choice of ζ (as smaller than ξδ ) 6n(1−ξ) 3ζ ξ . < 2n(1 − ξ) δ implies that 18. LEARNING TO MAXIMIZE EXPECTED PAYOFF 181 We have thus established that K(ω, j, t) < ξ K(ω, i, t) 2n(1 − ξ) for any t at which j is chosen (i.e., at (ω) = j).  We proceed as follows: let T3 ≥ T2 be such that for all t ≥ T3 , at (ω) ∈ O(ω). Let T4 ≥ T3 be large enough so that for all t ≥ T4 , a ∈ O(ω) and c ∈ O(ω), K(ω, c, t) ≤ ξ K(ω, a, t) . n(1 − ξ) Finally, let T5 > T4 be such that for all a ∈ O(ω), K(ω, a, T5 ) > K(ω, a, T2 ). We now have Claim 7.7 For all t ≥ T5 , all i ∈ O(ω) ∩ I and all j ∈ I, K(ω, j, t) ≤ ξ K(ω, i, t) . n(1 − ξ) Proof: Let there be given t, i and j as above. If j ∈ O(ω), the choice of T4 concludes the proof. Assume, then, that j ∈ O(ω). Then, by choice of T5 , j has been chosen since T2 . That is, Tjt ≡ {s|T2 ≤ s < t, as (ω) = j} = ∅ . Let s be the last time at which j was chosen before time t, i.e., s = max Tjt . Note that K(ω, j, t) = K(ω, j, s + 1) and K(ω, i, t) ≥ K(ω, i, s + 1) . Hence it suffices to show that K(ω, j, s + 1) ≤ ξ K(ω, i, s + 1) . n(1 − ξ) 182 CHAPTER 7. LEARNING AND INDUCTION By Claim 7.6 we know that K(ω, j, s) ≤ ξ K(ω, i, s) . 2n(1 − ξ) Since s ≥ T2 ≥ T0 , K(ω, j, s) ≥ K ≥ 1. This implies Next, observe that ξ K(ω, i, s) 2n(1−ξ) ≥ 1. ξ K(ω, i, s) + 1 2n(1 − ξ) ξ ξ K(ω, i, s) + K(ω, i, s) ≤ 2n(1 − ξ) 2n(1 − ξ) ξ ξ = K(ω, i, s) = K(ω, i, s + 1) . n(1 − ξ) n(1 − ξ) K(ω, j, s + 1) = K(ω, j, s) + 1 ≤ This concludes the proof of Claim 7.7.  Thus T5 may serve as the required T (ω, ξ). As a matter of fact, our claim regarding T5 is slightly stronger than that we need to prove regarding T (ω, ξ). The latter should have the inequality of Claim 7.7 satisfied for some i ∈ I, while the former satisfies it for all i ∈ O(ω) ∩ I, while Claim 7.5 guarantees that this set indeed contains some i ∈ I. At any rate, Claim 7.7 completes the proof of Claim 7.4, which, in turn, completes the proof of the theorem.  Proof of Theorem 7.2: The general idea of the proof, as well as the proof itself, is quite simple: as long as the aspiration level is close to the average performance of an expected utility maximizing act, the proof mimics that of Theorem 7.1. The problem is that the decision maker may “lock in” on sub-optimal acts, which may be almost-satisficing or even satisficing, and not try the optimal acts frequently enough. However, the fact that the decision maker is ambitious infinitely often (in the sense of setting the aspiration level beyond the maximal average performance) guarantees that this will not be the case. Thus, the fact that NA is infinite ensures that every act will be chosen infinitely often. On the other hand, the fact that it is sparse implies 18. LEARNING TO MAXIMIZE EXPECTED PAYOFF 183 that these periods of ambitiousness will not change the limit frequencies obtained in the proof of Theorem 7.1. In the formal proof it will prove convenient to take the following steps: we will restrict our attention to the event at which all acts that are chosen infinitely often have a limit average performance close to their expectation. On this event we will show that the expected utility maximizers among those acts have a limit choice frequency of 1. Finally, we will show that all acts are chosen infinitely often, whence the result follows. We adopt some notation from the proof of Theorem 7.1. In particular, assume that for some r < n, µ1 = µ2 = · · · = µr > µr+1 ≥ µr+2 ≥ · · · ≥ µn and denote I = arg max µi = {1, 2, . . . , r} . i∈A We will also use O(ω) = {a ∈ A|K(ω, a, t) −→ ∞} t→∞ and the new notation I(ω) = arg max{µi |i ∈ O(ω)} . We would like to focus on the event B = {ω ∈ Ω|∀i ∈ O(ω), X(ω, i, t) −→ µi }. t→∞ Since A is finite, the strong law of large numbers guarantees that P (B) = 1 for any consistent P . Thus it suffices to show that for every ω ∈ B, f(ω, I) = 1. We do this in two steps: we first show that f (ω, I(ω)) = 1, and then – that I(ω) = I. Claim 7.8 For all ω ∈ B, ∃f (ω, I(ω)) = 1. 184 CHAPTER 7. LEARNING AND INDUCTION Proof: Let there be given ω ∈ B, and denote µ = µi for some i ∈ I(ω). Given the proof of Claim 7.4 in Theorem 7.1, it suffices to show that for every ζ > 0, |Ht (ω) − µ| < ζ holds for all t ∈ N0 where N0 ⊂ N is sparse. Let ζ > 0 be given, and assume without loss of generality that ζ < δ = for all i ∈ I(ω) and that ζ < h. Let T1 be such that for all t ≥ T1 and µ−µi 3 all i ∈ O(ω), |X(ω, i, t) − µi | < ζ/2 . Let T2 ≥ T1 be such that for all t ≥ T2 , i ∈ O(ω) and j ∈ O(ω), X(ω, i, t) > X(ω, j, t). Thus, for t ≥ T2 , if t ∈ NA , Ht (ω) is adjusted towards X(ω, t), which equals X(ω, i, t) for some i ∈ O(ω), where the latter is close to µ. Since for t ∈ NA Ht (ω) is set to X(ω, t) + h, there exists T3 ≥ T2 such that for all t ≥ T3 , |Ht (ω) − µ| < 2h . We now wish to choose a number k, such that any sequence of k periods following T3 , at which Ht (ω) is adjusted “realistically”, that is, as an average of Ht−1 (ω) and X(ω, t), will guarantee that it ends up ζ-close to µ. Let k > logα (ζ/4h). Define NA ⊕ k =  t = t1 + t2 where t ∈ N  t1 ∈ NA and 0 ≤ t2 ≤ k . Note that for t ≥ T3 , if t ∈ NA ⊕ k, i.e., if t is at least k periods after the most recent “ambitious” update, we have |Ht (ω) − µ| < ζ . Setting N0 = (NA ⊕ k) ∪ {1, . . . , T2 } (and noting that it is sparse) completes the proof.  Claim 7.9 For all ω ∈ B, I(ω) = I. 18. LEARNING TO MAXIMIZE EXPECTED PAYOFF 185 Proof: It suffices to show that O(ω) = A for all ω ∈ B. Assume, to the contrary, that for some ω ∈ B, j ∈ A and L ≥ 1, K(ω, j, t) ≤ L for all t ≥ 1. Let i ∈ O(ω). For any t ∈ NA , U (ω, i, t) = K(ω, i, t)V (ω, i, t) = K(ω, i, t)[X(ω, i, t) − Ht (ω)] < −hK(ω, i, t) . Let T4 ≥ T3 be such that for all t ≥ T4 , at (ω) = j. Recall that for all t ≥ T4 , Ht (ω) < µ + 2h. Consider t ∈ NA such that t ≥ T4 . Then U (ω, j, t) = K(ω, j, t)V (ω, j, t) = K(ω, j, t)[X(ω, j, t) − Ht (ω)] > −LC where C = µ + 2h − X(ω, j, T3 ). That is, U(ω, j, t) is bounded from below. Since for a large enough t ∈ NA , U(ω, i, t) is arbitrarily small for all i ∈ O(ω), we obtain a contradiction to U-maximization. Thus we conclude that O(ω) = A. This concludes the proof of the claim and of the theorem.  186 CHAPTER 7. LEARNING AND INDUCTION 19 19.1 Learning the Similarity Function Examples Sarah wonders which of two new movies, a and b, she should see tonight. Let us consider her choice given two memories. In the first, M1 , there is only one case: Jim saw movie a and liked it. In the second, M2 , there are several cases involving other movies (but not a or b). It turns out that there are many movies that, in M2 , were seen by both Sarah and Jim, but none that they both liked. Given memory M1 , Sarah is likely to decide to see movie a. After all, all that she knows is that Jim liked it, and this is more than can be said of movie b. Given memory M2 , in which neither a nor b appear, Sarah is likely to be indifferent between the two movies. But given the union of M1 and M2 , Sarah may conclude that she has no chance of liking movie a, precisely because Jim did like it, and opt for movie b. This is a direct violation of the combination axiom of Chapter 3.3 It is therefore a phenomenon that cannot be captured by W -maximization. The reason is that the combination axiom implicitly assumes that the algorithm by which one learns from cases, and the support that a case c provides to choice a do not change from one memory to another. But in the example above one learns how to learn from experience. Specifically, memory M2 suggests that Sarah and Jim have very different tastes, and the fact that he liked a movie (M1 ) should detract from its desirability in Sarah’s eyes, rather than enhance it. To consider another example, assume that Mary has to choose a car. Cars are either green or blue, and they are of make 1 or 2. Mary had a good experience with a green car of make 1 (G1), and a bad experience with a blue car of make 2 (B2). Now she has to choose between a green car of make 2 (G2) and a blue one of make 1 (B1). Both are similar to some degree to both 3 This example is based on an example suggested to us by Sujoy Mukerji. 19. LEARNING THE SIMILARITY FUNCTION 187 past cases. Specifically, each new choice shares the color attribute with one case and the make attribute with the other. Hence the choice between G2 and B1 entails an implicit ranking of the attributes in the similarity judgment. Indeed, if Mary is three years old, she is likely to base her similarity judgment mostly on color. But if Mary is thirty years old, she is likely to put more weight on the make of the car than on its color in judging similarity. Namely, she would consider G2 similar to B2, and B1 – similar to G1, and prefer B1 to G2. The reason that the adult Mary puts more weight on the make attribute than on the color attribute is that past experience shows that the make of the car is a better predictor of quality than the car’s color. In other words, past cases are used to define the similarity function, which, in turn, is used to learn from these cases about the future. For concreteness, assume that Mary has six cases in her memory. In the first four she has used red (R) and white (W) cars both of make 3 and of make 4. The last two are cases G1 and B2 discussed above. We would like to consider her preference between G2 and B1 given three different memories. In the table below, each column corresponds to a problem-act pair, and each row describes a possible history, specifying the utility of the outcome that was experienced in each case. Observe that Mary cannot choose among rows in this table. Rather, given a possible memory, represented by a row, she is asked to choose between G2 and B1. possible memories x y x+y Problem-act pairs R3 R4 W3 W4 G1 B2 1 -1 1 -1 1 -1 1 1 -1 -1 -1 1 2 0 0 -2 0 0 In memory x, past cases with other makes (3 and 4) and other colors (R and W) can be summarized by the observations that cars of make 3 are good, and that cars of make 4 are bad, regardless of color. This indicates 188 CHAPTER 7. LEARNING AND INDUCTION that make is a more important attribute than is color. Further, this example corresponds to the history we started with: G1 was a good experience, while B2 was a bad one. Thus it makes sense that history x would give rise to preferring B1 to G2. Memory y, by contrast, differs in two ways. First, the first four cases are consistent with the simple theory that all red cars are good, and all white cars are not, regardless of the make. Second, the experiences in G1 and in B2 are the opposite of their counterparts in the story above: now G1 ended up in a bad experience while B2 resulted in success. Thus memory y supports the child’s similarity function, and this, in turn, favors B1 to G2, since B1 is of the same color as the car involved in the good experience B2. Consider now a hypothetical memory generated by the sum of the utility vectors, x + y. This summation corresponds to a memory from which one cannot infer much about the relative importance of the color and make attributes. Moreover, the history of G1 and B2 makes them completely equivalent. Hence, even if one knew what similarity function to use, by symmetry one would have to be indifferent between G2 and B1. Thus, under both x and y, a reasonable decision maker such as Mary would prefer B1 to G2. But given the memory x + y, she is indifferent between them. Assuming that the payoff numbers are measured on a utility scale, this is a contradiction to U ′ -maximization. Specifically, the formula ′ (a) = U ′ (a) = Up,M  s((p, a), (q, b))u(r) . (•) (q,b,r)∈M implies that if a is preferred to b given a set of cases {(qi , bi , ri )}i , and if a is preferred to b given the cases {(qi , bi , ri′ )}i , then the same preference should be observed given the cases {(qi , bi , ri′′ )}i when u(ri′′ ) = u(ri ) + u(ri′ ). This additivity condition trivially holds when the similarity function is fixed, but it need not hold when the similarity function is learned from experience. 19. LEARNING THE SIMILARITY FUNCTION 189 Generally, we distinguish between two levels of inductive reasoning in the context of CBDT. “First order” induction is the process by which similar past cases are implicitly generalized to bear upon future cases, and, in particular, to affect the decision in a new problem. CBDT, as presented above, attempts to model this process, if only in a rudimentary way. “Second order” induction refers to the process by which the decision maker learns how to conduct firstorder induction, namely, how to judge similarity of problems. The current version of CBDT does not model this process. Moreover, it implicitly assumes that it does not take place at all. 19.2 Counter-example to U -maximization It may be insightful to explicitly see how second-order induction may also violate U-maximization. Consider the following example: this afternoon, John wants to order tickets to a concert, and he can do it either by phone (P) or by accessing a Web site (W). He would like to minimize the time and hassle of the process. John’s experience is as follows. He has once ordered tickets to a ballet by phone (at a different hall) in the afternoon, and the procedure was short and efficient. He has also ordered tickets to a concert (at the same hall) by the Web, and this experience was also painless, but then John made the transaction at night. John is aware of the fact that different halls vary in terms of the phone and internet services they offer. He is also aware of the fact that the time of day may greatly affect his waiting time. Thus, the decision problem boils down to the following question: should John assume that ordering tickets for the concert hall in the afternoon is more similar to doing so at night, or to ordering tickets for a ballet in the afternoon? That is, which feature of the problem is more important: the hall where the show is performed or the time of day? In the past, John has also ordered tickets, both by phone and through the Web, for movie theaters. He has plenty of experience with movie theaters 1 and 2, where he ordered tickets in the morning and in the evening, with 190 CHAPTER 7. LEARNING AND INDUCTION varying degrees of success. John does not view his experience with movie theaters as relevant to the concert hall. But this experience might change the way he evaluates similarities. For concreteness, let us compare two scenarios, both represented by the matrix below. In this matrix, columns stand for decision problems encountered in the past, whereas rows – for acts chosen. The problems are assumed distinct. “M1” stands for ordering tickets to movie theater 1 in the morning; “E2” – for movie theater 2 in the evening, and so forth. “AB” stands for ordering tickets to the ballet in the afternoon, and “NC” – to the concert hall at night. The numerical entries in the matrix represent the payoffs obtained when an act was chosen in a given problem, that is, the values of the function u, and blank entries mean that the act corresponding to the row was not chosen in the problem corresponding to the column. For the purposes of U -maximization, we may think of the blank entries as zero payoffs. In the first scenario, John’s experience with phone orders is represented by the vector x, where 1 stands for short wait, −1 – for long wait, and his experience with Web orders is represented by the vector y. In the second scenario, the history of phone orders is given by the vector z, and that of Web orders – by w. The vector d does not represent an act in our story and will only be used to facilitate the algebraic description. Observe that it would be incoherent to compare, say, the vectors x and z, since both have non-blank values in certain columns, whereas only one act can be chosen in a given past problem. By contrast, vectors x and y can be the histories of two different acts in a given memory, and so can vectors z and w. Consider the first scenario, where John has to choose between using the phone, with history x, and using the Web, with history y. John’s experience with movie theaters clearly indicates that the morning is a better time to place an order than is the evening. This particular fact is not of great help to John, who has to order tickets in the afternoon. Moreover, John is now ordering tickets to a concert, rather than to a movie, and it may well be the 19. LEARNING THE SIMILARITY FUNCTION 191 case that movie fans are on a different schedule than are concert audience. Still, John might infer that the time of day is a more important attribute than is the particular establishment. John’s experience with halls include a success with a phone call made in the afternoon for ballet tickets, and a success with a Web order made at night for concert tickets. The former might be taken to suggest that the afternoon is a good time to call, favoring P to W. The latter tentatively offers the generalization that the concert hall Web system works fine, favoring W to P. Which of the two recommendations is more convincing? Given the experience with movie theaters, which supports generalizations based on the time of day rather than on the establishment, John is likely to give more weight to the afternoon order at the ballet and prefer a phone call, namely, choose an act with history x over one with history y. By contrast, in the second scenario John faces a choice between an act with history z and an act with history w. Both histories suggest that movie theater 1 has an efficient ordering service, whereas movie theater 2 does not. Again, this information per se is not very useful when ordering tickets to a concert, which is in a hall having nothing to do with either movie theater. But this piece of information also indicates that establishments do differ from each other, while the time of day is of little or no importance. This conclusion would imply that John should rely on his night experience at the concert hall more than on his afternoon experience at the ballet, and opt for a Web order, namely, choose an act with history w over one with history z. act profiles x (P) y (W) z (P) w (W) d Problems M1 M2 E1 E2 M1 M2 E1 E2 AB NC 1 1 -1 -1 1 -1 -1 1 1 1 1 1 -1 -1 1 -1 -1 1 1 1 -1 1 -1 1 192 CHAPTER 7. LEARNING AND INDUCTION However, this preference pattern is inconsistent with U-maximization for any fixed similarity function s. Indeed, for every s, since z − x = w − y = d, we have U(z) − U (x) = U (w) − U(y) = U(d) implying U (x) − U(y) = U(z) − U (w) . That is, x is preferred to y if and only if z is preferred to w, in contradiction to the preference pattern we motivated above. Similar examples may be constructed, in which the violation of U -maximization stems from learning that certain values of an attribute are similar, rather than that the attribute itself is of importance. For example, instead of learning that the establishment is an important factor, John may simply learn that the concert hall is similar to the ballet theater. In the previous subsection we generated a counter-example to W -maximization by a direct violation of the combination axiom. For concreteness we have also constructed counter-examples to U ′ - and to U -maximization. Indeed, all these functions are additively separable across cases, whereas second-order induction renders the weight of a set of cases a non-additive set function. Several cases, taken in conjunction, may implicitly suggest a rule. Hence the effect of all of them together may differ from the sum of effects of each one, considered separately. Differently put, the “marginal contribution” of a case to preferences depends not only on the case itself, but also on the other cases it is lumped together with.4 4 A possible generalization of CBDT functions that may account for this phenomenon involves the use of non-additive measures, where aggregation of utility is done by the Choquet integral. (See Schmeidler (1989).) 19. LEARNING THE SIMILARITY FUNCTION 19.3 193 Learning and expertise The distinction between the two levels of induction may also help classify types of learning and of expertise. A case-based decision maker learns in two ways: first, by introducing more cases into memory; second, by refining the similarity function based on past experiences. Knowing more is but one aspect of learning. Making better use of known cases is another. Correspondingly, the notion of expertise combines knowledge of cases and the ability to focus on the relevant analogies. Knowledge of cases is relatively objective. By contrast, knowledge of the similarity function is inherently subjective.5 Correspondingly, it is easier to compare people’s knowledge of cases than it is to compare their knowledge of similarity. It is natural to define a relation “knows more than” when cases are considered: a person (or a system) knows more cases than another if the memory of the former is a superset of that of the latter. The concept of “knowing more” is more evasive when applied to the similarity function. In hindsight, a similarity function that leads to better decision making can be viewed as better or more precise. But it is difficult to provide an a-priori formal definition of “having a better similarity function”. It is relatively easy to find that an expert has a vast database to draw upon. It is harder to know whether she makes the “right” analogies when using it. In Section 13 we discussed the two roles that rules may play in a casebased knowledge representation system. These two roles correspond to the two types of knowledge and to the two levels of induction. The first role, of summarizing many cases, conveys knowledge of cases. First-order induction suffices to formulate such rules: given a similarity function, one generates a rule by lumping similar cases together.6 The second role, of drawing attention 5 True, even “objective” cases and alleged “facts” may be described, construed, and interpreted in different ways. But they are still closer to objectivity than are similarity judgments. 6 Notice, however, that this is first-order explicit induction, that is, a process that generates a general rule, as opposed to the implicit induction performed by CBDT. 194 CHAPTER 7. LEARNING AND INDUCTION to similarity among cases, conveys knowledge of similarity. One needs to engage in second-order induction to formulate such rules: the similarity has to be learned in order to observe the regularity that the rule should express. These distinctions may have implications for the implementation of computerized systems. A case-based expert system has to represent both types of knowledge. As a result of programming necessity, cases will typically be represented by a database, whereas similarity judgments might be implicit in the software. The discussion above suggests that such a distinction is also conceptually desirable. For instance, one may wish to use one expert’s knowledge of cases with another expert’s similarity judgments. It might therefore be useful to represent cases in a way that is independent of the similarity function. Finally, we remind the reader that in Section 13 we argued that casebased systems enjoy a theoretical advantage when compared to rule-based systems: case-based knowledge representation incorporates modifications in a smooth way. This advantage exists also in the presence of second-order induction. Second-order induction may induce updates of similarity values, and this may lead to different decisions based on the same set of cases. But this process does not pose any theoretical difficulties such as those entailed by explicit induction. 20. TWO VIEWS OF INDUCTION: CBDT AND SIMPLICISM 20 20.1 195 Two Views of Induction: CBDT and Simplicism Wittgenstein and Hume Case-based decision making involves implicit first- and second-order induction. To what extent can it serve as a model of inductive reasoning or inductive inference? How does it compare with explicit induction, namely, with the formation of rules or general theories? This section is devoted to a few preliminary thoughts on these questions. How do people use past cases to extrapolate the outcomes of present circumstances? Wittgenstein (1922, 6.363) suggested that people tend to engage in explicit induction: “The procedure of induction consists in accepting as true the simplest law that can be reconciled with our experiences.”7 The notion of simplicity is rather vague and subjective. (See, for instance, Sober (1975) and Gärdenfors (1990).) Gilboa (1994) suggests employing Kolmogorov’s complexity measure for the definition of simplicity. In this model, a theory is a program that can generate a prediction for every instance it accepts as input. It is then argued, as a descriptive theory of inductive inference, that people tend to choose the shortest program that conforms to their observations. This theory is referred to as “simplicism”. Its prediction is well defined given a programming language. Yet the choice of language is no less subjective than the notion of simplicity. Indeed, simplicism merely translates the choice of a complexity measure to the choice of the appropriate programming language. By contrast, our starting point is Hume’s claim that “from similar causes we expect similar effects”. That is, Hume offers an alternative descriptive 7 Observe that this statement has a descriptive flavor, as opposed to the normative flavor of Occam’s razor. 196 CHAPTER 7. LEARNING AND INDUCTION theory of inductive reasoning, namely, that the process of implicit induction is based on the notion of similarity. Thus, Hume may be described as referring to implicit induction, based on the (vague and subjective) notion of similarity, whereas Wittgenstein may be viewed as defining explicit induction, based on the (vague and subjective) notion of simplicity. It is tempting to contrast the two views of induction. One way to do so is by choosing a formal representation of each theory. As an exercise, we here take Wittgenstein’s view of explicit induction as modeled by simplicism, and Hume’s view of implicit induction as modeled by CBDT, and study their implications in a few simple examples. Two caveats are in order: first, any formal model of an informal claim is bound to commit to a certain interpretation thereof; thus the particular models we discuss may not do justice to the original views. Second, both similarity and language are inherently subjective notions. Hence, the formal models we discuss here by no means resolve all ambiguity. Even with the use of these models, much freedom is left in the way the two views are brought to bear on a particular example. Yet, we hope that the analysis of a few simple examples might indicate some of the advantages of both views as theories of human thinking. 20.2 Examples Consider a simple learning problem. Every item has two observable attributes, color and shape. Each attribute might take one of two values. Color might be red (R) or blue (B) whereas shape might be a square (S) or a circle (C). The task is to learn a concept, say, of “nice” items, Σ, that is fully determined by the attributes. Formally, Σ is a subset of {RS, RC, BS, BC}. We are given positive and/or negative examples to learn from. An example states that one of RS, RC, BS, or BC is in Σ (+) or that it is not (−). We are asked to extrapolate whether a new item is in Σ, based on its observable attributes (again, one of the pairs RS, RC, BS, or BC). The set of examples we have observed, namely, our memory, may be summarized by a matrix, in 20. TWO VIEWS OF INDUCTION: CBDT AND SIMPLICISM 197 which the symbol “+” stands for “such items are in Σ”, the symbol “−” stands for “such items are not in Σ”, and a blank space is read as “such items have not yet been observed”. Finally, a “?” would indicate the next item we are asked about. For instance, the matrix 1 S R + B C ? describes a database in which a red square (RS) is known to be nice, and we are asked about a blue circle (BC). Let us proceed with this example. Knowing only that a red square is nice, would a blue circle be nice as well? Not having observed any negative examples, the simplest theory in any reasonable language is likely to be “All items are nice”, predicting “+” for BC. Correspondingly, if we assume that all items bear some resemblance to each other, a case-based extrapolator will also come up with this prediction. Next consider the matrices 2 S C 3 S C R + − R − + ? B B ? In matrix 2, the simplest theory would probably be “an item is nice if and only if it is a square”, predicting that BC is not in Σ. The same prediction would be generated by CBDT: since BC is more similar to RC than it is to RS, the former (non-nice) example would outweigh the latter (nice) one. Similarly, simplicism and CBDT will concur in their prediction for matrix 3, predicting that BC is nice. We also get similar predictions for the following two matrices: 198 CHAPTER 7. LEARNING AND INDUCTION 4 S R + B − C ? 5 S R − B + C ? However, the two methods of extrapolation might also be differentiated. Consider the matrices 6 S C 7 S R + ? R + − B ? B C − The observations in both matrices are identical. The simplest rule that accounts for them is not uniquely defined: the theory “an item is nice if and only if it is red”, as well as the theory “an item is nice if and only if it is a square” are both consistent with evidence, and both would be minimizers of Kolmogorov’s complexity measure in a language that has R and S as primitives. (As opposed to, say, their conjunction.) Each of these simplest theories would yield a positive prediction in one matrix and a negative one in the other. By contrast, assuming that color and shape are symmetric, the CBDT classification would leave us undecided between a positive and a negative answer in both matrices. In matrices 6 and 7 the prediction provided by CBDT appears more satisfactory than that of simplicism. In both matrices the evidence for and against a positive prediction is precisely balanced. Simplicism comes up with two equally simple but very different answers, whereas CBDT reflects the balance between them. Since CBDT uses quantitative similarity judgments, and produces quantitative evaluation functions, it deals with indifference more graciously than do simplest theories or rules. In a way that parallels our discussion in Section 13, we find that CBDT behaves more smoothly at the transition between different rules. 20. TWO VIEWS OF INDUCTION: CBDT AND SIMPLICISM 199 It is natural to suggest that simplicism be extended to a random choice among all simplest theories, or an expected prediction of a theory chosen randomly. Indeed, in matrices 6 and 7 above, if we were to quantify the predictions and take an average of the predictions of the two simplest theories, we will also be indifferent between a positive and a negative prediction. But once we allow weighted aggregation of theories, we would probably not want to restrict it to cases of indifference. For instance, suppose that a hundred theories, each of which has Kolmogorov complexity of 1,001 (say, bits), agree on a prediction, but disagree with the unique simplest theory, whose complexity is 1,000. It would be natural to extend the aggregated prediction method to this case as well, allowing many almost-simplest theories to outweigh a single simplest one. But then we are led down the slippery path leading to a Bayesian prior over all theories, which is a far cry from simplicism. An example of more pronounced disagreement between CBDT and simplicism is provided by the following matrix. 8 S C R + + B − ? The simplest theory that accounts for the data is “an item is nice if and only if it is red”, predicting that a blue circle is not nice. What will be the CBDT prediction? Had RS not been observed, the situation would have been symmetric to matrices 6 and 7, leaving a CBDT predictor indifferent between a positive and a negative answer. However, the similarity between RS and BC is positive, as argued above. (If it were not, CBDT would not yield a positive prediction in matrix 1.) Hence the additional observation tilts the balance in favor of a positive answer. Observe, however, that the derivation of the CBDT prediction in matrix 8 relies on additive separability. That is, we only allow CBDT to employ first-order induction. But if we extend CBDT to incorporate second-order 200 CHAPTER 7. LEARNING AND INDUCTION induction, the three observations in matrix 8 do indicate that color is a more important attribute than is shape. If this is reflected in the weight that the similarity function puts on the two attributes, BC will be more similar to BS than to RC. If this difference is large enough, the fact that a blue square is not nice may outweigh the two nice red examples. Second-order induction can also be defined in the context of simplicism. A simplicistic predictor chooses simplest theories given a language, but she can also learn the language, in an analogous way to learning the similarity function in CBDT. Consider Mary of Section 19 again. We have argued that, as compared with children, most adults learn to put less emphasis on color as a defining feature of a car’s quality. In both models, this type of learning falls under the category of second-order induction: in CBDT, as we have seen, it would be modeled by changing the similarity function so that the weight attached to color in similarity judgments is reduced. In simplicism, it would be captured by including in the language other predicates, and perhaps dispensing with “color” altogether. Observe that in this respect, too, the quantitative nature of CBDT may provide more flexibility than the qualitative choice of language in simplicism. To sum, CBDT appears to be more flexible than does simplicism. Implicit induction seems to avoid some of the difficulties posed by explicit induction. Yet, modeling only first order induction is unsatisfactory. It remains a challenge to find a formal model that will capture Hume’s intuition and allow quantitative aggregation of cases, without excluding second-order induction and refinements of the similarity function. Bibliography [A] Akaike, H. (1954), “An Approximation to the Density Function”, Annals of the Institute of Statistical Mathematics, 6: 127-132. [All] Allais, M. (1953), “Le Comportement de L’Homme Rationel devant le Risque: critique des Postulates et Axioms de l’Ecole Americaine”, Econometrica, 21: 503-546. [Al] Alt, F., (1936), “On the Measurability of Utility”, in J. S. Chipman, L. Hurwicz, M. K. Richter, H. F. Sonnenschein (eds.): Preferences, Utility, and Demand, A. Minnesota Symposium, pp. 424-431, ch. 20. New York: Harcourt Brace Jovanovich, Inc. [Translation of “Uber die Messbarheit des Nutzens”, Zeitschrift fuer Nationaloekonomie 7: 161-169.] [An] Anscombe, F. J., and R. J. Aumann (1963), “A Definition of Subjective Probability”, The Annals of Mathematics and Statistics, 34: 199-205. [Ar] Aragones, E. (1997), “Negativity Effect and the Emergence of Ideologies”, Journal of Theoretical Politics, 9: 189-210. [Arr1] Arrow, K. (1986), “Rationality of Self and Others in an Economic System”, Journal of Business, 59: S385-S399. [Arr2] Arrow, K. and G. Debreu (1954), “Existence of an Equilibrium for a Competitive Economy”, Econometrica, 22: 265-290. 201 202 BIBLIOGRAPHY [AL] Ashkenazi, G. and E. Lehrer (2000), “Well-Being Indices”, mimeo. [Au] Aumann, R. J. (1962), “Utility Theory without the Completeness Axiom”, Econometrica 30: 445-462. [BG] Beja, A. and I. Gilboa (1992), “Numerical Representations of Imperfectly Ordered Preferences (A Unified Geometric Exposition)”, Journal of Mathematical Psychology, 36: 426-449. [B] Bewley, T. (1986), “Knightian Decision Theory: Part I”, Discussion Paper 807, Cowles Foundation. [BM] Bush, R. R., and F. Mosteller (1955), Stochastic Models for Learning. New York: Wiley. [CW] Camerer, C. and M. Weber (1992), “Recent Developments in Modeling Preferences: Uncertainty and Ambiguity”, Journal of Risk and Uncertainty, 5: 325-370. [Ca] Carnap, R. (1923), “Uber die Aufgabe der Physik und die Andwednung des Grundsatze der Einfachstheit”, Kant-Studien, 28: 90-107. [CH] Cover, T. and P. Hart (1967), “Nearest Neighbor Pattern Classification”, IEEE Transactions on Information Theory 13: 21-27. [C] Cross, J. G. (1983), A Theory of Adaptive Economic Behavior. New York: Cambridge University Press. [deF] de Finetti, B. (1937), “La Prevision: Ses Lois Logiques, Ses Sources Subjectives”, Annales de l’Institute Henri Poincare, 7: 1-68. [DeG] DeGroot, M. H.(1975), Probability and Statistics. Reading, MA: Addison-Wesley. BIBLIOGRAPHY 203 [DGL] Devroye, L., L. Gyorfi, and G. Lugosi (1996), A Probabilistic Theory of Pattern Recognition, New York: Springer-Verlag. [E] Ellsberg, D. (1961), “Risk, Ambiguity and the Savage Axioms”, Quarterly Journal of Economics, 75: 643-669. [Et] Etzioni, A. (1986), “Rationality is Anti-Entropic”, Journal of Economic Psychology, 7: 17-36. [Fi1] Fishburn, P. C. (1970), “Intransitive Indifference in Preference Theory: A Survey”, Opertaions Research, 18: 207-228. [Fi2] Fishburn, P. C. (1985), Interval Orders and Interval Graphs. New York: Wiley and Sons. [FH1] Fix, E. and J. Hodges (1951), “Discriminatory Analysis. Nonparametric Discrimination: Consistency Properties”. Technical Report 4, Project Number 21-49-004, USAF School of Aviation Medicine, Randolph Field, TX. [FH2] Fix, E. and J. Hodges (1952), ”Discriminatory Analysis: Small Sample Performance”. Technical Report 21-49-004, USAF School of Aviation Medicine, Randolph Field, TX. [FFG] Flakenhainer, B., K. D. Forbus, and D. Gentner (1989), “The Structure-Mapping Engine: Algorithmic Example”, Artificial Intelligence, 41: 1-63. [FY] Foster, D. and H. P. Young (1990), “Stochastic Evolutionary Game Dynamics”, Theoretical Population Biology, 38: 219-232. [G] Gärdenfors, P. (1990), “Induction, Conceptual Spaces and AI”, Philosophy of Science, 57: 78-95. [GV] Gibbard, A. and H. Varian (1978), “Economic Models”, Journal of Philosophy, 75: 664-677. 204 BIBLIOGRAPHY [GH1] Gick, M. L. and K. J. Holyoak (1980), “Analogical Problem Solving”, Cognitive Psychology, 12: 306-355. [GH2] Gick, M. L. and K. J. Holyoak (1983), “Schema Induction and Analogical Transfer”, Cognitive Psychology, 15: 1-38. [G87] Gilboa, I. (1987), “Expected Utility with Purely Subjective NonAdditive Probabilities,” Journal of Mathematical Economics 16: 6588. [Gi] Gilboa, I. (1994), “Philosophical Applications of Kolmogorov’s Complexity Measure”, in Logic and Philosophy of Science in Uppsala, D. Prawitz and D. Westerstahl (eds.), Synthese Library, Vol. 236, Kluwer Academic Press, pp. 205-230. [GL] Gilboa, I. and R. Lapson (1995), “Aggregation of Semi-Orders: Intransitive Indifference Makes a Difference”, Economic Theory, 5: 109126. [GP] Gilboa, I. and A. Pazgal (1996), “History-Dependent Brand Switching: Theory and Evidence”, Northwestern University Discussion Paper. [GP2] Gilboa, I. and A. Pazgal (2000), “Cumulative Discrete Choice”, Marketing Letters, forthcoming. [GS1] Gilboa, I. and D. Schmeidler (1989), “Maxmin Expected Utility with a Non-Unique Prior”, Journal of Mathematical Economics, 18: 141153. [GS2] Gilboa, I. and D. Schmeidler (1995), “Case-Based Decision Theory”, The Quarterly Journal of Economics, 110: 605-639. [GS4] Gilboa, I. and D. Schmeidler (1996a), “Case-Based Optimization”, Games and Economic Behavior, 15: 1-26. BIBLIOGRAPHY 205 [GS5] Gilboa, I. and D. Schmeidler (1996b), “A Cognitive Model of Wellbeing”, Social Choice and Welfare, forthcoming. [GS6] Gilboa, I. and D. Schmeidler (1997a), “Act Similarity in Case-Based Decision Theory”, Economic Theory, 9: 47-61. [GS7] Gilboa, I., and D. Schmeidler (1997b), “Cumulative Utility Consumer Theory”, International Economic Review, 38: 737-761. [GS3] Gilboa, I. and D. Schmeidler (2000a), “Case-Based Knowledge and Induction”, IEEE Transactions on Systems, Man, and Cybernetics. [GS8] Gilboa, I. and D. Schmeidler (2000b), “Cognitive Foundations of Inductive Inference and Probability: An Axiomatic Approach”, mimeo. [GS9] Gilboa, I. and D. Schmeidler (2000c), “They’re All Nearest Neighbors”, mimeo. [GSW] Gilboa, I., D. Schmeidler, and P. Wakker (1999), “Utility in CaseBased Decision Theory”, mimeo. [Git] Gittins, J. C. (1979), “Bandit Processes and Dynamic Allocation Indices”, Journal of the Royal Statistical Society, B, 41: 148-164. [H] Halmos, P. R. (1950), Measure Theory. Princeton, NJ, Van Nostrand. [Ha] Hanson, N. R. (1958), Patterns of Discovery. Cambridge, England: Cambridge University Press. [HC] Harless, D. and C. Camerer (1994), “The Utility of Generalized Expected Utility Theories”, Econometrica, 62: 1251-1289. [Har1] Harsanyi J. C. (1953), “Cardinal Utility in Welfare Economics and in the Theory of Risk-Taking”, Journal of Political Economy, 61: 434-435. 206 BIBLIOGRAPHY [Har2] Harsanyi, J. C. (1955), “Cardinal Welfare, Individualistic Ethics, and Interpersonal Comparisons of Utility”, Journal of Political Economy, 63: 309-321. [HP] Herrnstein, R. J., and D. Prelec (1991), “Melioration: A Theory of Distributed Choice”, Journal of Economic Perspectives, 5: 137-156. [Ho] Holland, J. H. (1975), Adaptation in Natural and Artificial Systems. Ann Arbor: University of Michigan Press. [Hu] Hume, D. (1748), An Enquiry Concerning Human Understanding. Oxford: Clarendon Press. [KT] Kahneman, D. and A. Tversky (1979), “Prospect Theory: An Analysis of Decision Under Risk,” Econometrica, 47: 263-291. [KMR] Kandori, M., G. J. Mailath, and R. Rob (1993), “Learning, Mutation and Long-Run Equilibria in Games”, Econometrica, 61: 29-56. [KVS] Karni, E., D. Schmeidler, and K. Vind (1983), “On state dependent preferences and subjective probabilities,” Econometrica, 51: 10211031. [KM] Karni, E. and P. Mongin, (2000), “On the Determination of Subjective Probability by Choices”, Management Science, 46: 233-248. [KS] Karni E. and D. Schmeidler (1991), “Utility theory with uncertainty”, in Handbook of Mathematical Economics 4, W. Hildenbrand and H. Sonnenschein (eds.), North Holland, pp. 1763-1831. [K] Keynes, J. M. (1921), A Treatise on Probability. London: MacMillan and Co. [KG] Kirkpatrick, S., C. D. Gellatt, et al. (1982), “Optimization by Simulated Annealing”, IBM Thomas J. Watson Research Center. Yorktown Heights, NY. BIBLIOGRAPHY 207 [Kn] Knight, F. H. (1921), Risk, Uncertainty, and Profit. Boston, New York: Houghton Mifflin. [Ko] Kolodner, J. L., Ed. (1988), Proceedings of the First Case-Based Reasoning Workshop. Los Altos, CA: Morgan Kaufmann Publishers. [KR] Kolodner, J. L. and C. K. Riesbeck (1986), Experience, Memory, and Reasoning. Hillsdale, NJ: Lawrence Erlbaum Associates. [L] Levi, I. (1980), The Enterprise of Knowledge. Cambridge, MA: MIT Press. [Le] Lewin, S. (1986), “Economics and Psychology: Lessons For Our Own Day, From the Early Twentieth Century”, Journal of Economic Literature, 34: 1293-1323. [Lu] Luce, R. D. (1956), “Semiorders and a Theory of Utility Discrimination”, Econometrica, 24: 178-191. [M] Machina, M. (1987), “Choice Under Uncertainty: Problems Solved and Unsolved”, Economic Perspectives, 1: 121-154. [MS] March, J. G. and H. A. Simon (1958), Organizations. New York: Wiley. [Ma] Matsui, A. (2000), “Expected Utility and Case-Based Reasoning”, Mathematical Social Sciences, 39: 1-12. [MD] McDermott, D. and J. Doyle (1980), “Non-Monotonic Logic I”, Artificial Intelligence 25: 41-72. [Mo] Moser, P. K., Ed. (1986), Empirical Knowledge. Rowman and Littlefield Publishers. [My] Myerson, R. B. (1995), “Axiomatic Derivation of Scoring Rules Without the Ordering Assumption”, Social Choice and Welfare, 12: 59-74. 208 BIBLIOGRAPHY [Na] Nash, J. F. (1951), “Non-Cooperative Games”, Annals of Mathematics, 54: 286-295. [Pa] Parzen, E. (1962), “On the Estimation of a Probability Density Function and the Mode”, Annal of Mathematical Statistics, 33: 1065-1076. [P] Pazgal, A. (1997), “Satisficing Leads to Cooperation in Mutual Interests Games”, International Journal of Game Theory, 26: 439-453. [Po] Popper, K.R. (1934), Logik der Forschung; English edition (1958), The Logic of Scientific Discovery. London: Hutchinson and Co. Repreinted (1961), New York: Science Editions. [Q1] Quine, W. V. (1953), “Two Dogmas of Empiricism”, in From a Logical Point of View. Cambridge, MA: Harvard University Press. [Q2] Quine, W. V. (1969a), “Epistemology Naturalized”, in Ontological Relativity and Other Essays. New York: Columbia University Press. [Q3] Quine, W. V. (1969b), “Natural Kinds”, in Essays in Honor of Carl G. Hempel, Rescher N. (ed.), Dordrecht, Holland: D. Reidel Publishing Company. [R] Ramsey, F. P. (1931), “Truth and Probability”, in The Foundation of Mathematics and Other Logical Essays. New York, Harcourt, Brace and Co. [Ra] Rawls, J. (1971), A Theory of Justice. Cambridge, MA: Belknap. [Re] Reiter, R. (1980), “A Logic for Default Reasoning”, Artificial Intelligence, 13: 81-132. [RS] Riesbeck, C. K. and R. C. Schank (1989), Inside Case-Based Reasoning. Hillsdale, NJ: Lawrence Erlbaum Associates, Inc. BIBLIOGRAPHY 209 [Ros] Rosenblatt, M. (1956), “Remarks on Some Nonparametric Estimates of a Density Function”, Annal of Mathematical Statistics, 27: 832837. [Ro] Royall, R. (1966), A Class of Nonparametric Estimators of a Smooth Regression Function. Ph.D. Thesis, Stanford University, Stanford, CA. [S] Savage, L. J. (1954), The Foundations of Statistics. New York: John Wiley and Sons. [Sc] Schank, R. C. (1986), Explanation Patterns: Understanding Mechanically and Creatively. Hillsdale, NJ, Lawrence Erlbaum Associates. [Schm] Schmeidler, D. (1989), “Subjective Probability and Expected Utility without Additivity”, Econometrica, 57: 571-587. [Se] Selten, R. (1978), “The Chain-Store Paradox”, Theory and Decision, 9: 127-158. [Sh] Shapley, L. S. (1953), “A Value for n-Person Games”, in Contributions to the Theory of Games II, H. W. Kuhn and A. W. Tucker (eds.), Princeton: Princeton University Press, pp. 307-317. [SOF] Shepperd, J. A., A. J. Ouellette, and J. K. Fernandez (1996), “Abandoning Unrealistic Optimism: Performance, Estimates and the Temporal Proximity of Self-Relevant Feedback”, Journal of Personality and Social Psychology, 70: 844-855. [Si1] Simon, H. A. (1957), Models of Man. New York: John Wiley and Sons. [Si2] Simon, H. A. (1986), “Rationality in Psychology and Economics”, Journal of Business, 59: S209-S224. 210 BIBLIOGRAPHY [Sk] Skinner, B. F. (1938), The Behavior of Organisms. New York: Appleton-Century-Crofts. [So] Sober, E. (1975), Simplicity. Oxford: Clarendon Press. [Su] Suppe, F. (1974), The Structure of Scientific Theories (edited with a critical introduction by F. Suppe), Urbana, Chicago, London: University of Illinois Press. [T1] Tversky, A. (1977), “Features of Similarity”, Psychological Review, 84: 327-352. [T2] Tversky, A., and D. Kahneman (1981), “The Framing of Decisions and the Psychology of Choice”, Science, 211: 453-458. [vNM] von Neumann, J. and O. Morgenstern (1944), Theory of Games and Economic Behavior. Princeton, N.J.: Princeton University Press. [Wk] Wakker, P. P. (1989), Additive Representations of Preferences. Dordrecht, Boston, London: Kluwer Academic Publishers. [W1] Watson, J.B. (1913), “Psychology as the Behaviorist Views It”, Psychological Review, 20: 158-177. [W2] Watson, J.B. (1930), Behaviorism (Revised Edition). New York: Norton. [Wi] Wittgenstein, L. (1922), Tractatus Logico Philosophicus. London, Routledge and Kegan Paul. [Y1] Young, P. H. (1975), “Social Choice Scoring Function”, SIAM Journal of Applied Mathematics, 28: 824-838. [Y2] Young, P. H. (1993), “The Evolution of Conventions”, Econometrica, 61: 57-84. Index Akaike, 69 Allais, 18 Alt, 81 ambitiousness, 159 annealing, 163 Anscombe, 28 Aragones, 139, 143 Arrow, 27, 32 Ashkenazi, 77 aspiration level, 47, 54, 137, 143, 157 Aumann, 28, 53 average utility, 55 axiomatization, 20, 71, 128 Choquet, 192 classification, 111 cognitive, 24 data, 128 plausibility, 33, 52, 101 specification, 25, 34, 63, 64 combination axiom, 73, 75, 82, 186 complementarity, 151 conceptual framework, 13, 31, 101 Cover, 111 Cross, 41 de Finetti, 37, 54 exchangeability, 73 Debreu, 32 Bayes, 51, 124, 126 rule, 100 Bayesian statistics, 29 behaviorism, 24, 33 Bewley, 53 Bush, 138 decision tree, 126 DeGroot, 82 descriptive, 13, 16, 22, 27 Devroye, 111 Doyle, 107 dynamic programming, 127 Camerer, 37 Carnap, 21 case-based reasoning, 41, 116 change seeking, 54 empirical knowledge, 106, 107 Eshel, 60 Etzioni, 27 211 212 expected utility theory, 23, 24, 31, 37, 39, 50, 63, 66, 99, 163 Falkenhainer, 43 falsifiable, 22, 31 Fernandez, 160 first welfare theorem, 32 Fishburn, 82 Fix, 111 Forbus, 43 Foster, 163 framework, 13 framing effect, 25 game theory, 172 Gardenfors, 195 Gellatt, 163 genetic algorithms, 163 Gentner, 43 Gibbard, 30 Gick, 43 Gilboa-Schechtman, 144 Gittins, 161 Gyorfi, 111 habit formation, 54 Halmos, 174 Hanson, 30, 108 Harless, 37 Harsanyi, 17, 20 Hart, 111 Herrnstein, 137 INDEX Hodges, 111 Holland, 163 Holyoak, 43 Hume, 40, 43, 107, 110, 195 hypothesis testing, 29, 83 hypothetical cases, 54, 100 memory, 74 reasoning, 101 induction, 105, 157, 195 explicit, 107 first-order, 193 implicit, 110, 113 second-order, 84, 194 insufficient reason, 104 Kahneman, 15, 25 Kandori, 164 Karni, 37, 81 kernel-based estimate, 69 Keynes, 41 Kirkpatrick, 163 Knight, 52 Kolmogorov complexity, 195, 199 Kolodner, 41 language, 195 Laplace, 104 learning, 77, 138, 157, 163, 186, 193 Lehman, 84 213 INDEX Lehrer, 77 Levi, 107 Lewin, 30 logical positivism, 21, 30, 31 Luce, 82 Lugosi, 111 Machina, 37 Mailath, 164 March, 49 Markov chain, 127 Matsui, 100 McDermott, 107 melioration theory, 137 memory, 45, 135, 143, 172 Mongin, 81 Morgenstern, 15, 18, 37 Moser, 108 Mosteller, 138 Mukerji, 186 multi-armed bandit, 161 Myerson, 74 Nash, 15 nearest neighbor, 84, 111 normative, 13, 16, 22, 27 objective, 27 observable, 20, 103 Occam’s razor, 195 Ouellette, 160 Parzen, 69 Pazgal, 138, 143, 172 philosophy of science, 29 planning, 117 Popper, 22, 31 potential, 147 prediction, 67 Prelec, 137 Quine, 43, 106 Ramsey, 37 rationality, 25, 27, 35, 127 bounded, 35, 105, 157 Rawls, 17, 20 realism, 159 Received View, 30 Reiter, 107 Riesbeck, 41, 115, 116 risk, 37 Rob, 164 Rosenblatt, 69 Royall, 111 rule, 113, 115, 193 rule based, 53, 106 satisficing, 47, 99, 136 Savage, 15, 18, 23–25, 33, 37, 38, 54, 102 Schank, 41, 115, 116 Selten, 41 Shapley, 18 Shepperd, 160 214 similarity, 43, 54, 104, 111, 115, 186 averaged, 55 of acts, 57, 152, 170 of cases, 59 Simon, 27, 49, 136 simplicism, 195 Simpson’s paradox, 82 Skinner, 24 Sober, 195 states of the world, 40, 100 subjective, 27 substitution, 151 Suppe, 21 sure-thing principle, 33 theory, 13, 16, 20, 24, 31 -laden observations, 30, 108 Tversky, 16, 25, 43 U”-maximization, 59, 64 U’-maximization, 57, 64, 188 U-maximization, 169, 189 utility, 13, 22, 24, 44, 113, 124, 158 cardinal, 141 cumulative, 54, 135 neo-classical, 149 utility theory, 31 V-maximization, 55, 64, 82, 169 Varian, 30 Vecchi, 163 INDEX verificationism, 30 Vind, 81 von Neumann, 18, 37 von-Neumann, 15 W-maximization, 61, 71, 186 Wakker, 53, 72, 82 Walliser, 82 Watson, 24 Weber, 37 Wittgenstein, 195 Young, 74, 163, 164