Robust Bayes Classifiers: Research Note
Robust Bayes Classifiers: Research Note
Robust Bayes Classifiers: Research Note
Research Note
Abstract
Naive Bayes classifiers provide an efficient and scalable approach to supervised classification
problems. When some entries in the training set are missing, methods exist to learn these classifiers
under some assumptions about the pattern of missing data. Unfortunately, reliable information about
the pattern of missing data may be not readily available and recent experimental results show that
the enforcement of an incorrect assumption about the pattern of missing data produces a dramatic
decrease in accuracy of the classifier. This paper introduces a Robust Bayes Classifier (RBC) able
to handle incomplete databases with no assumption about the pattern of missing data. In order to
avoid assumptions, the RBC bounds all the possible probability estimates within intervals using a
specialized estimation method. These intervals are then used to classify new cases by computing
intervals on the posterior probability distributions over the classes given a new case and by ranking
the intervals according to some criteria. We provide two scoring methods to rank intervals and a
decision theoretic approach to trade off the risk of an erroneous classification and the choice of not
classifying unequivocally a case. This decision theoretic approach can also be used to assess the
opportunity of adopting assumptions about the pattern of missing data. The proposed approach is
evaluated on twenty publicly available databases. 2001 Elsevier Science B.V. All rights reserved.
Keywords: Bayes classifier; Missing data; Probability intervals
1. Introduction
Supervised classification is the task of assigning a class label to unclassified cases
described as a set of attribute values. This task is typically performed by first training
a classifier on a set of classified cases and then using it to label unclassified cases. The
* Corresponding author.
210
supervisory component of this classifier resides in the training signal, which provides the
classifier with a way to assess a dependency measure between attributes and classes. Naive
Bayes classifiers (NBCs) [4,11] have been among the first supervised classification methods
and, during the past few years, they have enjoyed a renewed interest and consideration [6].
The training step for a NBC consists of estimating the conditional probability distributions
of each attribute given the class from a training data set. Once trained, the NBC classifies
a case by computing the posterior probability distribution over the classes via Bayes
Theorem and assigning the case to the class with the highest posterior probability.
NBCs assumes that the attributes are conditionally independent given the class and this
assumption renders very efficient both training and classification. Unfortunately, when the
training set is incomplete, that is, some attribute values or the class itself are reported
as unknown, both efficiency and accuracy of the classifier can be lost. Simple solutions
to handle missing data are either to ignore the cases including unknown entries or to
ascribe these entries to an ad hoc dummy state of the respective variables [15]. Both these
solutions are known to introduce potentially dangerous biases in the estimates, see [9] for a
discussion. In order to overcome this problem, Friedman et al. [6] suggest the use of the EM
algorithm [3], gradient descent [20] or, we add, Gibbs sampling [7]. All these methods rely
on the assumption that data are Missing at Random (MAR) [13], that is, the database is left
with enough information to infer the missing entries from the recorded ones. Unfortunately,
there is no way to verify that data are actually MAR in a particular database and, when this
assumption is violated, these estimation methods suffer of a dramatic decrease in accuracy
with the consequence of jeopardizing the performance of the resulting classifier [21].
This paper introduces a new type of NBC, called Robust Bayes Classifier (RBC), which
does not rely on any assumption about the missing data mechanism. The RBC is based on
the Robust Bayes Estimator (RBE) [18], an estimator that returns intervals containing all
the estimates that could be induced from all the possible completions of an incomplete
database. The intuition behind the RBE is that, even with no information about the missing
data mechanism, an incomplete data set can still constrain the set of estimates that can
be induced from all its possible completions. However, in this situation, the estimator can
only bound the posterior probability of the classes. The first contribution of this paper is
to provide a specialized closed-form, interval-based estimation procedure for NBCs, which
takes full advantage of their conditional independence assumptions. Once trained, these
classifiers are used to classify unlabeled cases. Unfortunately, Bayes Theorem cannot
be straightforwardly extended from standard point-valued probabilities to interval-valued
probabilities. Nonetheless, the conditional independence assumptions underlying the NBC
allows for a closed-form solution for the classification task, too. The second contribution
of this paper is a new propagation algorithm to compute posterior probability intervals
containing all the class posterior probabilities that could be obtained from the exact
computation of all possible completions of the training set. These intervals are then ranked
according to a score and a new case is assigned to the class associated with the highest
ranked interval. We provide two scoring methods: the first, based on the strong dominance
criterion [10], assigns a case to the class whose minimum posterior probability is higher
than the maximum posterior probability for all other classes. This criterion preserves the
robustness of the classifier but may leave some cases unclassified and hence we provide a
weaker criterion to improve the coverage. We also introduce a general decision-theoretic
211
framework to select the most appropriate criterion by trading off accuracy and coverage.
As a by-product, this decision-theoretic approach provides a principled way to asses the
viability of the MAR assumption for a given training set. We also show that, when the
database is complete, the RBC estimates reduce to the standard Bayesian estimates and
therefore the RBC subsumes the standard NBC as a special case. This approach is evaluated
on twenty publicly available databases.
ij k + n(aik , cj )
h [ij h + n(aih , cj )]
j + n(cj )
and p(cj ) = P
,
l [l + n(cl )]
(1)
respectively. The quantities ij k and j can be regarded as frequencies of pair aik , cj and of
the class cj , respectively, in an imaginary sample, representing the prior information about
the distributions of the attributes and the classes. The size of this imaginary sample
is called global prior precision. Further details are in [17]. Once the classifier has been
Fig. 1. A Bayesian network representing an NBC with attributes A1 , . . . , A9 and a set C of classes.
212
trained, we can use it for the classification of new cases. If we represent a case as a set
of attribute values e = {a1k , . . . , amk }, Bayes theorem yields the posterior probability of a
class cj given e as
Q
p(cj ) m
i=1 p(aik | cj )
(2)
p(cj | e) = Pq
Qm
h=1 p(ch ) i=1 p(aik | ch )
and the case is assigned to the class with the highest posterior probability.
From the computational point of view, the training of the classifier reduces to
summarizing the whole database D into m contingency tables Ti of dimension (q si ),
each cell (j, k) of the table Ti collecting the frequency of the pair (aik , cj ). In this way
(i) the estimation of the q probability distributions of each attribute Ai conditional on
the classes c1 , . . . , cq can be done locally using the frequencies n(aik , cj ) in the
table Ti , as the frequencies n(ahk , cj ) in all other tables Th are irrelevant;
(ii) the estimation of each probability distribution of the attribute Ai conditional on the
class cj can be done independently of the other classes, by using the frequencies
n(aik , cj ) in the row j , and
(iii) the estimation of the marginal distribution of the classes can be done in any one of
the tables Ti , by using its row totals n(cj ).
In other words, the estimation procedure can be performed table by table and, within each
table, row by row. These properties were termed global and local parameter independence
by [22] and they are the source of the computational efficiency of the training process.
When some entries in the training set D are missing, both accuracy and efficiency
of the NBC are under threat. The reasons for this situation become clear if we regard
the incomplete database as the result of a deletion process occurred on a complete
(yet unknown) database. The received view on missing data [13] is based on the
characterization of the deletion process. According to this approach, data are Missing
Completely at Random (MCAR), if the probability that an entry is missing is independent
of both observed and unobserved values. They are Missing at Random (MAR), if this
probability is at most a function of the observed values in the database. In all other cases,
data are Informatively Missing. Under the assumption that data are either MAR or MCAR,
the values of the unknown entries can be estimated from the observed ones and the deletion
process is called ignorable. This property guarantees that the available data are sufficient
to train the classifier but, unfortunately, it does not enjoy any longer the properties of
global and local parameter independence. Indeed, unknown entries induce three types of
incomplete cases:
(i) cases in which the attribute Ai is observed and the class is missing;
(ii) cases in which the class cj is observed and the value of the attribute Ai is missing;
(iii) cases in which both the value of the attribute Ai and the class are missing.
We denote the frequency of these cases by n(aik , ?), n(?, cj ) and n(?, ?), respectively.
Suppose now we had some estimation method able to compute the estimates in Eq. (1) by
assigning a proportion of the frequencies n(aik , ?), n(?, cj ) and n(?, ?) to the cell (j, k) in
each contingency table Ti . As the reconstructed marginal frequency of each class needs to
be equal in all tables, the estimation cannot be done locally any longer, and the properties
of local and global parameter independence are lost. One exception arises when the class
is observed in all cases.
213
Theorem 1. Suppose that the class is observed in all cases of the training set D, and that
the entries are MAR. Then, the estimates in Eq. (1), with n(aik , cj ) being the frequency of
fully observed pairs aik , cj and n(cj ) being the class frequency, are the exact Bayesian
estimates.
The proof appears in [22]. When also some classes are missing, we can use one of the
approximate methods mentioned in the Introduction to compute the estimates in Eq. (1).
However, these methods require the deletion process to be ignorable. When data are
informatively missing, the available entries are no longer sufficient to train the classifier.
Furthermore, there is no way, yet, to check whether the deletion process responsible for the
missing data is actually ignorable. These are the motivations behind the introduction of the
Robust Bayesian Estimator ( RBE) [18] and its application, in this paper, to the development
of a robust version of the NBC.
3. Robust estimation
Recall that a NBC is trained by estimating the conditional probability distributions
{p(aik | cj )} and {p(cj )} from an database D. This section describes how to perform this
task when the database D is incomplete. We need the following definitions.
Definition 1 (Consistency). Let D be an incomplete data set and let p(x) be a probability
that we wish to estimate from D.
(1) A consistent completion of D is any complete data set Dc from which D is obtained
via some deletion process.
(2) A consistent estimate of p(x) is an estimate computed in a consistent completion of
D.
(3) A consistent probability interval for p(x) is an interval [pinf (x), psup(x)] containing
all consistent estimates. A consistent interval is non-trivial if pinf (x) > 0 and
psup (x) < 1.
(4) A consistent probability interval is tight when it is the smallest consistent probability
interval [p(x), p(x)] for p(x).
The difference between a consistent and a tight consistent probability interval is that, in
the former, the interval extreme points are lower and upper bounds for the set of consistent
estimates, while in the latter, the extreme points are reached in some consistent completion
of the database. The rest of this section is devoted to the construction of tight, consistent
probability intervals for the quantities p(aik | cj ) and p(cj ) defining an NBC.
In order to estimate the conditional probability p(aik | cj ) from an incomplete training
set D, the RBE collects the frequencies n(aik , ?), n(?, cj ) and n(?, ?) of incomplete cases
into the virtual frequencies n(aik , cj ) and n(aik , cj ). These frequencies are then used to
compute the extreme points of the tight consistent probability interval for p(aik | cj ).
The quantity n(aik , cj ) is the maximum number of incomplete cases (Ai , C) that can be
completed as (aik , cj ) and it is given by
n(aik , cj ) = n(?, cj ) + n(aik , ?) + n(?, ?).
(3)
214
On the other hand, the virtual frequency n(aik , cj ) is the maximum number of incomplete
cases (Ai , C) that can be ascribed to cj without increasing the frequency n(aik , cj ) and it
is
X
n(aih , ?) + n(?, ?).
(4)
n(aik , cj ) = n(?, cj ) +
h6=k
The virtual frequencies are used to compute the values p(aik | cj ) and p(aik | cj ) that are,
respectively, the minimum and the maximum estimate of p(aik | cj ) that can be found in
the consistent completions of D and they are
ij k + n(aik , cj )
,
p(aik | cj ) = P
h [j h + n(aih , cj )] + n(aik , cj )
(5)
ij k + n(aik , cj ) + n(aik , cj )
.
p(aik | cj ) = P
h [ij h + n(aih , cj )] + n(aik , cj )
It has been shown [18] that the interval [p(aik | cj ), p(aik | cj )] is tight and consistent.
We now consider the estimation of p(cj ) and note that the virtual frequencies n(cj ) and
n(cj ) are both equal to the number n(?) of cases in D in which the class is not observed.
We obtain tight consistent probability intervals for p(cj ) by setting:
p(cj ) = P
j + n(cj )
,
[
+ n(cl )] + n(?)
l
l
j + n(cj ) + n(?)
.
p(cj ) = P
l [l + n(cl )] + n(?)
(6)
When the training set is complete, Eqs. (5) and (6) reduce to Eq. (1). Each set given by the
maximum probability for the class cj and the minimum probabilities of the other classes,
say { p(cj ), p(ch ), h 6= j } defines a probability distribution
X
p(cj ) +
p(ch ) = 1 for all j ,
(7)
h6=j
so that the probability intervals [p(cj ), p(cj )] are reachable, as defined by [2]. By
definition, if the probability p(aik | cj ) is at its maximum value p(aik | cj ), then the virtual
counter n(aik , cj ) absorbs the frequencies n(aik , ?) and n(?, ?) so that p(cj ) = p(cj )
and, for any other class ch , we have that p(aik | ch ) < p(aik | ch ) and p(ch ) = p(cj ).
Similarly, if the probability p(aik | cj ) is at its minimum value p(aik | cj ), then for any
other class ch , we have that p(aik | ch ) > p(aik | ch ). However, if the class is always
observed, the virtual frequencies n(aik , cj ) and n(aik , cj ) are both equal to n(?, cj ),
because n(aik , ?) = n(?, ?) = 0, for all i and k. In this case, the probabilities p(aik | cj ) can
vary independently and maxima and minima can be reached at the same time, for different
classes cj .
4. Robust classification
Once trained, the classifier can be used to label unclassified cases. Given a new case, an
performs this task in two steps: first it computes the posterior probability of each class
NBC
215
given the attribute values, and then it assigns the case to the class with the highest posterior
probability. In this section, we first show how to compute posterior probability intervals of
each class and then how to rank these intervals to classify new cases.
4.1. Posterior probability intervals
Let e = {a1k , . . . , amk } be attribute values of a case e that we wish to classify. With
point-valued probabilities, the expression of the posterior probability of a class cj , given e,
is given in Eq. (2). The next Theorem identifies non-trivial consistent probability intervals
for the classes. The result generalizes the solution provided by [18] for Boolean classes. We
then show that, when the training set D reports always the class, these consistent intervals
are also tight.
Theorem 2. Let D be an incomplete data set. Then, the probability interval [pinf (cj | e),
psup (cj | e)] with
Q
p(cj ) m
i=1 p(aik | cj )
Qm
P
Q
(8)
psup (cj | e) =
p(cj ) i=1 p(aik | cj ) + h6=j p(ch ) m
i=1 p(aik | ch )
and
pinf (cj | e) =
p(cj )
p(cj )
Qm
Qm
i=1 p(aik
i=1 p(aik
| cj )
| cj ) + max{fg , g 6= j }
(9)
m
Y
i=1
p(aik | cg ) +
X
l6=j,g
p(cl )
m
Y
p(aik | cl )
i=1
216
P
P
minimizing h6=j xh , and it is minimized by minimizing xj and by maximizing h6=j xh .
This argument grounds the intuition of the proof: we will find maxima and minima of the
function f (xj , yj ) in a hyper-rectangle containing the region of definition of the variables
xj , for yj fixed, and these maxima and minima induce upper and lower bounds for the
function f (xj , yj ). We then maximize and minimize these bounds with respect to yj . The
first step is to find this hyper-rectangle.
If the probabilities p(aik | cj ) could vary independently within the intervals [p(aik | cj ),
p(aik | cj )], then the variables xj would vary independently in the Cartesian product C of
the intervals
" m
#
m
Y
Y
p(aik | cj )
p(aik | cj ) .
[ xj xj ] =
i=1
i=1
Qm
Q
Thus, setting xj = i=1 p(aik | cj ) and xh = m
i=1 p(aik
function f (xj , yj ) in the hyper-rectangle C, for yj fixed.
217
Theorem 3. If the class cj is reported for every case e, the probability intervals defined
by
Q
p(cj ) m
i=1 p(aik | cj )
Qm
P
Q
(11)
p(cj | e) =
p(cj ) i=1 p(aik | cj ) + l6=j p(cl ) m
i=1 p(aik | cl )
and by
p(cj | e) =
p(cj )
Qm
p(cj )
i=1 p(aik
Qm
| cj )
Qm
h6=j p(ch ) i=1 p(aik | ch )
i=1 p(aik
| cj ) +
(12)
Theorem 2 ensures that the interval [pinf (cj | e), psup(cj | e)] contains all possible
conditional probabilities p(cj | e) that can be computed from the consistent completions
of the training set D, and the variability within the intervals is due to the uncertainty about
the missing data mechanism. A conservative score derived from the strong dominance
criterion [10] provides a classification rule that does not require any assumption about the
missing data mechanism.
218
sup
The interval-based classification rule induced by the strong dominance score classifies
a new case as cj if and only if the probability pinf (cj | e) is larger than the probability
psup (ch | e), for any h 6= j . Strong dominance is a safe criterion since it returns the
classification that we would obtain from all consistent completions of the training set D.
However, when the probability intervals are overlapping, the strong dominance score is
not defined and we face a situation of undecidability. Moreover, the strong dominance
score is too conservative because the condition pinf (cj | e) > psup (ch | e), for all h 6= j ,
is sufficient to yield the classification we would obtain, the complete training set being
known, but it is not necessary. In order to increase the coverage of the classifier, we can
weaken this criterion by making the minimal assumption that all missing data mechanisms
are equally possible, thus making all values within the intervals [pinf (cj | e), psup(cj | e)]
equally likely. In this way, we summarize the interval into an average point by defining the
score
su (cj | e) = psup (cj | e) k psup (cj | e) pinf (cj | e)
= (1 k)psup (cj | e) + kpinf (cj | e),
where k is chosen so that the scores {su (cj | e)} satisfy the properties of Eq. (13). Hence,
P
1 h pinf (ch | e)
.
k= P
h (psup (ch | e) pinf (ch | e))
A consequence of the consistency of the probability intervals [pinf (cj | e), psup(cj | e)]
is that the extreme probabilities pinf (cj | e) and psup (cj | e) and the probability
training set D, are in the
p(cj | e) that we would compute, from a complete P
P relationship
pinf (cj | e) 6 p(cj | e) 6 psup (cj | e). It follows that j pinf (cj | e) 6 1 6 j psup (cj | e)
and, hence, that the quantity k is in the open interval (0, 1). This last finding guarantees
that the score su (cj | e) is in the interior of the interval [pinf (cj | e), psup (cj | e)] and,
consequently, that it cannot produce a classification rule that does not correspond to any
E-admissible classification rule compatible with the intervals [pinf (cj | e), psup (cj | e)]
[12]. Note that the Hurwiczs OptimismPessimism criterionthe usual solution for these
circumstances [14,16]does not guarantee this property. As the score su (cj | e) always
leads to a decision, we term it a complete-admissible score.
Definition 4 (Complete-admissible score). Given a set of q consistent posterior probability
intervals [pinf (cj | e), psup (cj | e)], we define the quantity
P
(psup (cj | e) pinf (cj | e))(1 h pinf (ch | e))
P
su (cj | e) = psup (cj | e)
h (psup (ch | e) pinf (ch | e))
a complete-admissible score.
219
It is worth noting that the classification based on the complete-admissible score subsumes the one based on the strong dominance score because if pinf (cj | e) > psup (ch | e),
for all h 6= j , then su (cj | e) > su (ch | e) for all h 6= j . When the condition to apply the
strong dominance score does not hold, the complete-admissible score lets us classify the
cases left unclassified by the strong dominance score. This strategy may result in an increased classification coverage at the price of a lower accuracy.
4.3. Which score?
Both the strong dominance and the complete-admissible score provide a sensible basis
for robust classification. Strong dominance is safe at the price of leaving cases unclassified
while the complete-admissible score increases the classification coverage by loosing
robustness. The choice of an interval-scoring method depends on the features of the
problem at hand and, in this section, we provide a principled way to choose the best
interval-based classification strategy.
A classification system is typically evaluated on the basis of its classification accuracy
and its coverage . The former is the probability of correctly classifying a case while the
latter is the probability of classifying one case. Let d and d be respectively the accuracy
and the coverage of an RBC with the strong dominance score (RBC d ). The accuracy d is
independent of the missing data mechanism. Similarly, let u be the accuracy of the RBC
with the complete-admissible score, say RBC u . The accuracy u of the RBC u is given by
two components. The first component is the probability of correctly classifying one case
when we can use the strong dominance score and, hence, it is weighted by the coverage d .
The second component is the probability of correctly classifying one case when we cannot
use the strong dominance score and, therefore, it is weighted by 1 d . Thus,
u = d d + ul (1 d ),
(14)
where ul is the classification accuracy of the RBC u on the cases left unclassified by the
RBC d and we term it residual accuracy. Residual accuracy provides a measure of the
gain/loss of classification accuracy achieved by the RBC u when one relaxes the strong
dominance criterion to increase coverage.
The decomposition in Eq. (14) provides a first basis to choose the scoring method. For
example, a simple rule could be to adopt the complete-admissible score if ul is greater
than 1/q, so that the cases left unclassified by the strong dominance score are classified
by the complete-admissible score better than at random. The intuition behind this rule is
that accuracy is more valuable than coverage and, hence, we would not prefer a method
that classifies randomly just because it always classifies a case. The rationale is that we
expect the consequence of a wrong classification to be worse than the inability to classify
one case. This argument can be used formally to choose between the strong dominance or
the complete-admissible score by introducing mis-classification costs and costs incurred
for the inability to classify one case. Suppose that the cost incurred for not being able
to classify a case with attribute values e is a quantity Ci , while the cost for a wrong
classification is Cw . Since the former event occurs with probability 1 d and the latter
occurs with probability (1 d )d , the expected cost incurred on using the RBC d is
C(RBC d ) = Cw (1 d )d + Ci (1 d )
220
if correct classification yields no cost. On the other hand, the expected cost incurred by an
RBC u achieving 100% coverage with accuracy u is
C(RBC u ) = Cw (1 u ).
In order to minimize the cost, RBC d is to be preferred to RBC u when C(RBC d ) 6 C(RBC u ).
This is true if and only if u d d = ul (1 d ) 6 (1 d )(1 Ci /Cw ) and it yields the
decision rule given in the next theorem.
Theorem 4. Let Ci and Cw denote respectively the cost of a wrong classification and the
cost of not being able to classify a case. The interval based classification rule which uses
the strong dominance score yields minimum expected cost if and only if:
ul 6 (1 Ci /Cw )
where ul is the accuracy of the RBC u on the cases left unclassified by the RBC d .
For example, if Ci = Cw , the best decision is to choose the RBC d whenever ul > 0.
Compared to the simpler rule described above, the decision now takes into account the
trade-off between accuracy and coverage. In practical applications, the quantities d , u
and d can be estimated from the available data using cross validation, as shown in the
next section. Suppose now the quantity a is the accuracy of any other NBC a trained
on an incomplete data set under some assumption about the missing data mechanism.
For example, a could be the accuracy of an NBC trained on an incomplete data set
under the assumption that data are MAR. We can use the same decision rule to help
one decide whether the RBC with the strong dominance or the complete-admissible score
yields minimum expected costs. As a by-product, the decision rule can be interpreted as
an evaluation of the consequences of enforcing the MAR assumption. The comparison
between the accuracy measures a and u is cost-independent, as we compare C(RBC u ) =
Cw (1 u ) and C(NBC a ) = Cw (1 a ) and the minimum expected cost is achieved
by the system having the highest accuracy. If we now compare the expected costs of
the RBC d and the NBC a , and apply the decision rule in Theorem 4, we have that the
NBC a is to be preferred to the RBC d whenever ul > (1 Ci /Cw ) and the quantity
(1 d )[ul (1 Ci /Cw )] is the cost incurred in enforcing the assumption about the
missing data mechanism. This solution can be easily extended to cases in which misclassification costs vary with the classes.
5. Evaluation
This section reports the results of an experimental evaluation of the RBC on twenty
incomplete data sets. The aim of the evaluation is to compare the performance of the
RBC with that of two NBCs, using the most common solutions to handle missing data [6,
15]: remove the missing entries (NBC m ) and assign the missing entries to a dummy state
(NBC ). Since all data sets always report the classes for every case, by Theorem 1 NBC m is
a faithful implementation of the MAR assumption. NBC , on the other hand, assumes some
knowledge on the missing data mechanism since the missing data are treated as a category
other, not reported in the observed data.
221
222
Table 1
Accuracy of NBC m , NBC , RBC d , RBC u . Maximum values are reported in boldface
Database
NBCm
NBC
Accuracy
Accuracy
RBCd
Accuracy
RBCu
Coverage
Accuracy
Adult
81.74 0.23
81.22 0.22
86.51 0.21
81.72 0.18
82.50 0.20
Annealing
86.54 2.88
80.88 3.32
97.53 1.51
49.12 4.87
96.73 1.24
Arythmia
64.40 2.25
61.05 2.76
76.09 3.25
39.82 2.30
66.19 2.33
Audiology
58.34 3.49
55.50 3.51
63.41 5.32
34.78 3.48
55.50 3.51
Automobile
60.48 3.41
58.05 3.45
68.49 3.84
71.22 3.16
61.96 3.39
B.Cancer
97.42 0.66
97.42 0.66
97.49 0.67
99.65 5.23
97.23 0.67
Bridge
67.62 4.57
64.76 4.66
80.00 4.78
66.67 4.60
69.52 4.49
Credit
84.88 1.30
84.88 1.30
87.48 1.72
95.40 5.21
84.70 1.31
Cylinder
73.70 3.71
73.00 3.74
91.71 4.14
31.30 6.97
74.26 0.67
10
Echocardiogram
87.23 2.94
88.54 2.78
93.58 2.35
83.21 3.27
88.54 2.78
11
Heart-C
54.13 2.86
53.80 2.86
58.97 2.89
95.71 1.16
58.07 2.83
12
Heart-H
83.33 2.00
81.29 2.27
85.88 2.11
86.73 1.98
83.67 2.11
13
Heart-S
38.29 4.38
36.59 4.34
47.37 11.45
15.44 3.26
42.28 4.45
14
Hepatitis
85.03 2.09
85.16 2.08
90.50 2.84
76.45 9.73
85.55 2.08
15
Horse Colic
75.79 1.62
75.79 1.63
92.56 0.59
6.51 2.05
77.73 1.61
16
KDD99
84.68 0.52
84.80 0.50
89.22 0.67
45.42 0.70
84.85 0.52
17
L.Cancer
43.75 17.88
43.75 17.88
46.67 17.85
93.75 8.66
43.75 17.88
18
Mushrooms
98.53 0.12
98.40 0.12
99.04 0.15
98.88 1.53
98.70 0.11
19
Sick
91.60 0.51
90.87 0.53
97.53 0.34
86.30 2.29
92.46 0.49
20
Vote
90.02 1.05
90.21 1.04
92.05 1.75
94.94 6.47
90.21 1.04
al =
a d d
,
1 d
where a is the (estimated) accuracy of NBC m , NBC or RBC u , while d and d are
the estimated accuracy and coverage of RBC d . Table 2 reports these accuracy values for
NBC m , NBC , and RBC u in the data sets used in this experiment. The sixth column reports
the maximum cost ratio Ci /Cw to make RBC d the best classification system in terms of
minimum expected costs and, for reference, the last two columns note the proportion of
cases left unclassified by RBC d and size of the database. If the cost ratio is higher than
the reported value, then the best system is the one with the highest accuracy al , and it is
reported in bold face in the table.
In the data sets B.Cancer and Credit, RBC d is the best choice if Cw > 4.44Ci and
Cw > 1.45Ci , respectively. If these conditions are not satisfied, then NBC m or, equivalently,
NBC , are the best systems. In the B.Cancer data set, the complete-admissible score
223
Table 2
Residual accuracy of NBC m , NBC and RBC u . The sixth column reports the maximum value on the cost ratio
Ci /Cw that makes RBC d the classification system with minimum expected cost. If the cost ratio Ci /Cw is
superior to this value, then the system corresponding to the bold-faced accuracy is the best choice. The last two
columns report the percentage of cases left unclassified by RBC d and the database size
ml
ul
Ci /Cw
(1 d )100
Adult
0.6040
0.5757
0.6457
0.3543
18.28
48842
Annealing
0.7593
0.6481
0.9596
0.0404
50.88
798
Arythmia
0.5666
0.5100
0.5964
0.4036
60.18
452
Audiology
0.5564
0.5128
0.5128
0.4436
65.22
200
Automobile
0.4066
0.3221
0.4580
0.5420
28.78
205
B.Cancer
0.7749
0.7749
0.2320
0.2251
0.35
699
Bridge
0.4286
0.3428
0.4856
0.5144
33.33
105
Credit
0.3096
0.3096
0.2705
0.6904
4.60
598
Cylinder
0.6448
0.6549
0.6631
0.3369
68.70
512
10
Echocardiogram
0.5576
0.6356
0.6356
0.3644
16.79
131
11
Heart-C
0.0000
0.0000
0.3799
0.6201
4.29
303
12
Heart-H
0.6667
0.5129
0.6923
0.3077
13.27
294
13
Heart-S
0.3663
0.3462
0.4135
0.5865
84.56
123
14
Hepatitis
0.6782
0.6727
0.6948
0.3052
23.55
155
15
Horse Colic
0.7462
0.7462
0.7670
0.2330
93.49
368
16
L.Cancer
0.0000
0.0000
0.0000
1.0005
6.25
32
17
KDD99
0.8090
0.8112
0.8121
0.1878
54.58
4704
18
Mushrooms
0.4190
0.5350
0.6868
0.3132
1.22
8124
19
Sick
0.4892
0.5424
0.6052
0.3948
15.70
2800
20
Vote
0.5569
0.5193
0.5569
0.4431
5.06
435
Database
Size
performs very poorly on the cases left unclassified by the strong dominance score, while
the enforcement of the MAR assumption allows the standard NBC to exploit the information
provided by the available data and reaches an accuracy of 77.49%. This data set has,
however, only 6 cases with missing entries. In the Credit data set, NBC m , NBC and RBC u
achieve an accuracy lower than 50% so that, if the the mis-classification cost is lower
than 1.45Ci , a random assignment of the cases left unclassified by RBC d is preferable. In
Audiology, RBC d is the minimum expected cost system if Cw > 2.25Ci . When the condition
on the cost ratio is not satisfied, that NBC m is the classification system to adopt. In the data
set L.Cancer, the accuracy al is null for all systems, and hence the choice of RBC d is never
under discussion. This is also confirmed by the fact that the maximum value on the cost
ratio Ci /Cw , which makes RBC d the system with minimum expected cost, is 1.005. Hence,
RBC d is the best whenever Cw > 0.995Ci . As this data set is of medical nature, one can
224
hardly imagine a situation in which not making an automatic analysis is less costly than
making the wrong one. In the remaining data sets, RBC u is always the second best choice,
if the cost ratio Ci /Cw is superior to the value reported in the last column of the table. In
the data set Annealing, for example, if the cost for not classifying a case is smaller than 25
timesgiven by 1/0.0404the cost for a wrong classification, RBC u is the best choice and
achieves an accuracy 0.9596 on the cases left unclassified by RBC d . This is a gain of about
20% compared to NBC m . Again this result confirms that the MAR assumption on this data
set has a negative effect on the accuracy. A similar result is shown in the Mushroom data
set, in which either assigning the missing entries to a dummy value or enforcing the MAR
assumption yields essentially a random classification of the cases left unclassified by RBC d ,
while the use of the complete-admissible score rises the residual accuracy to 68.68%. The
Sick data set reports a similar result, while the accuracy of RBC u is only slightly superior
to the NBC in the data sets Automobile, Cylinder, Hepatitis, and Horse Colic, and is none
in the Vote data set.
These results suggest that the RBC based on the strong dominance criterion delivers the
highest accuracy, at risk of a decreased coverage. The use of the complete-admissible score
improves coverage by decreasing accuracy, and it appears to achieve better results than
standard solutions, except when the proportion of missing data is small. However, there
does not seem to be a consistently superior classifier and the solution to adopt needs to
take into account features of the data at hand. Nonetheless, our decision theoretic approach
provides a principled way to choose the most appropriate solution.
6. Conclusions
This paper introduced the RBC: a generalization of the standard NBC which is robust
with respect to the missing data mechanism. The RBC performs the training step from
an incomplete data set resulting in a classification system quantified by tight consistent
probability intervals. Then, the RBC classifies new cases by reasoning with probability
intervals. We provided an interval propagation algorithm to identify bounds on the set
of the classes posterior probabilities that can be computed from all possible completions
of the data, and two scoring methods for interval-based classification. The choice of the
scoring methods that best suits the problem at hand is based on a decision-theoretic rule
that takes into account costs of mis-classification and cost incurred for not being able to
classify a case, and can be extended to make a cost-analysis of the implications of the MAR
assumption on the classification accuracy. The experimental evaluations showed the gain
of accuracy that can be achieved by the RBC compared to standard solutions. However, the
results also showed that there is no uniformly better classification strategy when the data
are incomplete, and we expect that the principled way to choose the solution that best suits
the problem at hand will become common practice in real applications.
Although the robust solution that we presented in this paper is limited to the NBC, it
is straightforward to extend it to tree-structured classification systems in which attributes
are binary and the classification problem is to choose between two classes. This can be
done by training the classifier with the RBE and by computing bounds on the posterior
probability of the classes using the 2 U algorithm of [5]. The classification can be done by
225
choosing one of the interval-based classification rules that we presented here, in the same
principled way. The extension to more general classification models is the real challenge
and essentially requires the development of interval propagation algorithms that returns not
too loose bounds on the class posterior probability. The methods described in this paper
have been implemented in the computer program 1 distributed, to date, in over 2000 copies.
Acknowledgements
Authors are very grateful to Paul Snow for his invaluable contribution to the development of the complete-admissible score, and to the anonymous referees and the editor for
their helpful suggestions.
References
[1] C. Blake, E. Keogh, C.J. Merz, UCI Repository of machine learning databases, University of California,
Irvine, Department of Information and Computer Sciences, 1998.
[2] L. Campos, J. Huete, S. Moral, Probability intervals: A tool for uncertain reasoning, Internat. J. Uncertainty,
Fuzziness and Knowledge-Based Systems 2 (1994) 167196.
[3] A.P. Dempster, D. Laird, D. Rubin, Maximum likelihood from incomplete data via the EM algorithm (with
discussion), J. Royal Statist. Soc. Ser. B 39 (1977) 138.
[4] R.O. Duda, P.E. Hart, Pattern Classification and Scene Analysis, Wiley, New York, 1973.
[5] E. Fagiuoli, M. Zaffalon, 2U: An exact interval propagation algorithm for polytrees with binary variables,
Artificial Intelligence 106 (1998) 77108.
[6] N. Friedman, D. Geiger, M. Goldszmidt, Bayesian network classifiers, Machine Learning 29 (1997) 131
163.
[7] S. Geman, D. Geman, Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images,
IEEE Transactions on Pattern Analysis and Machine Intelligence 6 (1984) 721741.
[8] R. Kohavi, A study of cross-validation and bootstrap for accuracy estimation and model selection, in: Proc.
IJCAI-95, Montreal, Quebec, Morgan Kaufmann, San Francisco, CA, 1995, pp. 11461151.
[9] R. Kohavi, B. Becker, D. Sommerfield, Improving simple Bayes, in: M. van Someren, G. Widmer (Eds.),
Poster Papers of the ECML-97, Charles University, Prague, 1997, pp. 7887.
[10] H.E. Kyburg, Rational belief, Behavioral and Brain Sciences 6 (1983) 231273.
[11] P. Langley, W. Iba, K. Thompson, An analysis of Bayesian classifiers, in: Proc. AAAI-92, San Jose, CA,
AAAI Press, Menlo Park, CA, 1992, pp. 223228.
[12] I. Levi, On indeterminate probabilities, J. Philos. 71 (1974) 391418.
[13] R.J.A. Little, D.B. Rubin, Statistical Analysis with Missing Data, Wiley, New York, 1987.
[14] M. Pittarelli, An algebra for probabilistic databases, IEEE Transactions on Knowledge and Data
Engineering 6 (2) (1994) 293303.
[15] J.R. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann, San Francisco, CA, 1993.
[16] M. Ramoni, Ignorant influence diagrams, in: Proc. IJCAI-95, Montreal, Quebec, Morgan Kaufmann, San
Francisco, CA, 1995, pp. 18081814.
[17] M. Ramoni, P. Sebastiani, Bayesian methods, in: M. Berthold, D.J. Hand (Eds.), Intelligent Data Analysis.
An Introduction, Springer, New York, 1999, pp. 129166.
[18] M. Ramoni, P. Sebastiani, Robust learning with missing data, Machine Learning (2000), to appear.
[19] R.T. Rockafellar, Convex Analysis, Princeton University Press, Princeton, NJ, 1970.
1 Available from Bayesware Limited (www.bayesware.com).
226
[20] S. Russell, J. Binder, D. Koller, K. Kanazawa, Local learning in probabilistic networks with hidden
variables, in: Proc. IJCAI-95, Montreal, Quebec, Morgan Kaufmann, San Francisco, CA, 1995, pp. 1146
1151.
[21] D.J. Spiegelhalter, R.G. Cowell, Learning in probabilistic expert systems, in: J. Bernardo, J. Berger, A.P.
Dawid, A.F.M. Smith (Eds.), Bayesian Statistics 4, Oxford University Press, Oxford, UK, 1992, pp. 447
466.
[22] D.J. Spiegelhalter, S.L. Lauritzen, Sequential updating of conditional probabilities on directed graphical
structures, Networks 20 (1990) 157224.