A Bayesian Interpretation of The Confusion Matrix
A Bayesian Interpretation of The Confusion Matrix
A Bayesian Interpretation of The Confusion Matrix
DOI 10.1007/s10472-017-9564-8
Olivier Caelen1
1 Introduction
The confusion matrix [1, 9] is typically used in machine learning to evaluate or to visualize
the behavior of models in supervised classification contexts [7]. It is a square matrix in
which the rows represent the actual class of the instances and the columns their predicted
class. If we are handling a binary classification task, then the confusion matrix is a 2 × 2
Olivier Caelen
olivier.caelen@worldline.com
matrix that reports the number of true positives (#T P ), true negatives (#T N ), false positives
(#F P ), and false negatives (#F N ) as follows:
#T P #F N
. (1)
#F P #T N
This matrix contains all the raw information about the predictions done by a classification
model on a given data set. To evaluate the generalization accuracy of a model, it is common
to use a testing data set which was not used during the learning process of said model.
Many synthetic one-dimensional performance indicators can be extracted from a confusion
matrix. The performance indicator can be, for example, the precision, the recall, the F-score
... When different kinds of errors are not assumed to be equal, in association with a 2×2 cost
matrix, cost-sensitive performance indicators [3] can also be computed from the confusion
matrix. The choice of the suitable performance indicator is directly linked to the objective
of the learning problem.
Let us assume that we have two models and we want to select the best one according to
a given indicator. Performance indicators are scalar numbers computed from the confusion
matrix. Assume that the F-scores of the two models are respectively 0.6 and 0.65. Classical
methods give no information about the confidence we can have about these values. As
performance indicators are intrinsically generated by a random process, we can’t be sure
that the model with the highest indicator (i.e. 0.65) is really the best one. We would like
to find a way to quantify this uncertainty. One of the known techniques to estimate the
variability of an indicator is the bootstrap method [2].
Note that we are not trying to estimate the generalization accuracy of a whole learning
machine algorithm.1 Instead, we study the generalization accuracy of a given model. In this
paper, we assume that we don’t have access to the training set or to the testing set. We only
have access to the confusion matrix and, on this basis, we try to deduce some properties of
the underlying distribution of the model’s performance indicators.
In this paper, we propose to use Bayesian techniques [5] on the confusion matrix. For
this, we will assume that the values in the confusion matrix are coming from a multinomial
distribution [4]. In this parametric context, Bayesian techniques allow us to take into account
the intrinsic variability of the unknown parameters of the multinomial distribution. The
variability of these parameters will be measured by a probability function, and we will see
that this function can be modelized by a Dirichlet distribution [4]. This Dirichlet distribution
can be used in the multinomial distribution to obtain information about the distribution of
the values in the confusion matrix. As the performance indicators come from the confusion
matrix, it gives us information about the indicators’ distribution. This makes it possible to
compute metrics about the uncertainty that concerns any performance indicator. The use of
a Bayesian framework also allows us to inject a priori knowledge in the confusion matrix.
We will see that injecting prior knowledge can have a positive impact on the a posteriori
distribution associated to an indicator, which is especially true when the number of measures
in the confusion matrix is low.
To the best of the author’s knowledge, [6] is the only paper which proposes to use
Bayesian methods to assess the confidence of indicators computed from the confusion
matrix. However, in [6], the framework may not be applicable to arbitrary performance
measures. In our paper, we extend the work by using Dirichlet distributions, which gives us
a way to generalize the method for any performance indicator extracted from a confusion
matrix. We study the impact on the a posteriori distribution when a priori knowledge is
injected, and we also compare our method with the bootstrap techniques.
This paper is structured as follows : The learning problem is first formalized in Section 2.
We start Section 3 by assuming that the data in the confusion matrix are generated from a
multinomial distribution, and we end this section by showing how to compute the a pos-
teriori distribution of any performance indicator generated from this confusion matrix. We
illustrate our method on a simple synthetic example in Section 4. By using Bayesian tech-
niques, we have the possibility of injecting prior knowledge in the confusion matrix. This
is the topic of Section 5. In Section 6, the bootstrap method is theoretically compared to our
method based on Bayesian techniques. Section 7 contains experimental results on real and
synthetic data sets. We end by the conclusions in Section 8.
– recall = #T P#T+#F
P
N
– Fβ -score = (1 + β 2 ) β 2precision·recall
·precision+recall
; where β ∈ R+
– G-score = precision · recall
#T P ·#T N−#F P ·#F N
– MCC = √(#T P +#F P )(#T P +#F N)(#T N+#F P )(#T N+#F N)
The testing set T contains NT independent and identically distributed random samples.
As the classifier h is tested on T , the scalar I (V ) is itself a random variable from which
we want to deduce some properties of the underlying distribution. To do that, we propose
to adopt a Bayesian framework on the confusion matrix V . We will first consider that the
values in the confusion matrix are coming from a multinomial distribution with unknown
parameters. The use of a Bayesian framework gives us the opportunity to assume that
the unknown parameters are generated from a Dirichlet distribution. We can add a priori
knowledge in the Dirichlet probability function to determine an a posteriori probability
distribution of the unknown parameters. This distribution can then be used to compute an
a posteriori of the one-dimensional performance indicators acquired from the confusion
matrix.
The output of the loss function can be seen as the result of a random experiment
with {T P , T N, F P , F N } as support set. It is a generalization of a Bernoulli trial in which,
rather than only two outputs for each trial, we have four. As the data in T are independent
and identically distributed random variables, we know that the NT outputs of are also
independent and identically distributed. After a series of NT independent random trials, the
vector V can be interpreted as a random vector where the elements #T P , #T N , #F P and
#F N contain the number of times that we observe T P , T N , F P and F N , respectively. The
binomial distribution is the discrete probability distribution of the number of successes in a
sequence of independent Bernoulli trials. The multinomial distribution is a generalization
of a binomial distribution when there are more than two possible outputs at each trial. As
the vector V counts the number of times that we observe T P , T N , F P and F N , the vector
V follows a multinomial distribution where there are four possible outputs at each of the
NT independent trials.
In this context, the vector V follows a multinomial distribution
V ∼ Mult(NT , θ )
with the following probability mass function:
4
NT ! v
P (V = v) = · θ v1 · θ v2 · θ 3 · θ v4 · I vi = NT (2)
v1 ! · v2 ! · v3 ! · v4 ! tp tn fp f n
i=1
A Bayesian interpretation of the confusion matrix 433
where θ = (θtp , θtn , θfp , θf n ) ∈ Sθ ⊂ R4 are the unknown parameters of the multinomial
distribution and v = (v1 , v2 , v3 , v4 ) is a realization of V with four numbers in N. The set
Sθ is called the probability simplex and contains all the possible values of θ :
Sθ = { θ | θtp ≥ 0, θtn ≥ 0, θfp ≥ 0, θf n ≥ 0
and θtp + θtn + θfp + θf n = 1}.
In (2), it is assumed that θ is a vector with four fixed unknown parameters. Adopting
a Bayesian point of view, we can consider that θ is a realization of an unknown random
variable . In this Bayesian setting, the left part of (2) becomes P (V = v| = θ).
The Bayes rule and the law of total probability tell us that
P (V = v| = θ) · f (θ )
f|V (θ|v) =
P (V = v)
P (V = v| = θ) · f (θ )
= (3)
Sθ P (V = v| = θ ) · f (θ ) · dθ
where f|V (θ|v) is the conditional density function of , given a confusion matrix v.
Note that in (3), the denominator is there to ensure that Sθ f|V (θ|v)dθ = 1. Conse-
quently, we only have to evaluate the numerator and to normalize the results so that the
integral at the end equals 1. In this setting, (3) becomes
f|V (θ|v) ∝ P (V = v| = θ) · f (θ ) . (4)
A posteriori Likelihood A priori
Thanks to the Bayes rule, which allows us to interpret the parameter vector θ as a real-
ization of a random variable, (4) gives the conditional density function of this variable .
This conditional density is proportional to the product of two terms. The likelihood returns
a plausibility score that the values in the confusion matrix are generated from a multinomial
distribution with parameters θ . The second term is the a priori distribution and, as we will
see, it allows us to inject prior knowledge about the accuracy of the classifier model h. The
a posteriori probability f|V (θ|v) is a compromise between the a priori and the likelihood.
The Dirichlet distribution is the conjugate distribution of the multinomial distribution
[5]. It ensures that if the likelihood follows a multinomial function and the a priori fol-
lows a Dirichlet function, then the a posteriori will also follow a Dirichlet function. The
Dirichlet distribution is commonly used in Bayesian statistics to model the parameters of
a multinomial distribution. In this context, we say that the vector follows a Dirichlet
distribution
∼ Dir(α) = Dir((α1 , α2 , α3 , α4 ))
with the following density function:
αi α1 −1 α2 −1 α −1
f (θ ) = i · θtp · θtn · θfp3 · θfα4n−1 · I (θ ∈ Sθ )
i (α i )
α1 −1 α2 −1 α −1
∝ θtp · θtn · θfp3 · θfα4n−1 · I (θ ∈ Sθ ) (5)
knowledge can be injected in the a priori distribution via α = (α1 , α2 , α3 , α4 ). This topic
will be covered in Section 5.
If in (4), the likelihood and the a priori are respectively replaced by (2) and (5), we have
v1 +α1 −1 v2 +α2 −1 v +α3 −1
f|V (θ|v) ∝ θtp · θtn · θfp3 · θfv4n+α4 −1 · I (θ ∈ Sθ ) .
This is the a posteriori probability density function of the unknown parameters. We can
identify that, given the confusion matrix v and the prior α, this a posteriori density function
of follows a Dirichlet distribution
|ω ∼ Dir((v1 + α1 , v2 + α2 , v3 + α3 , v4 + α4 )) = Dir(ω) (6)
where ω = (ω1 , ω2 , ω3 , ω4 ) = (v1 + α1 , v2 + α2 , v3 + α3 , v4 + α4 ) are the parameters of
the a posteriori density function.
In (6), we have an analytical definition of the a posteriori distribution of the unknown
parameters. In the following, we will see how we can use this definition to compute the a
posteriori distribution of V , given the observed confusion matrix v and the prior knowledge
α. Let ṽ be an arbitrary new confusion matrix. Before the data in v are considered, the
probability to observe this unknown matrix ṽ is
P (V = ṽ) = P (V = ṽ| = θ) f (θ ) dθ = Ef P (V = ṽ|) .
Sθ
It is often called the a priori predictive distribution. After the matrix v has been observed, we
can compute an a posteriori predictive distribution by replacing f (θ ) with the a posteriori
distribution f|V (θ|v) as follows:
P (V = ṽ|v) = P (V = ṽ| = θ) f|V (θ|v) dθ = Ef|V P (V = ṽ|) . (7)
Sθ
The distribution of V |v synthesizes all the relevant information we can have about the con-
fusion matrix. It shows that the a posteriori predictive function is the expectation of the
conditional probability P (V = ṽ| = θ) over the a posteriori distribution of . In (7),
P (V = ṽ| = θ) is obtained by the multinomial model for a given value of the parameters
and f|V (θ|v) is the a posteriori distribution of said parameters.
A performance indicator I is a function that maps the values of a confusion matrix
into real number space. Let G = I (V |v) be a random variable associated with the values
taken by the performance indicator function. The density function of G can theoretically
be computed by a reparameterization [10] of the probability function in (7). But most of
the time, the analytical calculation of the distribution of G may be very complex, or even
impossible. Instead, we propose to estimate this distribution by the following Monte Carlo
simulation process, described in algorithm 1.
Input in this algorithm are : (i) a performance indicator function I , (ii) a vector α with
prior knowledge, (iii) a vector v with the values of the observed confusion matrix and (iv)
the number M of samples we want to generate. The main loop appears between lines 2 and
6. In line 3, a sample θm is generated from a Dirichlet2 distribution where ω is a vector
containing the sum of the vectors α and v. This sample θm is used in line 4 to extract a
sample ṽm from a multinomial distribution. In line 5, ṽm is injected in the performance
indicator function I to compute gm . Note that if a correlation study must be done between
the indicators, it should be at this line that the other performance indicators are computed
on the same ṽm . In Section 4, we will use this technique to evaluate the correlation between
the precision and recall on a synthetic example. At line 7, the algorithm returns a vector
with M samples from G. These M samples can then be used to estimate any statistics about
the distribution of the indicator, such as the mean, variance, skewness, quantiles...
Note that to generate the M samples from the random variable associated with the per-
formance indicator G, we don’t need to know the learning set D or the testing set T . Since
the total number of samples has to be known in order to obtain a multinomial distribution,
we only need one observed realization v of the random vector V associated to the confusion
matrix.
4 Example
To illustrate our method, let us consider an example with two classifiers A and B, respec-
tively producing the two following confusion matrices on the same testing data set T where
NT = 145:
65 15
⇒ v A = (65, 30, 35, 15)
35 30
50 30
⇒ v B = (50, 35, 30, 30).
30 35
The superscript above v indicates the model from which the confusion matrix is coming
from. If I is the Matthews Correlation Coefficient (MCC), then in our example we have
I (v A ) = 0.2946 and I (v B ) = 0.1635. Based on this criteria, it seems that the classifier A
outperforms the other one. But we don’t have any information to decide if this is really the
case or due to chance.
Let us now adopt the Bayesian framework with α = (0, 0, 0, 0). From (6), the a
posteriori distributions of the unknown parameters θ are:
⎧
⎨ (|v A , α)∼Dir(w = (65, 30, 35, 15))
α = (0, 0, 0, 0) =⇒
⎩ (|v B , α)∼Dir(w = (50, 35, 30, 30))
2 To generate random samples from a Dirichlet distribution, we can use the following property : ∀i ∈
{1, . . . , N }, Xi ∼ gamma(ϕi , 1) ⇒ (X1 /X , . . . , XN /X ) ∼ Dir(ϕ1 , . . . , ϕN ) where X = i Xi .
436 O. Caelen
In the left part of Fig. 1, we used g A and g B to show the estimated distributions of
I (V |v A ) and I (V |v B ). As expected, model A seems better than model B. Adopting the
Bayesian point of view, g A and g B can also be used to infer other quantities such as:
M
1
P (I (V |v B ) > 0) ≈ B
I (gm > 0) ≈ 0.92
M
m=1
1
M
P (I (V |v A ) > I (V |v B )) ≈ A
I gm B
> gm ≈ 0.79
M
m=1
CI0.95 (I (V |v B )) ≈ hpd(gm
B
) ≈ [−0.07, 0.39].
where CI stands for credible interval. The first value means that we have 92% plausibility
that the MCC of model B is positive. The second value shows that we have a high degree of
plausibility that model A really outperforms model B in terms of MCC. The last equation
gives a 95% credible interval for the MCC of model B. To compute the credible interval,
we extract the highest posterior density(hpd) [5]. The hpd is a more robust way to extract
intervals when the distribution is asymmetrical.
It is also possible to study the joint distribution of more than one performance indicator.
As already mentioned, in line 5 of algorithm 1, we can compute the value of two indicators
instead of only one. In the right part of Fig. 1, we display 1,000 samples of the recall and
precision generated from the confusion matrix of model A. Many statistics can be extracted
from this scatter plot. We can, for example, infer that the Pearson correlation has a 95%
chance of being in the interval [0.21, 0.32].
As already mentioned, the Bayesian framework allows us to inject prior knowledge into
the a posteriori. Two distinct situations can occur : either we are in a situation of complete
I (V|vA, α0)
3.5
0.8
I (V|vB, α0)
3.0
2.5
0.7
2.0
Precision
0.6
1.5
1.0
0.5
0.5
0.0
Fig. 1 Left: We compare the estimated distributions I (V |v A ) and I (V |v B ) and we see that the distribution
of model A’s MCC is better. Right: Empirical distribution of the precision and recall
A Bayesian interpretation of the confusion matrix 437
uncertainty about the distribution, or we have some prior knowledge to inject in the a
posteriori distribution.
In the first case, the Dirichlet probability function f (θ ) achieves maximum entropy
when α = (1, 1, 1, 1). From (5), we can see that the a priori distribution f (θ ) ∝
I (θ ∈ Sθ ) is then an uniform distribution on the probability simplex Sθ . This means that if
there is no prior knowledge to inject in the a posteriori, we can just switch all the α values
to one.
In (6), the values α = (α1 , α2 , α3 , α4 ) and v = (v1 , v2 , v3 , v4 ) are coming from the a
priori knowledge and the confusion matrix, respectively. The vector ω with the parameters
of the distribution f|V (θ|v) is the sum of α and v. The duality between α and v in (6)
suggests that this prior conveys the same information as a pilot fake test where the model
would have been tested four times, with α = (1, 1, 1, 1) containing the number of times
T P , T N , F P and F N appear during this test. So, by using this prior, we inject a small
bias in the a posteriori distribution. Therefore, someone choosing a uniform prior is not
in a state of complete ignorance as (s)he discards the possibility of having a value equal
to zero in the confusion matrix. This suggests that we take as prior an improper3 Dirichlet
distribution where α = (0, 0, 0, 0), as it is equivalent to the absence of a pilot study [5].
This α has the advantage of injecting no bias in the a posteriori, but the drawback is that by
using this improper a priori distribution f (θ ), we are not guaranteed to obtain a proper a
posteriori distribution f|V (θ|v). If one of the elements in the observed confusion matrix v
is a zero, then the a posteriori distribution will also be improper.
Let us take from Section 4 the same confusion matrix v A = (65, 30, 35, 15), and let us
compare the impact on the a posteriori where α = (0, 0, 0, 0) or α = (1, 1, 1, 1) are used
as prior. From (6), the a posteriori distribution of the unknown parameters θ are:
⎧
⎨ α 0 = (0, 0, 0, 0)=⇒(|v A , α 0 ) ∼ Dir(w = (65, 30, 35, 15))
⎩ α 1 = (1, 1, 1, 1)=⇒(|v A , α 1 ) ∼ Dir(w = (66, 31, 36, 16))
Algorithm 1 is used to generate M = 1, 000, 000 samples from both random variables
I (V |v A , α 0 ) and I (V |v A , α 1 ). In the left part of Fig. 2, we used g A0 and g A1 to display the
curve of the two distributions. As we can see, there is a minor impact when α 1 = (1, 1, 1, 1)
is injected as prior knowledge.
A scientist is often not in a state of perfect ignorance with respect to the performance of
the models that (s)he is using. This prior knowledge can come from the scientist’s experi-
ence or from previous studies done in the same context. The prior knowledge can be injected
directly via α = (α1 , α2 , α3 , α4 ). For example, as α1 and α2 contain the prior information
about #T P and #T N respectively and if prior studies show that we should have an accu-
rate model, we can use α = (α1 , α2 , 1, 1) and set α1 , α2 to a high value. This is a very
approximate and imprecise way of adding prior knowledge.
Note that adding prior knowledge could have a negative impact if the knowledge that
was injected was wrong. In this case, a bias would be added in the a posteriori distribution,
but its impact would decline when the number of elements in the testing set NT increases.
Injecting prior knowledge directly via α = (α1 , α2 , α3 , α4 ) is sometimes difficult, even
for experts in the application area. Therefore, it is often easier to compute the values in
3 Roughly speaking, improper means that we lose the fundamental property of a probability density function
that its integral must be one.
438 O. Caelen
4
I (V|vB, α(m=725))
3.0
2.5
3
2.0
2
1.5
1.0
1
0.5
0.0
0
−0.2 0.0 0.2 0.4 0.6 −0.4 −0.2 0.0 0.2 0.4 0.6
Fig. 2 Left: we compare the distributions of I (V |v A , α 0 ) and I (V |v A , α 1 ). We observe that the impact
of using α 1 is relatively low. Right: Evolution of the distribution when prior knowledge is injected in the
distribution
α in an indirect way. Rather than information on α, the scientist often has more knowl-
edge/intuition about the values of some classical accuracy measures. If we define p =
#T P /(#T P + #F P ), r = #T P /(#T P + #F P ) and a = (#T P + #T N )/(#T P + #F P +
#F N + #T N ) as the precision, the recall and the accuracy respectively, then we can see that
⎧
⎪
⎪ α1 = #T P = m(1−a)rp
⎪
⎪
p+r−2rp
⎪
⎪
⎪
⎨ α2 = #T N = am − #T P
(8)
⎪
⎪ α3 = #F P = m(1−a)r(1−p)
⎪
⎪ p+r−2rp
⎪
⎪
⎪
⎩ m(1−a)(1−r)p
α4 = #F N = p+r−2rp
⎧
⎪ #T P = 0.3319149 × m
⎪
⎪
⎪
⎪
⎪
⎨ #T N = 0.2680851 × m
⎪
⎪ #F P = 0.2212766 × m
⎪
⎪
⎪
⎪
⎩
#F N = 0.1787234 × m
A Bayesian interpretation of the confusion matrix 439
where m is a parameter used to put a weight on the prior. We tested the two following
values for m : m = NT /2 = 72.5 and m = NT ∗ 5 = 725, and obtained the following
distribution for the parameters:
(|v B , α 0 ) ∼ Dir(w = (50, 35, 30, 30))
The bootstrap method is another technique that can be used to approximate the distribution
of I (V |v) in an easy and accurate way. Note that the bootstrap is often used in machine
learning to estimate the generalization accuracy of a learning algorithm by bootstrapping the
training set. But this is not what we will do here. As we assume in this paper that we don’t
have access to the training set, we will do the bootstrap directly
on the confusion
matrix.
Given a testing set T and a model h, the vector M = 1 , . . . , NT contains the
NT values of the loss function computed on T and the vector V is an aggregation of M
where we count in #T P , #T N , #F P and #F N the number oftimes we have observed
T P , T N , F P and F N , respectively.
For bootstrap, B samples M∗(1) , . . . , M∗(B) are
∗(b) ∗(b)
generated where M∗(b) = 1 , . . . , NT is obtained from M by doing NT random
samplings with replacement. After aggregating M∗(b) in V ∗(b) , the B vectors V ∗(b) are used
to compute the B values I (V ∗(b) ). These B values can then be used to infer properties about
the distribution of the statistic I (V |v). The following algorithm formalizes the process.
In Section 3, we have assumed that the output of the loss function could be interpreted
as the result of a random experiment with {T P , T N, F P , F N } as support set. Let P =
{P ( = k)} with k ∈ {T P , T N, F P , F N } be a set with the four unknown probabilities
associated to the four values in support of the loss function . Given the NT observed
values =
of the loss function, the maximum likelihood estimator of these probabilities is P
( = k) where P
P ( = k) = #k/NT . In the bootstrap method, ∗(b) is obtained by
doing a random sampling with replacement in M, which is equivalent to doing a sampling
440 O. Caelen
Compared to the Bayesian method studied in this paper, the previous equation shows
that the bootstrap method on M produces samples of the confusion matrix that follow a
multinomial distribution with four fixed parameters. In the Bayesian approach, we assume
that these parameters themselves are uncertain and follow a Dirichlet distribution. Roughly
speaking, we can say that, in bootstrap, these parameters follow a Dirac distribution.
As a consequence, by replacing the fixed parameters with random values, we can see
that the Bayesian approach has a tendency to have more variability in the distribution of
V . Let I (V |v) and I (V ∗ ) be the random variables for the performance indicator functions
obtained by the Bayesian approach and the bootstrap approach, respectively. By the law of
total variance, we have
V I (V ∗ ) = E V I (V ∗ )| + V E I (V ∗ )|
V I (V |v) = E V I (V |v)| + V E I (V |v)| .
As there is no variability
in the parameters
for bootstrap,
we have V E I (V ∗ )| = 0
and therefore V E I (V|v)| > V E I (V ∗ )| . As a consequence, V I (V |v)
tends to be higher than V I (V ∗ ) , and accordingly, the variability of I (V |v) obtained by
the Bayesian approach is often higher than the variability of I (V ∗ ) obtained by the bootstrap
approach.
We will now compare our strategy with the bootstrap method, and see what hap-
pens when the number of samples in the testing set T increases. Let e(λ) = λv A =
(λ65, λ30, λ35, λ15) be a vector where λ > 0. To simulate an increase of NT , we will take
the values of e(λ) where λ ∈ (1, 2, 3, . . . , 500). For each value of λ, we generate 1,000,000
values of g e(λ) with algorithm 1 where α 0 is used as prior. The same number of values of
g ∗e(λ) is generated by the bootstrap method described in algorithm 2. To compare our strat-
egy with the bootstrap method, we will compare the distributions of g e(λ) and g ∗e(λ) when
λ increases.
The left part of Fig. 3 shows the evolution of the interquartile distance of g e(λ) and g ∗e(λ) .
Both curves are decreasing, meaning that the variability decreases when more knowledge
is available. As expected, we can observe that the variability of the samples generated by
the Bayesian method is always higher than that of the samples generated by the bootstrap
method. This can be explained by the fact that the Bayesian method takes the variability
of the unknown parameters into account, whereas the bootstrap assumes them to be fixed.
But we can also notice that the relative distance between the two curves reduces when NT
increases. This is due to the fact that when more information is available, the variability of
the unknown parameters declines. Therefore, if NT becomes too great, the variability of
the unknown parameters will become almost null and the two curves will join. Thus, the
Bayesian approach tends to produce distributions with higher variance, but one must take
into account the ignorance regarding the parameters θ when NT is small.
The right part of Fig. 3 shows, for each λ, the difference between the empirical mean of
the samples in g e(λ) and g ∗e(λ) . As we can observe, the curve oscillates around the value 0.
This means that, on average, the two methods return samples with the same empirical mean.
This is confirmed by the fact that a t-test cannot exclude the null hypothesis that the mean
of the 500 values, in this figure, equals 0 (p-val = 0.67).
A Bayesian interpretation of the confusion matrix 441
0.15
e(λ) e(λ)
Q75(ge(λ)) − Q25(ge(λ))
Q75(b ) − Q25(b )
1e−03
mean(ge(λ)) − mean(be(λ))
0.10
5e−04
0e+00
0.05
−5e−04
0.00
0 100 200 300 400 500 0 100 200 300 400 500
λ λ
Fig. 3 Left: evolution of the interquartile range of the two methods when the number of samples in the
testing set T increases. Right: Compute, for each value of λ, the difference of the average of the samples
generated by bootstrap and the average of the samples generated by our method
7 Experimentation
In this section, we propose two types of experiments. First, we will use real data sets to
assess our method’s ability to construct accurate credible intervals for different performance
indicators. In the second part, we will evaluate our method in a context where the testing
sets are sent to the model sequentially, in a series of small chunks. This will allow us to
evaluate the possibility of injecting prior knowledge sequentially in the distribution.
For the first experiment, we will use random forest models and the data sets described in
Table 1. As performance indicators, we will use the accuracy, the G-score and the F1 -score.
Algorithm 3 describes the experimental design.
The inputs of the experimental design are a data set L, a performance indicator function
I and a probability δ for the credible interval. In line 2, a variable score is initialized at
0. The value of this variable will be returned at the end of the function and used, as the
coverage probability, in Table 2. The main loop is between lines 3 – 13 and is repeated
2,000 times. In this loop, the data set is first split in three sub-sets. It is a stratified random
sampling, and each sub-set has the same number of elements. In line 5, the learning data set
D is used with a random forest algorithm4 to generate the model R. This model is used in
line 6 to compute the confusion matrix v on the testing data set T . The algorithm 1 described
in Section 3 is then used on v to generate a set with 1,000 samples. As some data sets are
very small, to avoid introducing too many biases, we use α 0 = (0, 0, 0, 0) as prior in line
7. In line 8, the credible interval C is computed. To get C , we extract the highest posterior
density (hpd) from the samples in g. Now that we have C , the model R is reused on S to
obtain a new confusion matrix vs in line 9. Note that as T and S are independent, so are the
two confusion matrices v and vs . In lines 10 – 11, we check whether the new performance
4 We used R’s randomForest package [8] with the default learning parameters.
442 O. Caelen
Table 1 This table lists the 16 well known real data sets extracted from the UCI Machine Learning Reposi-
tory and used for the experiments. As we are considering binary classification tasks, in the last column (ϑ1 )
we have indicated the number of the variable used as output and the value for which the output equals true
Data set number Full data set name nb obs. nb var input ϑ1
indicator is in the credible interval C . At line 14, the function returns the proportion of time
that I (vs ) was in C .
The experimental results are to be found in Table 2. This table describes the experi-
ment results obtained on the 16 well-known real data sets and the three measures (accuracy,
G-score and F1 -score). Coverage probability gives the proportion of times that I (vs ) was
in the interval, and Average length of CI gives the average length of the 2,000 credible
intervals.
As expected, we can see that for each case, when the confidence δ increases, so does the
average length of the intervals. Sometimes we also notice that to get a high confidence, the
average size of the intervals must remain small. That is the case for the data set number 10,
with F1 -score as performance measure. In this case, to have a coverage probability of 0.976,
the average size of the intervals must be only 0.03.
The last line of each sub-table in Table 2 provides a t-based confidence interval (with
α = 0.05) of the mean of the 16 measures mentioned above. Concerning the coverage prob-
ability, we can see that for each case, the confidence interval contains the target confidence
δ. However, in some instances, the coverage probability stays far from the target value. That
is the case for the F1 -score with data sets number 6 and 9, where the coverage probability is
respectively too small and too high.
Figure 4 displays the density estimation of I (vs ) and g computed for the 2,000 loops in
algorithm 3. I (vs ) is the value of the performance indicator computed on the independent
testing set S . g is computed at line 7 of algorithm 3. g is a vector with 1,000 values, and the
2,000 vectors are put together to compute the density of g in Fig. 4. In Fig. 4, we display
the curve of F1 -score for data sets 6 and 13.
Table 2 Experimental results with the Bayesian approach
Accuracy
Data set avg(I (v)) Coverage Average Coverage Average Coverage Average
number probability length probability length probability length
of CI of CI of CI
Accuracy
1 0.95 0.805 0.1058 0.901 0.1333 0.9515 0.1802
2 0.97 0.785 0.0599 0.8015 0.0735 0.8105 0.1048
3 0.76 0.947 0.1234 0.9755 0.1463 0.9975 0.1901
A Bayesian interpretation of the confusion matrix
Table 2 (continued)
G-score
Data set avg(I (v)) Coverage Average Coverage Average Coverage Average
number probability length probability length probability length
of CI of CI of CI
G-score
1 0.89 0.8315 0.0836 0.898 0.1019 0.948 0.1423
2 0.94 0.774 0.0442 0.812 0.0564 0.8645 0.0814
3 0.25 0.955 0.0874 0.9835 0.1034 0.998 0.1346
4 0.44 0.933 0.1011 0.9755 0.1199 0.9955 0.1562
5 0.64 0.903 0.0992 0.961 0.1175 0.99 0.1524
6 0.99 0.8305 0.0115 0.8275 0.0145 0.8805 0.0226
7 0.07 0.993 0.1297 0.9995 0.156 1 0.207
8 0.91 0.8435 0.0696 0.923 0.084 0.9765 0.1113
9 0.07 1 0.0266 1 0.0315 1 0.0409
10 0.97 0.8435 0.019 0.9095 0.0227 0.9715 0.0301
11 0.73 0.9305 0.0359 0.9635 0.0426 0.9965 0.0554
12 0.95 0.8755 0.022 0.9425 0.0262 0.986 0.0341
13 0.57 0.886 0.2764 0.948 0.329 0.99 0.4293
14 0.37 0.945 0.0897 0.9745 0.1062 0.9955 0.1379
15 0.18 0.974 0.1433 0.984 0.1694 0.999 0.2209
16 0.48 0.9005 0.2604 0.959 0.3085 0.9875 0.4004
Conf.Int. [0.87, 0.94] [0.05, 0.14] [0.91, 0.97] [0.06, 0.16] [0.95, 1.00] [0.08, 0.21]
O. Caelen
Table 2 (continued)
F1 -score
Data set avg(I (v)) Coverage Average Coverage Average Coverage Average
number probability length probability length probability length
of CI of CI of CI
F1 -score
1 0.96 0.8355 0.0847 0.9105 0.1057 0.9495 0.1492
2 0.98 0.78 0.0449 0.815 0.0574 0.814 0.0832
3 0.85 0.962 0.0884 0.9855 0.1048 0.998 0.1365
A Bayesian interpretation of the confusion matrix
For the case number 6 in Table 2, the coverage probability is lower than target δ. In this
instance, I (vs ) only takes a small set of distinct values.5 That is why we observe waves in
the density estimation of I (vs ) in the left part of Fig. 4. In these conditions, the Bayesian
method tends to overgenerate samples with g = 1. In the figure, we observe a high density
for the value one (hatched lines). As the credible intervals C are taken from g, C tends
to be too small, and therefore the coverage probabilities are lower than δ. We empirically
observe that this phenomenon happens when the distribution is close to the boundary of the
values the performance indicator can take.6 In the right part of Fig. 4, we see that g follows
I (vs ) better. That is why for case 13, the coverage probabilities are almost equal to δ in the
F-Score of Table 2.
We also did a numerical comparison with the bootstrap approach. To achieve this, we
reused the same experimental design described in algorithm 3 where, at line 7, algorithm 1
is replaced by algorithm 2. The experimental results are to be found in Table 3, where the
Coverage probability and the Average length of CI are only computed for δ = 0.90.
Compared to the previous results, the average lengths of credible intervals C are always
smaller than those produced by the proposed approach.7 This has already been observed in
the left part of Fig. 3. The fact that the bootstrap approach tends to produce narrower credible
intervals could explain why the Coverage probability is lower in Table 3. We can even see
that the t-based confidence interval of the Coverage probability at the last line of Table 3
never contains the target confidence δ = 0.90. This means that the sizes of the credible
intervals are too small. By taking into account the ignorance regarding the parameters of the
multinomial distribution, the Bayesian approach tends to produce distributions with higher
variance.
5 On the 2,000 loops, the function I (vs ) was equal to one 1,395 times.
6 We did the experiment with the bootstrap method presented in Section 6, and we observed that bootstrap
also has difficulties with these cases.
7 For instance, in Table 2 with δ = 0.90, the average length of credible intervals for the accuracy is 0.1058
when the Bayesian approach is used on the data set number 1. The corresponding average length in Table 3
is 0.0842.
A Bayesian interpretation of the confusion matrix 447
I(vs) I(vs)
500
6
g g
400
5
4
300
Density
Density
3
200
2
100
1
0
0
0.2 0.4 0.6 0.8 1.0
0.94 0.96 0.98 1.00
Fig. 4 Left: The distribution of I (vs ) and g for the F1 -score on the data set 6. Right: The distribution of
I (vs ) and g for the F1 -score on the data set 13
We will now evaluate the possibility of using the Bayesian interpretation of the confusion
matrix in an online context.
In the following experiment, the model is evaluated sequentially on a series of small test-
ing data sets (T1 , . . . , Tl , . . . , TL ) and, at each step l, a confusion matrix vl is computed
from Tl . The fact that the same model is tested sequentially gives the possibility of improv-
ing at each step l the knowledge that we have about the performance accuracy. Thanks to
the Bayesian framework, at step l, we have an elegant way of injecting prior knowledge
that we have accumulated during the previous steps. At each step, a credible interval will
be computed and we will check over time whether it contains the true value or not. Tiny
intervals will be used. This will make it possible to see if a small credible interval can con-
tain the true value of the performance indicator. To get such small intervals, we will put the
confidence level at 10%.
We don’t have access to the true performance of the models. To get a good estimation of
the true value of the performance indicator, we need a very extensive data set, and that is
why we will synthetically generate the data from the following equation:
ϑ1 , if sin(X12 + X22 ) + cos(X12 ) + cos(X22 ) + > 0.5
Y = (9)
ϑ0 , else
where
X1 ∼ U nif (0, 3), X2 ∼ U nif (0, 3) and ∼ N (0, 1).
The experimental design is described in the algorithm 4, where the MCC is used as
performance indicator I .
The model that we will evaluate sequentially is built between lines 2 and 3. The learning
data set D is created at line 2 with 1,000 samples generated from (9), and a random forest
algorithm is used at line 3 to generate the model. This model is stored in R. We will now
try to estimate the true accuracy of the model, and for that purpose, another data set is
generated at line 4 from the same (9). This data set contains twenty million samples and is
used at line 5 to compute a confusion matrix vbig . We use twenty million samples to obtain
a good estimation of the (unknown) true accuracy of R. This confusion matrix vbig will be
returned as output of the function at line 14. At line 6, we initialize the sequential tests by
putting α = (0, 0, 0, 0). This stands for complete ignorance at the first loop (that is l = 1).
A Bayesian interpretation of the confusion matrix 449
0.63
80
0.62
60
% in CI
MCC
0.61
40
0.60
20
0.59
0 50 100 150 200 250 300 0 50 100 150 200 250 300
Fig. 5 Left: A run of algorithm 4. The horizontal line is the value of I vbig . Right: After 200 runs of
algorithm 4, we count the number of times that the true value is in the credible interval
The sequential tests are done between lines 7 and 13. At line 8, a small data set with only
100 observations is generated from (9). This data set is used at line 9 to compute a confusion
matrix. The sampling algorithm 1 is used at line 10. To get a credible interval Cl , at line 11,
hpd is extracted from the samples in g. We want small credible intervals, which is why the
confidence is put at 10%. At line 12, the a priori is updated with the new values for the next
step. At line 14, the list with the 300 confidence intervals is returned as well as the value of
the indicator function computed on the big data set.
The result of a run of algorithm 4 is available in the left part of Fig. 5. We observe in
Fig. 5 that the Bayesian method provides a way to add the a prior knowledge from the
previous loops and that, at 150 loops, the true value is always within the boundaries defined
by credible intervals.
We have repeated this process 200 times and counted, for each step l, the number of times
the credible interval Cl contained the true value. The result is available in the right part of
Fig. 5. Although the confidence interval is very small, we see that after 150 loops, the true
values are very often within the boundaries defined by the credible intervals (approximately
90% of the time).
8 Conclusion
In this paper, we have proposed a new way of dealing with the uncertainties a scientist can
have about the performance indicators of a classifier (s)he can extract from a confusion
matrix. In our work, we assume that the scientist does not have access to the learning or
testing set. (S)he only has access to the confusion matrix. We have shown that the values of
said matrix can be assumed to be generated from a random vector following a multinomial
distribution, and by taking the Bayesian point of view, we have assumed that the unknown
parameters of the multinomial distribution themselves are generated from a random vector
following a Dirichlet distribution. We made a theoretical and empirical comparison between
the bootstrap method and our method, based on the Bayesian framework. With α 0 used
as prior, we showed that both methods returned samples with the same mean and that the
variance of the distribution generated by bootstrap was lower. This can be explained by the
fact that the bootstrap method does not assume uncertainty about the unknown parameters of
450 O. Caelen
the multinomial distribution. Thanks to the Bayesian framework, we have shown that prior
knowledge can easily be injected in the distributions, and that it reduces the uncertainty we
can have about the distribution of the performance indicators. Experimental results show
that our method can be used to construct accurate confidence intervals for the unknown
performance indicator.
References
1. Davis, J., Goadrich, M.: The relationship between precision-recall and roc curves. In: Proceedings of the
23rd International Conference on Machine Learning, pp. 233–240. ACM, New York (2006)
2. Efron, B.: Bootstrap methods: another look at the jackknife. In: Breakthroughs in Statistics, pp. 569–593.
Springer, Berlin (1992)
3. Elkan, C.: The foundations of cost-sensitive learning. In: International Joint Conference on Artificial
Intelligence, vol. 17, pp. 973–978. Lawrence Erlbaum Associates Ltd (2001)
4. Forbes, C., Evans, M., Hastings, N., Peacock, B.: Statistical distributions. Wiley, Hoboken (2011)
5. Gelman, A., Carlin, J.B., Stern, H.S., Rubin, D.B.: Bayesian data analysis, vol. 2. Chapman & Hall/CRC
Boca Raton, FL (2014)
6. Goutte, C., Gaussier, E.: A probabilistic interpretation of precision, recall and f-score, with implication
for evaluation. In: Advances in Information Retrieval, pp. 345–359. Springer, Berlin (2005)
7. James, G., Witten, D., Hastie, T., Tibshirani, R.: An introduction to statistical learning, vol. 6. Springer,
Berlin (2013)
8. Liaw, A., Wiener, M.: Classification and regression by randomforest. R News 2(3), 18–22 (2002). http://
CRAN.R-project.org/doc/Rnews/
9. Powers, D.M.: Evaluation: from precision, recall and f-measure to roc, informedness markedness and
correlation (2011)
10. Wackerly, D., Mendenhall, W., Scheaffer, R.: Mathematical statistics with applications. Cengage
Learning, Boston (2008)