Classification: Evaluation: Data Mining and Text Mining (UIC 583 at Politecnico Di Milano)

Classification: Evaluation
Data Mining and Text Mining (UIC 583 @ Politecnico di Milano)
Prof. Pier Luca Lanzi
Lecture Outline
Issues in classifiers evaluation Metrics for performance evaluation Methods for performance evaluation Comparing models

Evaluation & Credidibility Issues
What measure should we use?
(Classification accuracy might not be enough!)
How reliable are the predicted results? How much should we believe in what was learned? Error on the training data is not a good indicator of
performance on future data The classifier was computed from the very same training data, any estimate based on that data will be optimistic. In addition, new data will probably not be exactly the same as the training data!

How to evaluate the performance of a model? How to obtain reliable estimates? How to compare the relative performance among competing models? Given two equally performing models, which one should we prefer?

How to evaluate the performance of a model? (Metrics for Performance Evaluation)

Metrics for Performance Evaluation
Focus on the predictive capability of a model

Confusion Matrix:

PREDICTED CLASS

Yes

ACTUAL CLASS

No
b (FN)
d (TN)

Yes
No

a (TP)
c (FP)

a: TP (true positive)
b: FN (false negative)


c: FP (false positive)
d: TN (true negative)

Metrics for Performance Evaluation
PREDICTED CLASS

Yes

ACTUAL CLASS

No
b (FN)
d (TN)

Yes
No

a (TP)
c (FP)

Most widely-used metric:

Sometimes Accuracy is not Enough
Consider a 2-class problem Number of Class 0 examples = 9990 Number of Class 1 examples = 10 If model predicts everything to be class 0,
accuracy is 9990/10000 = 99.9 %
Accuracy is misleading because model does not

detect any class 1 example

Cost Matrix
PREDICTED CLASS

Yes

ACTUAL CLASS

No
C(No|Yes)
C(No|No)

Yes
No

C(Yes|Yes)
C(Yes|No)

C(i|j): Cost of misclassifying class j example as class i


Computing Cost of Classification

Cost Matrix
ACTUAL CLASS

10
PREDICTED CLASS
C(i|j)
+

+
-1
1
Model M2
ACTUAL CLASS

100
0
PREDICTED CLASS
+
+

250
5

45
200

Model M1
ACTUAL CLASS

PREDICTED CLASS
+
+

150
60

40
250

Accuracy = 80%
Cost = 3910


Accuracy = 90%
Cost = 4255

Cost-Sensitive Measures
11
The higher the precision, the lower the FPs
The higher the precision, the lower the FNs
The higher the F1, the lower the FPs & FNs
Precision is biased towards C(Yes|Yes) & C(Yes|No), Recall is biased towards C(Yes|Yes) & C(No|Yes) F1-measure is biased towards all except C(No|No),

it is high when both precision and recall are reasonably high
How to obtain reliable estimates? (Methods for Performance Evaluation)

Classification Step 1: Split the Data Between Train and Test Sets
13
data
Training set
Testing set

Classification Step 2: Compute a Model using the Data in the Training Set
14
data
Training set Model Builder
Testing set

Classification Step 3: Evaluate the Model using the Test Set
15
data
Training set Evaluate Model Builder Predictions + + -
Testing set

Note on Parameter Tuning
16
It is important that the test data is not used

in any way to create the classifier
Some learning schemes operate in two stages: Stage 1: builds the basic structure Stage 2: optimizes parameter settings The test data cant be used for parameter tuning! Proper procedure uses three sets: training data, validation data,
and test data. Validation data is used to optimize parameters

Making the most of the data
17
Once evaluation is complete, all the data can be used to build

the final classifier
Generally, the larger the training data the better the classifier
(but returns diminish)
The larger the test data the more accurate the error estimate

The Full Process: Training, Validation, and Testing
18
Model Builder
data Training set Evaluate Model Builder Predictions + + -
Validation set
Y N
Final Evaluation
+ + -
Testing set

Methods for Performance Evaluation
19
How to obtain a reliable estimate of performance? Performance of a model may depend on

other factors besides the learning algorithm: Class distribution Cost of misclassification Size of training and test sets

Methods of Estimation
20
Holdout Reserve for training and for testing Reserve 2/3 for training and 1/3 for testing Random subsampling Repeated holdout Cross validation Partition data into k disjoint subsets k-fold: train on k-1 partitions, test on the remaining one Leave-one-out: k=n Stratified sampling oversampling vs undersampling Bootstrap Sampling with replacement

Holdout Evaluation & Small Datasets
21
Holdout Reserve for training and for testing Reserve 2/3 for training and 1/3 for testing
Reserves a certain amount for testing and uses the remainder for training. Usually, 1/3 for testing, 2/3 for training For small or unbalanced datasets, samples might not be representative For instance, it might generate training or testing datasets with few or none instances of some classes Stratified sample: make sure that each class is represented with approximately equal proportions in both subsets

Repeated Holdout
22
Holdout estimate can be made more reliable by repeating the

process with different subsamples
In each iteration, a certain proportion is randomly selected for

training (possibly with stratification)
The error rates on the different iterations are averaged to yield

an overall error rate
Still not optimum since the different test sets overlap

Cross-Validation
23
Avoids overlapping test sets First step: data is split into k subsets of equal size Second step: each subset in turn is used for testing and the
remainder for training
This is called k-fold cross-validation Often the subsets are stratified before cross-validation is
performed
The error estimates are averaged to yield an overall error

estimate

Cross-Validation
24
Standard method for evaluation stratified ten-fold cross-validation Why ten? Extensive experiments have shown that this is the best
choice to get an accurate estimate
Stratification reduces the estimates variance Even better: repeated stratified cross-validation E.g. ten-fold cross-validation is repeated ten times and results are
averaged (reduces the variance)
Other approaches appear to be robust, e.g., 5x2 crossvalidation


Leave-One-Out Cross-Validation
25
It is a particular form of cross-validation Set number of folds to number of training instances I.e., for n training instances, build classifier n times Makes best use of the data Involves no random subsampling Computationally expensive

Leave-One-Out and Stratification
26
Disadvantage of Leave-One-Out is that:

stratification is not possible
It guarantees a non-stratified sample because

there is only one instance in the test set!
Extreme example: random dataset split equally into two classes Best inducer predicts majority class 50% accuracy on fresh data Leave-One-Out-CV estimate is 100% error!

Bootstraping
27
Cross-validation uses sampling without replacement The same instance, once selected, can not be selected again
for a particular training/test set
The bootstrap uses sampling with replacement to form the
training set Sample a dataset of n instances n times with replacement to form a new dataset of n instances Use this data as the training set Use the instances from the original dataset that dont occur in the new training set for testing

The 0.632 Bootstrap
28
An instance has a probability of 11/n of not being picked Thus its probability of ending up in the test data is:
' 1$ 1 ( " ! e (1 = 0.368 % & n#
n
This means the training data will contain approximately 63.2%

of the instances

Estimating Error using Bootstrap
29
The error estimate on the test data will be very pessimistic,

since training was on just ~63% of the instances
Therefore, combine it with the resubstitution error:

err = 0.632 ! etest instances + 0.368 ! etraining instances
The resubstitution error gets less weight than the error on the
test data
Repeat process several times with different replacement

samples; average the results

More on Bootstrap
30
Probably the best way of estimating performance for very small

datasets
However, it has some problems Consider a random dataset with a 50% class distribution An model that memorizes all the traning data will achieve
0% resubstitution error on the training data and ~50% error on test data Bootstrap estimate for this classifier:
err = 0.632 ! 50% + 0.368 ! 0% = 31.6%
True expected error: 50%


How to compare the relative performance among competing models? (Model Comparison)

How to Compare the Performance of Two Models?
32
Two models, Model M1: accuracy = 85%, tested on 30 instances Model M2: accuracy = 75%, tested on 5000 instances Can we say M1 is better than M2? How much confidence can we place on accuracy of M1 & M2? Can the difference in performance measure be explained as a
result of random fluctuations in the test set?

Confidence Interval for Accuracy
33
Prediction can be regarded as a Bernoulli trial with two

possible outcomes, correct or wrong
A collection of Bernoulli trials has a Binomial distribution Give, the number of correct test predictions x
and the number of test instances N, accuracy acc is x/N
Can we predict the true accuracy of the model from acc?

34
For large test sets (N > 30), the accuracy acc has a normal
distribution with mean p and variance p(1-p)/N Confidence Interval for p:
Area = 1 -
P( Z <
/2
acc p <Z p(1 p) / N
1 / 2
= 1
Z/2
2 2
Z1- /2
2
2 N acc + Z Z + 4 N acc 4 N acc p= 2( N + Z )

/2 /2
2
/2

35
Consider a testing set containing 1000 examples (N=1000) 750 examples have been correctly classified (x=750, acc=75%) If we want an 80% confidence level, then the true performance
p is between 73.2% and 76.7%
If we only have 100 training examples and 75 correctly
classified examples, the true performance p is between 69.1% and 80.1%
If N is 10, then the true performance p is between 0.549 and

0.881

36
Consider a model that produces an accuracy of 80% when

evaluated on 100 test instances: N=100, acc = 0.8 Let 1- = 0.95 (95% confidence) From probability table, Z/2=1.96
1-
0.99
0.98

Z
2.58
2.33
1.96
1.65

N
p(lower)
p(upper)

50
0.670
0.888

100
0.711
0.866

500
0.763
0.833

1000
0.774
0.824

5000
0.789
0.811

0.95
0.90


Comparing the Performance of Two Models M1 & M2
37
Two models, say M1 and M2, which is better? M1 is tested on D1 (size=n1), found error rate = e1 M2 is tested on D2 (size=n2), found error rate = e2 Assume D1 and D2 are independent If n1 and n2 are sufficiently large, then
e1 ~ N (1 , 1 ) e2 ~ N (2 , 2 )
Approximate:
e (1 e ) = n
i i i i

Comparing the Performance of Two Models M1 & M2
38
To test if the difference between the performance of M1 and

M2 is statistically significant, we consider d = e1 e2 where dt is the true difference
d ~ N(dt,t)
Since D1 and D2 are independent, their variance adds up:

= + +
2 2 2 2 t 1 2 1 2 2
e1(1 e1) e2(1 e2) = + n1 n2
At (1-) confidence level,

d =d Z
t
/2

Example
39
Given, M1 with n1 = 30 and e1 = 0.15, M2 with n2 = 5000 and

e2 = 0.25, d = |e2 e1| = 0.1 (2-sided test)
0.15(1 0.15) 0.25(1 0.25) = + = 0.0043 30 5000

d
At 95% confidence level, Z/2=1.96

d = 0.100 1.96 0.0043 = 0.100 0.128
t
The interval contains 0, therefore the difference

may not be statistically significant

ROC (Receiver Operating Characteristic)
40
Developed in 1950s for signal detection theory

to analyze noisy signals
ROC curve plots TP (on the y-axis) against FP (on the x-axis),
or TPR vs FPR (TPR = TP/(TP+FN), FPR=FP/(TN+FP))
Performance of each classifier represented

as a point on the ROC curve
Changing the threshold of algorithm, sample distribution or

cost matrix changes the location of the point

ROC Curve
41
(TP,FP): (0,0): declare everything to

be negative class (1,1): declare everything to be positive class (1,0): ideal
Diagonal line: Random guessing Below diagonal line,
prediction is opposite of the true class


ROC Curve
42

Using ROC for Model Comparison
43
No model consistently outperform the other M1 is better for small FPR M2 is better for large FPR Area Under the ROC curve Ideal, area = 1 Random guess, area = 0.5

Lift Charts
44
Lift is a measure of the effectiveness of a predictive model
calculated as the ratio between the results obtained with and without the predictive model
Cumulative gains and lift charts are visual aids for measuring
model performance
Both charts consist of a lift curve and a baseline The greater the area between the lift curve and the baseline,
the better the model

Example: Direct Marketing
45
Mass mailout of promotional offers (1000000) The proportion who normally respond is 0.1%, that is 1000
responses A data mining tool can identify a subset of a 100000 for which the response rate is 0.4%, that is 400 responses In marketing terminology, the increase of response rate is known as the lift factor yielded by the model. The same data mining tool, may be able to identify 400000 households for which the response rate is 0.2%, that is 800 respondents corresponding to a lift factor of 2. The overall goal is to find subsets of test instances that have a high proportion of true positive

An Example from Direct Marketing
46

Which model should we prefer? (Model selection)

Model Selection Criteria
48
Model selection criteria attempt to find a good compromise
between: The complexity of a model Its prediction accuracy on the training data Reasoning: a good model is a simple model that achieves high accuracy on the given data Also known as Occams Razor : the best theory is the smallest one that describes all the facts
William of Ockham, born in the village of Ockham in Surrey (England) about 1285, was the most influential philosopher of the 14th century and a controversial theologian.

The MDL principle
49
MDL stands for minimum description length The description length is defined as:
space required to describe a theory + space required to describe the theorys mistakes
In our case the theory is the classifier and the mistakes are the
errors on the training data
We seek a classifier with Minimum Description Language


MDL and compression
50
MDL principle relates to data compression The best theory is the one that compresses the data the most I.e. to compress a dataset we generate a model and
then store the model and its mistakes
We need to compute
(a) size of the model, and (b) space needed to encode the errors
(b) easy: use the informational loss function (a) need a method to encode the model

Which Classification Algorithm?
51
Among the several algorithms, which one is the best? Some algorithms have a lower computational complexity Different algorithms provide different representations Some algorithms allow the specification of prior knowledge If we are interested in the generalization performance,
inferior overall? are there any reasons to prefer one classifier over another?
Can we expect any classification method to be superior or According to the No Free Lunch Theorem,
the answer to all these questions is known

No Free Lunch Theorem
52
If the goal is to obtain good generalization performance, there are no context-independent or usage-independent reasons to favor one classification method over another If one algorithm seems to outperform another in a certain situation, it is a consequence of its fit to the particular problem, not the general superiority of the algorithm When confronting a new problem, this theorem suggests that we should focus on the aspects that matter most Prior information Data distribution Amount of training data Cost or reward The theorem also justifies skepticism regarding studies that demonstrate the overall superiority of a certain algorithm

No Free Lunch Theorem
53
"[A]ll algorithms that search for an extremum of a cost [objective] [T]he average performance of any pair of algorithms across all
possible problems is identical. [2]
function perform exactly the same, when averaged over all possible cost functions. [1]
Wolpert, D.H., Macready, W.G. (1995), No Free Lunch Theorems Wolpert, D.H., Macready, W.G. (1997), No Free Lunch Theorems
for Search, Technical Report SFI-TR-95-02-010 (Santa Fe Institute).
for Optimization, IEEE Transactions on Evolutionary Computation 1, 67.

Classification: Evaluation: Data Mining and Text Mining (UIC 583 at Politecnico Di Milano)

Uploaded by

Copyright:

Available Formats

Classification: Evaluation: Data Mining and Text Mining (UIC 583 at Politecnico Di Milano)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Classification: Evaluation: Data Mining and Text Mining (UIC 583 at Politecnico Di Milano)

Uploaded by

Copyright:

Available Formats

Classification: Evaluation

Data Mining and Text Mining (UIC 583 @ Politecnico di Milano)

Prof. Pier Luca Lanzi

Prof. Pier Luca Lanzi

Evaluation & Credidibility Issues

What measure should we use?

(Classification accuracy might not be enough!)

Prof. Pier Luca Lanzi

How to evaluate the performance of a model? (Metrics for Performance Evaluation)

Prof. Pier Luca Lanzi

Metrics for Performance Evaluation

Focus on the predictive capability of a model

Metrics for Performance Evaluation

Most widely-used metric:

Prof. Pier Luca Lanzi

Sometimes Accuracy is not Enough

Accuracy is misleading because model does not

Prof. Pier Luca Lanzi

C(i|j): Cost of misclassifying class j example as class i

Prof. Pier Luca Lanzi

Computing Cost of Classification

The higher the precision, the lower the FPs

The higher the precision, the lower the FNs

it is high when both precision and recall are reasonably high

How to obtain reliable estimates? (Methods for Performance Evaluation)

Prof. Pier Luca Lanzi

Prof. Pier Luca Lanzi

Training set Model Builder

Prof. Pier Luca Lanzi

Classification Step 3: Evaluate the Model using the Test Set

Training set Evaluate Model Builder Predictions + + -

Prof. Pier Luca Lanzi

Note on Parameter Tuning

It is important that the test data is not used

Making the most of the data

Once evaluation is complete, all the data can be used to build

Prof. Pier Luca Lanzi

The Full Process: Training, Validation, and Testing

Methods for Performance Evaluation

How to obtain a reliable estimate of performance? Performance of a model may depend on

Prof. Pier Luca Lanzi

Prof. Pier Luca Lanzi

Holdout Evaluation & Small Datasets

Prof. Pier Luca Lanzi

Holdout estimate can be made more reliable by repeating the

In each iteration, a certain proportion is randomly selected for

The error rates on the different iterations are averaged to yield

Still not optimum since the different test sets overlap

Prof. Pier Luca Lanzi

The error estimates are averaged to yield an overall error

Other approaches appear to be robust, e.g., 5x2 crossvalidation

Prof. Pier Luca Lanzi

Leave-One-Out and Stratification

Disadvantage of Leave-One-Out is that:

It guarantees a non-stratified sample because

Prof. Pier Luca Lanzi

The bootstrap uses sampling with replacement to form the

Prof. Pier Luca Lanzi

The 0.632 Bootstrap