Data Mining Lab 02 - Truong Quang Tuong - ITITIU20130
Data Mining Lab 02 - Truong Quang Tuong - ITITIU20130
Data Mining Lab 02 - Truong Quang Tuong - ITITIU20130
Lab 2: Evaluation
2.1. Be a classifier
In the second class, we are going to learn how to use datasets to evaluate data mining algorithms in
Weka. (See the lecture of class 2 by Ian H. Witten, [1]1)
� Follow the instruction in [1] to see how decision trees are created for different combinations of
attributes in a dataset. Firstly, a dataset and a training set are selected. Secondly, we choose and
start running UserClassifier to see a decision tree in the Tree Visualizer. In the Data Visualizer,
thirdly, the attributes to use for X and Y are selected, we then select instances in a region in the
graph and submit. At this point, the Tree Visualizer shows the tree.
� Examine segment-challenge dataset to draw a decision tree for the following pair of attributes
by selecting and submitting classes one by one, then remark how many instances are predicted
correctly.
Decision
Tree
1
http://www.cs.waikato.ac.nz/ml/weka/mooc/dataminingwithweka/
1
2
Remark
Out of 810 instances we have 634 were correctly identify a 78.2746% accuracy
Build a tree, what strategy do you use? - using the UserClassifier method
Can you build a “perfect” tree? - yes you can create a perfect decision tree
Follow the instructions in [1]-2.3: use J48 to analyze segment dataset, and write down how accuracy it
can achieve with different seeds. (If a random number seed is provided, the dataset will be shuffled
before the subset is extracted.)
Random number seeds Percent accuracy (x) Random number seeds Percent accuracy (x)
1 0.967 6 .967
2 .94 7 .92
3 .94 8 .94
4 .967 9 .933
5 .953 10 .947
Note:
3
Remark? - random seed 1,4,5,6 gave us the most accurate while the rest are not that accurate
Classifier Accuracy
J48 76%
NaiveBayes 77%
IBk 73%
PART 74%
ZeroR 67.8%
Classifier Accuracy
ZeroR 64%
J48 63%
NaiveBayes 63%
4
IBk 38%
PART 63%
2.4. Cross-validation
The holdout procedure: a certain amount is held over for testing and the remainder used for training.
Stratification: each class is properly represented in both training and test sets.
The repeated holdout method of error rate estimation: In each iteration a certain proportion, say
two-thirds, of the data is randomly selected for training (using different random-number seeds),
possibly with stratification, and the remainder is used for testing. The error rates on the different
iterations are averaged to yield an overall error rate.
The lecture of cross validation, 10-fold cross-validation, stratified cross-validation (see [1]-2.5).
In cross-validation, you decide on a fixed number of folds, or partitions, of the data. Suppose we
use three. Then the data is split into three approximately equal partitions; each in turn is used
for testing and the remainder is used for training. That is, use two-thirds of the data for training
and one-third for testing, and repeat the procedure three times so that in the end, every instance
has been used exactly once for testing. This is called three-fold cross-validation, and if
stratification is adopted as well—which it often is—it is stratified three-fold cross-validation.
Holdout (10%) Percent accuracy (x) 10-fold cross- Percent accuracy (x)
validation
-- 2 .779 -- 2 .75
-- 3 .805 -- 3 .755
-- 4 .74 -- 4 .755
-- 5 .714 -- 5 .744
-- 6 .701 -- 6 .756
5
-- 7 .792 -- 7 .736
-- 8 .714 -- 8 .74
-- 9 .805 -- 9 .745
-- 10 .675 -- 10 .73
Holdout (10%) Percent accuracy (x) 10-fold cross- Percent accuracy (x)
validation
-- 2 75.3 -- 2 .731
-- 3 71.4 -- 3 .728
-- 4 72.7 -- 4 .749
-- 5 77.9 -- 5 .742
-- 6 71.4 -- 6 .73
-- 7 74.0 -- 7 .734
-- 8 68.8 -- 8 .719
-- 9 75.3 -- 9 .746
-- 10 66.2 -- 10 .714
6
Examine NaiveBayes on Diabetes dataset:
Holdout (10%) Percent accuracy (x) 10-fold cross- Percent accuracy (x)
validation
-- 2 75.3 -- 2 75.2
-- 3 72.7 -- 3 76.1
-- 4 68.8 -- 4 75.5
-- 5 80.5 -- 5 75.1
-- 6 76.6 -- 6 75.8
-- 7 76.6 -- 7 76.2
-- 8 74 -- 8 75.3
-- 9 76.6 -- 9 76
-- 10 71.4 -- 10 75.9
7
8