Data Mining Lab 02 - Truong Quang Tuong - ITITIU20130

Introduction to Data Mining
Lab 2: Evaluation
2.1. Be a classifier
In the second class, we are going to learn how to use datasets to evaluate data mining algorithms in
Weka. (See the lecture of class 2 by Ian H. Witten, [1]1)
Interactive decision tree construction
� Follow the instruction in [1] to see how decision trees are created for different combinations of
attributes in a dataset. Firstly, a dataset and a training set are selected. Secondly, we choose and
start running UserClassifier to see a decision tree in the Tree Visualizer. In the Data Visualizer,
thirdly, the attributes to use for X and Y are selected, we then select instances in a region in the
graph and submit. At this point, the Tree Visualizer shows the tree.
� Examine segment-challenge dataset to draw a decision tree for the following pair of attributes
by selecting and submitting classes one by one, then remark how many instances are predicted
correctly.
Attribut Split on Region-centroid-row and intensity-mean

es
Decision
Tree
1
http://www.cs.waikato.ac.nz/ml/weka/mooc/dataminingwithweka/
1
2
Remark
Out of 810 instances we have 634 were correctly identify a 78.2746% accuracy
Build a tree, what strategy do you use? - using the UserClassifier method
Can you build a “perfect” tree? - yes you can create a perfect decision tree
2.2. Training and testing

The lecture of evaluation (see [1]-2.2)
Follow the instructions in [1]-2.3: use J48 to analyze segment dataset, and write down how accuracy it
can achieve with different seeds. (If a random number seed is provided, the dataset will be shuffled
before the subset is extracted.)
Random number seeds Percent accuracy (x) Random number seeds Percent accuracy (x)
1 0.967 6 .967
2 .94 7 .92
3 .94 8 .94
4 .967 9 .933
5 .953 10 .947
Evaluation Sample Mean 0.949
Standard deviation 0.018
Note:
3
Remark? - random seed 1,4,5,6 gave us the most accurate while the rest are not that accurate
2.3. Baseline accuracy

Follow the instructions in [1]-2.4 to run some classifiers for diabetes dataset:
Classifier Accuracy
J48 76%
NaiveBayes 77%
IBk 73%
PART 74%
ZeroR 67.8%
What is Baseline accuracy? – 65.1%
For supermarket dataset
Classifier Accuracy
ZeroR 64%
J48 63%
NaiveBayes 63%
4
IBk 38%
PART 63%
Why do the classifiers achieve lower accuracy? –
2.4. Cross-validation
The holdout procedure: a certain amount is held over for testing and the remainder used for training.
Stratification: each class is properly represented in both training and test sets.
The repeated holdout method of error rate estimation: In each iteration a certain proportion, say
two-thirds, of the data is randomly selected for training (using different random-number seeds),
possibly with stratification, and the remainder is used for testing. The error rates on the different
iterations are averaged to yield an overall error rate.
The lecture of cross validation, 10-fold cross-validation, stratified cross-validation (see [1]-2.5).
In cross-validation, you decide on a fixed number of folds, or partitions, of the data. Suppose we
use three. Then the data is split into three approximately equal partitions; each in turn is used
for testing and the remainder is used for training. That is, use two-thirds of the data for training
and one-third for testing, and repeat the procedure three times so that in the end, every instance
has been used exactly once for testing. This is called three-fold cross-validation, and if
stratification is adopted as well—which it often is—it is stratified three-fold cross-validation.
Weka does stratified cross-validation by default.
Follow the instructions in [1]-2.5, and examine J48 on Diabetes dataset.
Holdout (10%) Percent accuracy (x) 10-fold cross- Percent accuracy (x)
validation
Random seed: 1 .753 Random seed: 1 .738
-- 2 .779 -- 2 .75
-- 3 .805 -- 3 .755
-- 4 .74 -- 4 .755
-- 5 .714 -- 5 .744
-- 6 .701 -- 6 .756
5
-- 7 .792 -- 7 .736
-- 8 .714 -- 8 .74
-- 9 .805 -- 9 .745
-- 10 .675 -- 10 .73
Sample Mean 74.8 Sample Mean 74.5
Standard deviation 4.6 Standard deviation .9
Examine PART on Diabetes dataset:
validation
Random seed: 1 75.3 Random seed: 1 .753
-- 2 75.3 -- 2 .731
-- 3 71.4 -- 3 .728
-- 4 72.7 -- 4 .749
-- 5 77.9 -- 5 .742
-- 6 71.4 -- 6 .73
-- 7 74.0 -- 7 .734
-- 8 68.8 -- 8 .719
-- 9 75.3 -- 9 .746
-- 10 66.2 -- 10 .714
Sample Mean 72.83 Sample Mean 0.7346
Standard deviation 3.311812 Standard deviation 0.012
6
Examine NaiveBayes on Diabetes dataset:
validation
Random seed: 1 77.9 Random seed: 1 76.3
-- 2 75.3 -- 2 75.2
-- 3 72.7 -- 3 76.1
-- 4 68.8 -- 4 75.5
-- 5 80.5 -- 5 75.1
-- 6 76.6 -- 6 75.8
-- 7 76.6 -- 7 76.2
-- 8 74 -- 8 75.3
-- 9 76.6 -- 9 76
-- 10 71.4 -- 10 75.9
Sample Mean 75.04 Sample Mean

75.74
Standard deviation 3.226515 Standard deviation 0.412795
7
8

Data Mining Lab 02 - Truong Quang Tuong - ITITIU20130

Uploaded by

Copyright:

Available Formats

Data Mining Lab 02 - Truong Quang Tuong - ITITIU20130

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining Lab 02 - Truong Quang Tuong - ITITIU20130

Uploaded by

Copyright:

Available Formats

Introduction to Data Mining

Interactive decision tree construction

Attribut Split on Region-centroid-row and intensity-mean

2.2. Training and testing

Evaluation Sample Mean 0.949

Standard deviation 0.018

2.3. Baseline accuracy

What is Baseline accuracy? – 65.1%

For supermarket dataset

Why do the classifiers achieve lower accuracy? –

Weka does stratified cross-validation by default.

Follow the instructions in [1]-2.5, and examine J48 on Diabetes dataset.

Random seed: 1 .753 Random seed: 1 .738

Sample Mean 74.8 Sample Mean 74.5

Standard deviation 4.6 Standard deviation .9

Examine PART on Diabetes dataset:

Random seed: 1 75.3 Random seed: 1 .753

Sample Mean 72.83 Sample Mean 0.7346

Standard deviation 3.311812 Standard deviation 0.012

Random seed: 1 77.9 Random seed: 1 76.3

Sample Mean 75.04 Sample Mean

Standard deviation 3.226515 Standard deviation 0.412795

You might also like