Data Mining Lab 02 - Truong Quang Tuong - ITITIU20130

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 8

Introduction to Data Mining

Lab 2: Evaluation

2.1. Be a classifier

In the second class, we are going to learn how to use datasets to evaluate data mining algorithms in
Weka. (See the lecture of class 2 by Ian H. Witten, [1]1)

Interactive decision tree construction

� Follow the instruction in [1] to see how decision trees are created for different combinations of
attributes in a dataset. Firstly, a dataset and a training set are selected. Secondly, we choose and
start running UserClassifier to see a decision tree in the Tree Visualizer. In the Data Visualizer,
thirdly, the attributes to use for X and Y are selected, we then select instances in a region in the
graph and submit. At this point, the Tree Visualizer shows the tree.
� Examine segment-challenge dataset to draw a decision tree for the following pair of attributes
by selecting and submitting classes one by one, then remark how many instances are predicted
correctly.

Attribut Split on Region-centroid-row and intensity-mean


es

Decision
Tree

1
http://www.cs.waikato.ac.nz/ml/weka/mooc/dataminingwithweka/

1
2
Remark

Out of 810 instances we have 634 were correctly identify a 78.2746% accuracy

Build a tree, what strategy do you use? - using the UserClassifier method

Can you build a “perfect” tree? - yes you can create a perfect decision tree

2.2. Training and testing


The lecture of evaluation (see [1]-2.2)

Follow the instructions in [1]-2.3: use J48 to analyze segment dataset, and write down how accuracy it
can achieve with different seeds. (If a random number seed is provided, the dataset will be shuffled
before the subset is extracted.)

Random number seeds Percent accuracy (x) Random number seeds Percent accuracy (x)

1 0.967 6 .967

2 .94 7 .92

3 .94 8 .94

4 .967 9 .933

5 .953 10 .947

Evaluation Sample Mean 0.949

Standard deviation 0.018

Note:

3
Remark? - random seed 1,4,5,6 gave us the most accurate while the rest are not that accurate

2.3. Baseline accuracy


Follow the instructions in [1]-2.4 to run some classifiers for diabetes dataset:

Classifier Accuracy

J48 76%

NaiveBayes 77%

IBk 73%

PART 74%

ZeroR 67.8%

What is Baseline accuracy? – 65.1%

For supermarket dataset

Classifier Accuracy

ZeroR 64%

J48 63%

NaiveBayes 63%

4
IBk 38%

PART 63%

Why do the classifiers achieve lower accuracy? –

2.4. Cross-validation
The holdout procedure: a certain amount is held over for testing and the remainder used for training.

Stratification: each class is properly represented in both training and test sets.

The repeated holdout method of error rate estimation: In each iteration a certain proportion, say
two-thirds, of the data is randomly selected for training (using different random-number seeds),
possibly with stratification, and the remainder is used for testing. The error rates on the different
iterations are averaged to yield an overall error rate.

The lecture of cross validation, 10-fold cross-validation, stratified cross-validation (see [1]-2.5).

In cross-validation, you decide on a fixed number of folds, or partitions, of the data. Suppose we
use three. Then the data is split into three approximately equal partitions; each in turn is used
for testing and the remainder is used for training. That is, use two-thirds of the data for training
and one-third for testing, and repeat the procedure three times so that in the end, every instance
has been used exactly once for testing. This is called three-fold cross-validation, and if
stratification is adopted as well—which it often is—it is stratified three-fold cross-validation.

Weka does stratified cross-validation by default.

Follow the instructions in [1]-2.5, and examine J48 on Diabetes dataset.

Holdout (10%) Percent accuracy (x) 10-fold cross- Percent accuracy (x)
validation

Random seed: 1 .753 Random seed: 1 .738

-- 2 .779 -- 2 .75

-- 3 .805 -- 3 .755

-- 4 .74 -- 4 .755

-- 5 .714 -- 5 .744

-- 6 .701 -- 6 .756

5
-- 7 .792 -- 7 .736

-- 8 .714 -- 8 .74

-- 9 .805 -- 9 .745

-- 10 .675 -- 10 .73

Sample Mean 74.8 Sample Mean 74.5

Standard deviation 4.6 Standard deviation .9

Examine PART on Diabetes dataset:

Holdout (10%) Percent accuracy (x) 10-fold cross- Percent accuracy (x)
validation

Random seed: 1 75.3 Random seed: 1 .753

-- 2 75.3 -- 2 .731

-- 3 71.4 -- 3 .728

-- 4 72.7 -- 4 .749

-- 5 77.9 -- 5 .742

-- 6 71.4 -- 6 .73

-- 7 74.0 -- 7 .734

-- 8 68.8 -- 8 .719

-- 9 75.3 -- 9 .746

-- 10 66.2 -- 10 .714

Sample Mean 72.83 Sample Mean 0.7346

Standard deviation 3.311812 Standard deviation 0.012

6
Examine NaiveBayes on Diabetes dataset:

Holdout (10%) Percent accuracy (x) 10-fold cross- Percent accuracy (x)
validation

Random seed: 1 77.9 Random seed: 1 76.3

-- 2 75.3 -- 2 75.2

-- 3 72.7 -- 3 76.1

-- 4 68.8 -- 4 75.5

-- 5 80.5 -- 5 75.1

-- 6 76.6 -- 6 75.8

-- 7 76.6 -- 7 76.2

-- 8 74 -- 8 75.3

-- 9 76.6 -- 9 76

-- 10 71.4 -- 10 75.9

Sample Mean 75.04 Sample Mean


75.74

Standard deviation 3.226515 Standard deviation 0.412795

7
8

You might also like