More Data Mining With Weka: Ian H. Witten
More Data Mining With Weka: Ian H. Witten
More Data Mining With Weka: Ian H. Witten
Class 1 Lesson 1
Introduction
Ian H. Witten
Department of Computer Science
University of Waikato
New Zealand
weka.waikato.ac.nz
Use Weka on your own data and understand what youre doing!
Experimenter interface
Using the Experimenter to compare classifiers
Knowledge Flow interface
Simple Command Line interface
Working with big data
Explorer: 1 million instances, 25 attributes
Command line interface: effectively unlimited
in the Activity you will process a multi-million-instance dataset
Course organization
Class 1 Exploring Wekas interfaces;
working with big data
Course organization
Class 1 Exploring Wekas interfaces;
working with big data
Lesson 1.1
Lesson 1.2
Lesson 1.3
Lesson 1.4
Lesson 1.5
Lesson 1.6
Course organization
Class 1 Exploring Wekas interfaces;
working with big data
Lesson 1.1
Activity 1
Lesson 1.2
Activity 2
Lesson 1.3
Activity 3
Lesson 1.4
Class 4 Selecting attributes and
counting the cost
Activity 4
Lesson 1.5
Activity 5
Lesson 1.6
Activity 6
Course organization
Class 1 Exploring Wekas interfaces;
working with big data
1/3
Post-class assessment
2/3
Weka 3.6.11
the latest stable version of Weka
includes datasets for the course
its important to get the right version!
Textbook
This textbook discusses data mining,
and Weka, in depth:
Data Mining: Practical machine
learning tools and techniques,
by Ian H. Witten, Eibe Frank and
Mark A. Hall. Morgan Kaufmann, 2011
World Map by David Niblack, licensed under a Creative Commons Attribution 3.0 Unported License
Ian H. Witten
Department of Computer Science
University of Waikato
New Zealand
weka.waikato.ac.nz
Trying out
classifiers/filters
Performance
comparisons
Graphical
interface
Command-line
interface
Training
data
ML
algorithm
Classifier
Evaluation
results
Basic assumption: training and test sets produced by
independent sampling from an infinite population
Deploy!
With segment-challenge.arff
and J48 (trees>J48)
Set percentage split to 90%
Run it: 96.7% accuracy
Repeat
[More options] Repeat with seed
2, 3, 4, 5, 6, 7, 8, 9 10
0.967
0.940
0.940
0.967
0.953
0.967
0.920
0.947
0.933
0.947
Sample mean
Variance
x =
2 =
Standard deviation
xi
n
(xi
x )2
n1
0.967
0.940
0.940
0.967
0.953
0.967
0.920
0.947
0.933
0.947
x = 0.949, = 0.018
Stratified cross-validation
Ensure that each fold has the right
proportion of each class value
Save/Load an experiment
Save the results in Arff file or in a database
Preserve order in Train/Test split (cant do repetitions)
Use several datasets, and several classifiers
Advanced mode
Run panel
Analyse panel
Load results from .csv or Arff file or from a database
Many options
The Experimenter
Ian H. Witten
Department of Computer Science
University of Waikato
New Zealand
weka.waikato.ac.nz
v significantly better
* significantly worse
Ian H. Witten
Department of Computer Science
University of Waikato
New Zealand
weka.waikato.ac.nz
Instance or dataset
test set, training set
classifier
output, text or chart
Toolbar
Choose an ArffLoader; Configure to set the file iris.arff
DataSources
Connect up a ClassAssigner to select the class
Evaluation
Connect the result to a CrossValidationFoldMaker
Evaluation
Connect this to J48
Classifiers
Make two connections, one for trainingSet and the other for testSet
Connect J48 to ClassifierPerformanceEvaluator
Evaluation
Connect this to a TextViewer
Visualization
Add a ModelPerformanceChart
Connect the visualizableError output of ClassifierPerformanceEvaluator to it
Show chart (need to run again)
instance
connection
updateable
classifier
incremental
evaluator
StripChart
visualization
Ian H. Witten
Department of Computer Science
University of Waikato
New Zealand
weka.waikato.ac.nz
General options
h print help info
t <name of training file>
T <name of test file>
class
Ian H. Witten
Department of Computer Science
University of Waikato
New Zealand
weka.waikato.ac.nz
Preprocess panel, Generate, choose LED24; show text: 100 instances, 25 attributes
100,000 examples (use % split!) NaiveBayes 74% J48 73%
1,000,000 examples
NaiveBayes 74% J48 runs out of memory
2,000,000 examples
Generate process grinds to a halt
Use NaiveBayesUpdateable
java weka.classifiers.bayes.NaiveBayesUpdateable t train.arff T test.arff
74%; 4 mins
Note: if no test file specified, will do cross-validation, which will fail (non-incremental)
creativecommons.org/licenses/by/3.0/
weka.waikato.ac.nz