Module 3
Module 3
Module 3
Supervised learning :
Linear Regression & Classification
Text Book: An Introduction to Statistical Learning with
Applications in R:
3.1:3.1.1,3.1.2,3.1.3;3.2:3.2.1
Linear Regression: Simple Linear Regression: Estimating the
Coefficients, Assessing the Accuracy of the Coefficient
Estimates, Assessing the Accuracy of the Model, Multiple
Linear Regression: Estimating the Regression Coefficients
β1 ≠ 0
Data Mining:
Concepts and Techniques
(3rd ed.)
— Chapter 8 —
Classification
Algorithms
Training
Data
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes 42
Chapter 8. Classification: Basic Concepts
no yes yes no
44
Algorithm for Decision Tree Induction
• Basic algorithm (a greedy algorithm)
– Tree is constructed in a top-down recursive divide-and-conquer
manner
– At start, all the training examples are at the root
– Attributes are categorical (if continuous-valued, they are
discretized in advance)
– Examples are partitioned recursively based on selected
attributes
– Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain)
• Conditions for stopping partitioning
– All samples for a given node belong to the same class
– There are no remaining attributes for further partitioning –
majority voting is employed for classifying the leaf
– There are no samples left 46
Brief Review of Entropy
•
m=2
47
Attribute Selection Measure:
Information Gain (ID3/C4.5)
Select the attribute with the highest information gain
Let pi be the probability that an arbitrary tuple in D belongs to
class Ci, estimated by |Ci, D|/|D|
Expected information (entropy) needed to classify
m a tuple in D:
Info ( D) pi log 2 ( pi )
i 1
Information needed (after using A to split D into v partitions) to
v | D |
classify D:
Info A ( D )
j
Info ( D j )
j 1 | D |
Information gained by branching on attribute A
Gain(A) Info(D) Info A(D)
48
Attribute Selection: Information Gain
g Class P: buys_computer = “yes” 5 4
Info age ( D ) I (2,3) I (4,0)
g Class N: buys_computer = “no” 14 14
9 9 5 5 5
Info ( D) I (9,5) log 2 ( ) log 2 ( ) 0.940 I (3,2) 0.694
14 14 14 14 14
age pi ni I(p i, n i) 5
<=30 2 3 0.971 I (2,3)means “age <=30” has 5 out of
14
31…40 4 0 0 14 samples, with 2 yes’es and 3
>40 3 2 0.971 no’s. Hence
age
<=30
income student credit_rating
high no fair
buys_computer
no
Gain(age) Info ( D ) Info age ( D ) 0.246
<=30 high no excellent no
31…40
>40
high
medium
no
no
fair
fair
yes
yes
Similarly,
>40 low yes fair yes
Gain(income) 0.029
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30
>40
low
medium
yes
yes
fair
fair
yes
yes
Gain( student ) 0.151
<=30
31…40
medium
medium
yes
no
excellent
excellent
yes
yes Gain(credit _ rating ) 0.048
31…40 high yes fair yes 49
>40 medium no excellent no
Decision Tree Generation
50
Computing Information-Gain for Continuous-
Valued Attributes
• Let attribute A be a continuous-valued attribute
• Must determine the best split point for A
– Sort the value A in increasing order
– Typically, the midpoint between each pair of adjacent values
is considered as a possible split point
• (ai+ai+1)/2 is the midpoint between the values of ai and ai+1
– The point with the minimum expected information
requirement for A is selected as the split-point for A
• Split:
– D1 is the set of tuples in D satisfying A ≤ split-point, and D2 is
the set of tuples in D satisfying A > split-point
51
Gain Ratio for Attribute Selection (C4.5)
• Information gain measure is biased towards attributes with a
large number of values
• C4.5 (a successor of ID3) uses gain ratio to overcome the
problem (normalization to information gain)
v | Dj | | Dj |
SplitInfo A ( D ) log 2 ( )
j 1 |D| |D|
– GainRatio(A) = Gain(A)/SplitInfo(A)
• Ex.
• Understandability
vs accuracy
• Alternates:
– Multivariate splits
– If-then rules
59
Enhancements to Basic Decision Tree Induction
62
Rainforest: Training Set and Its AVC Sets
64
Presentation of Classification Results
65
July 12, 2024 Data Mining: Concepts and Techniques
Visualization of a Decision Tree in SGI/MineSet 3.0
66
July 12, 2024 Data Mining: Concepts and Techniques
Interactive Visual Mining by Perception-Based
Classification (PBC)
Interactive approaches to decision tree induction that allow us to
visualize the data and the tree as it is being constructed
using knowledge of our data helps in building the tree
Perception-based classification(PBC) is an interactive approach based
on multidimensional visualization techniques and allows the user to
incorporate background knowledge about the data when building a
decision tree.
PBC uses a pixel-oriented approach to view multidimensional data with
its class label information.
The circle segments approach is adapted, which maps d-dimensional
data objects to a circle that is partitioned into d segments, each
representing one attribute
The PBC system displays a split screen, consisting of a Data Interaction
window and a Knowledge Interaction window
67
Data Mining: Concepts and Techniques
Interactive Visual Mining by Perception-Based
Classification (PBC)
68
Data Mining: Concepts and Techniques
Chapter 8. Classification: Basic Concepts
73
Naïve Bayes Classifier
P ( X | C i ) g ( x k , Ci , C i )
74
Naïve Bayes Classifier
75
Naïve Bayes Classifier: Training Dataset
age income studentcredit_rating
buys_compu
<=30 high no fair no
Class: <=30 high no excellent no
C1:buys_computer = ‘yes’ 31…40 high no fair yes
C2:buys_computer = ‘no’ >40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
Data to be classified:
31…40 low yes excellent yes
X = (age <=30,
<=30 medium no fair no
Income = medium, <=30 low yes fair yes
Student = yes >40 medium yes fair yes
Credit_rating = Fair) <=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
76
Naïve Bayes Classifier: An Example age income studentcredit_rating
buys_comp
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
81
Classifier Evaluation Metrics: Confusion Matrix
Confusion Matrix:
Actual class\Predicted class C1 ¬ C1
C1 True Positives (TP) False Negatives (FN)
¬ C1 False Positives (FP) True Negatives (TN)
84
Classifier Evaluation Metrics:
Precision and Recall, and F-measures
• Precision: exactness – what % of tuples that the classifier labeled
as positive are actually positive
85
Classifier Evaluation Metrics: Example
86
Model Evaluation Measures
87
Evaluating Classifier Accuracy:
Holdout & Cross-Validation Methods
• Holdout method
– Given data is randomly partitioned into two independent sets
• Training set (e.g., 2/3) for model construction
• Test set (e.g., 1/3) for accuracy estimation
– Random sampling: a variation of holdout
• Repeat holdout k times, accuracy = avg. of the accuracies
obtained
• Cross-validation (k-fold, where k = 10 is most popular)
– Randomly partition the data into k mutually exclusive subsets,
each approximately equal size
– At i-th iteration, use Di as test set and others as training set
– Leave-one-out: k folds where k = # of tuples, for small sized
data
– *Stratified cross-validation*: folds are stratified so that class
dist. in each fold is approx. the same as that in the initial data
88
Evaluating Classifier Accuracy: Bootstrap
• Bootstrap
– Works well with small data sets
– Samples the given training tuples uniformly with replacement
• i.e., each time a tuple is selected, it is equally likely to be selected
again and re-added to the training set
• Several bootstrap methods, and a common one is .632 boostrap
– A data set with d tuples is sampled d times, with replacement, resulting in
a training set of d samples. The data tuples that did not make it into the
training set end up forming the test set. About 63.2% of the original data
end up in the bootstrap, and the remaining 36.8% form the test set (since
(1 – 1/d)d ≈ e-1 = 0.368)
– Repeat the sampling procedure k times, overall accuracy of the model:
89
Estimating Confidence Intervals:
Classifier Models M1 vs. M2
90
Estimating Confidence Intervals:
Null Hypothesis
91
Estimating Confidence Intervals: t-test
• If only 1 test set available: pairwise
comparison
– For ith round of 10-fold cross-validation, the same cross
partitioning is used to obtain err(M1)i and err(M2)i
– Average over 10 rounds to get
– t-test computes t-statistic with k-1 degrees of
freedom:
92
Estimating Confidence Intervals:
Table for t-distribution
• Symmetric
• Significance level,
e.g., sig = 0.05 or 5%
means M1 & M2 are
significantly different
for 95% of population
• Confidence limit, z =
sig/2
• If t is outside + or –
readoff value it is in
rejection range and
hypothesis M1 and
M2 same is rejected
93
Estimating Confidence Intervals:
Statistical Significance
• Are M1 & M2 significantly different?
– Compute t. Select significance level (e.g. sig = 5%)
– Consult table for t-distribution: Find t value corresponding to k-1
degrees of freedom (here, 9)
– t-distribution is symmetric: typically upper % points of
distribution shown → look up value for confidence limit z=sig/2
(here, 0.025)
– If t > z or t < -z, then t value lies in rejection region:
• Reject null hypothesis that mean error rates of M1 & M2 are
same
• Conclude: statistically significant difference between M1 & M2
– Otherwise, conclude that any difference is chance
94
Model Selection: Cost-Benefit Analysis
95
Model Selection: Cost-Benefit Analysis (contd)
96
Model Selection: ROC Curves
• ROC (Receiver Operating Characteristics) curves: for visual
comparison of classification models
• Originated from signal detection theory
• Shows the trade-off between the true positive rate and the
false positive rate
• The area under the ROC curve is a measure of the accuracy of
the model
• Rank the test tuples in decreasing order: the one that is most
likely to belong to the positive class appears at the top of the
list
• The closer to the diagonal line (i.e., the closer the area is to
0.5), the less accurate is the model
97
Model Selection: ROC Curves
• TPR = TP/ P which is sensitivity.
• FPR = FP/ N , which is 1 − specificity.
Vertical axis
represents the true
positive rate
Horizontal axis rep.
the false positive rate
The plot also shows a
diagonal line
A model with perfect
accuracy will have an
area of 1.0
98
99
Tuple Class TP FP TN FN TPR=TP/TP+FN FPR=FP/FP+TN
No
1 P 1 0 5 4 0.2 0
2 P 2 0 5 3 0.4 0
3 N 2 1 4 3 0.4 0.2
4 P 3 1 4 2
5 P 4 1 4 1
6 N 4 2 3 1
7 N 4 3 2 1
8 N 4 4 1 1
9 P 5 4 1 0
10 N 5 5 0 0
Issues Affecting Model Selection
• Accuracy
– classifier accuracy: predicting class label
• Speed
– time to construct the model (training time)
– time to use the model (classification/prediction time)
• Robustness: handling noise and missing values
• Scalability: efficiency in disk-resident databases
• Interpretability
– understanding and insight provided by the model
• Other measures, e.g., goodness of rules, such as decision tree
size or compactness of classification rules
101
Chapter 8. Classification: Basic Concepts
• Ensemble methods
– Use a combination of models to increase accuracy
– Combine a series of k learned models, M1, M2, …, Mk, with
the aim of creating an improved model M*
• Popular ensemble methods
– Bagging: averaging the prediction over a collection of
classifiers
– Boosting: weighted vote with a collection of classifiers
– Ensemble: combining a set of heterogeneous classifiers
103
Bagging: Boostrap Aggregation
105
Boosting
• Analogy: Consult several doctors, based on a combination of
weighted diagnoses—weight assigned based on the previous
diagnosis accuracy
• How boosting works?
– Weights are assigned to each training tuple
– A series of k classifiers is iteratively learned
– After a classifier Mi is learned, the weights are updated to
allow the subsequent classifier, Mi+1, to pay more attention to
the training tuples that were misclassified by Mi
– The final M* combines the votes of each individual classifier,
where the weight of each classifier's vote is a function of its
accuracy
• Boosting algorithm can be extended for numeric prediction
• Comparing with bagging: Boosting tends to have greater accuracy,
but it also risks overfitting the model to misclassified data 106
Adaboost (Freund and Schapire, 1997)
• Given a set of d class-labeled tuples, (X1, y1), …, (Xd, yd)
• Initially, all the weights of tuples are set the same (1/d)
• Generate k classifiers in k rounds. At round i,
– Tuples from D are sampled (with replacement) to form a training set Di
of the same size
– Each tuple’s chance of being selected is based on its weight
– A classification model Mi is derived from Di
– Its error rate is calculated using Di as a test set
– If a tuple is misclassified, its weight is increased, o.w. it is decreased
• Error rate: err(Xj) is the misclassification error of tuple Xj. Classifier Mi error
rate is the sum of the weights of the misclassified tuples:
d
error ( M i ) w j err ( X j )
j
112
Summary (II)
• Significance tests and ROC curves are useful for model selection.
• There have been numerous comparisons of the different
classification methods; the matter remains a research topic
• No single method has been found to be superior over all others
for all data sets
• Issues such as accuracy, training time, robustness, scalability,
and interpretability must be considered and can involve trade-
offs, further complicating the quest for an overall superior
method
113
Data Mining:
Concepts and Techniques
(3rd ed.)
— Chapter 9 —
Classification: Advanced Methods
115
Bayesian Belief Networks
• Bayesian belief networks (also known as Bayesian networks,
probabilistic networks): allow class conditional independencies
between subsets of variables
• A (directed acyclic) graphical model of causal relationships
– Represents dependency among the variables
– Gives a specification of joint probability distribution
Nodes: random variables
Links: dependency
X Y
X and Y are the parents of Z, and Y is
the parent of P
Z
P No dependency between Z and P
Has no loops/cycles
116
Bayesian Belief Network: An Example
119
Classification: A Mathematical Mapping
• Classification: predicts categorical class labels
– E.g., Personal homepage classification
• xi = (x1, x2, x3, …), yi = +1 or –1
• x1 : # of word “homepage”
x
• x2 : # of word “welcome” x x
x x
• Mathematically, x X = n, y Y = {+1, –1}, x
x x x o
– We want to derive a function f: X Y o
x o o o o
• Linear Classification
o o o
– Binary Classification problem o o o o
– Data above the red line belongs to class ‘x’
– Data below red line belongs to class ‘o’
– Examples: SVM, Perceptron, Probabilistic Classifiers
120
Discriminative Classifiers
• Advantages
– Prediction accuracy is generally high
• As compared to Bayesian methods – in general
– Robust, works when training examples contain errors
– Fast evaluation of the learned target function
• Bayesian networks are normally slow
• Criticism
– Long training time
– Difficult to understand the learned function (weights)
• Bayesian networks can be used easily for pattern discovery
– Not easy to incorporate domain knowledge
• Easy in the form of priors on the data or distributions
121
SVM—Support Vector Machines
• A relatively new classification method for both linear and
nonlinear data
• It uses a nonlinear mapping to transform the original training
data into a higher dimension
• With the new dimension, it searches for the linear optimal
separating hyperplane (i.e., “decision boundary”)
• With an appropriate nonlinear mapping to a sufficiently high
dimension, data from two classes can always be separated by a
hyperplane
• SVM finds this hyperplane using support vectors (“essential”
training tuples) and margins (defined by the support vectors)
122
SVM—History and Applications
• Vapnik and colleagues (1992)—groundwork from Vapnik &
Chervonenkis’ statistical learning theory in 1960s
• Features: training can be slow but accuracy is high owing to
their ability to model complex nonlinear decision boundaries
(margin maximization)
• Used for: classification and numeric prediction
• Applications:
– handwritten digit recognition, object recognition, speaker
identification, benchmarking time-series prediction tests
123
SVM—General Philosophy
124
SVM—Margins and Support Vectors
Let data D be (X1, y1), …, (X|D|, y|D|), where Xi is the set of training tuples
associated with the class labels yi
There are infinite lines (hyperplanes) separating the two classes but we want to
find the best one (the one that minimizes classification error on unseen data)
SVM searches for the hyperplane with the largest margin , i.e., maximum
marginal hyperplane (MMH)
127
SVM—Linearly Separable
A separating hyperplane can be written as
W●X+b=0
where W={w1, w2, …, wn} is a weight vector and b a scalar (bias)
For 2-D it can be written as
w0 + w1 x1 + w2 x2 = 0
The hyperplane defining the sides of the margin:
H1: w0 + w1 x1 + w2 x2 ≥ 1 for yi = +1, and
H2: w0 + w1 x1 + w2 x2 ≤ – 1 for yi = –1
Any training tuples that fall on hyperplanes H1 or H2 (i.e., the
sides defining the margin) are support vectors
This becomes a constrained (convex) quadratic optimization problem:
Quadratic objective function and linear constraints Quadratic
Programming (QP) Lagrangian multipliers
128
Why Is SVM Effective on High Dimensional Data?
SVM—Linearly Inseparable
SVM can also be used for classifying multiple (> 2) classes and
for regression analysis (with additional parameters)
131
SVM vs. Neural Network
• SVM • Neural Network
– Deterministic – Nondeterministic
algorithm algorithm
– Nice generalization – Generalizes well but
properties doesn’t have strong
mathematical
– Hard to learn –
foundation
learned in batch mode
– Can easily be learned in
using quadratic
incremental fashion
programming
– To learn complex
techniques 132