8 Classification

DATA MINING
Lectures 8: Supervised Learning

Classification
Dr. Doaa Elzanfaly

Lecture Outline
◼ Supervised vs. Unsupervised Learning
◼ Classification vs. Numeric Prediction (Regression)
◼ General Approach for Building Classification Model
◼ Classification Techniques
◼ Decision Tree
◼ Naïve Bayes
Supervised Learning Unsupervised Learning
Input data is labelled Input data is unlabelled
Data is classified based on training dataset Uses properties of given data to cluster it.
Used for prediction Used for analysis
Regression & Classification Clustering & Association
Known number of classes Unknown number of classes
Use off-line analysis of data Use real-time analysis of data
https://www.lotus-qa.com/blog/data-annotation-guide/
Classification vs. Numeric Prediction
Classification Regression
The output variable is discrete. The output variable is continuous.
The independent variable in the dataset The independent variable in the dataset is
(label) is one of two or more classes real values (quantities)
Ex. Classifying mail to spam or unspam, Ex. Predicting the value of a house, amount
and web page categorization of sales, student’s mark.
Simple Linear Regression, Multiple Linear
Logistic Regression, K-Nearest Neighbours,
Regression, Polynomial Regression, Decision
Support Vector Machines, Naïve Bayes,
Tree Regression, Random Forest
Decision Tree, and Random Forest
Regression, Support Vector Regression
Classification: Definition
◼ Given a collection of records (training set )

◼ Each record is characterized by a tuple (x,y), where x
is the attribute set and y is the class label
◆ x : attribute, predictor, independent variable, input
◆ y : class, response, dependent variable, output
◼ Task:
◼ Learn a model that maps each attribute set x into one
of the predefined class labels y
General Approach for Building Classification
Model
General Approach for Building Classification
Model
Classification—A Two-Step Process
◼ Model Construction: describing a set of predetermined classes
◼ Each tuple is assumed to belong to a predefined class, as determined by
the class label attribute
◼ The set of tuples used for model construction is training set
◼ The model is represented as classification rules, decision trees, or
mathematical formulae
◼ Model Usage: for classifying future or unknown objects
◼ Estimate accuracy of the model
◼ The known label of test sample is compared with the classified result
from the model.
◼ Accuracy rate is the percentage of test set samples that are correctly
classified by the model.
◼ Test set is independent of training set (otherwise overfitting)
◼ If the accuracy is acceptable, use the model to classify new data.
Classification—A Two-Step Process
Classification
Algorithms
Training
Data
NAME RANK YEARS TENURED Classifier

M ike A ssistan t P ro f 3 no (Model) Process (2): Using the Model in
M ary A ssistan t P ro f 7 yes Prediction
B ill P ro fesso r 2 yes
Jim A sso ciate P ro f 7 yes
IF rank = ‘professor’
D ave A ssistan t P ro f 6 no
Anne A sso ciate P ro f 3 no
OR years > 6
THEN tenured = ‘yes’
Classifier
Process (1): Model Construction Testing

Unseen Data
Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
T om A ssistant P rof 2 no Tenured?
M erlisa A ssociate P rof 7 no
G eorge P rofessor 5 yes
Joseph A ssistant P rof 7 yes
Types of Classifiers
There are three methods to establish a classifier

a) Model a classification rule directly
Examples: k-NN, decision trees, perceptron, SVM
b) Model the probability of class memberships given
input data
Example: multi-layered perceptron with the cross-entropy
cost
c) Make a probabilistic model of data within each class
Examples: naive Bayes, model-based classifiers
a) and b) are examples of discriminative classification

c) is an example of generative classification
b) and c) are both examples of probabilistic classification
Classification Techniques
◼ Base Classifiers
◼ Decision Tree based Methods
◼ Rule-based Methods
◼ Nearest-neighbor
◼ Naïve Bayes
◼ Support Vector Machines
◼ Neural Networks, Deep Neural Nets
◼ Ensemble Classifiers
◼ Boosting, Bagging, Random Forests
Decision Tree Induction
◼ A decision tree is a tree-like
structure that is used as a
model for classifying data.
◼ It decomposes the data into
sub-trees made of other sub-
trees and/or leaf nodes.
◼ A decision tree is made up of
three types of nodes
◼ Decision Nodes: These type of
node have two or more
branches
◼ Leaf Nodes: The lowest nodes
which represents decision
◼ Root Node: This is also a
decision node but at the
topmost level
Brief Review of Entropy
◼ Entropy is a measure of impurity of a node. By Impurity, we mean to
measure the heterogeneity at a particular node.
◼ It is a measure of uncertainty associated with a random variable.
◼ C is the number of classes. Pi is the

proportion of the ith class in that set.
◼ Entropy = 0 implies that it is of a pure class, that means all are of same
category. The higher the entropy, the lower the purity of the class (high
heterogeneity).
Information Gain
◼ Information Gain measures the reduction in entropy by
splitting a dataset according to a given value of a random
variable.
◼ Constructing a decision tree is all about finding attribute
that returns the highest information gain (i.e., the most
homogeneous branches).
𝐼𝑛𝑓𝑜. 𝐺𝑎𝑖𝑛
= 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑏𝑒𝑓𝑜𝑟𝑒 𝑠𝑝𝑙𝑖𝑡𝑡𝑖𝑛𝑔 − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑎𝑓𝑡𝑒𝑟 𝑠𝑝𝑙𝑖𝑡𝑡𝑖𝑛𝑔
Gain(A) = Info(D)− InfoA(D)

Algorithm for Decision Tree Induction
◼ Basic algorithm (a greedy algorithm)
◼ Tree is constructed in a top-down recursive divide-and-conquer manner
◼ At start, all the training examples are at the root
◼ Attributes are categorical (if continuous-valued, they are discretized in
advance)
◼ Examples are partitioned recursively based on selected attributes
◼ Test attributes are selected on the basis of a heuristic or statistical measure
(e.g., information gain)
◼ Conditions for stopping partitioning
◼ All samples for a given node belong to the same class
◼ There are no remaining attributes for further partitioning
◼ There are no samples left
Decision Tree ID3 Algorithm
◼ The ID3 algorithm builds decision trees using a top-down greedy
search approach through the space of possible branches with no
backtracking.
◼ A greedy algorithm, as the name suggests, always makes the choice
that seems to be the best at that moment.
◼ The steps in ID3 algorithm are as follows:
1. Calculate entropy for the dataset.
2. For each attribute/feature.
2.a Calculate entropy.
2.b Calculate information gain.
3. Use the feature with maximum information gain to split.

4. Repeat it until we get the desired tree.
Decision Tree ID3 Algorithm: Example
1. Calculate Entropy for The Dataset.
2.a Calculate The Entropy for Each
Attribute
2.b Calculate The Information Gain for
Each Attribute
Max.
Info. Gain
3. Split with the feature with maximum
information gain
◼ Choose attribute with the largest information gain as the
decision node, divide the dataset by its branches.
4.a A branch with entropy of 0 is a leaf
node
4.b A branch with entropy more than 0
needs further splitting
4.b A branch with entropy more than 0
needs further splitting
Bayesian Classification
◼ A statistical classifier: performs probabilistic prediction, i.e., predicts

class membership probabilities
◼ Foundation: Based on Bayes’ Theorem.
◼ Performance: A simple Bayesian classifier, naïve Bayesian classifier,
has comparable performance with decision tree and selected neural
network classifiers
◼ Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is correct — prior
knowledge can be combined with observed data
◼ Standard: Even when Bayesian methods are computationally
intractable, they can provide a standard of optimal decision making
against which other methods can be measured
Bayes’ Theorem: Basics
◼ Let A and B be two events whose probability is P(A) and P(B) are
known.
◼ If also the conditional probability P(B│A) is known, Bayes’ rule gives:
𝑃 𝐵 𝐴 ∗ 𝑃(𝐴)
𝑃(𝐴|𝐵) =
𝑃(𝐵)
◼ The conditional probability P(A│B) called posterior probability, as it is

computed after receiving the information that event B has happened.
Example
◼ Having a sample of 1000 person
◼ P(Male) = 40% P(Female) = 60% P(Customer) = 10%
◼ Out of customers:
◼ 50% are males and 50% are Females
◼ What is the probability when getting a Male to be a customers from

the sample?
𝑃 𝑀𝑎𝑙𝑒│𝐶𝑢𝑠𝑡 ∗𝑃(𝐶𝑢𝑠𝑡) 0.5 ∗0.1
𝑃(𝐶𝑢𝑠𝑡│Male) = = = 0.125
𝑃(𝑀𝑎𝑙𝑒) 0.4
◼ What is the probability of getting a Female from the customers in the

sample
𝑃 𝐹𝑒𝑚𝑎𝑙𝑒│𝐶𝑢𝑠𝑡 ∗𝑃(𝐶𝑢𝑠𝑡) 0.5 ∗0.1
𝑃(𝐶𝑢𝑠𝑡│Female) = = = 0.08
𝑃(𝐹𝑒𝑚𝑎𝑙𝑒) 0.6
Classification Is to Derive the Maximum Posteriori
◼ Let D be a training set of tuples and their associated class labels, and
each tuple is represented by an n-D attribute vector X = (x1, x2, …, xn)
◼ Suppose there are m classes C1, C2, …, Cm.
◼ Classification is to derive the maximum posteriori, i.e., the maximal
P(Ci|X)
◼ This can be derived from Bayes’ theorem
P(X | C )P(C )
P(C | X) = i i
i P(X)
◼ Since P(X) is constant for all classes, only

P(C | X) = P(X | C )P(C )
i i i
needs to be maximized
Naïve Bayes Classifier
Naïve Bayes Classifier: An Example
◼ To classify the new data point to
New Point either default or non-default
customer:
I. What is the probability that this
person is a default given his features.
𝑃 X│𝐷𝑒𝑓. ∗ 𝑃(𝐷𝑒𝑓. )
𝑃(𝐷𝑒𝑓. │X) =
𝑃(𝑋)
II. What is the probability that this

person is a non-default given his
features.
𝑃 X│Non−Def. ∗ 𝑃(Non−Def.)
𝑃(Non−Def.│X) =
𝑃(𝐵|𝐴)𝑃(𝐴) 𝑃(𝑋)
𝑃(𝐴|𝐵) =
𝑃(𝐵)
III. Take the highest probability
I. What is the probability that this person is a default given his features
◼ Step 1: Prior Probability

𝐷𝑒𝑓𝑎𝑢𝑙𝑡
𝑃 𝐷𝑒𝑓. =
𝑇𝑜𝑡𝑎𝑙 𝑂𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠
8
= = 0.33
24
◼ Step 2: Marginal Likelihood

𝑆𝑖𝑚𝑖𝑙𝑎𝑟 𝑂𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠
𝑃 𝑥 = 𝑇𝑜𝑡𝑎𝑙 𝑂𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠
5
= = 0.21
24
◼ Step 3: Likelihood
𝑆𝑖𝑚𝑖𝑙𝑎𝑟 𝑂𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠
𝑎𝑚𝑜𝑛𝑔 𝑡ℎ𝑜𝑠𝑒 𝑤ℎ𝑜 𝑎𝑟𝑒 𝐷𝑒𝑓.
𝑃 Def.│𝑋 =
𝑇𝑜𝑡𝑎𝑙 𝑂𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠
2
= = 0.25
8
I. What is the probability that this person is a default given his
features
◼ Step 4: Posterior Probability

0.25 ∗ 0.33
𝑃(𝐷𝑒𝑓. │X) = = 0.39
0.21
II. What is the probability that this person is a non-default

given his features.
𝑃 𝑁𝑜𝑛 − Def.│𝑋 ∗ 𝑃(𝑁𝑜𝑛 − 𝐷𝑒𝑓. )
𝑃(𝑁𝑜𝑛 − 𝐷𝑒𝑓. │X) =
𝑃(𝑋)
III. Take the highest probability
𝑃(𝑁𝑜𝑛 − 𝐷𝑒𝑓. │X) > 𝑃(𝐷𝑒𝑓. │X)
So, the new point is classified as a Non-Default Customer

Avoiding the Zero-Probability Problem
◼ Naïve Bayesian prediction requires each conditional prob. To be
non-zero. Otherwise, the predicted prob. will be zero
◼ Ex. Suppose a dataset with 1000 tuples, income=low (0),

income= medium (990), and income = high (10)
◼ Use Laplacian correction (or Laplacian estimator)
◼ Adding 1 to each case

Prob(income = low) = 1/1003
Prob(income = medium) = 991/1003
Prob(income = high) = 11/1003
◼ The “corrected” prob. estimates are close to their “uncorrected”

counterparts
Naïve Bayes Classifier: Comments
◼ Advantages
◼ Easy to implement
◼ Good results obtained in most of the cases
◼ Disadvantages
◼ Assumption: class conditional independence, therefore loss of
accuracy
◼ Practically, dependencies exist among variables
◼ E.g., hospitals: patients: Profile: age, family history, etc.
Symptoms: fever, cough etc., Disease: lung cancer, diabetes,
etc.
◼ Dependencies among these cannot be modeled by Naïve Bayes
Classifier
Model Evaluation and Selection
◼ Model evaluation is a method of assessing the correctness

of models on test data.
◼ The test data consists of data points that have not been
seen by the model before.
◼ Models can be evaluated using multiple metrics.
◼ However, the right choice of an evaluation metric is

crucial and often depends upon the problem that is being
solved.
42
Confusion Matrix
◼ A matrix that demonstrates the number of test cases
correctly and incorrectly classified.
Type I Error
- TP: Predicted that the events
will happen, and they actually
happened.
- TN: Predicted that the events

will not happen, and they
actually didn’t happen.
- FN: Type I error.
- FP: Type II error.
Type II Error
https://www.researchgate.net/figure/Confusion-matrix-and-performance-evaluation-metrics_fig5_346062755
Confusion Matrix: Example
https://towardsdatascience.com/understanding-confusion-matrix-a9ad42dcfd62
44
Confusion Matrix
Classifier Evaluation Metrics
Recall (Completeness)
◼ The ability of a model to find all the relevant cases within a data set.
◼ What proportion of actual positives was identified correctly
Precision (exactness)
◼ The ability of a classification model to identify only the relevant data
points.
◼ What proportion of positive identifications was actually correct?
◼ Classifier Accuracy, or recognition rate: percentage of test set tuples
that are correctly classified
Accuracy = (TP + TN)/All
◼ Error rate: 1 – accuracy, or
Error rate = (FP + FN)/All
◼ F measure (F1 or F -score): harmonic mean of precision and recall
47
◼ Class Imbalance Problem:
◼ One class may be rare, e.g. fraud, or HIV-positive
◼ Significant majority of the negative class and minority of the

positive class
◼ Sensitivity: True Positive recognition rate
◼ Sensitivity = TP/P
◼ Specificity: True Negative recognition rate
◼ Specificity = TN/N
Accuracy Paradox
◼ Given the following Two Scenario:
◼ The accuracy will jump up from 98% to 98.5% which is not correct as
the model stopped working.
◼ Model evaluation should not be base only on the accuracy rate as it is
sometimes misleading.
CAP: The Cumulative Accuracy Profile
◼ Used to visualize the discriminative power of a model.
◼ It represents the cumulative number of positive outcomes along

the y-axis versus the corresponding cumulative number of a
classifying parameter along the x-axis.

8 Classification

Uploaded by

Copyright:

Available Formats

8 Classification

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

8 Classification

Uploaded by

Copyright:

Available Formats

DATA MINING

Lectures 8: Supervised Learning

Dr. Doaa Elzanfaly

◼ Supervised vs. Unsupervised Learning

◼ Classification vs. Numeric Prediction (Regression)

◼ General Approach for Building Classification Model

◼ Given a collection of records (training set )

NAME RANK YEARS TENURED Classifier

Process (1): Model Construction Testing

There are three methods to establish a classifier

a) and b) are examples of discriminative classification

◼ C is the number of classes. Pi is the

Gain(A) = Info(D)− InfoA(D)

2.b Calculate information gain.

3. Use the feature with maximum information gain to split.

◼ A statistical classifier: performs probabilistic prediction, i.e., predicts

◼ If also the conditional probability P(B│A) is known, Bayes’ rule gives:

◼ The conditional probability P(A│B) called posterior probability, as it is

◼ What is the probability when getting a Male to be a customers from

◼ What is the probability of getting a Female from the customers in the

◼ Since P(X) is constant for all classes, only

II. What is the probability that this

◼ Step 1: Prior Probability

◼ Step 2: Marginal Likelihood

◼ Step 4: Posterior Probability

II. What is the probability that this person is a non-default

III. Take the highest probability

𝑃(𝑁𝑜𝑛 − 𝐷𝑒𝑓. │X) > 𝑃(𝐷𝑒𝑓. │X)

So, the new point is classified as a Non-Default Customer

◼ Ex. Suppose a dataset with 1000 tuples, income=low (0),

◼ Use Laplacian correction (or Laplacian estimator)

◼ Adding 1 to each case

Prob(income = medium) = 991/1003

Prob(income = high) = 11/1003

◼ The “corrected” prob. estimates are close to their “uncorrected”

◼ Model evaluation is a method of assessing the correctness

◼ Models can be evaluated using multiple metrics.

◼ However, the right choice of an evaluation metric is

- TN: Predicted that the events

- FN: Type I error.

- FP: Type II error.

◼ Classifier Accuracy, or recognition rate: percentage of test set tuples

that are correctly classified

Accuracy = (TP + TN)/All

◼ Error rate: 1 – accuracy, or

Error rate = (FP + FN)/All

◼ F measure (F1 or F -score): harmonic mean of precision and recall

◼ Class Imbalance Problem:

◼ One class may be rare, e.g. fraud, or HIV-positive

◼ Significant majority of the negative class and minority of the

◼ Sensitivity: True Positive recognition rate

◼ Specificity: True Negative recognition rate

◼ It represents the cumulative number of positive outcomes along

You might also like