8 Classification
8 Classification
8 Classification
◼ Classification Techniques
◼ Decision Tree
◼ Naïve Bayes
Supervised Learning Unsupervised Learning
Input data is labelled Input data is unlabelled
Data is classified based on training dataset Uses properties of given data to cluster it.
Used for prediction Used for analysis
Regression & Classification Clustering & Association
Known number of classes Unknown number of classes
Use off-line analysis of data Use real-time analysis of data
https://www.lotus-qa.com/blog/data-annotation-guide/
Classification vs. Numeric Prediction
Classification Regression
The output variable is discrete. The output variable is continuous.
The independent variable in the dataset The independent variable in the dataset is
(label) is one of two or more classes real values (quantities)
Ex. Classifying mail to spam or unspam, Ex. Predicting the value of a house, amount
and web page categorization of sales, student’s mark.
Simple Linear Regression, Multiple Linear
Logistic Regression, K-Nearest Neighbours,
Regression, Polynomial Regression, Decision
Support Vector Machines, Naïve Bayes,
Tree Regression, Random Forest
Decision Tree, and Random Forest
Regression, Support Vector Regression
Classification: Definition
◼ Task:
◼ Learn a model that maps each attribute set x into one
of the predefined class labels y
General Approach for Building Classification
Model
General Approach for Building Classification
Model
Classification—A Two-Step Process
◼ Model Construction: describing a set of predetermined classes
◼ Each tuple is assumed to belong to a predefined class, as determined by
the class label attribute
◼ The set of tuples used for model construction is training set
◼ The model is represented as classification rules, decision trees, or
mathematical formulae
◼ Model Usage: for classifying future or unknown objects
◼ Estimate accuracy of the model
◼ The known label of test sample is compared with the classified result
from the model.
◼ Accuracy rate is the percentage of test set samples that are correctly
classified by the model.
◼ Test set is independent of training set (otherwise overfitting)
◼ If the accuracy is acceptable, use the model to classify new data.
Classification—A Two-Step Process
Classification
Algorithms
Training
Data
◼ Base Classifiers
◼ Decision Tree based Methods
◼ Rule-based Methods
◼ Nearest-neighbor
◼ Naïve Bayes
◼ Support Vector Machines
◼ Neural Networks, Deep Neural Nets
◼ Ensemble Classifiers
◼ Boosting, Bagging, Random Forests
Decision Tree Induction
◼ A decision tree is a tree-like
structure that is used as a
model for classifying data.
◼ It decomposes the data into
sub-trees made of other sub-
trees and/or leaf nodes.
◼ A decision tree is made up of
three types of nodes
◼ Decision Nodes: These type of
node have two or more
branches
◼ Leaf Nodes: The lowest nodes
which represents decision
◼ Root Node: This is also a
decision node but at the
topmost level
Brief Review of Entropy
◼ Entropy is a measure of impurity of a node. By Impurity, we mean to
measure the heterogeneity at a particular node.
◼ It is a measure of uncertainty associated with a random variable.
◼ Entropy = 0 implies that it is of a pure class, that means all are of same
category. The higher the entropy, the lower the purity of the class (high
heterogeneity).
Information Gain
◼ Information Gain measures the reduction in entropy by
splitting a dataset according to a given value of a random
variable.
◼ Constructing a decision tree is all about finding attribute
that returns the highest information gain (i.e., the most
homogeneous branches).
𝐼𝑛𝑓𝑜. 𝐺𝑎𝑖𝑛
= 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑏𝑒𝑓𝑜𝑟𝑒 𝑠𝑝𝑙𝑖𝑡𝑡𝑖𝑛𝑔 − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑎𝑓𝑡𝑒𝑟 𝑠𝑝𝑙𝑖𝑡𝑡𝑖𝑛𝑔
Max.
Info. Gain
3. Split with the feature with maximum
information gain
◼ Choose attribute with the largest information gain as the
decision node, divide the dataset by its branches.
4.a A branch with entropy of 0 is a leaf
node
4.b A branch with entropy more than 0
needs further splitting
4.b A branch with entropy more than 0
needs further splitting
Bayesian Classification
◼ Let A and B be two events whose probability is P(A) and P(B) are
known.
𝑃 𝐵 𝐴 ∗ 𝑃(𝐴)
𝑃(𝐴|𝐵) =
𝑃(𝐵)
◼ Let D be a training set of tuples and their associated class labels, and
each tuple is represented by an n-D attribute vector X = (x1, x2, …, xn)
◼ Suppose there are m classes C1, C2, …, Cm.
◼ Classification is to derive the maximum posteriori, i.e., the maximal
P(Ci|X)
◼ This can be derived from Bayes’ theorem
P(X | C )P(C )
P(C | X) = i i
i P(X)
needs to be maximized
Naïve Bayes Classifier
Naïve Bayes Classifier: An Example
◼ To classify the new data point to
New Point either default or non-default
customer:
I. What is the probability that this
person is a default given his features.
𝑃 X│𝐷𝑒𝑓. ∗ 𝑃(𝐷𝑒𝑓. )
𝑃(𝐷𝑒𝑓. │X) =
𝑃(𝑋)
◼ Step 3: Likelihood
𝑆𝑖𝑚𝑖𝑙𝑎𝑟 𝑂𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠
𝑎𝑚𝑜𝑛𝑔 𝑡ℎ𝑜𝑠𝑒 𝑤ℎ𝑜 𝑎𝑟𝑒 𝐷𝑒𝑓.
𝑃 Def.│𝑋 =
𝑇𝑜𝑡𝑎𝑙 𝑂𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠
2
= = 0.25
8
Naïve Bayes Classifier: An Example
I. What is the probability that this person is a default given his
features
◼ The test data consists of data points that have not been
seen by the model before.
42
Confusion Matrix
◼ A matrix that demonstrates the number of test cases
correctly and incorrectly classified.
Type I Error
- TP: Predicted that the events
will happen, and they actually
happened.
Type II Error
https://www.researchgate.net/figure/Confusion-matrix-and-performance-evaluation-metrics_fig5_346062755
Confusion Matrix: Example
https://towardsdatascience.com/understanding-confusion-matrix-a9ad42dcfd62
44
Confusion Matrix
Classifier Evaluation Metrics
Recall (Completeness)
◼ The ability of a model to find all the relevant cases within a data set.
◼ What proportion of actual positives was identified correctly
Precision (exactness)
◼ The ability of a classification model to identify only the relevant data
points.
◼ What proportion of positive identifications was actually correct?
Classifier Evaluation Metrics
47
Classifier Evaluation Metrics
◼ Sensitivity = TP/P
◼ Specificity = TN/N
Accuracy Paradox
◼ Given the following Two Scenario:
◼ The accuracy will jump up from 98% to 98.5% which is not correct as
the model stopped working.
◼ Model evaluation should not be base only on the accuracy rate as it is
sometimes misleading.
CAP: The Cumulative Accuracy Profile
◼ Used to visualize the discriminative power of a model.