Unit IV Da Online - PPTX 2 82

Download as pdf or txt
Download as pdf or txt
You are on page 1of 81

Syllabus

● Decision trees- Overview, general algorithm, decision tree algorithm,


evaluating a decision tree.
● Naïve Bayes – Bayes‟ Algorithm, Naïve Bayes‟ Classifier, smoothing,
diagnostics.
● Diagnostics of classifiers,
● Additional classification methods.
What is Classification?

● Classification is a type of supervised learning.


● It specifies the class to which data elements belong
● It predicts a class for an input variable as well.

● There are 2 types of Classification:


○ Binomial
○ Multi-Class
Classification: Use Cases

To find whether an email received is a spam or Not-Spam


Classification: Use Cases

● To find whether an email received is a spam or ham


● To identify customer will buy or not buy a product
● To find if a bank loan is granted
● To identify if a kid will pass or fail in an examination
● Analysing sentiment based on product reviews
Types of Classification / Classifiers

● Linear Models
○ Logistic Regression
○ Support Vector Machines
● Nonlinear models
○ K-nearest Neighbors (KNN)
○ Support Vector Machines (SVM)
○ Naïve Bayes
○ Decision Tree Classification
○ Random Forest Classification
Decision trees

Overview

General algorithm

Decision tree algorithm

Evaluating a decision tree


Decision trees- Overview

supervised Learning Used For Prediction


Decision trees- Overview

Train Model Test Model

Training Test/ Unseen


Dataset Data
Decision trees- Overview

Classification

Train Model Test Model

Training Dataset Unseen Data

Input Class
Variable Label
Predict Class:
Yes or No
Decision trees- Overview

Decision Tree-
● Uses tree structure to
specify sequences of
● Decisions &
● consequences
https://medium.com/cracking-the-data-science-interview/an-introduction-to-big-data-decision-trees-aae6a3587f59
Decision trees- Overview

The goal is to predict a


Given input
output variable(Class)
{X1,X2,.....Xn}
Y
Decision trees- Overview
Gender Root- Top Internal Node
Female Male
Depth= 1
Branch- Outcome of Test

Income Age Internal Node- Decision


on variable

<=45 k >45k <=40 k >40k

Leaf Node-
Yes No Yes No
Class Label
Decision trees- Overview
● Internal nodes are the decision or test points. Each internal node refers to an input variable or an
attribute.
● The top internal node is called the Root Node.
● The decision tree in Figure is a binary tree in that each internal node has no more than two branches.
The branching of a node is referred to as a split
● The depth of a node is the minimum number of steps required to reach the node from the root.
● Leaf nodes are at the end of the last branches on the tree. They represent class labels—the outcome
of all the prior decisions.
● The path from the root to a leaf node contains a series of decisions made at various internal
nodes.
● The simplest short tree is called a decision stump,
Decision trees

Overview

General algorithm

Decision tree algorithm

Evaluating a decision tree


Decision trees-General algorithm

1. The objective of a decision tree algorithm is to construct a tree T from


a training set S.
2. The algorithm constructs subtrees , … for the subsets of S recursively
until one of the following criteria is met:
a. All the leaf nodes in the tree satisfy the minimum purity threshold.
b. The tree cannot be further split
c. Any other stopping criterion is satisfied (such as the maximum
depth of the tree).
Decision trees- Example

Construct
Decision Tree
Train Model Test Model

Training Dataset Unseen Data

Input Class
Variable Label
Predict Class:
Yes or No
Decision trees-General algorithm

Choose
Step
1
Most Informative
Attribute
Decision trees-General algorithm

How to Choose?
Entropy
Based ?
Method
Most Informative
Attribute
Decision trees-Example
Randomness
or Entropy Method
Uncertainty
Measured

Measures
Measures
the
impurity of
Information the purity of

an attribute
Entropy Gain an attribute
Decision trees-General algorithm
Entropy
Equation

Entropy = 1
Entropy = 0 when probability of all class labels
when all P(x) is 0 or 1 is 50/50
Decision trees-General algorithm

Information Gain of
Attribute A

Base Entropy - Conditional Entropy of Attribute A


Decision Tree Example

Training Dataset
Decision Tree Example

● Two class: Yes and No


● Calculate Probability of each class
● P(Buys_Computer =Yes) = 9/14
● P(Buys_Computer= No)= 5/14

● Calculate Base Entropy S

● Entropy(S) => I(Yes,No) => (9,5) = - 9/14 Log2 (9/14) - 5/14 Log 2(5/14)
Entropy(S) = 0.94
Decision Tree Example

Find Information Gain


on Every Attribute

Age Income Student Credit


Decision Tree Example

Age

<=30
(5)

Yes- 2
No-3
Decision Tree Example

Age

<=30 31..40 >40


(5) (4) (5)

Yes- 2 Yes- 4 Yes- 2


No-3 No-0 No-3
Decision Tree Example

E(<=30) = - 2/5 Log2 (2/5) - 3/5 Log 2(3/5)= 0.97

E(31..40) = - 4/4 Log2 (4/4) - 0/4 Log 2(0/4)= 0

E (>40) = - 2/5 Log2 (2/5) - 3/5 Log 2(3/5)= 0.97

IG(S,Age) = 0.94 - ( (0.35 * 0.97) +(0.28 * 0 )+ (0.35 * 0.97))

IG(S,Age) = 0.94 - (0.68) = 0.26


Decision Tree Example

IG(S,Age) = 0.26
Highest Gain is given by Age
IG(S,Student) = 0.15 So
Age Attribute will Come at
Root
IG(S,Credit) = 0.048

IG(S,Income) = 0.029
Decision Tree Example

Age

<30 31..40 >40


Decision Tree Example

Age

<30 31..40 >40


Yes = 4 No = 0

Yes
Decision Tree Example

Age

<30 31..40 >40

Student
Yes

Yes No

Yes No
Decision Tree Example

Age

<30 31..40 >40

Student Credit
Yes

Yes No Excellent Fair

Yes No No Yes
Decision trees

Overview

General algorithm

Decision tree algorithm

Evaluating a decision tree


Decision Tree Algorithms

Decision Tree
Lorem Ipsum
Algorithm

ID3 C 4.5 CART


ID3 Algorithm

● ID3 (or Iterative Dichotomiser 3) is one of the first decision


tree algorithms, and it was developed by John Ross Quinlan.
● Let A be a set of categorical input variables, P be the output
variable (or the predicted class), and T be the training set.
● The ID3 algorithm is shown in next slide
ID3 Algorithm

1 ID3 (A, P, T)
2 if T ∈ ∅
3 return ∅
4 if all records in T have the same value for P
5 return a single node with that value
6 if A ∈ ∅
7 return a single node with the most frequent value of P in T
8 Compute information gain for each attribute in A relative to T
9 Pick attribute D with the largest gain
10 Let {d1 , d2 , ….dm } be the values of attribute D
11 Partition T into { T1, T2, ….Tm} according to the values of D
12 return a tree with root D and branches labeled d1, d2, ….dm"going respectively to trees ID3(A-{D}, P, T1),
ID3(A-{D}, P,T2 ), … …...ID3(A-{D}, P, Tm)
C4.5
● The C4.5 algorithm introduces a number of improvements over the original ID3 algorithm.
● The C4.5 algorithm can handle missing data.
● If the training records contain unknown attribute values, the C4.5 evaluates the gain for an attribute by
considering only the records where the attribute is defined.
● Both categorical and continuous attributes are supported by C4.5.
● Values of a continuous variable are sorted and partitioned.
● For the corresponding records of each partition, the gain is calculated, and the partition that maximizes
the gain is chosen for the next split.
● The ID3 algorithm may construct a deep and complex tree, which would cause overfitting
● The C4.5 algorithm addresses the overfitting problem in ID3 by using a bottom-up technique called
pruning to simplify the tree by removing the least visited nodes and branches.
CART

● CART (or Classification And Regression Trees) is often used as a generic acronym for the decision
tree, although it is a specific implementation.
● Similar to C4.5, CART can handle continuous attributes.
● Whereas C4.5 uses entropy based criteria to rank tests, CART uses the Gini diversity index defined
in Equation


● Whereas C4.5 employs stopping rules, CART constructs a sequence of subtrees, uses
cross-validation to estimate the misclassification cost of each subtree, and chooses the one with the
lowest cost.
Decision trees

Overview

General algorithm

Decision tree algorithm

Evaluating a decision tree


Evaluate Decision Tree

•Decision trees are greedy algorithms


•Best option at each step, maybe not best overall
•Addressed by ensemble methods: random forest
•Model might overfit the data
Evaluate Decision Tree

•Decision trees -> rectangular decision regions


Evaluate Decision Tree

● Advantages of decision trees


○ Computationally inexpensive
○ Outputs are easy to interpret – sequence of tests
○ Show importance of each input variable
○ Decision trees handle
■ Both numerical and categorical attributes
■ Categorical attributes with many distinct values
■ Variables with nonlinear effect on outcome
■ Variable interactions

Evaluate Decision Tree

● Disadvantages of decision trees


○ Sensitive to small variations in the training data
○ Overfitting can occur because each split reduces training data
for subsequent splits
○ Poor if dataset contains many irrelevant variables
Syllabus

● Decision trees- Overview, general algorithm, decision tree algorithm,


evaluating a decision tree.
● Naïve Bayes – Bayes Algorithm, Naïve Bayes- Classifier, smoothing,
diagnostics.
● Diagnostics of classifiers,
● Additional classification methods.
Naïve Bayes

Bayes‟ Algorithm

Naïve Bayes‟ Classifier

Smoothing

Diagnostics
Naïve Bayes Applications

● Text Classification
○ Spam Filtering
○ Document classification based on Topic (Technology, Sports etc)
○ Sentiment Analysis
○ Information Retrieval
● Speech Recognition
● Image Classification (e.g. Face detection)
● Target Marketing
● Medical Field
○ Disease detection
○ Treatment Detection
Naïve Bayes- Overview

supervised Learning
Naïve Bayes- Overview

Train Model Test Model

Training Test/ Unseen


Dataset Data
Naïve Bayes- Classifier

Bay’s Theorem
where
C is the class label
A is the observed attributes
Naïve Bayes- Bay’s Theorem

Class Prior
Likelihood
Probability

Posterior Predictor Prior


Probability Probability
Naïve Bayes- Classifier

● P(C|A) is the posterior probability of class (target) given predictor


(attribute).
● P(C) is the prior probability of class.
● P(A|C) is the likelihood which is the probability of predictor given class.
● P(A) is the prior probability of predictor.
Naïve Bayes- Classifier

Naive Bayes classifier calculates the probability of an event in the following steps:
Step 1: Calculate the prior probability for given class labels
Step 2: Find Likelihood probability with each attribute for each class
Step 3: Put these value in Bayes Formula and calculate posterior probability.
Step 4: See which class has a higher probability, given the input belongs to the
higher probability class.
NAIVE BAYES EXAMPLE
Outlook Temp Humidity Windy Play Golf
Rainy Hot High F NO
Class Label Rainy Hot High T NO

Overcast Hot High F YES

Sunny Mild Normal F YES

Sunny Cool Normal F YES

Sunny Cool Normal T No


Training
Overcast Cool Normal T Yes
Dataset
Rainy Cool High F YES

Rainy Mild Normal F NO

Sunny Cool Normal F YES


Input Rainy Mild Normal T YES
Attribute Overcast Mild High T YES

Overcast Hot Normal F YES

Sunny Mild High T NO


NAIVE BAYES EXAMPLE
Frequency Table

Play Golf
14
Yes No
9 5
9/14 5/14
NAIVE BAY’S EXAMPLE
Frequency Table

Play Golf
Yes No
Sunny 3 2
Outlook
Overcast 4 0

Rainy 2 3

Total 9 5
NAIVE BAYES EXAMPLE

Frequency Table Likelihood Table

Play Golf
Play Golf
Yes No
Yes No
Sunny 3 2
Outlook Sunny 3/9 2/5
Overcast 4 0 Outlook
Overcast 4/9 0/5
Rainy 2 3
Rainy 2/9 3/5
Total 9 5
NAIVE BAYES EXAMPLE-
Frequency & Likelihood Table
Play Golf Play Golf

Yes No Yes No

3/9 2/5 Hot 2/9 2/5


Sunny
Outlook Temp
4/9 0/5 Mild 4/9 2/5
Overcast
2/9 3/5 Cold 3/9 1/5
Rainy

Play Golf Play Golf

Yes No Yes No

High 3/9 4/5 F 6/9 2/5


Humidity Windy
Normal 6/9 1/5 T 3/9 3/5
Unseen Data X
(Test Data X)

Outlook Temp Humidity Windy Play Golf


Rainy Cool High T
?
P (Yes | X) = P (Rainy | Yes) * P (Cool | Yes) * P (High | Yes ) * P (T | Yes) * P (Yes)
= 2/9 * 3/9 * 3/9 * 3/9 * 9/14
= 0.0052

P (No | X) = P (Rainy | No) * P (Cool | No) * P (High | No ) * P (T | No) * P (No)


= 3/5 * 1/5 * 4/5 * 3/5 * 5/14
= 0.020
Unseen Data X
(Test Data X)

Outlook Temp Humidity Windy Play Golf


Rainy Cool High T
?
P (Yes | X) = 0.0052

P (No | X) = 0.020

P (Yes | X) < P (No | X)


Outlook Temp Humidity Windy Play Golf
Rainy Cool High T NO
Naïve Bayes- Smoothing

● A smoothing technique assigns a small nonzero probability to


rare events that are missing in the training data
● E.g., Laplace smoothing assumes every output occurs once more
than occurs in the dataset
● Smoothing is essential – without it, a zero conditional probability

results in P(c |A)=0


j
● Naïve Bayes advantages
○ Handles missing values
○ Robust to irrelevant variables
○ Simple to implement
○ Computationally efficient
○ Handles high-dimensional data efficiently
○ Often competitive with other learning algorithms
○ Reasonably resistant to overfitting

● Naïve Bayes disadvantages


○ Assumes variables are conditionally independent
○ Therefore, sensitive to double counting correlated variables

○ In its simplest form, used only for categorical variables


Syllabus

● Decision trees- Overview, general algorithm, decision tree algorithm,


evaluating a decision tree.
● Naïve Bayes – Bayes Algorithm, Naïve Bayes- Classifier, smoothing,
diagnostics.
● Diagnostics of classifiers,
● Additional classification methods.
Diagnostics of Classifier

Actual Value
Positive(1) Negative(0)
Predicted Positive(1) True Positive False Positive
Value Negative(0) False Negative True Negative
Diagnostics of Classifier

Actual Value
Subscribed Non Subscribed
Predicted Subscribed True Positive False Positive
Value Non Subscribed False Negative True Negative
Diagnostics of Classifier

Actual Value
Subscribed Non Subscribed
Predicted Subscribed 3 2
Value Non Subscribed 8 87
Diagnostics of Classifier

Actual Value
Subscribed Non Subscribed
Predicted Subscribed TP = 3 FP = 2
Value Non Subscribed FN = 8 TN = 87
Diagnostics of Classifier

Actual Value
Subscribed Non Subscribed Total
Predicted Subscribed 3 2 5
Value Non Subscribed 8 87 95

Total 11 89 100
Diagnostics of Classifier

Performance
Measured

Accuracy Precision Recall


Diagnostics of Classifier
Syllabus

● Decision trees- Overview, general algorithm, decision tree algorithm,


evaluating a decision tree.
● Naïve Bayes – Bayes Algorithm, Naïve Bayes- Classifier, smoothing,
diagnostics.
● Diagnostics of classifiers,
● Additional classification methods.
Additional Classifier Method

Additional
Classifier Method

Bagging Boosting Random Forest Support Vector


Machine (SVM)
Additional Classifier Method

Bagging
● Bagging (or bootstrap aggregating) uses the bootstrap technique that repeatedly
samples with replacement from a dataset according to a uniform probability
distribution.
● “With replacement” means that when a sample is selected for a training or testing set,
the sample is still kept in the dataset and may be selected again.
● Because the sampling is with replacement, some samples may appear several times
in a training or testing set, whereas others may be absent.
● A model or base classifier is trained separately on each bootstrap sample, and a test
sample is assigned to the class that received the highest number of votes.
Additional Classifier Method

Boosting
● Similar to bagging, boosting (or AdaBoost) uses votes for classification to combine
the output of individual models.
● In addition, it combines models of the same type.
● However, boosting is an iterative procedure where a new model is influenced by the
performances of those models built previously.
● Furthermore, boosting assigns a weight to each training sample that reflects its
importance, and the weight may adaptively change at the end of each boosting round.
● Bagging and boosting have been shown to have better performances than a decision
tree.
Additional Classifier Method

Random Forest

● Random forest is a class of ensemble methods using decision tree classifiers.


● It is a combination of tree predictors such that each tree depends on the values of a
random vector sampled independently and with the same distribution for all trees in
the forest.
● A special case of random forest uses bagging on decision trees, where samples are
randomly chosen with replacement from the original training set.
Additional Classifier Method

Support Vector
Machine (SVM)
● SVM is another common classification method that combines linear models with
instance-based learning techniques.
● Support vector machines select a small number of critical boundary instances called
support vectors from each class and build a linear decision function that separates
them as widely as possible.
● SVM by default can efficiently perform linear classifications and can be configured to
perform nonlinear classifications as well.
How to choose a suitable classifier among
Decision trees, naïve Bayes, & logistic regression

You might also like