Unit IV Da Online - PPTX 2 82

Syllabus
● Decision trees- Overview, general algorithm, decision tree algorithm,

evaluating a decision tree.
● Naïve Bayes – Bayes‟ Algorithm, Naïve Bayes‟ Classifier, smoothing,
diagnostics.
● Diagnostics of classifiers,
● Additional classification methods.
What is Classification?
● Classification is a type of supervised learning.

● It specifies the class to which data elements belong
● It predicts a class for an input variable as well.
● There are 2 types of Classification:

○ Binomial
○ Multi-Class
Classification: Use Cases
To find whether an email received is a spam or Not-Spam

Classification: Use Cases
● To find whether an email received is a spam or ham

● To identify customer will buy or not buy a product
● To find if a bank loan is granted
● To identify if a kid will pass or fail in an examination
● Analysing sentiment based on product reviews
Types of Classification / Classifiers
● Linear Models
○ Logistic Regression
○ Support Vector Machines
● Nonlinear models
○ K-nearest Neighbors (KNN)
○ Support Vector Machines (SVM)
○ Naïve Bayes
○ Decision Tree Classification
○ Random Forest Classification
Decision trees
Overview
General algorithm
Decision tree algorithm
Evaluating a decision tree

Decision trees- Overview
supervised Learning Used For Prediction

Train Model Test Model
Training Test/ Unseen

Dataset Data
Classification
Training Dataset Unseen Data
Input Class
Variable Label
Predict Class:
Yes or No
Decision Tree-
● Uses tree structure to
specify sequences of
● Decisions &
● consequences
https://medium.com/cracking-the-data-science-interview/an-introduction-to-big-data-decision-trees-aae6a3587f59
The goal is to predict a

Given input
output variable(Class)
{X1,X2,.....Xn}
Y
Gender Root- Top Internal Node
Female Male
Depth= 1
Branch- Outcome of Test
Income Age Internal Node- Decision

on variable
<=45 k >45k <=40 k >40k
Leaf Node-
Yes No Yes No
Class Label
● Internal nodes are the decision or test points. Each internal node refers to an input variable or an
attribute.
● The top internal node is called the Root Node.
● The decision tree in Figure is a binary tree in that each internal node has no more than two branches.
The branching of a node is referred to as a split
● The depth of a node is the minimum number of steps required to reach the node from the root.
● Leaf nodes are at the end of the last branches on the tree. They represent class labels—the outcome
of all the prior decisions.
● The path from the root to a leaf node contains a series of decisions made at various internal
nodes.
● The simplest short tree is called a decision stump,
Decision trees
Overview
General algorithm

Decision trees-General algorithm
1. The objective of a decision tree algorithm is to construct a tree T from

a training set S.
2. The algorithm constructs subtrees , … for the subsets of S recursively
until one of the following criteria is met:
a. All the leaf nodes in the tree satisfy the minimum purity threshold.
b. The tree cannot be further split
c. Any other stopping criterion is satisfied (such as the maximum
depth of the tree).
Decision trees- Example
Construct
Decision Tree
Training Dataset Unseen Data
Input Class
Variable Label
Predict Class:
Yes or No
Choose
Step
1
Most Informative
Attribute
How to Choose?
Entropy
Based ?
Method
Most Informative
Attribute
Decision trees-Example
Randomness
or Entropy Method
Uncertainty
Measured
Measures
Measures
the
impurity of
Information the purity of
an attribute
Entropy Gain an attribute
Entropy
Equation
Entropy = 1
Entropy = 0 when probability of all class labels
when all P(x) is 0 or 1 is 50/50
Information Gain of
Attribute A
Base Entropy - Conditional Entropy of Attribute A

Decision Tree Example
Training Dataset
● Two class: Yes and No

● Calculate Probability of each class
● P(Buys_Computer =Yes) = 9/14
● P(Buys_Computer= No)= 5/14
● Calculate Base Entropy S
● Entropy(S) => I(Yes,No) => (9,5) = - 9/14 Log2 (9/14) - 5/14 Log 2(5/14)
Entropy(S) = 0.94
Find Information Gain

on Every Attribute
Age Income Student Credit

Age
<=30
(5)
Yes- 2
No-3
Age
<=30 31..40 >40

(5) (4) (5)
Yes- 2 Yes- 4 Yes- 2

No-3 No-0 No-3
E(<=30) = - 2/5 Log2 (2/5) - 3/5 Log 2(3/5)= 0.97
E(31..40) = - 4/4 Log2 (4/4) - 0/4 Log 2(0/4)= 0
E (>40) = - 2/5 Log2 (2/5) - 3/5 Log 2(3/5)= 0.97
IG(S,Age) = 0.94 - ( (0.35 * 0.97) +(0.28 * 0 )+ (0.35 * 0.97))
IG(S,Age) = 0.94 - (0.68) = 0.26

IG(S,Age) = 0.26
Highest Gain is given by Age
IG(S,Student) = 0.15 So
Age Attribute will Come at
Root
IG(S,Credit) = 0.048
IG(S,Income) = 0.029
Age
<30 31..40 >40

Age
<30 31..40 >40

Yes = 4 No = 0
Yes
Age
<30 31..40 >40
Student
Yes
Yes No
Yes No
Age
<30 31..40 >40
Student Credit
Yes
Yes No Excellent Fair
Yes No No Yes
Decision trees
Overview
General algorithm

Decision Tree Algorithms
Decision Tree
Lorem Ipsum
Algorithm
ID3 C 4.5 CART

ID3 Algorithm
● ID3 (or Iterative Dichotomiser 3) is one of the first decision

tree algorithms, and it was developed by John Ross Quinlan.
● Let A be a set of categorical input variables, P be the output
variable (or the predicted class), and T be the training set.
● The ID3 algorithm is shown in next slide
ID3 Algorithm
1 ID3 (A, P, T)
2 if T ∈ ∅
3 return ∅
4 if all records in T have the same value for P
5 return a single node with that value
6 if A ∈ ∅
7 return a single node with the most frequent value of P in T
8 Compute information gain for each attribute in A relative to T
9 Pick attribute D with the largest gain
10 Let {d1 , d2 , ….dm } be the values of attribute D
11 Partition T into { T1, T2, ….Tm} according to the values of D
12 return a tree with root D and branches labeled d1, d2, ….dm"going respectively to trees ID3(A-{D}, P, T1),
ID3(A-{D}, P,T2 ), … …...ID3(A-{D}, P, Tm)
C4.5
● The C4.5 algorithm introduces a number of improvements over the original ID3 algorithm.
● The C4.5 algorithm can handle missing data.
● If the training records contain unknown attribute values, the C4.5 evaluates the gain for an attribute by
considering only the records where the attribute is defined.
● Both categorical and continuous attributes are supported by C4.5.
● Values of a continuous variable are sorted and partitioned.
● For the corresponding records of each partition, the gain is calculated, and the partition that maximizes
the gain is chosen for the next split.
● The ID3 algorithm may construct a deep and complex tree, which would cause overfitting
● The C4.5 algorithm addresses the overfitting problem in ID3 by using a bottom-up technique called
pruning to simplify the tree by removing the least visited nodes and branches.
CART
● CART (or Classification And Regression Trees) is often used as a generic acronym for the decision
tree, although it is a specific implementation.
● Similar to C4.5, CART can handle continuous attributes.
● Whereas C4.5 uses entropy based criteria to rank tests, CART uses the Gini diversity index defined
in Equation
●
● Whereas C4.5 employs stopping rules, CART constructs a sequence of subtrees, uses
cross-validation to estimate the misclassification cost of each subtree, and chooses the one with the
lowest cost.
Decision trees
Overview
General algorithm

Evaluate Decision Tree
•Decision trees are greedy algorithms

•Best option at each step, maybe not best overall
•Addressed by ensemble methods: random forest
•Model might overfit the data
•Decision trees -> rectangular decision regions

● Advantages of decision trees

○ Computationally inexpensive
○ Outputs are easy to interpret – sequence of tests
○ Show importance of each input variable
○ Decision trees handle
■ Both numerical and categorical attributes
■ Categorical attributes with many distinct values
■ Variables with nonlinear effect on outcome
■ Variable interactions
●
● Disadvantages of decision trees

○ Sensitive to small variations in the training data
○ Overfitting can occur because each split reduces training data
for subsequent splits
○ Poor if dataset contains many irrelevant variables
Syllabus

● Naïve Bayes – Bayes Algorithm, Naïve Bayes- Classifier, smoothing,
diagnostics.
Naïve Bayes
Bayes‟ Algorithm
Naïve Bayes‟ Classifier
Smoothing
Diagnostics
Naïve Bayes Applications
● Text Classification
○ Spam Filtering
○ Document classification based on Topic (Technology, Sports etc)
○ Sentiment Analysis
○ Information Retrieval
● Speech Recognition
● Image Classification (e.g. Face detection)
● Target Marketing
● Medical Field
○ Disease detection
○ Treatment Detection
Naïve Bayes- Overview
supervised Learning
Naïve Bayes- Overview
Training Test/ Unseen

Dataset Data
Naïve Bayes- Classifier
Bay’s Theorem
where
C is the class label
A is the observed attributes
Naïve Bayes- Bay’s Theorem
Class Prior
Likelihood
Probability
Posterior Predictor Prior

Probability Probability
● P(C|A) is the posterior probability of class (target) given predictor

(attribute).
● P(C) is the prior probability of class.
● P(A|C) is the likelihood which is the probability of predictor given class.
● P(A) is the prior probability of predictor.
Naive Bayes classifier calculates the probability of an event in the following steps:
Step 1: Calculate the prior probability for given class labels
Step 2: Find Likelihood probability with each attribute for each class
Step 3: Put these value in Bayes Formula and calculate posterior probability.
Step 4: See which class has a higher probability, given the input belongs to the
higher probability class.
NAIVE BAYES EXAMPLE
Outlook Temp Humidity Windy Play Golf
Rainy Hot High F NO
Class Label Rainy Hot High T NO
Overcast Hot High F YES
Sunny Mild Normal F YES
Sunny Cool Normal F YES
Sunny Cool Normal T No

Training
Overcast Cool Normal T Yes
Dataset
Rainy Cool High F YES
Rainy Mild Normal F NO
Sunny Cool Normal F YES

Input Rainy Mild Normal T YES
Attribute Overcast Mild High T YES
Overcast Hot Normal F YES
Sunny Mild High T NO

NAIVE BAYES EXAMPLE
Frequency Table
Play Golf
14
Yes No
9 5
9/14 5/14
NAIVE BAY’S EXAMPLE
Frequency Table
Play Golf
Yes No
Sunny 3 2
Outlook
Overcast 4 0
Rainy 2 3
Total 9 5
NAIVE BAYES EXAMPLE
Frequency Table Likelihood Table
Play Golf
Play Golf
Yes No
Yes No
Sunny 3 2
Outlook Sunny 3/9 2/5
Overcast 4 0 Outlook
Overcast 4/9 0/5
Rainy 2 3
Rainy 2/9 3/5
Total 9 5
NAIVE BAYES EXAMPLE-
Frequency & Likelihood Table
Play Golf Play Golf
Yes No Yes No
3/9 2/5 Hot 2/9 2/5

Sunny
Outlook Temp
4/9 0/5 Mild 4/9 2/5
Overcast
2/9 3/5 Cold 3/9 1/5
Rainy
Play Golf Play Golf
Yes No Yes No
High 3/9 4/5 F 6/9 2/5

Humidity Windy
Normal 6/9 1/5 T 3/9 3/5
Unseen Data X
(Test Data X)

Rainy Cool High T
?
P (Yes | X) = P (Rainy | Yes) * P (Cool | Yes) * P (High | Yes ) * P (T | Yes) * P (Yes)
= 2/9 * 3/9 * 3/9 * 3/9 * 9/14
= 0.0052
P (No | X) = P (Rainy | No) * P (Cool | No) * P (High | No ) * P (T | No) * P (No)

= 3/5 * 1/5 * 4/5 * 3/5 * 5/14
= 0.020
Unseen Data X
(Test Data X)

Rainy Cool High T
?
P (Yes | X) = 0.0052
P (No | X) = 0.020
P (Yes | X) < P (No | X)

Rainy Cool High T NO
Naïve Bayes- Smoothing
● A smoothing technique assigns a small nonzero probability to

rare events that are missing in the training data
● E.g., Laplace smoothing assumes every output occurs once more
than occurs in the dataset
● Smoothing is essential – without it, a zero conditional probability
results in P(c |A)=0

j
● Naïve Bayes advantages
○ Handles missing values
○ Robust to irrelevant variables
○ Simple to implement
○ Computationally efficient
○ Handles high-dimensional data efficiently
○ Often competitive with other learning algorithms
○ Reasonably resistant to overfitting
● Naïve Bayes disadvantages

○ Assumes variables are conditionally independent
○ Therefore, sensitive to double counting correlated variables
○ In its simplest form, used only for categorical variables

Syllabus

diagnostics.
Diagnostics of Classifier
Actual Value
Positive(1) Negative(0)
Predicted Positive(1) True Positive False Positive
Value Negative(0) False Negative True Negative
Actual Value
Subscribed Non Subscribed
Predicted Subscribed True Positive False Positive
Value Non Subscribed False Negative True Negative
Actual Value
Predicted Subscribed 3 2
Value Non Subscribed 8 87
Actual Value
Predicted Subscribed TP = 3 FP = 2
Value Non Subscribed FN = 8 TN = 87
Actual Value
Subscribed Non Subscribed Total
Predicted Subscribed 3 2 5
Value Non Subscribed 8 87 95
Total 11 89 100
Performance
Measured
Accuracy Precision Recall

Syllabus

diagnostics.
Additional Classifier Method
Additional
Classifier Method
Bagging Boosting Random Forest Support Vector

Machine (SVM)
Bagging
● Bagging (or bootstrap aggregating) uses the bootstrap technique that repeatedly
samples with replacement from a dataset according to a uniform probability
distribution.
● “With replacement” means that when a sample is selected for a training or testing set,
the sample is still kept in the dataset and may be selected again.
● Because the sampling is with replacement, some samples may appear several times
in a training or testing set, whereas others may be absent.
● A model or base classifier is trained separately on each bootstrap sample, and a test
sample is assigned to the class that received the highest number of votes.
Boosting
● Similar to bagging, boosting (or AdaBoost) uses votes for classification to combine
the output of individual models.
● In addition, it combines models of the same type.
● However, boosting is an iterative procedure where a new model is influenced by the
performances of those models built previously.
● Furthermore, boosting assigns a weight to each training sample that reflects its
importance, and the weight may adaptively change at the end of each boosting round.
● Bagging and boosting have been shown to have better performances than a decision
tree.
Random Forest
● Random forest is a class of ensemble methods using decision tree classifiers.

● It is a combination of tree predictors such that each tree depends on the values of a
random vector sampled independently and with the same distribution for all trees in
the forest.
● A special case of random forest uses bagging on decision trees, where samples are
randomly chosen with replacement from the original training set.
Support Vector
Machine (SVM)
● SVM is another common classification method that combines linear models with
instance-based learning techniques.
● Support vector machines select a small number of critical boundary instances called
support vectors from each class and build a linear decision function that separates
them as widely as possible.
● SVM by default can efficiently perform linear classifications and can be configured to
perform nonlinear classifications as well.
How to choose a suitable classifier among
Decision trees, naïve Bayes, & logistic regression

Unit IV Da Online - PPTX 2 82

Uploaded by

Copyright:

Available Formats

Unit IV Da Online - PPTX 2 82

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit IV Da Online - PPTX 2 82

Uploaded by

Copyright:

Available Formats

Syllabus

● Decision trees- Overview, general algorithm, decision tree algorithm,

● Classification is a type of supervised learning.

● There are 2 types of Classification:

To find whether an email received is a spam or Not-Spam

● To find whether an email received is a spam or ham

Decision tree algorithm

Evaluating a decision tree

supervised Learning Used For Prediction

Train Model Test Model

Training Test/ Unseen

Train Model Test Model

Training Dataset Unseen Data

The goal is to predict a

Income Age Internal Node- Decision

<=45 k >45k <=40 k >40k

Decision tree algorithm

Evaluating a decision tree

1. The objective of a decision tree algorithm is to construct a tree T from

Training Dataset Unseen Data

Base Entropy - Conditional Entropy of Attribute A

● Two class: Yes and No

● Calculate Base Entropy S

Find Information Gain

Age Income Student Credit

<=30 31..40 >40

Yes- 2 Yes- 4 Yes- 2

E(<=30) = - 2/5 Log2 (2/5) - 3/5 Log 2(3/5)= 0.97

E(31..40) = - 4/4 Log2 (4/4) - 0/4 Log 2(0/4)= 0

E (>40) = - 2/5 Log2 (2/5) - 3/5 Log 2(3/5)= 0.97

IG(S,Age) = 0.94 - ( (0.35 * 0.97) +(0.28 * 0 )+ (0.35 * 0.97))

IG(S,Age) = 0.94 - (0.68) = 0.26

<30 31..40 >40

<30 31..40 >40

<30 31..40 >40

<30 31..40 >40

Yes No Excellent Fair

Decision tree algorithm

Evaluating a decision tree

ID3 C 4.5 CART

● ID3 (or Iterative Dichotomiser 3) is one of the first decision

Decision tree algorithm

Evaluating a decision tree

•Decision trees are greedy algorithms

•Decision trees -> rectangular decision regions

● Advantages of decision trees

● Disadvantages of decision trees

● Decision trees- Overview, general algorithm, decision tree algorithm,

Naïve Bayes‟ Classifier

Train Model Test Model

Training Test/ Unseen

Posterior Predictor Prior

● P(C|A) is the posterior probability of class (target) given predictor

Overcast Hot High F YES

Sunny Mild Normal F YES

Sunny Cool Normal F YES

Sunny Cool Normal T No

Rainy Mild Normal F NO

Sunny Cool Normal F YES