0% found this document useful (0 votes)
1 views12 pages

Adobe Scan 16 May 2023 (5)

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 12

COMMON CLASSIFICATION ALGORITHMS

Let us now delve into some common


most common classification algorithms.classification algorithms. Following are the
out of which we have already learnt about
the NaïveBaves classifier in Chapter 6.We willcover
in this chapter. details of the other algorithms
1. k-Nearest Neighbour (kNN)
2. Decision tree
3. Random forest
4. Support Vector Machine (SVM)
5. Naive Baves classifier
IL2 Decisiontrees
Do vou remember the "GuessWhat" ganme that we used to play in our childhood? One of us will
think of sonmething, and the others had to guess what it is. For that they can ask questions, answers
kywhich will be either "Yes" or "No" based on these clues, the answer is given. For example, con
sider the tree given below. (Note, it is just a sample tree, not a general tree). According to the rules
specified, if you are thinking of an animal that is herbivores and does not live in forest then it is a
rabbit (Figure 11.3).

Animal

YES
NO

National
Herbivores
Bird?
YES NO YES NO

Lives in Known for


Gives Milk? Peacock
Forest?
singing well?
YES YES
NO NO
YES NO

Cow
Gives Wool? Giraffe Rabbit
Cuckoo Crow

YES, NO

Sheep Deer

Figure 11.3 Decision Tree to identify the animal


The same concept is used to create a Decision Tree for any Data
study in detail the math behind it. Science problem. Now let us

11.2.1 Basic introduction


Decision Tree is one of the most intuitive and popular techniques in data mining that
explicit rules for classification. It works well with heterogeneous data and provides
Of an
item by mapping observations about the item. predicts the target value
tree. The
Adecision tree represents choices and their results in the form of a nodes
event or choice and the edges of the graph
mon applications of decision tree include:
represent the decision rules or conditions. represent an
Some com-
predicting an email as spam or not spam
predicting if a tumor is cancerous or not
predicting a loan as good or bad
> predicting credit risk based on certain factors
Basically, decision trees can perform either classificationor regression. For example,
credit card transaction as genuine or fraudulent is an example of classification but using a
tree to forecast prices of stock would be regression task.
identifying
decision
a

Unlike linear models, decision trees map non-linear relationships quite well.

112.2 Principle of decision trees


Decision trees can be used to divide a set of items into n predetermined classes based on a specified
criterion. They belong to aclass of recursive partitioning algorithms that are simple to describe and
implement. To create a decision tree,
Step 1: Select a variable which best separates the classes. Set this variable as the root node.
Step 2: Divide each individual variable into the given classes thereby generating new nodes. Note
that decision trees are based on forwarding selection mechanism; so after a split is created, it can
not be re-visited.

Step 3:Again select a variable which best separates the classes.


Step 4: Repeat steps 2and 3for each node generated until further separation of individuals is not
possible.
Now each terminal (or leaf) node consists of individuals of a single class. That is, once the tree is
constructed and division criterion is specified, each individual can be assigned to exactly one lea.
This is determined by the values of the independent variables for the individual.
Moreover, a leaf isassigned to a class if the cost of assigning that leaf to any other class is higher
than assigning the leaf to the current class. After all the leaf nodes are assigned to a class, we ca
culate the error rate of the tree (also known as the total cost of tree or risk of the tree) by
starting
with an error rate of each leaf.
Thus, we see that a Decision Tree is a supervised learning predictive model that uses rules t0
calculate atarget value. This target value can be either:
a continuous variable, for regression trees. Such trees are also
known as Continuous Variable Decision Tree. Example, predict Programming Tip:
the price of a product or income of a customer (based on his occu Decision tree works
pation, experience, etc.). for both categorical
a categorical variable, for classification trees. Such trees are also and continuous input
known as Categorical Variable Decision Tree. Exanmple, to check and output variables
if the target value is "Yes" or "No"- whether a customer will buy a
product or not.
Pruning the Tree: When the decision tree becomes very deep, it must be pruned as it may
insome irrelevant nodes in the leaves. Therefore, pruning avoids creating very small con-
noreal statistical significance. nodes with
An algorithm based on decision tree is said to be good if it creates a largest-sized tree
and auto
matically
The
prunes it after detecting the optimal pruning threshold.
algorithm should also use Cross-validation technique and combine error rates found for all
n0ssible sub-trees to choose the best sub-tree.
For exanmple, if Information Gain of a node is -10 (loss of 10) and then the next split on that gives
ns IG of 20, then a simple decision tree willstop at step 1. But in pruning, the overall gain would be
taken as +10 and both the leaves willbe retained.

Atree with each node having no more than two child nodes is called binary tree.

11.2.3 Key terminology


Some importantterms that are frequently used in decision trees include (Figure 11.4):
ROOT Node
Branch/ Sub-Tree
Splitting
Decision Node A Decision Node

Decision Node Terminal Node Terminal Node


Terminal Node
B C

Terminal Node
Terminal Node

Figure 11.4 Structure of aDecision Tree

Root Node: This is the node that performs the first split.
Terminal Nodes/Leaves: These nodes predict the outcome.
from question to
Branches: They are depicted by arrows that connect nodes and show the flow
entire tree.
answer. Technically, a branch is a sub-section of the
sub-nodes. In a decision tree, split
Splitting: Theprocess of dividing a node into two or more
reached. For example, the programmer
ting is done until auser-defined stopping-criterion isnum
may specify that the algorithm should stop once the
lessthan 30. Decision Tree Classifier
Der of items per node becomes into further
Decision Node: A sub-node that splits is capable of both binary1])
sub-nodes. (where the labels are -1,
does not split classification and multiclass
Terminal or Leaf Node: Asub-node that
(where the labels are (0,
further. into sub-nodes is called a K-1) classification
Parent Node: A node which splits childof the parent node).
parent node of the sub-nodes (or
The deoson of making splits has a great impact on the accuracv of decision trees
Moreover,
the
deoson Titerion is different for classification and regression trees Multiple algorithmse e
used to deide to spita node in two or more subnodes The sub-nodes should increase
the
gencitv of resultant sub nodes For this, the deision tree splits the nodes on all available homo-
and then selots the split nhich resnlts in the most homegeneos sub-nodes variables
Some of the algrithms nerd for splitting the node are given below The algorithm selectioa
depends on the hTe of target variables
Gii Ind: Gini index measures inequality of distribution. It is ameasurement of the purit f
nodes ani ts yerd for alldependent variables. It savs that if we randomly select two items from
porulation (with replacement)then thev must be of same class. ACCording to this measure.
For aur population the probabilitv of two items belong
ing to the same cass is 1 Programming Tip:
>Gini index worksnhen the target value to be predicted is Decision trees come under
rategoncal greedy algorithms becatuse
Thts mcasure pertorms onlv two splitsat each node. the algorithm looks for best
A lower value indicates higher similarity. variable available at the
Theformula for computing Gini index can be given as, current split, and not abot
future splits that mav resuit to
Gini =1 a better tree
1

where pj) is the Probability of Class /.


An alternate wav to calculate Gini for a split can be given as,
Step 1: Calculate Gini for sub-nodes, using formula sum of squares of probabilitv tor success and
failure (p - q?)
Step 2: Caiculate Gini for split using weighted Gini score of each node of that split.
For example, consider Figure 11.5 given below and perform a split producing more homogenevus
sub-nodes using Gini index.

Total Students = 50

Female Male Students


Students = 20 =30

Plays Soccer =5 Plays Soccer =


(25%) 20 (67%)

Gini for sub-node Female - (0.25)* (0.25)+(0.7S)*(0. 75) = 0.0625 + 0.5625 = 0.625
Ginifor sub-node Male = (0.67)*(0.67)+(0.33)"(0.33) = 0.4489 + 0.1089 = 0.5578
Weighted Ginifor Spit -(20/s0)*0.63 (30/s0)*0.56 =0.252 + 0.336 =0.588
Figure 11.5 Computing weighted Gini score of each node of that split.
CHAPTER 11 Classification and Clustering 493

Catropy: Entropy measures impurity. It is used for categorical target value. Higher the impurity
ore is the intormation required to describe it and vice versa. Therefore, if a population is com
pletely homogeneous,then the entropy is zero. Correspondingly, ifthe value of entropy is onethen
tindicates that the items are equally divided in the population.
Entropy =-Xp, log, P0
Totropy of a node canalso be calculated using the formula:
Entropy=-p log2 P-q log24
where, p and q is probability of success and failure, respectively in thatnode. Asplit with lowest
entropy compared to parent node and other splits ischosen as lesser theentropy, the better it is.
Steps to calculate entropy for a split:
1. Calculate entropy of parent node.
2. Calculate entropy of each individual node of split.
3. Calculate weighted average of all sub-nodes available in split.
TotalStudents = 50

Female Male Students


Students = 20 =30

Plays Soccer=5 Plays Soccer = 20


(25%) (67%)
Entropy for parent node = -(25/50) log2 (25/50) (25/30) log2 (25/30) = 1.
Entropy for Female node = -(5/20) log2 (5/20) - (15/20) log2 (15/20)
= (0.25)(-2) (.75)(-0.41) = 0.5 + 0.30 = 0.8
Entropy for male node = -(20/30) log2 (20/30)- (10/30) log2(10/30)
= -(0.67)(-0.57) (.33)(-1.59) = 0.44 + 0.52= 0.96
Entropy for splitGender =Weighted entropy of sub-nodes =(20/50)*0.8 +
(30/50)*0.96
= (0.4)(0.8) + (0.6)(0.96) = 0.32 + 0.57 = 0.89

Figure 11.6 Computing Entropy of asplit


Similarly, calculate entropy for the other split at the same level and A tree generated by
select the split with lowest entropy.
using Chi Square test
Chi-Square: The Chi-Square test is used to deduce the statistical sig is knoWn as CHAID
nificance of the differences between sub-nodes and parent node. It (Chi-square Automatic
1S calculated by sum of squares of standardized differences between Interaction Detector)
observed and expected frequencies of target variable. The advantage
of using this test is that itworks wellwith categorical target variable. A
higher value of Chi-Square indicates higher statistical significance of differences between sub-node
and Parent node.It is calculated using the formula,
nl-square = ((Actual Expected) /Expected)1/2
Steps toCalculate Chi-square for a split:
Success and Failure.
Calculate Chi-square for individual node by calculating the deviation for both
2. Calculated Chi-square of Split using Sum of all Chi-square of Success and Failure of
of the split. each node
For exanple, let us fornm atable (Table III) containing actual and expected values for "Play
and "Not Play Soccer". We have kept expected value for "Play Soccer" and "Not Play Cricket"Soccer"
as 10
for Female node because parent node has probabilityof 50%.
Calculate Chisquare of node for "Play Soccer" and "Not Play Soccer" using the
formula, =
(Actual -Expected)}/Epected)/2, Similarly, calculate Chi-square value for Male node also.
add allChi-square values to calculate Chi-square for split on Gender (Male and Female). Finally,
Table lIl. Find Chi Square
NODE PLAY NOT TOTAL EXPECTED PLAY EXPECTED NOT CHI-SQUARE
SOCCER PLAY SOCCER PLAY SOCCER
SOCCER Play Soccer Not Play
Soccer
Female 15 20 10 10 1.58 0.7

Male 20 10 30 15 15 1,29 1.29

TOTAL CHI QUARE VALUE FOR GENDER 4.86

Information Gain: Consider Figure11.7 and think which node can you describe easily? Thefirst
node, right? Since all the individuals in the first node are exactly the same. So if youknow the
features of one individual, then you can eas
ily describe the others. Now if you look at
the second node, amajority of the elements
are same but a few of them are different. So
you need some additional information to
describe allthe individuals. When we look at
Figure 11.7 Role of Information the third node, we see that we need maxi
mum information here.
Technically, we can say that the first node is a pure node, second node is less impure
and the third is more impure. Less impure node requires less information to describe it.
Correspondingly, more impure node requires more information. Information Gain is calcu
lated as, 1- Entropy.
Thus, Information Gaincaptures the amount of information one gains by selecting a
particular
attribute. We start at the root node of the tree and split the data on the feature that results in the
largest information gain (|G).Therefore, IG tells us how important agiven
attribute is.
11.2.5 Best split for continuous variables
Reduction in Variance isan excellenttechnique for continuous target variables. It uses the
formula of variance to choose the best split. The split with lower variance is standar
to divide the individuals. chosen as the best spul

Variance = Z(X-X
n

where, X-bar is mean of the values and X is actual value and n is


number of values.
CIUsTerng 495

Stpsto caleulate Variance:


Calculate variance for eaclh node.
1.
Calculate variance for each split as veighted
average of cach node variance.
Example: Let us use the same example and assign nunmerical value 1for playing soccer and 0for
not plaving soccer.

Total Students 50

Female Male Students


Students= 20 30

Plays Soccer =5 Plays Soccer = 20


(25%) (67%)
Mean of Female node = (5°1+ 15*0)/20=0.25 and Variance =
/ 20=((5*0.56) +15*{0.06)/20 =(2.8 +8.4)/20 =0.04 (5(1-0.25)^2+15°(0-0.25)^2)
Mean of Male Node= (20*1+10*0)/30=0.67 and Variance =
|30 =(20 *.10 +10 * 0.44 )/30 =(2 +4.4)/30 =0.21 (20(1-0.67)^24+10°(0-0.672)
Mean of Root Node =(251+25°0)/50=0.5 and Variance= (25 (1-0.S^2+25°(0-0.5/2) /
50 =0.25

Variance for Split Gender =Weighted Variance of Sub-nodes =(20/50)*0.04+ (30/50)


*0.21= (0.4° 0.04) +(0.6* 0.21) =0.016+ 0.12 =0.136

Points to Remember
Programming Tip:
> To choose the best separation, the X² test is used which tests the As the purity of node
independence of two variables Xand Y. increases, Gini index
> The number of degrees of freedom can be calculated as, decreases
p = (number of rows 1) X (number of columns - 1)

Decision trees are biased toward selecting categorical variables with large numbers of levels. So, if
there are categorical variables with high numbers of levels then try to remove them to have fewer
levels.
> lt is possible to set various options while constructing a decision tree. These options include:
2 Maximum Depth which specifies the number of levels deep a tree can go. This limits the
complexity of a tree.
2 Minimum Number of Records in Terminal Nodes. This is important because it splitting a
terminal node results in less number of records than that specified, the split is not made and
the tree ceases to grow to the next level. This parameter is basically used to control overtit
ting problems.
o Minimum Number of Records in ParentNode to determine where a split can occur. For
example, if records in anode are less than the number of records specitied then it is not split
further. This parameter is used to control overfitting. Ahigher value prevents amodel trom
learning relations which might be important for a particular sample. Thus, a higher value
may result in underfitting.
BonferroniCorrection shouldbe used as it adjusts for nmanycomparisons when Chi-Square
statistics is used for a categorical input variable rather than categoricaltarget variable.
496 Data Science and Machine Learning using Python

1.2.6 Advantages of Rdecision trees


statistical techniques because they provide
Decision trees are one of the most popular
advantages to the users.
following
> Thev are casv to iplement. require any statistical knowledge to read.
can be casily understood. They do not
> Results
interpret them.
Togramers can code the resulting model.
model is applied to new individuals.
> Thevevecute faster wlhen the affect the Decision trees.
utliers or extreme individuals can not
with missing data.
> Some Decision tree algorithms can deal represents all factors that are importans
>A Decision tree allows userstovisualize the results and
in decision making. structure in data and relationshine
underlying
> Decision trees provide a clear picture of the
between variables.
significant variables in the data-set.
> Thev can also be used to identify the most
> Decision tree closely mirrors hunman decision-making
compared to other regression and cias
sification approaches.
techniques.
Decision trees require less data cleaning as compared to other
Thev are not influenced by missing values to a fair degree.
> Thev can handle both numerical as well as categorical variables.
Decision trees employ non-parametric method as they have no assumptions about the space
distribution and the classifier structure.

OUTLIERS
An outlier is an individual data item or observation that lies at an abnormal distance from other
data values in arandom sample of a data set. It is the work of the data analvst to decide what will
be considered abnormal. Outliers should be carefully investigated as they may contain valuable
information about the data being processed. Data analyst must be able to answer questions like
the reasons for the presence of outliers in thedata set,probability that such values will continue
to appear, etc. for example, if an examination was held of maximum marks 100, then one or more
students getting more than 100 will be treated as an outlie.
Outlier analysis is extensivelyused for detecting fraudulent cases. In sonme cases, outliers mav
be contextual outliers. This means that a particular data value is an outlier only under a specitic
condition. For example, in hot summer season when temperature is 40C and above, then an
unexpected windstorm and rain may bring it to 32°C. Another example could be that a bank
customer may not be withdrawing more than 50K in amonth but because of a family wedding
may withdraw ?2.5 lakhs ina day.

11.2.7 Limitations of decisiontrees


The disadvantages of using Rdecision trees are as follows:
> The tree detects local, and not global, optima.
> All independent variables are evaluated sequentially, not
simultaneously.
The modification of a single variable may change the
whole tree if this variable is located near
the top of the tree.
Docision trees lack robustness. Though this can be overcome by
construct the trees on nnany successive samples and can aggregateresampling,
in which one can
by mean, but this will lead
to acomplex tree that is difficult to read and interpret.
Since each split reduces the number of remaining records, later splits based on very few records
have less statistical power.
The property of forwarding variable selection and constant splits of nodes makes prediction by
trees not as accurate as done byother algorithms.
Orerfitting is a key challenge in modeling decision trees. If there is no limit set for adecision
tree, it will give 100% accuracy on training set because in the worst case, there willbe a single
leaf for each observation.

The problem of overfitting could be overcome by tree pruning and setting constraints.on
tree size.

11.2.8 Overfittingand underfitting problems in ML algorithms


Instatistics and ML algorithms, a fit refers to how well a target function is approximated. In ML
algorithms, the target function is used to make predictions. Thus, it is important to know whether
the target function is suffering from overfitting or underfitting problems.
Overfitting: Overfitting means that a model fits the training data too well. It occurs when a
model learns the details and noise in the training data to the extent that it negatively impacts
the performance of the model on new data. In such a situation, outliers in the training data are
with a lot
picked up and learned as concepts by the model. Thus, an overfitted model is trained
of data.
models that
Overfitting is more likely with non-parametric (like decision trees) and non-lineardecision tree
the
have more flexibility when learning a target function. To rectify this problem,
details that it has learnt can
Must be pruned after it has learned so that some of the unnecessary
be easily removed.
Underfitting: Underfitting refers to a model that can neither model the training data nor general
performance on the training data.
ize to new data. An underfit ML model also gives poor
capture the underlying trend of the
An ML model is said to be underfitted when it cannot usually happens when we have less
data. Such a model does not fit the data well enough. It
build an accurate model and also when we try to build a linear model with a non-linear
data to
more data and performing dimensionality
data. An underfitted model can be avoided by using
reduction.
Model: An ML model is said to be goodif it generalizes any new input data from the
Good Fit
This helps us to make predictions in the future data, that the
Problem domain in a proper way. underfitted nor overfitted. Such a
the model should be neither
data model has never seen. Ideally, underfitting and overfitting. If we neasure the performance of
model is at the sweet spot betweenerrors decreases as the training model learns. But, if we traina
d model over time, the
number of
training dataset may continue to decrease
performance of the model on the
todel for too long, the the irrelevant detail and outliers in the train1ng
overfitting andlearning
Mue to the problem of
dataset. Simultaneously, errors in the test data start to rise as the model's ability to
decreases.
Agood-fit model makes predictions with 0
errors. In order to get a good fit, We
point just before wherethe error starts increasing.
generalize
ing dataset as wellas on the testing dataset.
At this point, the model has will at a
good skills stop
on
Limiting Overfitting Issues: While both overfitting and
train-
mance, by far the most common problem in applied underfitting can result in
can be limited by using machine learning is overfitting. poor perfor-
resampling technique like the
accuracy. The k-fold cross validation allows you to train k-fold
and
cross validation to
test your model Overfit ing
estimate model
performance of an MLk-times
subsets of training data and build up an estimate of the on
data. model on different
une

Price

Size
Size
Size

High bais (underfit) High bais (underfit)


High
overce

XX
X
X
Under-fitting
(too simple to Approplrate-fitting Over-fitting
explain the varlance)
(forcefitting--too
good to be true) G
Figure 11.8 Relationship between Bias and Model
Fitting
Figure 11.8 shows that an overfitted model shows low
on the other hand, shows low variance bias but high variance. Underfitted model,
but high bias. While overfitting results due to a complex
mode, underfitting is a result of a very simple model. Both
predictions on new datasets. overfitting and underfitting lead to poor

11.2.9 Improving performance of decision trees


The performance of decision trees can be enhanced by aggregating many
methods like bagging, random forests, and boosting. In this section, we decision trees, usun,
but implement random forest. will discuss all three of them
1APIER 11 Classification and Clustering

Bagging: The decision trees suffer from high variance. That is, if the training data is randomly
divided into two parts and a decision tree is created for both the parts, then the two decision trees
ll be quite different. Ideally, we should get similar results when the algorithm is applied repeat
edly to distinct datasets.

D
Ornginal
Training data

Step 1:
Create Multiple D, D, Da1 D
Data Sets

Step 2:
Build Muitiple
Classifiers

Step 3:
Combine
Classifiers

Figure 11.9 Bagging


Bagging also known as bootstrap aggregation, is used to reduce the variance of predictions by com
bining the result of multiple classifiers modeled on different sub-samples of the same dataset. This
technique can be mathematically represented as,
B

fnag (3) -B h )
b=l

Look at Figure 11.9 to understand how bagging is done.


Programming Tip:
Step 1: The original dataset is divided into multiple datasets Besides decision trees,
where each dataset has a fraction of the original dataset's data. bagging can also be used
Step 2: A cdassifier is created on each dataset. to improve predictions for
other many regression and
Step 3: Take the average of predictions made by all the classification methods
classifiers.
Implementing Decision Tree Algorithm
from sklearn. datasets import load_iris
DecisionTreeClassifier
tOm sklearn.tree import
Lrom sklearn.tree import export text

lris = load iris ()


state=0, max depth=2)
decision tree = DecisionTreeClassifier (random

decision tree decision tree.fit (iris.data, iris.target)

You might also like