Adobe Scan 16 May 2023 (5)
Adobe Scan 16 May 2023 (5)
Adobe Scan 16 May 2023 (5)
Animal
YES
NO
National
Herbivores
Bird?
YES NO YES NO
Cow
Gives Wool? Giraffe Rabbit
Cuckoo Crow
YES, NO
Sheep Deer
Unlike linear models, decision trees map non-linear relationships quite well.
Atree with each node having no more than two child nodes is called binary tree.
Terminal Node
Terminal Node
Root Node: This is the node that performs the first split.
Terminal Nodes/Leaves: These nodes predict the outcome.
from question to
Branches: They are depicted by arrows that connect nodes and show the flow
entire tree.
answer. Technically, a branch is a sub-section of the
sub-nodes. In a decision tree, split
Splitting: Theprocess of dividing a node into two or more
reached. For example, the programmer
ting is done until auser-defined stopping-criterion isnum
may specify that the algorithm should stop once the
lessthan 30. Decision Tree Classifier
Der of items per node becomes into further
Decision Node: A sub-node that splits is capable of both binary1])
sub-nodes. (where the labels are -1,
does not split classification and multiclass
Terminal or Leaf Node: Asub-node that
(where the labels are (0,
further. into sub-nodes is called a K-1) classification
Parent Node: A node which splits childof the parent node).
parent node of the sub-nodes (or
The deoson of making splits has a great impact on the accuracv of decision trees
Moreover,
the
deoson Titerion is different for classification and regression trees Multiple algorithmse e
used to deide to spita node in two or more subnodes The sub-nodes should increase
the
gencitv of resultant sub nodes For this, the deision tree splits the nodes on all available homo-
and then selots the split nhich resnlts in the most homegeneos sub-nodes variables
Some of the algrithms nerd for splitting the node are given below The algorithm selectioa
depends on the hTe of target variables
Gii Ind: Gini index measures inequality of distribution. It is ameasurement of the purit f
nodes ani ts yerd for alldependent variables. It savs that if we randomly select two items from
porulation (with replacement)then thev must be of same class. ACCording to this measure.
For aur population the probabilitv of two items belong
ing to the same cass is 1 Programming Tip:
>Gini index worksnhen the target value to be predicted is Decision trees come under
rategoncal greedy algorithms becatuse
Thts mcasure pertorms onlv two splitsat each node. the algorithm looks for best
A lower value indicates higher similarity. variable available at the
Theformula for computing Gini index can be given as, current split, and not abot
future splits that mav resuit to
Gini =1 a better tree
1
Total Students = 50
Gini for sub-node Female - (0.25)* (0.25)+(0.7S)*(0. 75) = 0.0625 + 0.5625 = 0.625
Ginifor sub-node Male = (0.67)*(0.67)+(0.33)"(0.33) = 0.4489 + 0.1089 = 0.5578
Weighted Ginifor Spit -(20/s0)*0.63 (30/s0)*0.56 =0.252 + 0.336 =0.588
Figure 11.5 Computing weighted Gini score of each node of that split.
CHAPTER 11 Classification and Clustering 493
Catropy: Entropy measures impurity. It is used for categorical target value. Higher the impurity
ore is the intormation required to describe it and vice versa. Therefore, if a population is com
pletely homogeneous,then the entropy is zero. Correspondingly, ifthe value of entropy is onethen
tindicates that the items are equally divided in the population.
Entropy =-Xp, log, P0
Totropy of a node canalso be calculated using the formula:
Entropy=-p log2 P-q log24
where, p and q is probability of success and failure, respectively in thatnode. Asplit with lowest
entropy compared to parent node and other splits ischosen as lesser theentropy, the better it is.
Steps to calculate entropy for a split:
1. Calculate entropy of parent node.
2. Calculate entropy of each individual node of split.
3. Calculate weighted average of all sub-nodes available in split.
TotalStudents = 50
Information Gain: Consider Figure11.7 and think which node can you describe easily? Thefirst
node, right? Since all the individuals in the first node are exactly the same. So if youknow the
features of one individual, then you can eas
ily describe the others. Now if you look at
the second node, amajority of the elements
are same but a few of them are different. So
you need some additional information to
describe allthe individuals. When we look at
Figure 11.7 Role of Information the third node, we see that we need maxi
mum information here.
Technically, we can say that the first node is a pure node, second node is less impure
and the third is more impure. Less impure node requires less information to describe it.
Correspondingly, more impure node requires more information. Information Gain is calcu
lated as, 1- Entropy.
Thus, Information Gaincaptures the amount of information one gains by selecting a
particular
attribute. We start at the root node of the tree and split the data on the feature that results in the
largest information gain (|G).Therefore, IG tells us how important agiven
attribute is.
11.2.5 Best split for continuous variables
Reduction in Variance isan excellenttechnique for continuous target variables. It uses the
formula of variance to choose the best split. The split with lower variance is standar
to divide the individuals. chosen as the best spul
Variance = Z(X-X
n
Total Students 50
Points to Remember
Programming Tip:
> To choose the best separation, the X² test is used which tests the As the purity of node
independence of two variables Xand Y. increases, Gini index
> The number of degrees of freedom can be calculated as, decreases
p = (number of rows 1) X (number of columns - 1)
Decision trees are biased toward selecting categorical variables with large numbers of levels. So, if
there are categorical variables with high numbers of levels then try to remove them to have fewer
levels.
> lt is possible to set various options while constructing a decision tree. These options include:
2 Maximum Depth which specifies the number of levels deep a tree can go. This limits the
complexity of a tree.
2 Minimum Number of Records in Terminal Nodes. This is important because it splitting a
terminal node results in less number of records than that specified, the split is not made and
the tree ceases to grow to the next level. This parameter is basically used to control overtit
ting problems.
o Minimum Number of Records in ParentNode to determine where a split can occur. For
example, if records in anode are less than the number of records specitied then it is not split
further. This parameter is used to control overfitting. Ahigher value prevents amodel trom
learning relations which might be important for a particular sample. Thus, a higher value
may result in underfitting.
BonferroniCorrection shouldbe used as it adjusts for nmanycomparisons when Chi-Square
statistics is used for a categorical input variable rather than categoricaltarget variable.
496 Data Science and Machine Learning using Python
OUTLIERS
An outlier is an individual data item or observation that lies at an abnormal distance from other
data values in arandom sample of a data set. It is the work of the data analvst to decide what will
be considered abnormal. Outliers should be carefully investigated as they may contain valuable
information about the data being processed. Data analyst must be able to answer questions like
the reasons for the presence of outliers in thedata set,probability that such values will continue
to appear, etc. for example, if an examination was held of maximum marks 100, then one or more
students getting more than 100 will be treated as an outlie.
Outlier analysis is extensivelyused for detecting fraudulent cases. In sonme cases, outliers mav
be contextual outliers. This means that a particular data value is an outlier only under a specitic
condition. For example, in hot summer season when temperature is 40C and above, then an
unexpected windstorm and rain may bring it to 32°C. Another example could be that a bank
customer may not be withdrawing more than 50K in amonth but because of a family wedding
may withdraw ?2.5 lakhs ina day.
The problem of overfitting could be overcome by tree pruning and setting constraints.on
tree size.
Price
Size
Size
Size
XX
X
X
Under-fitting
(too simple to Approplrate-fitting Over-fitting
explain the varlance)
(forcefitting--too
good to be true) G
Figure 11.8 Relationship between Bias and Model
Fitting
Figure 11.8 shows that an overfitted model shows low
on the other hand, shows low variance bias but high variance. Underfitted model,
but high bias. While overfitting results due to a complex
mode, underfitting is a result of a very simple model. Both
predictions on new datasets. overfitting and underfitting lead to poor
Bagging: The decision trees suffer from high variance. That is, if the training data is randomly
divided into two parts and a decision tree is created for both the parts, then the two decision trees
ll be quite different. Ideally, we should get similar results when the algorithm is applied repeat
edly to distinct datasets.
D
Ornginal
Training data
Step 1:
Create Multiple D, D, Da1 D
Data Sets
Step 2:
Build Muitiple
Classifiers
Step 3:
Combine
Classifiers
fnag (3) -B h )
b=l