DWDM final5
DWDM final5
Classification
1. Confusion Matrix
Definition:
A confusion matrix is a table summarizing the performance of a classification model. It shows counts
for True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN).
Steps:
1. Predict outcomes for all instances.
2. Compare predicted and actual values.
3. Fill the confusion matrix with counts of TP, TN, FP, FN.
Formula:
No specific formula. It is directly tabulated based on comparisons.
Example:
Graph:
Not applicable.
2. Accuracy
Definition:
The proportion of correct predictions over total predictions.
Steps:
1. Count TP and TN.
2. Add FP and FN for the total predictions.
3. Divide correct predictions by total.
Formula:
Accuracy=TP+TNTP+TN+FP+FN\text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{TP} + \
text{TN} + \text{FP} + \text{FN}}
Example:
Using the confusion matrix above:
Accuracy=2+22+2+0+1=0.8 (80%)\text{Accuracy} = \frac{2 + 2}{2 + 2 + 0 + 1} = 0.8 \, (80\%)
Graph:
Not applicable.
3. Precision
Definition:
The proportion of true positive predictions among all positive predictions.
Steps:
1. Count TP and FP.
2. Divide TP by (TP + FP).
Formula:
Precision=TPTP+FP\text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}}
Example:
Precision=22+0=1.0 (100%)\text{Precision} = \frac{2}{2 + 0} = 1.0 \, (100\%)
Graph:
Not applicable.
4. Recall (Sensitivity)
Definition:
The proportion of true positive predictions among all actual positive instances.
Steps:
1. Count TP and FN.
2. Divide TP by (TP + FN).
Formula:
Recall=TPTP+FN\text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}}
Example:
Recall=22+1=0.67 (67%)\text{Recall} = \frac{2}{2 + 1} = 0.67 \, (67\%)
Graph:
Not applicable.
5. F1 Score
Definition:
The harmonic mean of Precision and Recall, balancing the two metrics.
Steps:
1. Calculate Precision and Recall.
2. Use the F1 Score formula.
Formula:
F1 Score=2×Precision×RecallPrecision+Recall\text{F1 Score} = 2 \times \frac{\text{Precision} \
times \text{Recall}}{\text{Precision} + \text{Recall}}
Example:
F1 Score=2×1.0×0.671.0+0.67=0.8 (80%)\text{F1 Score} = 2 \times \frac{1.0 \times 0.67}{1.0 +
0.67} = 0.8 \, (80\%)
Graph:
Not applicable.
6. Cross-Validation
Definition:
A method to assess model performance by training and testing it on different subsets of the data.
Steps:
1. Split the data into kk-folds.
2. Train the model on k−1k-1 folds, test on the remaining fold.
3. Repeat kk-times, then average performance metrics.
Formula:
Cross-Validation Metric=∑i=1kMetric on Foldik\text{Cross-Validation Metric} = \frac{\
sum_{i=1}^{k} \text{Metric on Fold}_i}{k}
Example:
5-fold cross-validation accuracies: [0.85, 0.88, 0.90, 0.86, 0.87].
Average Accuracy=0.85+0.88+0.90+0.86+0.875=0.87 (87%)\text{Average Accuracy} = \frac{0.85 +
0.88 + 0.90 + 0.86 + 0.87}{5} = 0.87 \, (87\%)
Graph:
Not applicable.
7. ROC-AUC
Definition:
ROC (Receiver Operating Characteristic) is a curve plotting the True Positive Rate (Recall) vs. the
False Positive Rate (1-Specificity). AUC measures the area under the ROC curve.
Steps:
1. Calculate Recall (TPR) and False Positive Rate (FPR) at different thresholds.
2. Plot ROC curve.
3. Calculate AUC.
Formula:
No explicit formula; AUC is the integral under the ROC curve.
Example:
Model predicts spam with probabilities [0.1, 0.8, 0.9, 0.4, 0.95].
Threshold: 0.5
o Recall = 67%, FPR = 0%
Threshold: 0.8
o Recall = 33%, FPR = 0%
Here is a refined explanation with formulas, definitions, steps, and examples for each concept:
Formulas
1. Entropy:
Entropy(S)=−∑i=1npilog2pi\text{Entropy}(S) = -\sum_{i=1}^{n} p_i \log_2 p_i
Where pip_i is the proportion of instances belonging to class ii.
2. Information Gain:
Steps
1. Compute the entropy of the dataset.
2. Calculate the information gain for each attribute.
3. Choose the attribute with the highest information gain to split the data.
4. Repeat recursively for the subsets until all records belong to the same class or no attributes
remain.
Example
Dataset: Predict whether a customer will Buy a Laptop based on Income, Student Status, and
Credit Rating.
High No Fair No
Low No Excellent No
1. Entropy(S):
S=[2 Yes,2 No]S = [2 \text{ Yes}, 2 \text{ No}]:
Entropy(S)=−(24log224+24log224)=1\text{Entropy}(S) = -\left(\frac{2}{4} \log_2 \frac{2}{4} + \
frac{2}{4} \log_2 \frac{2}{4}\right) = 1
2. Information Gain for Income:
Split by {High, Medium, Low}:
o High: S=[0 Yes,1 No]S = [0 \text{ Yes}, 1 \text{ No}], Entropy=0\text{Entropy} = 0
Weighted Entropy:
EntropyIncome=14(0)+14(0)+24(1)=0.5\text{Entropy}_{Income} = \frac{1}{4}(0) + \frac{1}{4}(0)
+ \frac{2}{4}(1) = 0.5
Information Gain:
Gain(S,Income)=1−0.5=0.5\text{Gain}(S, \text{Income}) = 1 - 0.5 = 0.5
2. ID3
Definition
ID3 (Iterative Dichotomiser 3) selects attributes based on Information Gain and splits the dataset
recursively.
Formulas
Same as Decision Tree Induction formulas for Entropy and Information Gain.
Steps
1. Calculate the entropy of the dataset and each attribute.
2. Compute the information gain for all attributes.
3. Split the dataset on the attribute with the highest information gain.
4. Repeat recursively for each subset.
Example
Using the same dataset as Decision Tree Induction, ID3 chooses the Income attribute for the first split
because it has the highest gain.
3. C4.5
Definition
C4.5 improves ID3 by using Gain Ratio and handling continuous data and missing values.
Formulas
1. Split Information:
Steps
1. Calculate the Information Gain for each attribute.
2. Compute the Split Information for each attribute.
3. Select the attribute with the highest Gain Ratio.
4. Handle continuous data by converting it into intervals.
Example
1. Split Info for Income:
SplitInfo(Income)=−(14log214+14log214+24log224)≈1.5\text{SplitInfo}(\text{Income}) = -\left(\
frac{1}{4}\log_2\frac{1}{4} + \frac{1}{4}\log_2\frac{1}{4} + \frac{2}{4}\log_2\frac{2}{4}\
right) \approx 1.5
2. Gain Ratio:
GainRatio(Income)=0.51.5≈0.333\text{GainRatio}(\text{Income}) = \frac{0.5}{1.5} \approx 0.333
C4.5 would choose the attribute with the highest Gain Ratio, and proceed similarly.
4. CART
Definition
CART (Classification and Regression Trees) splits data using Gini Index and produces binary trees.
Formulas
1. Gini Index:
Gini(S)=1−∑i=1npi2\text{Gini}(S) = 1 - \sum_{i=1}^{n} p_i^2
Where pip_i is the proportion of instances in class ii.
2. Split Gini:
Steps
1. Calculate the Gini Index for the dataset.
2. Compute the Split Gini for each attribute.
3. Select the attribute with the lowest Split Gini for splitting.
4. Repeat recursively for each subset.
Example
1. Gini Index for Income (High):
Gini(High)=1−(01)2−(11)2=0\text{Gini}(\text{High}) = 1 - \left(\frac{0}{1}\right)^2 - \left(\frac{1}
{1}\right)^2 = 0
2. Weighted Gini for Income:
Combine Gini values for subsets:
GiniIncome=14(0)+14(0)+24(0.5)=0.25\text{Gini}_{Income} = \frac{1}{4}(0) + \frac{1}{4}(0) + \
frac{2}{4}(0.5) = 0.25
CART would choose the attribute with the lowest Gini for splitting.
Summary Table
Splitting Example
Method Key Formula
Metric Output
Multiway
Information Gain(S,A)=Entropy(S)−EntropyA\text{Gain}(S, A) = \
ID3 Decision
Gain text{Entropy}(S) - \text{Entropy}_A
Tree
Pruned
GainRatio(A)=Gain(S,A)SplitInfo(A)\text{GainRatio}(A) = \ Multiway
C4.5 Gain Ratio
frac{\text{Gain}(S, A)}{\text{SplitInfo}(A)} Decision
Tree
Binary
CART Gini Index Gini(S)=1−∑pi2\text{Gini}(S) = 1 - \sum p_i^2 Decision
Tree
This explanation includes definitions, formulas, steps, and examples for each method. Let me know
if further clarification is needed!
Prediction: prediction also used for to know the unknown or missing values..
which also uses some models in order to predict the attributes
models like neural networks, if else rules and other mechanisms
Predictive Model: First, it is used as a predictive model to classify previously unlabelled instances.
A good classification model must provide accurate predictions with a fast response time.
Descriptive Model: Second, it serves as a descriptive model to identify the characteristics that
distinguish instances from different classes. This is particularly useful for critical applications, such as
medical diagnosis, where it is insufficient to have a model that makes a prediction without justifying
how it reaches such a decision.
Every leaf node in the decision tree is associated with a class label. The non-terminal nodes, which
include the root and internal nodes, contain attribute test conditions that are typically defined using a
single attribute. Each possible outcome of the attribute test condition is associated with exactly one
child of this node. For example, the root node of the tree shown in Figure 3.4 uses the attribute Body
Temperature to define an attribute test condition that has two outcomes, warm and cold, resulting in
two child nodes.
Given a decision tree, classifying a test instance is straightforward. Starting from the root node, we
apply its attribute test condition and follow the appropriate branch based on the outcome of the test.
This will lead us either to another internal node, for which a new attribute test condition is applied, or
to a leaf node. Once a leaf node is reached, we assign the class label associated with the node to the
test instance.
Figure traces the path used to predict the class label of a flamingo. The path terminates at a leaf node
labeled as Non-mammals.
In Hunt’s algorithm, a decision tree is grown in a recursive fashion. The tree initially contains a single
root node that is associated with all the training instances. If a node is associated with instances from
more than one class, it is expanded using an attribute test condition that is determined using a splitting
criterion. A child leaf node is created for each outcome of the attribute test condition and the instances
associated with the parent node are distributed to the children based on the test outcomes. This node
expansion step can then be recursively applied to each child node, as long as it has labels of more than
one class. If all the instances associated with a leaf node have identical class labels, then the node is
not expanded any further. Each leaf node is assigned a class label that occurs most frequently in the
training instances associated with the node.
To illustrate how the algorithm works, consider the training set shown in Table for the loan borrower
classification problem. Suppose we apply Hunt’s algorithm to fit the training data. The tree initially
contains only a single leaf node as shown in Figure (a). This node is labelled as Defaulted = No, since
the majority of the borrowers did not default on their loan payments. The training error of this tree is
30% as three out of the ten training instances have the class label Defaulted = Yes. The leaf node can
therefore be further expanded because it contains training instances from more than one class.
Let Home Owner be the attribute chosen to split the training instances. The justification for choosing
this attribute as the attribute test condition will be discussed later. The resulting binary split on the
Home Owner attribute is shown in Figure (b). All the training instances for which Home Owner = Yes
are propagated to the left child of the root node and the rest are propagated to the right child. Hunt’s
algorithm is then recursively applied to each child. The left child becomes a leaf node labelled
Defaulted = No, since all instances associated with this node have identical class label Defaulted =
No. The right child has instances from each class label. Hence, we split it further. The resulting
subtrees after recursively expanding the right child are shown in Figures (c) and (d). Hunt’s algorithm,
as described above, makes some simplifying assumptions that are often not true in practice.
Type of attributes :
This is the First step of Data-preprocessing. We differentiate
between different types of attributes and then preprocess the data.
So here is the description of attribute types.
1. Qualitative (Nominal (N), Ordinal (O), Binary(B)).
2. Quantitative (Numeric, Discrete, Continuous)
Qualitative Attributes:
Quantitative Attributes:
Example:
Binary Attributes: The test condition for a binary attribute generates two potential outcomes,
Nominal Attributes : Since a nominal attribute can have many values, its attribute test
condition can be expressed in two ways, as a multiway split or a binary split.
Multiway split: The number of outcomes depends on the number of distinct values for the
corresponding attribute.
For example, if an attribute such as marital status has three distinct values—single, married, or
divorced—its test condition will produce a three-way split.
Binary split : partitioning all values taken by the nominal attribute into two groups.
For example, some decision tree algorithms, such as CART, produce only binary splits by
considering all 2k−1 − 1 ways of creating a binary partition of k attribute values. Figure illustrates
three different ways of grouping the attribute values for marital status into two subsets.
Ordinal Attributes :
Ordinal attributes can also produce binary or multiway splits. Ordinal attribute values can be grouped
as long as the grouping does not violate the order property of the attribute values.
Figure illustrates various ways of splitting training records based on the Shirt Size attribute. The
groupings shown in Figures (a) and (b) preserve the order among the attribute values, whereas the
grouping shown in Figure c) violates this property because it combines the attribute values Small and
Large into the same partition while Medium and Extra Large are combined into another partition.
Continuous Attributes For continuous attributes, the attribute test condition can be expressed as a
comparison test (e.g., A<v) producing a binary split, or as a range query of the form vi ≤ A<v i+1, fori
=1,...,k, producing a multiway split.
For the binary split, any possible value v between the minimum and maximum attribute values in the
training data can be used for constructing the comparison test A<v. However, it is su fficient to only
consider distinct attribute values in the training set as candidate split positions.
For the multiway split, any possible collection of attribute value ranges can be used, as long as they
are mutually exclusive and cover the entire range of attribute values between the minimum and
maximum values observed in the training set.
One approach for constructing multiway splits is to apply the discretization strategies. After
discretization, a new ordinal value is assigned to each discretized interval, and the attribute test
condition is then defined using this newly constructed ordinal attribute.
Measures for Selecting an Attribute Test Condition
There are many measures that can be used to determine the goodness of an attribute test condition.
These measures try to give preference to attribute test conditions that partition the training instances
into purer subsets in the child nodes, which mostly have the same class labels. Having purer nodes is
useful since a node that has all of its training instances from the same class does not need to be
expanded further. In contrast, an impure node containing training instances from multiple classes is
likely to require several levels of node expansions, thereby increasing the depth of the tree
considerably.
where pi(t) is the relative frequency of training instances that belong to class i at node t, c is the total
number of classes, and 0log2 0 = 0 in entropy calculations. All three measures give a zero impurity
value if a node contains instances from a single class and maximum impurity if the node has equal
proportion of instances from multiple classes.
Based on these calculations, node N1 has the lowest impurity value, followed by N2 and N3. This
example, along with Figure, shows the consistency among the impurity measures, i.e., if a node N1
has lower entropy than node N2, then the Gini index and error rate of N1 will also be lower than that
of N2. Despite their agreement, the attribute chosen as splitting criterion by the impurity measures can
still be different.
Entropy
It is used to measure the impurity or randomness of a dataset. serves as the
Information Gain
The feature with the largest entropy information gain should be the root node
to build the decision tree.
ID3 algorithm uses information gain for constructing the decision tree.
Gini Index
The classic CART algorithm uses the Gini Index for constructing the
decision tree.
where I(parent) is the impurity of a node before splitting and I(children) is the weighted impurity
measure after splitting. It can be shown that the gain is non-negative since I(parent) ≥ I(children) for
any reasonable measure such as those presented above. The higher the gain, the purer are the classes
in the child nodes relative to the parent node. The splitting criterion in the decision tree learning
algorithm selects the attribute test condition that shows the maximum gain. Note that maximizing the
gain at a given node is equivalent to minimizing the weighted impurity measure of its children since
I(parent) is the same for all candidate attribute test conditions. Finally, when entropy is used as the
impurity measure, the difference in entropy is commonly known as information gain, Δinfo.
If Home Owner is chosen as the splitting attribute, the Gini index for the child nodes N1 and N2 are 0
and 0.490, respectively. The weighted average Gini index for the children is
where the weights represent the proportion of training instances assigned to each child. The gain using
Home Owner as splitting attribute is 0.420 - 0.343 = 0.077.
Similarly, we can apply a binary split on the Marital Status attribute. However, since Marital Status is
a nominal attribute with three outcomes, there are three possible ways to group the attribute values
into a binary split. The weighted average Gini index of the children for each candidate binary split is
shown in Figure. Based on these results, Home Owner and the last binary split using Marital Status
are clearly the best candidates, since they both produce the lowest weighted average Gini index.
Binary splits can also be used for ordinal attributes, if the binary partitioning of the attribute values
does not violate the ordering property of the values.
Gain Ratio
Gain Ratio is modification of information gain that reduces its bias. Gain ratio overcomes the
problem with information gain by taking into account the number of branches that would
result before making the split.It corrects information gain by taking the intrinsic information
of a split into account.We can also say Gain Ratio will add penalty to information gain.
One potential limitation of impurity measures such as entropy and Gini index is that they tend to favor
qualitative attributes with large number of distinct values. Figure shows three candidate attributes for
partitioning the data set. As previously mentioned, the attribute Marital Status is a better choice than
the attribute Home Owner, because it provides a larger information gain. However, if we compare
them against Customer ID, the latter produces the purest partitions with the maximum information
gain, since the weighted entropy and Gini index is equal to zero for its children. Yet, Customer ID is
not a good attribute for splitting because it has a unique value for each instance. Even though a test
condition involving Customer ID will accurately classify every instance in the training data, we
cannot use such a test condition on new test instances with Customer ID values that haven’t been seen
before during training. This example suggests having a low impurity value alone is insu fficient to find
a good attribute test condition for a node. The number of children produced by the splitting attribute
should also be taken into consideration while deciding the best attribute test condition. There are two
ways to overcome this problem. One way is to generate only binary decision trees, thus avoiding the
difficulty of handling attributes with varying number of partitions. This strategy is employed by
decision tree classifiers such as CART. Another way is to modify the splitting criterion to take into
account the number of partitions produced by the attribute. For example, in the C4.5 decision tree
algorithm, a measure known as gain ratio is used to compensate for attributes that produce a large
number of child nodes. This measure is computed as follows:
where N(vi) is the number of instances assigned to node vi and k is the total number of splits. The
split information measures the entropy of splitting a node into its child nodes and evaluates if the split
same number of instances, then ∀i : N(vi)/N =1/k and the split information would be equal to log2 k.
results in a larger number of equally-sized child nodes or not. For example, if every partition has the
Thus, if an attribute produces a large number of splits, its split information is also large, which in turn,
reduces the gain ratio.
2. The find best split() function determines the attribute test condition for partitioning the training
instances associated with a node. The splitting attribute chosen depends on the impurity measure used.
The popular measures include entropy and the Gini index.
3. The Classify() function determines the class label to be assigned to a leaf node. For each leaf
node t, let p(i|t) denote the fraction of training instances from class i associated with the node t. The
label assigned to the leaf node is typically the one that occurs most frequently in the training instances
that are associated with this node.
where the argmax operator returns the class i that maximizes p(i|t). Besides providing the information
needed to determine the class label of a leaf node, p(i|t) can also be used as a rough estimate of the
probability that an instance assigned to the leaf node t belongs to class i.
4. The stopping cond() function is used to terminate the tree-growing process by checking
whether all the instances have identical class label or attribute values. Since decision tree
classifiers employ a top-down, recursive partitioning approach for building a model, the
number of training instances associated with a node decreases as the depth of the tree
increases. As a result, a leaf node may contain too few training instances to make a
statistically significant decision about its class label. This is known as the data fragmentation
problem. One way to avoid this problem is to disallow splitting of a node when the number of
instances associated with the node fall below a certain threshold.
Tree Pruning
Pruning is the procedure that decreases the size of decision trees. It can decrease the risk of
overfitting by defining the size of the tree or eliminating areas of the tree that support little
power. Pruning supports by trimming the branches that follow anomalies in the training
information because of noise or outliers and supports the original tree in a method that
enhances the generalization efficiency of the tree.
Various methods generally use statistical measures to delete the least reliable departments,
frequently resulting in quicker classification and an improvement in the capability of the tree
to properly classify independent test data.
There are two approaches to tree pruning which are as follows –
Prepruning Approach
Confusion Matrix
A Confusion matrix is an N x N matrix used for evaluating the performance of a
classification model, where N is the number of target classes. The matrix compares
the actual target values with those predicted by the machine learning model. This
gives us a holistic view of how well our classification model is performing and
what kinds of errors it is making.
True Positive (TP) = 560; meaning 560 positive class data points were correctly
classified by the model
True Negative (TN) = 330; meaning 330 negative class data points were correctly
classified by the model
False Positive (FP) = 60; meaning 60 negative class data points were incorrectly
classified as belonging to the positive class by the model
False Negative (FN) = 50; meaning 50 positive class data points were incorrectly
classified as belonging to the negative class by the model
This turned out to be a pretty decent classifier for our dataset considering the
relatively larger number of true positive and true negative values.
Example
Suppose we are trying to create a model that can predict the result for the disease that is either
a person has that disease or not. So, the confusion matrix for this is given as:
o The table is given for the two-class classifier, which has two predictions "Yes" and
"NO." Here, Yes defines that patient has the disease, and No defines that patient does
not has that disease.
o The classifier has made a total of 100 predictions. Out of 100 predictions, 89 are
true predictions, and 11 are incorrect predictions.
o The model has given prediction "yes" for 32 times, and "No" for 68 times. Whereas
the actual "Yes" was 27, and actual "No" was 73 times.
o Misclassification rate: It is also termed as Error rate, and it defines how often the
model gives the wrong predictions. The value of error rate can be calculated as the
number of incorrect predictions to all number of the predictions made by the
classifier. The formula is given below:
o Precision: It can be defined as the number of correct outputs provided by the model
or out of all positive classes that have predicted correctly by the model, how many of
them were actually true. It can be calculated using the below formula:
o Recall: It is defined as the out of total positive and negative classes, how our model
predicted correctly. The recall must be as high as possible.
Recall vs Precision
Recall is the number of relevant documents retrieved by a search divided by the total number
of existing relevant documents, while precision is the number of relevant documents retrieved
by a search divided by the total number of documents retrieved by that search.
o F-measure: If two models have low precision and high recall or vice versa, it is
difficult to compare these models. So, for this purpose, we can use F-score. This score
helps us to evaluate the recall and precision at the same time. The F-score is
maximum if the recall is equal to the precision. It can be calculated using the below
formula:
What is Cross-Validation?
Cross validation is a technique used in machine learning to evaluate the
performance of a model on unseen data. It involves dividing the available
data into multiple folds or subsets, using one of these folds as a validation
set, and training the model on the remaining folds. This process is
repeated multiple times, each time using a different fold as the validation
set. Finally, the results from each validation step are averaged to produce
a more robust estimate of the model’s performance. Cross validation is an
important step in the machine learning process and helps to ensure that
the model selected for deployment is robust and generalizes well to new
data.
Types of Cross-Validation
There are several types of cross validation techniques, including k-fold
cross validation, leave-one-out cross validation, and Holdout
validation, Stratified Cross-Validation. The choice of technique
depends on the size and nature of the data, as well as the specific
requirements of the modeling problem.
1. Holdout Validation
In Holdout Validation, we perform training on the 50% of the given dataset
and rest 50% is used for the testing purpose. It’s a simple and quick way
to evaluate a model. The major drawback of this method is that we
perform training on the 50% of the dataset, it may possible that the
remaining 50% of the data contains some important information which we
are leaving while training our model i.e. higher bias.
2. LOOCV (Leave One Out Cross Validation)
In this method, we perform training on the whole dataset but leaves only
one data-point of the available dataset and then iterates for each data-
point. In LOOCV, the model is trained on n−1n−1 samples and tested on
the one omitted sample, repeating this process for each data point in the
dataset. It has some advantages as well as disadvantages also.
An advantage of using this method is that we make use of all data points
and hence it is low bias.
The major drawback of this method is that it leads to higher variation in
the testing model as we are testing against one data point. If the data
point is an outlier it can lead to higher variation. Another drawback is
it takes a lot of execution time as it iterates over ‘the number of data
points’ times.
3. Stratified Cross-Validation
It is a technique used in machine learning to ensure that each fold of the
cross-validation process maintains the same class distribution as the
entire dataset. This is particularly important when dealing with imbalanced
datasets, where certain classes may be underrepresented. In this method,
1. The dataset is divided into k folds while maintaining the proportion of
classes in each fold.
2. During each iteration, one-fold is used for testing, and the remaining
folds are used for training.
3. The process is repeated k times, with each fold serving as the test set
exactly once.
Stratified Cross-Validation is essential when dealing with classification
problems where maintaining the balance of class distribution is crucial for
the model to generalize well to unseen data.
4. K-Fold Cross Validation
In K-Fold Cross Validation, we split the dataset into k number of subsets
(known as folds) then we perform training on the all the subsets but leave
one(k-1) subset for the evaluation of the trained model. In this method, we
iterate k times with a different subset reserved for testing purpose each
time.
Note: It is always suggested that the value of k should be 10 as the lower
value of k takes towards validation and higher value of k leads to LOOCV
method.
Example of K Fold Cross Validation
The diagram below shows an example of the training subsets and
evaluation subsets generated in k-fold cross-validation. Here, we have
total 25 instances. In first iteration we use the first 20 percent of data for
evaluation, and the remaining 80 percent for training ([1-5] testing and [5-
25] training) while in the second iteration we use the second subset of 20
percent for evaluation, and the remaining three subsets of the data for
training ([5-10] testing and [1-5 and 10-25] training), and so on.
Total instances: 25
Value of k : 5
No. Iteration Training set observations
Testing set observations
1 [ 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
22 23 24] [0 1 2 3 4]
2 [ 0 1 2 3 4 10 11 12 13 14 15 16 17 18 19 20 21
22 23 24] [5 6 7 8 9]
3 [ 0 1 2 3 4 5 6 7 8 9 15 16 17 18 19 20 21
22 23 24] [10 11 12 13 14]
4 [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 20 21
22 23 24] [15 16 17 18 19]
5 [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
17 18 19] [20 21 22 23 24]
1. The first step of the algorithm is the selection of the attributes that will
become nodes of the decision tree. As already discussed there are two terms
entropy and information gain that are used as the basis for attribute selection.
2. Once the attribute is selected for the current node, the child nodes are
generated, one for each possible value of the selected attribute.
3. The examples are then partitioned using the possible values of this attribute
and these partitioned subsets of the examples are assigned to the appropriate child
node
4. The steps 1 to 3 are repeated for each child node until all examples associated
with a node are either all positive or all negative