0% found this document useful (0 votes)
8 views45 pages

DWDM final5

Uploaded by

vidya7.gcsj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views45 pages

DWDM final5

Uploaded by

vidya7.gcsj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 45

UNIT-3

Classification

1. Confusion Matrix
Definition:
A confusion matrix is a table summarizing the performance of a classification model. It shows counts
for True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN).
Steps:
1. Predict outcomes for all instances.
2. Compare predicted and actual values.
3. Fill the confusion matrix with counts of TP, TN, FP, FN.
Formula:
No specific formula. It is directly tabulated based on comparisons.
Example:

Predicted: No (0) Predicted: Yes (1)

Actual: No (0) 2 (TN) 0 (FP)

Actual: Yes (1) 1 (FN) 2 (TP)

Graph:
Not applicable.

2. Accuracy
Definition:
The proportion of correct predictions over total predictions.
Steps:
1. Count TP and TN.
2. Add FP and FN for the total predictions.
3. Divide correct predictions by total.
Formula:
Accuracy=TP+TNTP+TN+FP+FN\text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{TP} + \
text{TN} + \text{FP} + \text{FN}}
Example:
Using the confusion matrix above:
Accuracy=2+22+2+0+1=0.8 (80%)\text{Accuracy} = \frac{2 + 2}{2 + 2 + 0 + 1} = 0.8 \, (80\%)
Graph:
Not applicable.

3. Precision
Definition:
The proportion of true positive predictions among all positive predictions.
Steps:
1. Count TP and FP.
2. Divide TP by (TP + FP).
Formula:
Precision=TPTP+FP\text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}}
Example:
Precision=22+0=1.0 (100%)\text{Precision} = \frac{2}{2 + 0} = 1.0 \, (100\%)
Graph:
Not applicable.

4. Recall (Sensitivity)
Definition:
The proportion of true positive predictions among all actual positive instances.
Steps:
1. Count TP and FN.
2. Divide TP by (TP + FN).
Formula:
Recall=TPTP+FN\text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}}
Example:
Recall=22+1=0.67 (67%)\text{Recall} = \frac{2}{2 + 1} = 0.67 \, (67\%)
Graph:
Not applicable.

5. F1 Score
Definition:
The harmonic mean of Precision and Recall, balancing the two metrics.
Steps:
1. Calculate Precision and Recall.
2. Use the F1 Score formula.
Formula:
F1 Score=2×Precision×RecallPrecision+Recall\text{F1 Score} = 2 \times \frac{\text{Precision} \
times \text{Recall}}{\text{Precision} + \text{Recall}}
Example:
F1 Score=2×1.0×0.671.0+0.67=0.8 (80%)\text{F1 Score} = 2 \times \frac{1.0 \times 0.67}{1.0 +
0.67} = 0.8 \, (80\%)
Graph:
Not applicable.

6. Cross-Validation
Definition:
A method to assess model performance by training and testing it on different subsets of the data.
Steps:
1. Split the data into kk-folds.
2. Train the model on k−1k-1 folds, test on the remaining fold.
3. Repeat kk-times, then average performance metrics.
Formula:
Cross-Validation Metric=∑i=1kMetric on Foldik\text{Cross-Validation Metric} = \frac{\
sum_{i=1}^{k} \text{Metric on Fold}_i}{k}
Example:
5-fold cross-validation accuracies: [0.85, 0.88, 0.90, 0.86, 0.87].
Average Accuracy=0.85+0.88+0.90+0.86+0.875=0.87 (87%)\text{Average Accuracy} = \frac{0.85 +
0.88 + 0.90 + 0.86 + 0.87}{5} = 0.87 \, (87\%)
Graph:
Not applicable.

7. ROC-AUC
Definition:
ROC (Receiver Operating Characteristic) is a curve plotting the True Positive Rate (Recall) vs. the
False Positive Rate (1-Specificity). AUC measures the area under the ROC curve.
Steps:
1. Calculate Recall (TPR) and False Positive Rate (FPR) at different thresholds.
2. Plot ROC curve.
3. Calculate AUC.
Formula:
No explicit formula; AUC is the integral under the ROC curve.
Example:
Model predicts spam with probabilities [0.1, 0.8, 0.9, 0.4, 0.95].
 Threshold: 0.5
o Recall = 67%, FPR = 0%

 Threshold: 0.8
o Recall = 33%, FPR = 0%

Here is a refined explanation with formulas, definitions, steps, and examples for each concept:

1. Decision Tree Induction


Definition
Decision Tree Induction is a supervised learning technique that splits the dataset recursively using
attributes that maximize the reduction in impurity (e.g., entropy or Gini index).

Formulas
1. Entropy:
Entropy(S)=−∑i=1npilog⁡2pi\text{Entropy}(S) = -\sum_{i=1}^{n} p_i \log_2 p_i
Where pip_i is the proportion of instances belonging to class ii.
2. Information Gain:

Gain(S,A)=Entropy(S)−∑v∈A∣Sv∣∣S∣Entropy(Sv)\text{Gain}(S, A) = \text{Entropy}(S) - \sum_{v \in


A} \frac{|S_v|}{|S|} \text{Entropy}(S_v)
Where SvS_v is the subset of SS for which attribute A=vA = v.

Steps
1. Compute the entropy of the dataset.
2. Calculate the information gain for each attribute.
3. Choose the attribute with the highest information gain to split the data.
4. Repeat recursively for the subsets until all records belong to the same class or no attributes
remain.

Example
Dataset: Predict whether a customer will Buy a Laptop based on Income, Student Status, and
Credit Rating.

Income Student Credit Rating Buys Laptop?

High No Fair No

Medium Yes Fair Yes

Low Yes Excellent Yes

Low No Excellent No

1. Entropy(S):
S=[2 Yes,2 No]S = [2 \text{ Yes}, 2 \text{ No}]:
Entropy(S)=−(24log⁡224+24log⁡224)=1\text{Entropy}(S) = -\left(\frac{2}{4} \log_2 \frac{2}{4} + \
frac{2}{4} \log_2 \frac{2}{4}\right) = 1
2. Information Gain for Income:
Split by {High, Medium, Low}:
o High: S=[0 Yes,1 No]S = [0 \text{ Yes}, 1 \text{ No}], Entropy=0\text{Entropy} = 0

o Medium: S=[1 Yes,0 No]S = [1 \text{ Yes}, 0 \text{ No}], Entropy=0\text{Entropy}


=0
o Low: S=[1 Yes,1 No]S = [1 \text{ Yes}, 1 \text{ No}], Entropy=1\text{Entropy} = 1

Weighted Entropy:
EntropyIncome=14(0)+14(0)+24(1)=0.5\text{Entropy}_{Income} = \frac{1}{4}(0) + \frac{1}{4}(0)
+ \frac{2}{4}(1) = 0.5
Information Gain:
Gain(S,Income)=1−0.5=0.5\text{Gain}(S, \text{Income}) = 1 - 0.5 = 0.5

2. ID3
Definition
ID3 (Iterative Dichotomiser 3) selects attributes based on Information Gain and splits the dataset
recursively.

Formulas
Same as Decision Tree Induction formulas for Entropy and Information Gain.
Steps
1. Calculate the entropy of the dataset and each attribute.
2. Compute the information gain for all attributes.
3. Split the dataset on the attribute with the highest information gain.
4. Repeat recursively for each subset.

Example
Using the same dataset as Decision Tree Induction, ID3 chooses the Income attribute for the first split
because it has the highest gain.

3. C4.5
Definition
C4.5 improves ID3 by using Gain Ratio and handling continuous data and missing values.

Formulas
1. Split Information:

SplitInfo(A)=−∑v∈A∣Sv∣∣S∣log⁡2∣Sv∣∣S∣\text{SplitInfo}(A) = -\sum_{v \in A} \frac{|S_v|}{|S|} \


log_2 \frac{|S_v|}{|S|}
2. Gain Ratio:
GainRatio(A)=Gain(S,A)SplitInfo(A)\text{GainRatio}(A) = \frac{\text{Gain}(S, A)}{\
text{SplitInfo}(A)}

Steps
1. Calculate the Information Gain for each attribute.
2. Compute the Split Information for each attribute.
3. Select the attribute with the highest Gain Ratio.
4. Handle continuous data by converting it into intervals.

Example
1. Split Info for Income:
SplitInfo(Income)=−(14log⁡214+14log⁡214+24log⁡224)≈1.5\text{SplitInfo}(\text{Income}) = -\left(\
frac{1}{4}\log_2\frac{1}{4} + \frac{1}{4}\log_2\frac{1}{4} + \frac{2}{4}\log_2\frac{2}{4}\
right) \approx 1.5
2. Gain Ratio:
GainRatio(Income)=0.51.5≈0.333\text{GainRatio}(\text{Income}) = \frac{0.5}{1.5} \approx 0.333
C4.5 would choose the attribute with the highest Gain Ratio, and proceed similarly.

4. CART
Definition
CART (Classification and Regression Trees) splits data using Gini Index and produces binary trees.

Formulas
1. Gini Index:
Gini(S)=1−∑i=1npi2\text{Gini}(S) = 1 - \sum_{i=1}^{n} p_i^2
Where pip_i is the proportion of instances in class ii.
2. Split Gini:

GiniA=∑v∈A∣Sv∣∣S∣Gini(Sv)\text{Gini}_{A} = \sum_{v \in A} \frac{|S_v|}{|S|} \text{Gini}(S_v)

Steps
1. Calculate the Gini Index for the dataset.
2. Compute the Split Gini for each attribute.
3. Select the attribute with the lowest Split Gini for splitting.
4. Repeat recursively for each subset.

Example
1. Gini Index for Income (High):
Gini(High)=1−(01)2−(11)2=0\text{Gini}(\text{High}) = 1 - \left(\frac{0}{1}\right)^2 - \left(\frac{1}
{1}\right)^2 = 0
2. Weighted Gini for Income:
Combine Gini values for subsets:
GiniIncome=14(0)+14(0)+24(0.5)=0.25\text{Gini}_{Income} = \frac{1}{4}(0) + \frac{1}{4}(0) + \
frac{2}{4}(0.5) = 0.25
CART would choose the attribute with the lowest Gini for splitting.

Summary Table
Splitting Example
Method Key Formula
Metric Output

Multiway
Information Gain(S,A)=Entropy(S)−EntropyA\text{Gain}(S, A) = \
ID3 Decision
Gain text{Entropy}(S) - \text{Entropy}_A
Tree

Pruned
GainRatio(A)=Gain(S,A)SplitInfo(A)\text{GainRatio}(A) = \ Multiway
C4.5 Gain Ratio
frac{\text{Gain}(S, A)}{\text{SplitInfo}(A)} Decision
Tree

Binary
CART Gini Index Gini(S)=1−∑pi2\text{Gini}(S) = 1 - \sum p_i^2 Decision
Tree

This explanation includes definitions, formulas, steps, and examples for each method. Let me know
if further clarification is needed!

Classification: Basic Concepts, General Approach to solving a classification problem,


Decision Tree Induction: Working of Decision Tree, building a decision tree, methods for
expressing an attribute test conditions, measures for selecting the best split, Algorithm for
decision tree induction.
Basic Concepts:
What is classification? What is prediction?
Classification: It is the task of learning a target function f that maps each attribute set X
to one of the predefined class labels ’Y’
 used for prediction(future analysis ) to know the unknown attributes with their values by
using classifier algorithms and decision tree.(in data mining)
 which constructs some models(like decision trees) then which classifies the attributes..
 The classification can work on both categorical attribute and numerical attribute attributes.

Prediction: prediction also used for to know the unknown or missing values..
 which also uses some models in order to predict the attributes
 models like neural networks, if else rules and other mechanisms

Classification and prediction are used in the Applications like


 Credit approval
 Target marketing
 Medical diagnosis

Classification—A Two-Step Process


• Model construction: describing a set of predetermined classes
1. Each tuple/sample is assumed to belong to a predefined class, as determined by the class label
attribute
2. The set of tuples used for model construction: training set
3. The model is represented as classification rules, decision trees, or mathematical formulae
• Model usage: for classifying future or unknown objects – Estimate accuracy of the model
1. The known label of test sample is compared with the classified result from the model
2. Accuracy rate is the percentage of test set samples that are correctly classified by the model
3. Test set is independent of training set, otherwise over-fitting will occur

Process (2): Using the Model in Prediction

Supervised vs. Unsupervised Learning :


Supervised learning (classification)
– Supervision: The training data (observations, measurements, etc.) are accompanied
by labels indicating the class of the observations.
– New data is classified based on the training set.

Unsupervised learning (clustering)


– The class labels of training data is unknown.
– Given a set of measurements, observations, etc. with the aim of establishing the
existence of classes or clusters in the data.

A classification model serves two important roles in data mining.

Predictive Model: First, it is used as a predictive model to classify previously unlabelled instances.
A good classification model must provide accurate predictions with a fast response time.

Descriptive Model: Second, it serves as a descriptive model to identify the characteristics that
distinguish instances from different classes. This is particularly useful for critical applications, such as
medical diagnosis, where it is insufficient to have a model that makes a prediction without justifying
how it reaches such a decision.

General Approach To Solve Classification Problem


Classification is the task of assigning labels to unlabelled data instances and a classifier is used to
perform such a task. The model is created using a given a set of instances, known as the training set,
which contains attribute values as well as class labels for each instance.
The systematic approach for learning a classification model given a training set is known as a learning
algorithm. The process of using a learning algorithm to build a classification model from the training
data is known as induction. This process is also often described as “learning a model” or “building a
model.” This process of applying a classification model on unseen test instances to predict their class
labels is known as deduction.
Thus, the process of classification involves two steps: applying a learning algorithm to training data
to learn a model, and then applying the model to assign labels to unlabelled instances.
Confusion Matrix:
The performance of a model (classifier) can be evaluated by comparing the predicted labels against
the true labels of instances. This information can be summarized in a table called a confusion matrix.
Table depicts the confusion matrix for a binary classification problem. Each entry fij denotes the
number of instances from class i predicted to be of class j.
For example, f01 is the number of instances from class 0 incorrectly predicted as class 1. The
number of correct predictions made by the model is (f11 + f00) and the number of incorrect
predictions is (f10 + f01).
Although a confusion matrix provides the information needed to determine how well a classification
model performs, summarizing this information into a single number makes it more convenient to
compare the relative performance of different models. This can be done using an evaluation metric
such as accuracy, which is computed in the following way:

For binary classification problems, the accuracy of a model is given by

Error rate is defined as follows for binary classification problems:

Decision Tree Classifier


To illustrate how a decision tree works, consider the classification problem of distinguishing
mammals from non-mammals using the vertebrate data set.
Suppose a new species is discovered by scientists. How can we tell whether it is a mammal or a non-
mammal? One approach is to pose a series of questions about the characteristics of the species. The
first question we may ask is whether the species is cold- or warm-blooded. If it is cold-blooded, then it
is definitely not a mammal. Otherwise, it is either a bird or a mammal. In the latter case, we need to
ask a follow-up question: Do the females of the species give birth to their young? Those that do give
birth are definitely mammals, while those that do not are likely to be nonmammals (with the exception
of egg-laying mammals such as the platypus and spiny anteater).
The previous example illustrates how we can solve a classification problem by asking a series of
carefully crafted questions about the attributes of the test instance. Each time we receive an answer,
we could ask a follow-up question until we can conclusively decide on its class label. The series of
questions and their possible answers can be organized into a hierarchical structure called a decision
tree.
The tree has three types of nodes:
• A root node, with no incoming links and zero or more outgoing links.
• Internal nodes, each of which has exactly one incoming link and two or more outgoing links.
• Leaf or terminal nodes, each of which has exactly one incoming link and no outgoing links.
Every leaf node in the decision tree is associated with a class label. The non-terminal nodes, which
include the root and internal nodes, contain attribute test conditions that are typically defined using a
single attribute. Each possible outcome of the attribute test condition is associated with exactly one
child of this node.

Every leaf node in the decision tree is associated with a class label. The non-terminal nodes, which
include the root and internal nodes, contain attribute test conditions that are typically defined using a
single attribute. Each possible outcome of the attribute test condition is associated with exactly one
child of this node. For example, the root node of the tree shown in Figure 3.4 uses the attribute Body
Temperature to define an attribute test condition that has two outcomes, warm and cold, resulting in
two child nodes.
Given a decision tree, classifying a test instance is straightforward. Starting from the root node, we
apply its attribute test condition and follow the appropriate branch based on the outcome of the test.
This will lead us either to another internal node, for which a new attribute test condition is applied, or
to a leaf node. Once a leaf node is reached, we assign the class label associated with the node to the
test instance.
Figure traces the path used to predict the class label of a flamingo. The path terminates at a leaf node
labeled as Non-mammals.

A Basic Algorithm to Build a Decision Tree


Hunt’s algorithm, which is the basis for many current implementations of decision tree classifiers,
including ID3, C4.5, and CART

In Hunt’s algorithm, a decision tree is grown in a recursive fashion. The tree initially contains a single
root node that is associated with all the training instances. If a node is associated with instances from
more than one class, it is expanded using an attribute test condition that is determined using a splitting
criterion. A child leaf node is created for each outcome of the attribute test condition and the instances
associated with the parent node are distributed to the children based on the test outcomes. This node
expansion step can then be recursively applied to each child node, as long as it has labels of more than
one class. If all the instances associated with a leaf node have identical class labels, then the node is
not expanded any further. Each leaf node is assigned a class label that occurs most frequently in the
training instances associated with the node.

To illustrate how the algorithm works, consider the training set shown in Table for the loan borrower
classification problem. Suppose we apply Hunt’s algorithm to fit the training data. The tree initially
contains only a single leaf node as shown in Figure (a). This node is labelled as Defaulted = No, since
the majority of the borrowers did not default on their loan payments. The training error of this tree is
30% as three out of the ten training instances have the class label Defaulted = Yes. The leaf node can
therefore be further expanded because it contains training instances from more than one class.
Let Home Owner be the attribute chosen to split the training instances. The justification for choosing
this attribute as the attribute test condition will be discussed later. The resulting binary split on the
Home Owner attribute is shown in Figure (b). All the training instances for which Home Owner = Yes
are propagated to the left child of the root node and the rest are propagated to the right child. Hunt’s
algorithm is then recursively applied to each child. The left child becomes a leaf node labelled
Defaulted = No, since all instances associated with this node have identical class label Defaulted =
No. The right child has instances from each class label. Hence, we split it further. The resulting
subtrees after recursively expanding the right child are shown in Figures (c) and (d). Hunt’s algorithm,
as described above, makes some simplifying assumptions that are often not true in practice.

Assumptions and the ways to handle them


1. Some of the child nodes created in Hunt’s algorithm can be empty if none of the training instances
have the particular attribute values. One way to handle this is by declaring each of them as a leaf node
with a class label that occurs most frequently among the training instances associated with their parent
nodes.
2. If all training instances associated with a node have identical attribute values but di fferent class
labels, it is not possible to expand this node any further. One way to handle this case is to declare it a
leaf node and assign it the class label that occurs most frequently in the training instances associated
with this node.

Design Issues of Decision Tree Induction


Hunt’s algorithm is a generic procedure for growing decision trees in a greedy fashion. To implement
the algorithm, there are two key design issues that must be addressed.
1. What is the splitting criterion? At each recursive step, an attribute must be selected to partition
the training instances associated with a node into smaller subsets associated with its child nodes. The
splitting criterion determines which attribute is chosen as the test condition and how the training
instances should be distributed to the child nodes.
2. What is the stopping criterion? The basic algorithm stops expanding a node only when all the
training instances associated with the node have the same class labels or have identical attribute
values. Although these conditions are sufficient, there are reasons to stop expanding a node much
earlier even if the leaf node contains training instances from more than one class. This process is
called early termination and the condition used to determine when a node should be stopped from
expanding is called a stopping criterion

Type of attributes :
This is the First step of Data-preprocessing. We differentiate
between different types of attributes and then preprocess the data.
So here is the description of attribute types.
1. Qualitative (Nominal (N), Ordinal (O), Binary(B)).
2. Quantitative (Numeric, Discrete, Continuous)
Qualitative Attributes:

1. Nominal Attributes – related to names: The values of a


Nominal attribute are names of things, some kind of symbols. Values
of Nominal attributes represents some category or state and that’s
why nominal attribute also referred as categorical attributes and
there is no order (rank, position) among values of the nominal
attribute.
Example :
2. Binary Attributes: Binary data has only 2 values/states. For
Example yes or no, affected or unaffected, true or false.
 Symmetric: Both values are equally important (Gender).
 Asymmetric: Both values are not equally important (Result).

3. Ordinal Attributes : The Ordinal Attributes contains values that


have a meaningful sequence or ranking(order) between them, but
the magnitude between values is not actually known, the order of
values that shows what is important but don’t indicate how
important it is.

Quantitative Attributes:

1. Numeric: A numeric attribute is quantitative because, it is a


measurable quantity, represented in integer or real values.
Numerical attributes are of 2 types, interval, and ratio.
 An interval-scaled attribute has values, whose differences are
interpretable, but the numerical attributes do not have the
correct reference point, or we can call zero points. Data can be
added and subtracted at an interval scale but can not be
multiplied or divided. Consider an example of temperature in
degrees Centigrade. If a day’s temperature of one day is twice of
the other day we cannot say that one day is twice as hot as
another day.
 A ratio-scaled attribute is a numeric attribute with a fix zero-
point. If a measurement is ratio-scaled, we can say of a value as
being a multiple (or ratio) of another value. The values are
ordered, and we can also compute the difference between values,
and the mean, median, mode, Quantile-range, and Five number
summary can be given.

2. Discrete : Discrete data have finite values it can be numerical


and can also be in categorical form. These attributes has finite or
countably infinite set of values.

Example:

3. Continuous: Continuous data have an infinite no of states.


Continuous data is of float type. There can be many values between
2 and 3.
Example :

Methods for Expressing Attribute Test Conditions


Decision tree induction algorithms must provide a method for expressing an attribute test condition
and its corresponding outcomes for different attribute types.

Binary Attributes: The test condition for a binary attribute generates two potential outcomes,
Nominal Attributes : Since a nominal attribute can have many values, its attribute test
condition can be expressed in two ways, as a multiway split or a binary split.

Multiway split: The number of outcomes depends on the number of distinct values for the
corresponding attribute.

For example, if an attribute such as marital status has three distinct values—single, married, or
divorced—its test condition will produce a three-way split.

Binary split : partitioning all values taken by the nominal attribute into two groups.
For example, some decision tree algorithms, such as CART, produce only binary splits by
considering all 2k−1 − 1 ways of creating a binary partition of k attribute values. Figure illustrates
three different ways of grouping the attribute values for marital status into two subsets.

Ordinal Attributes :
Ordinal attributes can also produce binary or multiway splits. Ordinal attribute values can be grouped
as long as the grouping does not violate the order property of the attribute values.
Figure illustrates various ways of splitting training records based on the Shirt Size attribute. The
groupings shown in Figures (a) and (b) preserve the order among the attribute values, whereas the
grouping shown in Figure c) violates this property because it combines the attribute values Small and
Large into the same partition while Medium and Extra Large are combined into another partition.

Continuous Attributes For continuous attributes, the attribute test condition can be expressed as a
comparison test (e.g., A<v) producing a binary split, or as a range query of the form vi ≤ A<v i+1, fori
=1,...,k, producing a multiway split.
For the binary split, any possible value v between the minimum and maximum attribute values in the
training data can be used for constructing the comparison test A<v. However, it is su fficient to only
consider distinct attribute values in the training set as candidate split positions.

For the multiway split, any possible collection of attribute value ranges can be used, as long as they
are mutually exclusive and cover the entire range of attribute values between the minimum and
maximum values observed in the training set.
One approach for constructing multiway splits is to apply the discretization strategies. After
discretization, a new ordinal value is assigned to each discretized interval, and the attribute test
condition is then defined using this newly constructed ordinal attribute.
Measures for Selecting an Attribute Test Condition
There are many measures that can be used to determine the goodness of an attribute test condition.
These measures try to give preference to attribute test conditions that partition the training instances
into purer subsets in the child nodes, which mostly have the same class labels. Having purer nodes is
useful since a node that has all of its training instances from the same class does not need to be
expanded further. In contrast, an impure node containing training instances from multiple classes is
likely to require several levels of node expansions, thereby increasing the depth of the tree
considerably.

Impurity Measure for a Single Node


The impurity of a node measures how dissimilar the class labels are for the data instances belonging
to a common node. Following are examples of measures that can be used to evaluate the impurity of a
node t:

where pi(t) is the relative frequency of training instances that belong to class i at node t, c is the total
number of classes, and 0log2 0 = 0 in entropy calculations. All three measures give a zero impurity
value if a node contains instances from a single class and maximum impurity if the node has equal
proportion of instances from multiple classes.

Comparison among the impurity measures for binary classification problems.


Since there are only two classes, p0(t)+p1(t) = 1. The horizontal axis p refers to the fraction of
instances that belong to one of the two classes. Observe that all three measures attain their maximum
value when the class distribution is uniform (i.e., p0(t)=p1(t)= 0.5) and minimum value when all the
instances belong to a single class (i.e., either p0(t) orp1(t) equals to 1). The following examples
illustrate how the values of the impurity measures vary as we alter the class distribution.

Based on these calculations, node N1 has the lowest impurity value, followed by N2 and N3. This
example, along with Figure, shows the consistency among the impurity measures, i.e., if a node N1
has lower entropy than node N2, then the Gini index and error rate of N1 will also be lower than that
of N2. Despite their agreement, the attribute chosen as splitting criterion by the impurity measures can
still be different.

Collective Impurity of Child Nodes:


Consider an attribute test condition that splits a node containing N training instances into k children,
{v1,v2,···,vk}, where every child node represents a partition of the data resulting from one of the k
outcomes of the attribute test condition. Let N(vj) be the number of training instances associated with
a child node vj, whose impurity value is I(vj). Since a training instance in the parent node reaches
node vj for a fraction of N(vj)/N times, the collective impurity of the child nodes can be computed by
taking a weighted sum of the impurities of the child nodes, as follows:
Example: [Weighted Entropy]
Consider the candidate attribute test condition shown in Figures (a) and (b) for the loan borrower
classification problem. Splitting on the Home Owner attribute will generate two child nodes

Entropy
It is used to measure the impurity or randomness of a dataset. serves as the

baseline for the information gain calculation:

Information Gain

 The feature with the largest entropy information gain should be the root node
to build the decision tree.

ID3 algorithm uses information gain for constructing the decision tree.

Gini Index

It is calculated by subtracting the sum of squared probabilities of each

class from one. It favors larger partitions and is easy to implement,

whereas information gain favors smaller partitions with distinct values.

A feature with a lower Gini index is chosen for a split.

The classic CART algorithm uses the Gini Index for constructing the

decision tree.

Identifying the best attribute test condition


To determine the goodness of an attribute test condition, we need to compare the degree of impurity
of the parent node (before splitting) with the weighted degree of impurity of the child nodes (after
splitting). The larger their difference, the better the test condition. This di fference, Δ, also termed as
the gain in purity of an attribute test condition, can be defined as follows:

where I(parent) is the impurity of a node before splitting and I(children) is the weighted impurity
measure after splitting. It can be shown that the gain is non-negative since I(parent) ≥ I(children) for
any reasonable measure such as those presented above. The higher the gain, the purer are the classes
in the child nodes relative to the parent node. The splitting criterion in the decision tree learning
algorithm selects the attribute test condition that shows the maximum gain. Note that maximizing the
gain at a given node is equivalent to minimizing the weighted impurity measure of its children since
I(parent) is the same for all candidate attribute test conditions. Finally, when entropy is used as the
impurity measure, the difference in entropy is commonly known as information gain, Δinfo.

Splitting of Qualitative Attributes0


Consider the first two candidate splits shown in Figure involving qualitative attributes Home Owner
and Marital Status. The initial class distribution at the parent node is (0.3,0.7), since there are 3
instances of class Yes and 7 instances of class No in the training data. Thus,
Binary Splitting of Qualitative Attributes
Consider building a decision tree using only binary splits and the Gini index as the impurity measure.
Figure 3.13 shows examples of four candidate splitting criteria for the Home Owner and Marital
Status attributes. Since there are 3 borrowers in the training set who defaulted and 7 others who repaid
their loan (see Table in Figure 3.13), the Gini inde x of the parent node before splitting is

If Home Owner is chosen as the splitting attribute, the Gini index for the child nodes N1 and N2 are 0
and 0.490, respectively. The weighted average Gini index for the children is

where the weights represent the proportion of training instances assigned to each child. The gain using
Home Owner as splitting attribute is 0.420 - 0.343 = 0.077.
Similarly, we can apply a binary split on the Marital Status attribute. However, since Marital Status is
a nominal attribute with three outcomes, there are three possible ways to group the attribute values
into a binary split. The weighted average Gini index of the children for each candidate binary split is
shown in Figure. Based on these results, Home Owner and the last binary split using Marital Status
are clearly the best candidates, since they both produce the lowest weighted average Gini index.
Binary splits can also be used for ordinal attributes, if the binary partitioning of the attribute values
does not violate the ordering property of the values.

Binary Splitting of Quantitative Attributes


Consider the problem of identifying the best binary split Annual Income ≤ τ for the preceding loan
approval classification problem. As discussed previously,
even though τ can take any value between the minimum and maximum values of annual income in the
training set, it is sufficient to only consider the annual income values observed in the training set as
candidate split positions. For each candidate τ, the training set is scanned once to count the number of
borrowers with annual income less than or greater than τ along with their class proportions. We can
then compute the Gini index at each candidate split position and choose the τ that produces the lowest
value. Computing the Gini index at each candidate split position requires O(N) operations, where N is
the number of training instances. Since there are at most N possible candidates, the overall complexity
of this brute-force method is O(N2). It is possible to reduce the complexity of this problem to O(N
logN) by using a method described as follows (see illustration in Figure 3.14). In this method, we first
sort the training instances based on their annual income, a one-time cost that requires O(N logN)
operations. The candidate split positions are given by the midpoints between every two adjacent
sorted values: $55,000, $65,000, $72,500, and so on. For the first candidate, since none of the
instances has an annual income less than or equal to $55,000, the Gini index for the child node with
Annual Income < $55,000 is equal to zero. In contrast, there are 3 training instances of class Yes and
7 instances of class No with annual income greater than $55,000. The Gini index for this node is
0.420. The weighted average Gini index for the first candidate split position, τ = $55,000, is equal to
0×0+1×0.420 = 0.420. For the next candidate, τ = $65,000, the class distribution of its child nodes can
be obtained with a simple update of the distribution for the previous candidate. This is because, as τ
increases from $55,000 to $65,000, there is only one training instance a ffected by the change. By
examining the class label of the affected training instance, the new class distribution is obtained. For
example, as τ increases to $65,000, there is only one borrower in the training set, with an annual
income of $60,000, affected by this change. Since the class label for the borrower is No, the count for
class No increases from 0 to 1 (for Annual Income ≤ $65,000) and decreases from 7 to 6 (for Annual
Income > $65,000), as shown in Figure 3.14. The distribution for the Yes class remains una ffected.
The updated Gini index for this candidate split position is 0.400. This procedure is repeated until the
Gini index for all candidates are found. The best split position corresponds to the one that produces
the lowest Gini index, which occurs at τ = $97,500. Since the Gini index at each candidate split
position can be computed in O(1) time, the complexity of finding the best split position is O(N) once
all the values are kept sorted, a one-time operation that takes O(N logN) time. The overall complexity
of this method is thus O(N logN), which is much smaller than the O(N2) time taken by the brute-force
method. The amount of computation can be further reduced by considering only candidate split
positions located between two adjacent sorted instances with di fferent class labels. For example, we
do not need to consider candidate split positions located between $60,000 and $75,000 because all
three instances with annual income in this range ($60,000, $70,000, and $75,000) have the same class
labels. Choosing a split position within this range only increases the degree of impurity, compared to
a split position located outside this range. Therefore, the candidate split positions at τ = $65,000 and τ
= $72,500 can be ignored. Similarly, we do not need to consider the candidate split positions at
$87,500, $92,500, $110,000, $122,500, and $172,500 because they are located between two adjacent
instances with the same labels. This strategy reduces the number of candidate split positions to
consider from 9 to 2 (excluding the two boundary cases τ = $55,000 and τ = $230,000).

Gain Ratio

Gain Ratio is modification of information gain that reduces its bias. Gain ratio overcomes the
problem with information gain by taking into account the number of branches that would
result before making the split.It corrects information gain by taking the intrinsic information
of a split into account.We can also say Gain Ratio will add penalty to information gain.

We already know how to calculate Gain or Information Gain

One potential limitation of impurity measures such as entropy and Gini index is that they tend to favor
qualitative attributes with large number of distinct values. Figure shows three candidate attributes for
partitioning the data set. As previously mentioned, the attribute Marital Status is a better choice than
the attribute Home Owner, because it provides a larger information gain. However, if we compare
them against Customer ID, the latter produces the purest partitions with the maximum information
gain, since the weighted entropy and Gini index is equal to zero for its children. Yet, Customer ID is
not a good attribute for splitting because it has a unique value for each instance. Even though a test
condition involving Customer ID will accurately classify every instance in the training data, we
cannot use such a test condition on new test instances with Customer ID values that haven’t been seen
before during training. This example suggests having a low impurity value alone is insu fficient to find
a good attribute test condition for a node. The number of children produced by the splitting attribute
should also be taken into consideration while deciding the best attribute test condition. There are two
ways to overcome this problem. One way is to generate only binary decision trees, thus avoiding the
difficulty of handling attributes with varying number of partitions. This strategy is employed by
decision tree classifiers such as CART. Another way is to modify the splitting criterion to take into
account the number of partitions produced by the attribute. For example, in the C4.5 decision tree
algorithm, a measure known as gain ratio is used to compensate for attributes that produce a large
number of child nodes. This measure is computed as follows:
where N(vi) is the number of instances assigned to node vi and k is the total number of splits. The
split information measures the entropy of splitting a node into its child nodes and evaluates if the split

same number of instances, then ∀i : N(vi)/N =1/k and the split information would be equal to log2 k.
results in a larger number of equally-sized child nodes or not. For example, if every partition has the

Thus, if an attribute produces a large number of splits, its split information is also large, which in turn,
reduces the gain ratio.

Example: [Gain Ratio]


Consider the data set. We want to select the best attribute test condition among the following three
attributes: Gender, Car Type, and Customer ID. The entropy before splitting is
Gini Index –CART(Classification And Regression Tree) algorithm

ID3 (Iterative Dichotomiser) decision tree algorithm uses information gain


Gain Ratio is used as an attribute selection criteria in algorithms such as C4. 5

Algorithm for Decision Tree Induction


Algorithm presents a pseudocode for decision tree induction algorithm. The input to this algorithm is
a set of training instances E along with the attribute set F. The algorithm works by recursively
selecting the best attribute to split the data (Step 7) and expanding the nodes of the tree (Steps 11 and
12) until the stopping criterion is met (Step 1).
The details of this algorithm are:
1. The createNode() function extends the decision tree by creating a new node. A node in the
decision tree either has a test condition, denoted as node.test cond, or a class label, denoted as
node.label.

2. The find best split() function determines the attribute test condition for partitioning the training
instances associated with a node. The splitting attribute chosen depends on the impurity measure used.
The popular measures include entropy and the Gini index.

3. The Classify() function determines the class label to be assigned to a leaf node. For each leaf
node t, let p(i|t) denote the fraction of training instances from class i associated with the node t. The
label assigned to the leaf node is typically the one that occurs most frequently in the training instances
that are associated with this node.

where the argmax operator returns the class i that maximizes p(i|t). Besides providing the information
needed to determine the class label of a leaf node, p(i|t) can also be used as a rough estimate of the
probability that an instance assigned to the leaf node t belongs to class i.

4. The stopping cond() function is used to terminate the tree-growing process by checking
whether all the instances have identical class label or attribute values. Since decision tree
classifiers employ a top-down, recursive partitioning approach for building a model, the
number of training instances associated with a node decreases as the depth of the tree
increases. As a result, a leaf node may contain too few training instances to make a
statistically significant decision about its class label. This is known as the data fragmentation
problem. One way to avoid this problem is to disallow splitting of a node when the number of
instances associated with the node fall below a certain threshold.

Tree Pruning
Pruning is the procedure that decreases the size of decision trees. It can decrease the risk of
overfitting by defining the size of the tree or eliminating areas of the tree that support little
power. Pruning supports by trimming the branches that follow anomalies in the training
information because of noise or outliers and supports the original tree in a method that
enhances the generalization efficiency of the tree.
Various methods generally use statistical measures to delete the least reliable departments,
frequently resulting in quicker classification and an improvement in the capability of the tree
to properly classify independent test data.
There are two approaches to tree pruning which are as follows –
Prepruning Approach

In the prepruning approach, a tree is “pruned” by halting its construction


early (e.g., by deciding not to further split or partition the subset
of training tuples at a given node). Upon halting, the node becomes a leaf.
The leaf may hold the most frequent class among the subset tuples or the
probability distribution of those tuples.
When constructing a tree, measures such as statistical significance,
information gain, Gini index, and so on, can be used to assess the
goodness of a split. If partitioning the tuples at a node would result in a
split that falls below a prespecified threshold, then further partitioning of
the given subset is halted. There are difficulties, however, in choosing an
appropriate threshold. High thresholds could result in oversimplified trees,
whereas low thresholds could result in very little simplification.
Post Pruning Approach
The second and more common approach is postpruning, which removes
subtrees from a “fully grown” tree. A subtree at a given node is pruned by
removing its branches and replacing it with a leaf. The leaf is labeled with
the most frequent class among the subtree being replaced. For example,
notice the subtree at node “A3?” in the unpruned tree of Figure 8.6.
Suppose that the most common class within this subtree is “class B.” In
the pruned version of the tree, the subtree in question is pruned by
replacing it with the leaf “class B.”
The cost complexity pruning algorithm used in CART is an example of
the postpruning approach. This approach considers the cost complexity of
a tree to be a function of the number of leaves in the tree and the error rate
of the tree (where the error rate is the percentage of tuples misclassified
by the tree). It starts from the bottom of the tree. For each internal node, N,
it computes the cost complexity of the subtree at N, and the cost
complexity of the subtree at N if it were to be pruned (i.e., replaced by a
leaf node). The two values are compared. If pruning the subtree at
node N would result in a smaller cost complexity, then the subtree is
pruned. Otherwise, it is kept.
A pruning set of class-labeled tuples is used to estimate cost complexity.
This set is independent of the training set used to build the unpruned tree
and of any test set used for accuracy estimation. The algorithm generates a
set of progressively pruned trees. In general, the smallest decision tree that
minimizes the cost complexity is preferred.
C4.5 uses a method called pessimistic pruning, which is similar to the
cost complexity method in that it also uses error rate estimates to make
decisions regarding subtree pruning. Pessimistic pruning, however, does
not require the use of a prune set. Instead, it uses the training set to
estimate error rates. Recall that an estimate of accuracy or error based on
the training set is overly optimistic and, therefore, strongly biased. The
pessimistic pruning method therefore adjusts the error rates obtained from
the training set by adding a penalty, so as to counter the bias incurred.
Rather than pruning trees based on estimated error rates, we can prune
trees based on the number of bits required to encode them. The “best”
pruned tree is the one that minimizes the number of encoding bits.
Scalibility and Decision Tree Induction
The efficiency of existing decision tree algorithms, such as ID3, C4. 5, and CART, has been well
established for relatively small data sets. Efficiency becomes an issue of concern when these
algorithms are applied to the mining of very large real-world databases.
Techniques
1.Rain Forest-AVC set(Attribute-Value,Class Label)
2.BOAT(Bootstrapped Optimistic Algorithm)

Confusion Matrix
A Confusion matrix is an N x N matrix used for evaluating the performance of a
classification model, where N is the number of target classes. The matrix compares
the actual target values with those predicted by the machine learning model. This
gives us a holistic view of how well our classification model is performing and
what kinds of errors it is making.

For a binary classification problem, we would have a 2 x 2 matrix as shown below


with 4 values:

 The target variable has two values: Positive or Negative


 The columns represent the actual values of the target variable
 The rows represent the predicted values of the target variable

True Positive (TP)

 The predicted value matches the actual value


 The actual value was positive and the model predicted a positive value

True Negative (TN)

 The predicted value matches the actual value


 The actual value was negative and the model predicted a negative value

False Positive (FP) – Type 1 error


 The predicted value was falsely predicted
 The actual value was negative but the model predicted a positive value
 Also known as the Type 1 error

False Negative (FN) – Type 2 error

 The predicted value was falsely predicted


 The actual value was positive but the model predicted a negative value
 Also known as the Type 2 error

Let me give you an example to better understand this. Suppose we had a


classification dataset with 1000 data points. We fit a classifier on it and get the
below confusion matrix:

The different values of the Confusion matrix would be as follows:

 True Positive (TP) = 560; meaning 560 positive class data points were correctly
classified by the model
 True Negative (TN) = 330; meaning 330 negative class data points were correctly
classified by the model
 False Positive (FP) = 60; meaning 60 negative class data points were incorrectly
classified as belonging to the positive class by the model
 False Negative (FN) = 50; meaning 50 positive class data points were incorrectly
classified as belonging to the negative class by the model

This turned out to be a pretty decent classifier for our dataset considering the
relatively larger number of true positive and true negative values.
Example

Suppose we are trying to create a model that can predict the result for the disease that is either
a person has that disease or not. So, the confusion matrix for this is given as:

From the above example, we can conclude that:

o The table is given for the two-class classifier, which has two predictions "Yes" and
"NO." Here, Yes defines that patient has the disease, and No defines that patient does
not has that disease.
o The classifier has made a total of 100 predictions. Out of 100 predictions, 89 are
true predictions, and 11 are incorrect predictions.
o The model has given prediction "yes" for 32 times, and "No" for 68 times. Whereas
the actual "Yes" was 27, and actual "No" was 73 times.

Calculations using Confusion Matrix:


We can perform various calculations for the model, such as the model's accuracy, using this
matrix. These calculations are given below:

o Classification Accuracy: It is one of the important parameters to determine the


accuracy of the classification problems. It defines how often the model predicts the
correct output. It can be calculated as the ratio of the number of correct predictions
made by the classifier to all number of predictions made by the classifiers. The
formula is given below:

o Misclassification rate: It is also termed as Error rate, and it defines how often the
model gives the wrong predictions. The value of error rate can be calculated as the
number of incorrect predictions to all number of the predictions made by the
classifier. The formula is given below:

o Precision: It can be defined as the number of correct outputs provided by the model
or out of all positive classes that have predicted correctly by the model, how many of
them were actually true. It can be calculated using the below formula:

o Recall: It is defined as the out of total positive and negative classes, how our model
predicted correctly. The recall must be as high as possible.

Recall vs Precision

Recall is the number of relevant documents retrieved by a search divided by the total number
of existing relevant documents, while precision is the number of relevant documents retrieved
by a search divided by the total number of documents retrieved by that search.

o F-measure: If two models have low precision and high recall or vice versa, it is
difficult to compare these models. So, for this purpose, we can use F-score. This score
helps us to evaluate the recall and precision at the same time. The F-score is
maximum if the recall is equal to the precision. It can be calculated using the below
formula:

What is Cross-Validation?
Cross validation is a technique used in machine learning to evaluate the
performance of a model on unseen data. It involves dividing the available
data into multiple folds or subsets, using one of these folds as a validation
set, and training the model on the remaining folds. This process is
repeated multiple times, each time using a different fold as the validation
set. Finally, the results from each validation step are averaged to produce
a more robust estimate of the model’s performance. Cross validation is an
important step in the machine learning process and helps to ensure that
the model selected for deployment is robust and generalizes well to new
data.
Types of Cross-Validation
There are several types of cross validation techniques, including k-fold
cross validation, leave-one-out cross validation, and Holdout
validation, Stratified Cross-Validation. The choice of technique
depends on the size and nature of the data, as well as the specific
requirements of the modeling problem.
1. Holdout Validation
In Holdout Validation, we perform training on the 50% of the given dataset
and rest 50% is used for the testing purpose. It’s a simple and quick way
to evaluate a model. The major drawback of this method is that we
perform training on the 50% of the dataset, it may possible that the
remaining 50% of the data contains some important information which we
are leaving while training our model i.e. higher bias.
2. LOOCV (Leave One Out Cross Validation)
In this method, we perform training on the whole dataset but leaves only
one data-point of the available dataset and then iterates for each data-
point. In LOOCV, the model is trained on n−1n−1 samples and tested on
the one omitted sample, repeating this process for each data point in the
dataset. It has some advantages as well as disadvantages also.
An advantage of using this method is that we make use of all data points
and hence it is low bias.
The major drawback of this method is that it leads to higher variation in
the testing model as we are testing against one data point. If the data
point is an outlier it can lead to higher variation. Another drawback is
it takes a lot of execution time as it iterates over ‘the number of data
points’ times.
3. Stratified Cross-Validation
It is a technique used in machine learning to ensure that each fold of the
cross-validation process maintains the same class distribution as the
entire dataset. This is particularly important when dealing with imbalanced
datasets, where certain classes may be underrepresented. In this method,
1. The dataset is divided into k folds while maintaining the proportion of
classes in each fold.
2. During each iteration, one-fold is used for testing, and the remaining
folds are used for training.
3. The process is repeated k times, with each fold serving as the test set
exactly once.
Stratified Cross-Validation is essential when dealing with classification
problems where maintaining the balance of class distribution is crucial for
the model to generalize well to unseen data.
4. K-Fold Cross Validation
In K-Fold Cross Validation, we split the dataset into k number of subsets
(known as folds) then we perform training on the all the subsets but leave
one(k-1) subset for the evaluation of the trained model. In this method, we
iterate k times with a different subset reserved for testing purpose each
time.
Note: It is always suggested that the value of k should be 10 as the lower
value of k takes towards validation and higher value of k leads to LOOCV
method.
Example of K Fold Cross Validation
The diagram below shows an example of the training subsets and
evaluation subsets generated in k-fold cross-validation. Here, we have
total 25 instances. In first iteration we use the first 20 percent of data for
evaluation, and the remaining 80 percent for training ([1-5] testing and [5-
25] training) while in the second iteration we use the second subset of 20
percent for evaluation, and the remaining three subsets of the data for
training ([5-10] testing and [1-5 and 10-25] training), and so on.

Total instances: 25
Value of k : 5
No. Iteration Training set observations
Testing set observations
1 [ 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
22 23 24] [0 1 2 3 4]
2 [ 0 1 2 3 4 10 11 12 13 14 15 16 17 18 19 20 21
22 23 24] [5 6 7 8 9]
3 [ 0 1 2 3 4 5 6 7 8 9 15 16 17 18 19 20 21
22 23 24] [10 11 12 13 14]
4 [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 20 21
22 23 24] [15 16 17 18 19]
5 [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
17 18 19] [20 21 22 23 24]

Other important terms used in Confusion Matrix:


o Null Error rate: It defines how often our model would be incorrect if it always
predicted the majority class. As per the accuracy paradox, it is said that " the best
classifier has a higher error rate than the null error rate."
o ROC Curve: The ROC(Receiver Operating Characterstic) is a graph displaying a
classifier's performance for all possible thresholds. The graph is plotted between the
true positive rate (on the Y-axis) and the false Positive rate (on the x-axis).
o PRC area:Precision Recall area
o MCC-Mathews Correlation Coefficient

Visual Mining for Decision Tree Induction

Perception based Classification(PBC)

“Are there any interactive approaches to decision tree induction that


allow us to visualize the data and the tree as it is being constructed? Can
we use any knowledge of our data to help in building the tree?” In this
section, you will learn about an approach to decision tree induction that
supports these options. Perception-based classification (PBC) is an
interactive approach based on multidimensional visualization techniques
and allows the user to incorporate background knowledge about the data
when building a decision tree. By visually interacting with the data, the
user is also likely to develop a deeper understanding of the data. The
resulting trees tend to be smaller than those built using traditional decision
tree induction methods and so are easier to interpret, while achieving about
the same accuracy.
“How can the data be visualized to support interactive decision tree
construction?” PBC uses a pixel-oriented approach to view
multidimensional data with its class label information. The circle segments
approach is adapted, which maps d-dimensional data objects to a circle
that is partitioned into d segments, each representing one attribute (Section
2.3.1). Here, an attribute value of a data object is mapped to one colored
pixel, reflecting the object's class label. This mapping is done for each
attribute–value pair of each data object. Sorting is done for each attribute
to determine the arrangement order within a segment. For example,
attribute values within a given segment may be organized so as to display
homogeneous (with respect to class label) regions within the same
attribute value. The amount of training data that can be visualized at one
time is approximately determined by the product of the number of
attributes and the number of data objects.
The PBC system displays a split screen, consisting of a Data Interaction
window and a Knowledge Interaction window (Figure 8.9). The Data
Interaction window displays the circle segments of the data under
examination, while the Knowledge Interaction window displays the
decision tree constructed so far. Initially, the complete training set is
visualized in the Data Interaction window, while the Knowledge
Interaction window displays an empty decision tree.
Traditional decision tree algorithms allow only binary splits for numeric
attributes. PBC, however, allows the user to specify multiple split-points,
resulting in multiple branches to be grown from a single tree node.
A tree is interactively constructed as follows. The user visualizes the
multidimensional data in the Data Interaction window and selects a
splitting attribute and one or more split-points. The current decision tree in
the Knowledge Interaction window is expanded. The user selects a node of
the decision tree. The user may either assign a class label to the node
(which makes the node a leaf) or request the visualization of the training
data corresponding to the node. This leads to a new visualization of every
attribute except the ones used for splitting criteria on the same path from
the root. The interactive process continues until a class has been assigned
to each leaf of the decision tree.
The trees constructed with PBC were compared with trees generated by
the CART, C4.5, and SPRINT algorithms from various data sets. The trees
created with PBC were of comparable accuracy with the tree from
the algorithmic approaches, yet were significantly smaller and, thus, easier
to understand. Users can use their domain knowledge in building a
decision tree, but also gain a deeper understanding of their data during the
construction process.
Steps of the ID3 Algorithm

1. The first step of the algorithm is the selection of the attributes that will
become nodes of the decision tree. As already discussed there are two terms
entropy and information gain that are used as the basis for attribute selection.

2. Once the attribute is selected for the current node, the child nodes are
generated, one for each possible value of the selected attribute.

3. The examples are then partitioned using the possible values of this attribute
and these partitioned subsets of the examples are assigned to the appropriate child
node

4. The steps 1 to 3 are repeated for each child node until all examples associated
with a node are either all positive or all negative

You might also like