Unit-3
Unit-3
March 5, 2025 1
Classification vs. Prediction
Classification is the process of finding a model (or function) that describes and
distinguishes data classes or concepts, for the purpose of being able to use the
model to predict the class of objects whose class label is unknown. This derived
model is based on the analysis of a set of training data
Classification
predicts categorical class labels (discrete or nominal)
classifies data (constructs a model) based on the training set and the values
(class labels) in a classifying attribute and uses it in classifying new data
Can be done by IF-THEN rules, Decision Tree, Neural Network, Genetic
Algorithm, Rough Set Approach, Fuzzy Set Approach
Prediction
Prediction is about predicting a missing/unknown element (continuous
value) of a dataset, i.e. predicts unknown or missing values.
A prediction is a statement or claim that a particular event will occur in the
future in more certain terms than a forecast.
Can be Done by Regression Analysis Technique, Log Linear Models,
Regression Trees
Typical applications (Classification)
1 Credit approval 2 Target marketing 3 Medical diagnosis
4 Fraud detection
March 5, 2025 2
Classification vs. Prediction :Examples
1. A bank loans officer needs analysis of her data in order to learn which loan
applicants are “safe” and which are “risky” for the bank (classification). If we
instead wanted to predict the amount (in Rupees) that would be “safe” for the bank
to loan an applicant. (predication)
2. A marketing manger at All Electronics needs data analysis to help guess whether a
customer with a given profile will “buy” a new laptop computer or “not buy”
(Classification).He can predict the expenditures in Rs. of potential customers on
computer equipment given their income and occupation (Prediction).
3. A medical researcher wants to analyze lung cancer data in order to predict “which
one” of the three specific treatments a patient should receive “treatment A”,
“treatment B” or “treatment C” .
In each of above examples , the data analysis task is “classification”, where a model
or classifier is constructed to predict categorical labels, such as “safe” or “risky”
for the loan application data; “yes” or “no” for the marketing data; or “treatment
A”, “treatment B” or “treatment C” for the medical data. These categories can be
represented by discrete values where there is no ordering among values has no
meaning.
The data analysis task is an example of numeric prediction, where the model
constructed predicts a continuous-valued function or ordered value. This model is a
predictor.
March 5, 2025 3
Classification—A Two-Step Process
Model construction(Classifier): describing a set of predetermined
classes
Each tuple/sample/examples/instances/data points or objects is
assumed to belong to a predefined class, as determined by the class
label attribute( discrete-valued and unordered)
The set of tuples (training tuples) used for model construction is
training set
The model is represented as classification rules, decision trees, or
mathematical formulae
Model usage: for classifying future or unknown objects
Estimate accuracy of the model
The known label of test sample is compared with the classified
result from the model
Accuracy rate is the percentage of test set samples that are
correctly classified by the model
Test set is independent of training set, otherwise over-fitting will
occur
If the accuracy is acceptable, use the model to classify data tuples
whose class labels are not known
March 5, 2025 4
Process (1): Model Construction
Classification
Algorithms
Training
Data
Classifier
(Model)
NAME AGE INCOME LOAN_DECISION
Mike young low risky IF age = youth then
Mary young low risky loan_decision=risky
Bill middle_age high safe If income=high THEN
loan_decision = safe
Jim middle_age low risky
IF age = ‘middle_age’
Dave senior low safe AND income = low
Anne senior medium safe THEN
joe middle_age high safe loan_decision = ‘risky’
……..
March 5, 2025 5
Process (2): Using the Model in Prediction
Classifier
Testing
Data Unseen Data
(Jeff, middle_aged,low)
NAME AGE INCOME LOAN_DECISION Loan_decision?
Tom senior low safe
Merlisa middle_aged low risky
George middle aged high safe
Joseph young medium risky
March 5, 2025 6
Supervised vs. Unsupervised Learning
March 5, 2025 7
Uses of Classification
concept maps in education
e-commerce (B2B, B2C)
Corporate portals
Agent-to-agent communication
Database schema correlation for interoperability
Text summarization and other NLP applications
The Semantic Web
Ontologies, Taxonomies
March 5, 2025 8
Use of Prediction
Using Prediction in Web Site Management: Your site developer uses the Predictor
resource in Commerce Server to add predictive capabilities to your site. The Predictor
resource enables you to perform the following profiling and targeting activities
Provide intelligent cross-sell support
Segment users based on their behavior or profile
Fill in missing values in your user profile data
Analysis Models: The Predictor resource builds analysis models from data. An analysis
model is a set of statistical relationships based on known properties of past site users,
their purchase history, click history, or other behavior. The model contains information
about the types of users who visit your site, but it does not contain information about
specific users. The detailed information used to create an analysis model is stored in the
Commerce Server Data Warehouse, or another data source.
You can use analysis models to perform implicit profiling and to target content to users:
Implicit profiling
Targeting content.
Creating Analysis Models: Each analysis model is based on a model configuration,
which is a description of the data to be used to build the model. A model configuration
specifies:
A data source for building the analysis model.
Attributes for the model.
March 5, 2025 9
Use of Prediction
Informal prediction from hypothesis
Opinion Polls (Political Forecasting , opinion polls )
Supernatural (prophecy) (water divining, astrology, numerology,
and fortune telling)
Anticipatory science forecasts (Quantum physics , Mathematical
models , microprocessors, branch prediction , software reliability,
natural disasters, pandemics, demography, population dynamics and
meteorology)
Example of scientific hypothesis and prediction
Finance (stock market ,stock investors , prediction games)
Vision and prophecy
Prediction in fiction (fantasy, forecasting and science fiction)
March 5, 2025 10
Issues: Data Preparation
Data cleaning
Preprocess data in order to reduce noise and handle missing values
by applying smoothing techniques
Relevance analysis (feature selection)
Many of the attributes in the data may be redundant. Correlation
analysis can be used to identify whether any two given attributes are
statistically related.
Remove the irrelevant or redundant attributes by Attribute subset
selection
Data transformation and reduction
Generalize and/or normalize data
March 5, 2025 11
Issues: Evaluating Classification Methods
Accuracy
classifier accuracy: predicting class label
predictor accuracy: guessing value of predicted attributes
Speed
time to construct the model (training time)
time to use the model (classification/prediction time)
Robustness: handling noise and missing values
Scalability: efficiency in disk-resident databases
Interpretability
understanding and insight provided by the model
Other measures, e.g., goodness of rules, such as decision tree
size or compactness of classification rules
March 5, 2025 12
Decision Tree Learning
Decision Tree learning is a method for approximating
discrete-valued target functions in which the learned
function is represented by a decision tree.
Learned trees can also be re-represented as sets of if-then
rules to improve human readability.
Decision Trees classify instances by sorting them down
the tree from root to some leaf node which provides the
classification of the instance.
March 5, 2025 13
Classification by Decision Tree Induction (DTI)
DTI is the learning of decision trees from class_labeled training tuples.
A decision tree is a flowchart-like tree structure, where each internal node
(non-leaf node) denotes a test on an attribute, each branch represents an
outcome of the test, and each leaf node (or terminal node) holds a class
label. The top most node is the root node.
Why are DT Classifier so popular ?
The construction of DT classifiers does not require any domain
knowledge or parameter setting, and therefore is appropriate for
exploratory knowledge discovery.
DT can handle high dimensional data.
Their representation of acquired knowledge in tree form is intuitive
and generally easy to assimilate by humans
They have good accuracy.
They may be used in medicine manufacturing, production, financial
analysis, astronomy and molecular biology.
March 5, 2025 14
Decision Tree Induction: Training Dataset
age income student credit_rating buys_computer
Youth high no fair no
This Youth high no excellent no
Middle_aged high no fair yes
follows Senior medium no fair yes
an Senior low yes fair yes
example Senior low yes excellent no
Middle_aged low yes excellent yes
of Youth medium no fair no
Quinlan’s Youth low yes fair yes
Senior medium yes fair yes
ID3 (Buys
Youth medium yes excellent yes
computer Middle_aged medium no excellent yes
) Middle_aged high yes fair yes
Senior medium no excellent no
March 5, 2025 15
Output: A Decision Tree for “buys_computer”
age?
<=30 overcast
31..40 >40
no yes no yes
March 5, 2025 16
Algorithm for Decision Tree Induction
Basic algorithm (a greedy algorithm)
Tree is constructed in a top-down recursive divide-and-conquer manner
At start, all the training examples are at the root
If attribute A are categorical or discrete valued: In this case, the outcomes of the
test at node N correspond directly to the known value of A.
(if attribute A is continuous-valued: In this case, the test at node N has two
possible outcomes corresponding to the condition A<=split_point and A > split
_point, respectively where split_point is the split_point returned by
Attribute_selection_method as part of the splitting criterion.
Examples are partitioned recursively based on selected attributes
Test attributes are selected on the basis of a heuristic or statistical measure (e.g.,
information gain)
Conditions for stopping partitioning
All samples for a given node belong to the same class
There are no remaining attributes for further partitioning – majority voting is
employed for classifying the leaf
There are no samples left
March 5, 2025 17
Algorithm for Decision Tree Induction
Algorithm: Generate_decision_Tree. Generate a
decision tree from the training tuples of data partition D.
Input:
Data partition, D, which is a set of training tuples and
their associated class labels;
Attribute_list, the set of candidate attributes;
Attribute_selection_method, a procedure to
determine the splitting criterion that “best” partition
the data tuples into individual classes. This criterion
consists of a splitting_atribute and, possibly, either a
split point or splitting subset.
Output: A decision tree.
March 5, 2025 18
Algorithm for Decision Tree Induction: Method
1. Create a node N;
2. If tuples in D are all of the same class, C then return N as a leaf node labeled
with the class C;
3. If attribute_list is empty then
4. return N as a leaf node labeled with the majority class in D;
5. Apply attribute_selection_method(D, attribute_list) to find “best”
splitting_criterion;
6. Label node N with splitting_criterion
7. If splitting_attribute is descrete-valued and
multiway splits allowed then
9. attribute_list attribute_list – splitting_atributes;
10. For each outcome j of splitting_criterion
11. let Dj ,be the set of data tuples in D satisfying outcome j;
12. if Dj is empty then
13. attach a leaf labeled with the majority class in D to Node N;
14. Else attach the node returned by Generate_decision_tree(Dj,attribute_list)
to node N;
endfor
15. Return N;
March 5, 2025 19
Attribute Selection Measure:
Information Gain (ID3/C4.5)
1. Iterative Dichotomiser 3 (ID3): This algorithm uses Information Gain to decide which attribute is to be
used classify the current subset of the data. For each level of the tree, information gain is calculated for the
remaining data recursively.
2. C4.5: This algorithm is the successor of the ID3 algorithm. This algorithm uses either Information gain or
Gain ratio to decide upon the classifying attribute. It is a direct improvement from the ID3 algorithm as it can
handle both continuous and missing attribute values.
3. Classification and Regression Tree(CART): It is a dynamic learning algorithm which can produce a
regression tree as well as a classification tree depending upon the dependent variable.
Select the attribute with the highest information gain
Let pi be the probability that an arbitrary tuple in D belongs to class C i, estimated by |Ci, D|/|D|
Expected information (entropy) needed to classify a tuple in D: m
Info ( D ) pi log 2 ( pi )
i 1
Information needed (after using A to split D into v partitions) to classify D:
v | Dj |
Info A ( D) I ( D j )
j 1 |D|
Information gained by branching on attribute A
Gain(A) Info(D) Info A(D)
March 5, 2025 20
Attribute Selection: Information Gain
g Class P: buys_computer = “yes” 5 4
Info age ( D ) I (2,3) I (4,0)
g Class N: buys_computer = “no” 14 14
9 9 5 5 5
Info ( D) I (9,5) log 2 ( ) log 2 ( ) 0.940 I (3,2) 0.694
14 14 14 14 14
age pi ni I(p i, n i) 5
I (2,3) means “age <=30” has 5 out of
<=30 2 3 0.971 14
14 samples, with 2 yes’es and 3
31…40 4 0 0
no’s. Hence
>40 3 2 0.971
age income student credit_rating buys_computer Gain(age) Info ( D ) Info age ( D ) 0.246
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes Similarly,
>40 medium no fair yes
>40 low yes fair yes
>40
31…40
low
low
yes
yes
excellent
excellent
no
yes
Gain(income) 0.029
<=30 medium no fair no
<=30 low yes fair yes Gain( student ) 0.151
>40 medium yes fair yes
<=30
31…40
medium
medium
yes
no
excellent
excellent
yes
yes
Gain(credit _ rating ) 0.048
31…40
>40
high
medium
yes
Marchno5,
fair
2025
excellent
yes
no 21
The attribute age has the highest information gain & therefore becomes the splitting
attribute at the root node of the decision tree. Branches are grown for each outcome of
age. The tuples are shown partitioned accordingly.
Age ?
youth
senior
income student credit_rating buys_computer income student credit_rating buys_computer
high no fair no medium no fair yes
high no excellent no low yes fair yes
medium no fair no low yes excellent no
low yes fair yes medium yes fair yes
medium yes excellent yes medium no excellent no
Middle_aged
income student credit_rating buys_computer
high no fair yes
low yes excellent yes
medium no excellent yes
high yes fair yes
March 5, 2025 22
Computing Information-Gain for Continuous-
Value Attributes
Let attribute A be a continuous-valued attribute
Must determine the best split point for A
Sort the value A in increasing order
Typically, the midpoint between each pair of adjacent values
is considered as a possible split point
(ai+ai+1)/2 is the midpoint between the values of ai and ai+1
The point with the minimum expected information
requirement for A is selected as the split-point for A
Split:
D1 is the set of tuples in D satisfying A ≤ split-point, and
D2 is the set of tuples in D satisfying A > split-point
March 5, 2025 23
Gain Ratio for Attribute Selection (C4.5)
March 5, 2025 26
Comparing Attribute Selection Measures
March 5, 2025 28
Underfitting / Overfitting
If we have an underfitted model, this means that we do not have enough
parameters to capture the trends in the underlying system. Imagine for
example that we have data that is parabolic in nature, but we try to fit this
with a linear function, with just one parameter. Because the function does
not have the required complexity to fit the data (two parameters), we end up
with a poor predictor. In this case the model will have high bias. This means
that we will get consistent answers, but consistently wrong answers
March 5, 2025 29
Overfitting and Tree Pruning
Overfitting: An induced tree may overfit the training data
Too many branches, some may reflect anomalies due to noise or outliers
Poor accuracy for unseen samples
Two approaches to avoid overfitting
Prepruning: Halt tree construction early—do not split a node if this would result in the
goodness measure falling below a threshold
Difficult to choose an appropriate threshold
Postpruning: Remove branches from a “fully grown” tree—get a sequence of
progressively pruned trees
Use a set of data different from the training data to decide which is the “best pruned
tree”
Decision Trees suffer from Repetition and Replication
Repetition: It occurs when an attribute is repeatedly tested along a given branch of the
tree (such as “age<60 ?” followed by “age<45 ?” and so on).
Replication: In replication, duplicate subtrees exist within the trees
March 5, 2025 30
Unpruned Decision Tree pruned Decision Tree
A1 ? A1
Y N Y
?
N
A2 ? A3
?
A2
?
Class B
Y N N Y N
Y
A4
A4 A5
? Class A
? Class A ? Class B
Y N Y N
Y N
Class A Class B Class A Class B
Class B Class A
March 5, 2025 31
Repetition
A1 < 60 ?
Where an
Yes No attribute is
repeatedly
A1 <45 ?
tested along a
Yes No given branch of
the tree e.g. age
A1 < 50 ?
Yes No
Class A Class B
March 5, 2025 32
Replication
( Where duplicate subtrees exist within a tree)
Age=youth ?
N
Y
Credit_rating ?
Student ?
Y excellent
Fair
Y Class A
Income ?
Credit_rating ?
LOW HIGH
MED
March 5, 2025 33
Enhancements to Basic Decision Tree Induction
Allow for continuous-valued attributes
Dynamically define new discrete-valued attributes that
partition the continuous attribute value into a discrete set of
intervals
Handle missing attribute values
Assign the most common value of the attribute
Assign probability to each of the possible values
Attribute construction
Create new attributes based on existing ones that are
sparsely represented
This reduces fragmentation, repetition, and replication
March 5, 2025 34
Issues In Decision Tree Learning
Practical issues in learning decision trees include
Determining how deeply to grow the decision tree
Handling continuous attributes
Choosing an appropriate attribute selection measure
Handling training data with missing attribute values, handling
Attributes with differing costs, and improving computational
efficiency.
Avoiding Overfitting the Data
Reduce Error Pruning
March 5, 2025 35
Instance Based Learning
The Machine Learning systems which are categorized as instance-based
learning are the systems that learn the training examples by heart and
then generalizes to new instances based on some similarity measure. It is
called instance-based because it builds the hypotheses from the training
instances. It is also known as memory-based learning or lazy-
learning. The time complexity of this algorithm depends upon the size of
training data. The worst-case time complexity of this algorithm is O (n),
where n is the number of training instances.
March 5, 2025 36
Advantages:
Instead of estimating for the entire instance set, local
approximations can be made to the target function.
This algorithm can adapt to new data easily, one which is
collected as we go .
Disadvantages:
Classification costs are high
Large amount of memory required to store the data, and each
query involves starting the identification of a local model from
scratch.
Some of the instance-based learning algorithms are :
K Nearest Neighbor (KNN)
Self-Organizing Map (SOM)
Learning Vector Quantization (LVQ)
Locally Weighted Learning (LWL)
March 5, 2025 37
K Nearest Neighbor (KNN)
K-Nearest Neighbor is one of the simplest Machine Learning algorithms based
on Supervised Learning technique.
K-NN algorithm assumes the similarity between the new case/data and
available cases and put the new case into the category that is most similar to the
available categories.
K-NN algorithm stores all the available data and classifies a new data point
based on the similarity. This means when new data appears then it can be easily
classified into a well suite category by using K- NN algorithm.
K-NN algorithm can be used for Regression as well as for Classification but
mostly it is used for the Classification problems.
K-NN is a non-parametric algorithm, which means it does not make any
assumption on underlying data.
It is also called a lazy learner algorithm because it does not learn from the
training set immediately instead it stores the dataset and at the time of
classification, it performs an action on the dataset.
KNN algorithm at the training phase just stores the dataset and when it gets new
data, then it classifies that data into a category that is much similar to the new
data.
March 5, 2025 38
Example: Suppose, we have an image of a creature that
looks similar to cat and dog, but we want to know either it
is a cat or dog. So for this identification, we can use the
KNN algorithm, as it works on a similarity measure. Our
KNN model will find the similar features of the new data
set to the cats and dogs images and based on the most
similar features it will put it in either cat or dog category
March 5, 2025 39
Why do we need a K-NN Algorithm?
Suppose there are two categories, i.e., Category A and
Category B, and we have a new data point x1, so this data
point will lie in which of these categories. To solve this
type of problem, we need a K-NN algorithm. With the help
of K-NN, we can easily identify the category or class of a
particular dataset.
March 5, 2025 40
How does K-NN work?
The K-NN working can be explained on the basis of the below
algorithm:
Step-1: Select the number K of the neighbors
Step-2: Calculate the Euclidean distance of K number of
neighbors
Step-3: Take the K nearest neighbors as per the calculated
Euclidean distance.
Step-4: Among these k neighbors, count the number of the data
points in each category.
Step-5: Assign the new data points to that category for which
the number of the neighbor is maximum.
Step-6: Our model is ready.
March 5, 2025 41
Suppose we have a new data point and we need to
put it in the required category.
• Firstly, we will choose
the number of neighbors,
so we will choose the
k=5.
March 5, 2025 42
By calculating the Euclidean distance we got the
nearest neighbors, as three nearest neighbors in
category A and two nearest neighbors in category B
March 5, 2025 43
As we can see the 3 nearest neighbors are from
category A, hence this new data point must belong
to category A.
March 5, 2025 44
Advantages of KNN Algorithm:
• It is simple to implement.
• It is robust to the noisy training data
• It can be more effective if the training data is
large.
March 5, 2025 45
LOCALLY WEIGHTED REGRESSION
The phrase "locally weighted regression" is called local because
the function is approximated based only on data near the query
point, weighted because the contribution of each training example
is weighted by its distance from the query point, and regression
because this is the term used widely in the statistical learning
community for the problem of approximating real-valued
functions.
Given a new query instance Xq, the general approach in locally
weighted regression is to construct an approximation that fits the
training examples in the neighborhood surrounding Xq. This
approximation is then used to calculate the value 𝑓(Xq), which is
output as the estimated target value for the query instance.
March 5, 2025 46
Locally Weighted Linear Regression
Consider locally weighted regression in which the target function
f is approximated near
Xq using a linear function of the form
Where, ai(x) denotes the value of the ith attribute of the instance x , and W0,W1…
Wn are the coefficient and weight of the function
Derived methods are used to choose weights that minimize the squared error
summed over the set D of training examples using gradient descent
Which led us to the gradient descent training
rule
f(x) is the target output and Function f(x) if
the ca
March 5, 2025 47
Where, η is a constant learning rate
• Need to modify this procedure to derive a local
approximation rather than a global one. The simple
way is to redefine the error criterion E to emphasize
fitting the local training examples. Three possible
criteria are given below.
March 5, 2025 48
2. Minimize the squared error over the entire set
distance D of training examples, while weighting
the error of each training example by some
decreasing function K of its distance from xq :
d is the distance of new instance with respect to the current point and
K is the decreasing function.
Combine 1 and 2:
March 5, 2025 49
If we choose criterion three and re-derive the gradient
descent rule, we obtain the following training rule
variance 𝜎u2
function centered at the point xu with some
March 5, 2025 52
The function given by equ(1) can be viewed as
describing a two layer network where the first layer of
units computes the values of the various Ku(d(xu, x))
and where the second layer computes a linear
combination of these first-layer unit values
March 5, 2025 53