0% found this document useful (0 votes)
7 views53 pages

Unit-3

The document provides an overview of decision tree learning and instance-based learning in machine learning, focusing on classification and prediction. It discusses the processes of model construction and usage, the differences between supervised and unsupervised learning, and various applications of classification and prediction. Additionally, it covers decision tree induction, algorithms for constructing decision trees, and attribute selection measures like Information Gain.

Uploaded by

Anurag Chaudhary
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views53 pages

Unit-3

The document provides an overview of decision tree learning and instance-based learning in machine learning, focusing on classification and prediction. It discusses the processes of model construction and usage, the differences between supervised and unsupervised learning, and various applications of classification and prediction. Additionally, it covers decision tree induction, algorithms for constructing decision trees, and attribute selection measures like Information Gain.

Uploaded by

Anurag Chaudhary
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 53

UNIT 3

Decision Tree Learning/ Instance Based Learning

BCS 055: Machine Learning Introduction

Dr. Neelaksh Sheel


(Associate Prof. CS & E)

CS & E Dept. M.I.T Moradabad


B.Tech CSE V sem
Recommended Books:

1. Tom M. Mitchell,―Machine Learning, McGraw-Hill Education (India) Private Limited, 2013.


2. Ethem Alpaydin,―Introduction to Machine Learning (Adaptive Computation and Machine
Learning), The MIT Press 2004.
3. Stephen Marsland, ―Machine Learning: An Algorithmic Perspective, CRC Press, 2009.
4. Bishop, C., Pattern Recognition and Machine Learning. Berlin: Springer- Verlag.

March 5, 2025 1
Classification vs. Prediction
 Classification is the process of finding a model (or function) that describes and
distinguishes data classes or concepts, for the purpose of being able to use the
model to predict the class of objects whose class label is unknown. This derived
model is based on the analysis of a set of training data
 Classification
 predicts categorical class labels (discrete or nominal)
 classifies data (constructs a model) based on the training set and the values
(class labels) in a classifying attribute and uses it in classifying new data
 Can be done by IF-THEN rules, Decision Tree, Neural Network, Genetic
Algorithm, Rough Set Approach, Fuzzy Set Approach
 Prediction
 Prediction is about predicting a missing/unknown element (continuous
value) of a dataset, i.e. predicts unknown or missing values.
 A prediction is a statement or claim that a particular event will occur in the
future in more certain terms than a forecast.
 Can be Done by Regression Analysis Technique, Log Linear Models,
Regression Trees
 Typical applications (Classification)
1 Credit approval 2 Target marketing 3 Medical diagnosis
4 Fraud detection

March 5, 2025 2
Classification vs. Prediction :Examples
1. A bank loans officer needs analysis of her data in order to learn which loan
applicants are “safe” and which are “risky” for the bank (classification). If we
instead wanted to predict the amount (in Rupees) that would be “safe” for the bank
to loan an applicant. (predication)
2. A marketing manger at All Electronics needs data analysis to help guess whether a
customer with a given profile will “buy” a new laptop computer or “not buy”
(Classification).He can predict the expenditures in Rs. of potential customers on
computer equipment given their income and occupation (Prediction).
3. A medical researcher wants to analyze lung cancer data in order to predict “which
one” of the three specific treatments a patient should receive “treatment A”,
“treatment B” or “treatment C” .
 In each of above examples , the data analysis task is “classification”, where a model
or classifier is constructed to predict categorical labels, such as “safe” or “risky”
for the loan application data; “yes” or “no” for the marketing data; or “treatment
A”, “treatment B” or “treatment C” for the medical data. These categories can be
represented by discrete values where there is no ordering among values has no
meaning.
 The data analysis task is an example of numeric prediction, where the model
constructed predicts a continuous-valued function or ordered value. This model is a
predictor.

March 5, 2025 3
Classification—A Two-Step Process
 Model construction(Classifier): describing a set of predetermined
classes
 Each tuple/sample/examples/instances/data points or objects is
assumed to belong to a predefined class, as determined by the class
label attribute( discrete-valued and unordered)
 The set of tuples (training tuples) used for model construction is
training set
 The model is represented as classification rules, decision trees, or
mathematical formulae
 Model usage: for classifying future or unknown objects
 Estimate accuracy of the model
 The known label of test sample is compared with the classified
result from the model
 Accuracy rate is the percentage of test set samples that are
correctly classified by the model
 Test set is independent of training set, otherwise over-fitting will
occur
 If the accuracy is acceptable, use the model to classify data tuples
whose class labels are not known

March 5, 2025 4
Process (1): Model Construction
Classification
Algorithms
Training
Data
Classifier
(Model)
NAME AGE INCOME LOAN_DECISION
Mike young low risky IF age = youth then
Mary young low risky loan_decision=risky
Bill middle_age high safe If income=high THEN
loan_decision = safe
Jim middle_age low risky
IF age = ‘middle_age’
Dave senior low safe AND income = low
Anne senior medium safe THEN
joe middle_age high safe loan_decision = ‘risky’
……..
March 5, 2025 5
Process (2): Using the Model in Prediction

Classifier

Testing
Data Unseen Data

(Jeff, middle_aged,low)
NAME AGE INCOME LOAN_DECISION Loan_decision?
Tom senior low safe
Merlisa middle_aged low risky
George middle aged high safe
Joseph young medium risky
March 5, 2025 6
Supervised vs. Unsupervised Learning

 Supervised learning (classification)


 Supervision: The training data (observations, measurements, etc.)
are accompanied by labels indicating the class of the observations
 New data is classified based on the training set
 No new class is generated
 Unsupervised learning (clustering)
 The class labels of training data is unknown
 Given a set of measurements, observations, etc. with the aim of
establishing the existence of classes or clusters in the data
 New classes can be generated.

March 5, 2025 7
Uses of Classification
 concept maps in education
 e-commerce (B2B, B2C)
 Corporate portals
 Agent-to-agent communication
 Database schema correlation for interoperability
 Text summarization and other NLP applications
 The Semantic Web
 Ontologies, Taxonomies

March 5, 2025 8
Use of Prediction
 Using Prediction in Web Site Management: Your site developer uses the Predictor
resource in Commerce Server to add predictive capabilities to your site. The Predictor
resource enables you to perform the following profiling and targeting activities
 Provide intelligent cross-sell support
 Segment users based on their behavior or profile
 Fill in missing values in your user profile data
 Analysis Models: The Predictor resource builds analysis models from data. An analysis
model is a set of statistical relationships based on known properties of past site users,
their purchase history, click history, or other behavior. The model contains information
about the types of users who visit your site, but it does not contain information about
specific users. The detailed information used to create an analysis model is stored in the
Commerce Server Data Warehouse, or another data source.
 You can use analysis models to perform implicit profiling and to target content to users:
 Implicit profiling
 Targeting content.
 Creating Analysis Models: Each analysis model is based on a model configuration,
which is a description of the data to be used to build the model. A model configuration
specifies:
 A data source for building the analysis model.
 Attributes for the model.

March 5, 2025 9
Use of Prediction
 Informal prediction from hypothesis
 Opinion Polls (Political Forecasting , opinion polls )
 Supernatural (prophecy) (water divining, astrology, numerology,
and fortune telling)
 Anticipatory science forecasts (Quantum physics , Mathematical
models , microprocessors, branch prediction , software reliability,
natural disasters, pandemics, demography, population dynamics and
meteorology)
 Example of scientific hypothesis and prediction
 Finance (stock market ,stock investors , prediction games)
 Vision and prophecy
 Prediction in fiction (fantasy, forecasting and science fiction)

March 5, 2025 10
Issues: Data Preparation
 Data cleaning
 Preprocess data in order to reduce noise and handle missing values
by applying smoothing techniques
 Relevance analysis (feature selection)
 Many of the attributes in the data may be redundant. Correlation
analysis can be used to identify whether any two given attributes are
statistically related.
 Remove the irrelevant or redundant attributes by Attribute subset
selection
 Data transformation and reduction
 Generalize and/or normalize data

March 5, 2025 11
Issues: Evaluating Classification Methods
 Accuracy
 classifier accuracy: predicting class label
 predictor accuracy: guessing value of predicted attributes
 Speed
 time to construct the model (training time)
 time to use the model (classification/prediction time)
 Robustness: handling noise and missing values
 Scalability: efficiency in disk-resident databases
 Interpretability
 understanding and insight provided by the model
 Other measures, e.g., goodness of rules, such as decision tree
size or compactness of classification rules

March 5, 2025 12
Decision Tree Learning
 Decision Tree learning is a method for approximating
discrete-valued target functions in which the learned
function is represented by a decision tree.
 Learned trees can also be re-represented as sets of if-then
rules to improve human readability.
 Decision Trees classify instances by sorting them down
the tree from root to some leaf node which provides the
classification of the instance.

March 5, 2025 13
Classification by Decision Tree Induction (DTI)
 DTI is the learning of decision trees from class_labeled training tuples.
 A decision tree is a flowchart-like tree structure, where each internal node
(non-leaf node) denotes a test on an attribute, each branch represents an
outcome of the test, and each leaf node (or terminal node) holds a class
label. The top most node is the root node.
 Why are DT Classifier so popular ?
 The construction of DT classifiers does not require any domain
knowledge or parameter setting, and therefore is appropriate for
exploratory knowledge discovery.
 DT can handle high dimensional data.
 Their representation of acquired knowledge in tree form is intuitive
and generally easy to assimilate by humans
 They have good accuracy.
 They may be used in medicine manufacturing, production, financial
analysis, astronomy and molecular biology.

March 5, 2025 14
Decision Tree Induction: Training Dataset
age income student credit_rating buys_computer
Youth high no fair no
This Youth high no excellent no
Middle_aged high no fair yes
follows Senior medium no fair yes
an Senior low yes fair yes
example Senior low yes excellent no
Middle_aged low yes excellent yes
of Youth medium no fair no
Quinlan’s Youth low yes fair yes
Senior medium yes fair yes
ID3 (Buys
Youth medium yes excellent yes
computer Middle_aged medium no excellent yes
) Middle_aged high yes fair yes
Senior medium no excellent no

March 5, 2025 15
Output: A Decision Tree for “buys_computer”
age?

<=30 overcast
31..40 >40

student? yes credit rating?

no yes excellent fair

no yes no yes

March 5, 2025 16
Algorithm for Decision Tree Induction
 Basic algorithm (a greedy algorithm)
 Tree is constructed in a top-down recursive divide-and-conquer manner
 At start, all the training examples are at the root
 If attribute A are categorical or discrete valued: In this case, the outcomes of the
test at node N correspond directly to the known value of A.
 (if attribute A is continuous-valued: In this case, the test at node N has two
possible outcomes corresponding to the condition A<=split_point and A > split
_point, respectively where split_point is the split_point returned by
Attribute_selection_method as part of the splitting criterion.
 Examples are partitioned recursively based on selected attributes
 Test attributes are selected on the basis of a heuristic or statistical measure (e.g.,
information gain)
 Conditions for stopping partitioning
 All samples for a given node belong to the same class
 There are no remaining attributes for further partitioning – majority voting is
employed for classifying the leaf
 There are no samples left

March 5, 2025 17
Algorithm for Decision Tree Induction
 Algorithm: Generate_decision_Tree. Generate a
decision tree from the training tuples of data partition D.
 Input:
 Data partition, D, which is a set of training tuples and
their associated class labels;
 Attribute_list, the set of candidate attributes;
 Attribute_selection_method, a procedure to
determine the splitting criterion that “best” partition
the data tuples into individual classes. This criterion
consists of a splitting_atribute and, possibly, either a
split point or splitting subset.
 Output: A decision tree.
March 5, 2025 18
Algorithm for Decision Tree Induction: Method
1. Create a node N;
2. If tuples in D are all of the same class, C then return N as a leaf node labeled
with the class C;
3. If attribute_list is empty then
4. return N as a leaf node labeled with the majority class in D;
5. Apply attribute_selection_method(D, attribute_list) to find “best”
splitting_criterion;
6. Label node N with splitting_criterion
7. If splitting_attribute is descrete-valued and
multiway splits allowed then
9. attribute_list attribute_list – splitting_atributes;
10. For each outcome j of splitting_criterion
11. let Dj ,be the set of data tuples in D satisfying outcome j;
12. if Dj is empty then
13. attach a leaf labeled with the majority class in D to Node N;
14. Else attach the node returned by Generate_decision_tree(Dj,attribute_list)
to node N;
endfor
15. Return N;

March 5, 2025 19
Attribute Selection Measure:
Information Gain (ID3/C4.5)
 1. Iterative Dichotomiser 3 (ID3): This algorithm uses Information Gain to decide which attribute is to be
used classify the current subset of the data. For each level of the tree, information gain is calculated for the
remaining data recursively.
 2. C4.5: This algorithm is the successor of the ID3 algorithm. This algorithm uses either Information gain or
Gain ratio to decide upon the classifying attribute. It is a direct improvement from the ID3 algorithm as it can
handle both continuous and missing attribute values.
 3. Classification and Regression Tree(CART): It is a dynamic learning algorithm which can produce a
regression tree as well as a classification tree depending upon the dependent variable.
 Select the attribute with the highest information gain
 Let pi be the probability that an arbitrary tuple in D belongs to class C i, estimated by |Ci, D|/|D|
 Expected information (entropy) needed to classify a tuple in D: m
Info ( D )   pi log 2 ( pi )
i 1
 Information needed (after using A to split D into v partitions) to classify D:
v | Dj |
Info A ( D)  I ( D j )
j 1 |D|
 Information gained by branching on attribute A
Gain(A) Info(D)  Info A(D)

March 5, 2025 20
Attribute Selection: Information Gain
g Class P: buys_computer = “yes” 5 4
Info age ( D )  I (2,3)  I (4,0)
g Class N: buys_computer = “no” 14 14
9 9 5 5 5
Info ( D) I (9,5)  log 2 ( )  log 2 ( ) 0.940  I (3,2) 0.694
14 14 14 14 14
age pi ni I(p i, n i) 5
I (2,3) means “age <=30” has 5 out of
<=30 2 3 0.971 14
14 samples, with 2 yes’es and 3
31…40 4 0 0
no’s. Hence
>40 3 2 0.971
age income student credit_rating buys_computer Gain(age) Info ( D )  Info age ( D ) 0.246
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes Similarly,
>40 medium no fair yes
>40 low yes fair yes
>40
31…40
low
low
yes
yes
excellent
excellent
no
yes
Gain(income) 0.029
<=30 medium no fair no
<=30 low yes fair yes Gain( student ) 0.151
>40 medium yes fair yes
<=30
31…40
medium
medium
yes
no
excellent
excellent
yes
yes
Gain(credit _ rating ) 0.048
31…40
>40
high
medium
yes
Marchno5,
fair
2025
excellent
yes
no 21
The attribute age has the highest information gain & therefore becomes the splitting
attribute at the root node of the decision tree. Branches are grown for each outcome of
age. The tuples are shown partitioned accordingly.

Age ?

youth
senior
income student credit_rating buys_computer income student credit_rating buys_computer
high no fair no medium no fair yes
high no excellent no low yes fair yes
medium no fair no low yes excellent no
low yes fair yes medium yes fair yes
medium yes excellent yes medium no excellent no

Middle_aged
income student credit_rating buys_computer
high no fair yes
low yes excellent yes
medium no excellent yes
high yes fair yes

March 5, 2025 22
Computing Information-Gain for Continuous-
Value Attributes
 Let attribute A be a continuous-valued attribute
 Must determine the best split point for A
 Sort the value A in increasing order
 Typically, the midpoint between each pair of adjacent values
is considered as a possible split point
 (ai+ai+1)/2 is the midpoint between the values of ai and ai+1
 The point with the minimum expected information
requirement for A is selected as the split-point for A
 Split:
 D1 is the set of tuples in D satisfying A ≤ split-point, and
D2 is the set of tuples in D satisfying A > split-point
March 5, 2025 23
Gain Ratio for Attribute Selection (C4.5)

 Information gain measure is biased towards attributes with a


large number of values
 C4.5 (a successor of ID3) uses gain ratio to overcome the
problem (normalization to information gain)
v | Dj | | Dj |
SplitInfo A ( D )   log 2 ( )
j 1 |D| |D|
 GainRatio(A) = Gain(A)/SplitInfo(A)
 4 4 6 6 4 4
Ex. SplitInfo A ( D )  log 2 ( )  log 2 ( )  log 2 ( ) 0.926
14 14 14 14 14 14
 gain_ratio(income) = 0.029/0.926 = 0.031
 The attribute with the maximum gain ratio is selected as the
splitting attribute
March 5, 2025 24
Gini index (CART, IBM IntelligentMiner)
 If a data set D contains examples from n classes, gini index, gini(D) is
defined as n
gini( D) 1  p 2j
j 1
where pj is the relative frequency of class j in D
 If a data set D is split on A into two subsets D1 and D2, the gini index
gini(D) is defined as |D | |D |
gini A ( D)  1 gini( D1)  2 gini( D 2)
|D| |D|
 Reduction in Impurity:
gini( A) gini( D)  giniA ( D)
 The attribute provides the smallest ginisplit(D) (or the largest reduction in
impurity) is chosen to split the node (need to enumerate all the possible
splitting points for each attribute)
March 5, 2025 25
Gini index (CART, IBM IntelligentMiner)
 Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no”
 Suppose the attribute income partitions D into 10 in D 1: {low, medium} and 4 in D2
2 2
 9  5
gini ( D) 1       0.459
 14   14 
 10   4
giniincome{low,medium} ( D)   Gini ( D1 )    Gini ( D1 )
 14   14 

Similarly Gini index for remaining splits are 0.315( {low,high}&{medium})and


0.300({medium,high}&{low}) but gini {medium,high} is 0.30 and thus the best since it is the
lowest. Similarly 0.375({youth,senior}&{middle_aged})as best split for age, the
attributes {students} and {credit_rating} are both binary, with Gini index values 0.367
and 0.429 respectively.
The attribute income and splitting subset{medium,high} therefore give the minimum gini index
overall , with a reduction in impurity of 0.459-0.300=0.159.
The above binary results in the maximum reduction in impurity of tupples in D & thus a
splitting criterion( Gini index selected income instead of age at the root node)

March 5, 2025 26
Comparing Attribute Selection Measures

 The three measures, in general, return good results but


 Information gain:
 biased towards multivalued attributes
 Gain ratio:
 tends to prefer unbalanced splits in which one partition
is much smaller than the others
 Gini index:
 biased to multivalued attributes
 has difficulty when # of classes is large
 tends to favor tests that result in equal-sized partitions
and purity in both partitions
March 5, 2025 27
Other Attribute Selection Measures
 CHAID: a popular decision tree algorithm, measure based on χ2 test for
independence
 C-SEP: performs better than info. gain and gini index in certain cases
 G-statistics: has a close approximation to χ2 distribution
 MDL (Minimal Description Length) principle (i.e., the simplest solution is
preferred):
 The best tree as the one that requires the fewest # of bits to both (1)
encode the tree, and (2) encode the exceptions to the tree
 Multivariate splits (partition based on multiple variable combinations)
 CART: finds multivariate splits based on a linear comb. of attrs.
 Which attribute selection measure is the best?
 Most give good results, none is significantly superior than others

March 5, 2025 28
Underfitting / Overfitting
 If we have an underfitted model, this means that we do not have enough
parameters to capture the trends in the underlying system. Imagine for
example that we have data that is parabolic in nature, but we try to fit this
with a linear function, with just one parameter. Because the function does
not have the required complexity to fit the data (two parameters), we end up
with a poor predictor. In this case the model will have high bias. This means
that we will get consistent answers, but consistently wrong answers

 If we have overfitted, this means that we have too many parameters to


be justified by the actual underlying data and therefore build an overly
complex model. Again imagine that the true system is a parabola, but
we used a higher order polynomial to fit to it. Because we have natural
noise in the data used to fit (deviations from the perfect parabola), the
overly complex model treats these fluctuations and noise as if they were
intrinsic properties of the system and attempts to fit to them. The result is a model that has
high variance. This means that we will not get consistent predictions of future results.

March 5, 2025 29
Overfitting and Tree Pruning
 Overfitting: An induced tree may overfit the training data
 Too many branches, some may reflect anomalies due to noise or outliers
 Poor accuracy for unseen samples
 Two approaches to avoid overfitting
 Prepruning: Halt tree construction early—do not split a node if this would result in the
goodness measure falling below a threshold
 Difficult to choose an appropriate threshold
 Postpruning: Remove branches from a “fully grown” tree—get a sequence of
progressively pruned trees
 Use a set of data different from the training data to decide which is the “best pruned
tree”
 Decision Trees suffer from Repetition and Replication
 Repetition: It occurs when an attribute is repeatedly tested along a given branch of the
tree (such as “age<60 ?” followed by “age<45 ?” and so on).
 Replication: In replication, duplicate subtrees exist within the trees

March 5, 2025 30
Unpruned Decision Tree pruned Decision Tree

A1 ? A1

Y N Y
?
N

A2 ? A3
?
A2
?
Class B
Y N N Y N
Y
A4
A4 A5
? Class A
? Class A ? Class B
Y N Y N
Y N
Class A Class B Class A Class B
Class B Class A

March 5, 2025 31
Repetition

A1 < 60 ?
Where an
Yes No attribute is
repeatedly
A1 <45 ?
tested along a
Yes No given branch of
the tree e.g. age
A1 < 50 ?
Yes No

Class A Class B

March 5, 2025 32
Replication
( Where duplicate subtrees exist within a tree)
Age=youth ?
N
Y
Credit_rating ?
Student ?
Y excellent
Fair

Y Class A
Income ?
Credit_rating ?

Class B LOW HIGH


excellent Fair
MED

Class A Class B Class C


Income ? Class A

LOW HIGH
MED

Class A Class B Class C

March 5, 2025 33
Enhancements to Basic Decision Tree Induction
 Allow for continuous-valued attributes
 Dynamically define new discrete-valued attributes that
partition the continuous attribute value into a discrete set of
intervals
 Handle missing attribute values
 Assign the most common value of the attribute
 Assign probability to each of the possible values
 Attribute construction
 Create new attributes based on existing ones that are
sparsely represented
 This reduces fragmentation, repetition, and replication
March 5, 2025 34
Issues In Decision Tree Learning
Practical issues in learning decision trees include
 Determining how deeply to grow the decision tree
 Handling continuous attributes
 Choosing an appropriate attribute selection measure
 Handling training data with missing attribute values, handling
Attributes with differing costs, and improving computational
efficiency.
 Avoiding Overfitting the Data
 Reduce Error Pruning

March 5, 2025 35
Instance Based Learning
 The Machine Learning systems which are categorized as instance-based
learning are the systems that learn the training examples by heart and
then generalizes to new instances based on some similarity measure. It is
called instance-based because it builds the hypotheses from the training
instances. It is also known as memory-based learning or lazy-
learning. The time complexity of this algorithm depends upon the size of
training data. The worst-case time complexity of this algorithm is O (n),
where n is the number of training instances.

 For example, If we were to create a spam filter with an instance-based


learning algorithm, instead of just flagging emails that are already
marked as spam emails, our spam filter would be programmed to also
flag emails that are very similar to them. This requires a measure of
resemblance between two emails. A similarity measure between two
emails could be the same sender or the repetitive use of the same
keywords or something else.

March 5, 2025 36
Advantages:
 Instead of estimating for the entire instance set, local
approximations can be made to the target function.
 This algorithm can adapt to new data easily, one which is
collected as we go .
Disadvantages:
 Classification costs are high
 Large amount of memory required to store the data, and each
query involves starting the identification of a local model from
scratch.
Some of the instance-based learning algorithms are :
 K Nearest Neighbor (KNN)
 Self-Organizing Map (SOM)
 Learning Vector Quantization (LVQ)
 Locally Weighted Learning (LWL)

March 5, 2025 37
K Nearest Neighbor (KNN)
 K-Nearest Neighbor is one of the simplest Machine Learning algorithms based
on Supervised Learning technique.
 K-NN algorithm assumes the similarity between the new case/data and
available cases and put the new case into the category that is most similar to the
available categories.
 K-NN algorithm stores all the available data and classifies a new data point
based on the similarity. This means when new data appears then it can be easily
classified into a well suite category by using K- NN algorithm.
 K-NN algorithm can be used for Regression as well as for Classification but
mostly it is used for the Classification problems.
 K-NN is a non-parametric algorithm, which means it does not make any
assumption on underlying data.
 It is also called a lazy learner algorithm because it does not learn from the
training set immediately instead it stores the dataset and at the time of
classification, it performs an action on the dataset.
 KNN algorithm at the training phase just stores the dataset and when it gets new
data, then it classifies that data into a category that is much similar to the new
data.
March 5, 2025 38
Example: Suppose, we have an image of a creature that
looks similar to cat and dog, but we want to know either it
is a cat or dog. So for this identification, we can use the
KNN algorithm, as it works on a similarity measure. Our
KNN model will find the similar features of the new data
set to the cats and dogs images and based on the most
similar features it will put it in either cat or dog category

March 5, 2025 39
Why do we need a K-NN Algorithm?
Suppose there are two categories, i.e., Category A and
Category B, and we have a new data point x1, so this data
point will lie in which of these categories. To solve this
type of problem, we need a K-NN algorithm. With the help
of K-NN, we can easily identify the category or class of a
particular dataset.

March 5, 2025 40
How does K-NN work?
 The K-NN working can be explained on the basis of the below
algorithm:
 Step-1: Select the number K of the neighbors
 Step-2: Calculate the Euclidean distance of K number of
neighbors
 Step-3: Take the K nearest neighbors as per the calculated
Euclidean distance.
 Step-4: Among these k neighbors, count the number of the data
points in each category.
 Step-5: Assign the new data points to that category for which
the number of the neighbor is maximum.
 Step-6: Our model is ready.

March 5, 2025 41
Suppose we have a new data point and we need to
put it in the required category.
• Firstly, we will choose
the number of neighbors,
so we will choose the
k=5.

• Next, we will calculate


the Euclidean distance
between the data points.
The Euclidean distance is
the distance between
two points, which we
have already studied in
geometry. It can be
calculated as:

March 5, 2025 42
By calculating the Euclidean distance we got the
nearest neighbors, as three nearest neighbors in
category A and two nearest neighbors in category B

March 5, 2025 43
As we can see the 3 nearest neighbors are from
category A, hence this new data point must belong
to category A.

March 5, 2025 44
Advantages of KNN Algorithm:
• It is simple to implement.
• It is robust to the noisy training data
• It can be more effective if the training data is
large.

Disadvantages of KNN Algorithm:


• Always needs to determine the value of K which
may be complex some time.
• The computation cost is high because of
calculating the distance between the data points
for all the training samples.

March 5, 2025 45
LOCALLY WEIGHTED REGRESSION
 The phrase "locally weighted regression" is called local because
the function is approximated based only on data near the query
point, weighted because the contribution of each training example
is weighted by its distance from the query point, and regression
because this is the term used widely in the statistical learning
community for the problem of approximating real-valued
functions.
 Given a new query instance Xq, the general approach in locally
weighted regression is to construct an approximation that fits the
training examples in the neighborhood surrounding Xq. This
approximation is then used to calculate the value 𝑓(Xq), which is
output as the estimated target value for the query instance.

March 5, 2025 46
Locally Weighted Linear Regression
 Consider locally weighted regression in which the target function
f is approximated near
 Xq using a linear function of the form

Where, ai(x) denotes the value of the ith attribute of the instance x , and W0,W1…
Wn are the coefficient and weight of the function
 Derived methods are used to choose weights that minimize the squared error
summed over the set D of training examples using gradient descent
Which led us to the gradient descent training
rule
f(x) is the target output and Function f(x) if
the ca
March 5, 2025 47
Where, η is a constant learning rate
• Need to modify this procedure to derive a local
approximation rather than a global one. The simple
way is to redefine the error criterion E to emphasize
fitting the local training examples. Three possible
criteria are given below.

1. Minimize the squared error over just the k nearest neighbors:

March 5, 2025 48
2. Minimize the squared error over the entire set
distance D of training examples, while weighting
the error of each training example by some
decreasing function K of its distance from xq :

d is the distance of new instance with respect to the current point and
K is the decreasing function.

Combine 1 and 2:

This will become our error term.

March 5, 2025 49
If we choose criterion three and re-derive the gradient
descent rule, we obtain the following training rule

Here we are modifying the weight i.e. Wj=Wj+delta Wj

The differences between this new rule and the


rule given by Equation (3) are that the
contribution of instance x to the weight update is
now multiplied by the distance penalty K(d(xq,
x)), and that the error is summed over only the k
nearest training examples.
https://www.youtube.com/watch?v=38kNPkeGoR4
March 5, 2025 50
RADIAL BASIS FUNCTIONS
 One approach to function approximation that is
closely related to distance-weighted regression
and also to artificial neural networks is learning
with radial basis functions
 In this approach, the learned hypothesis is a
function of the form

Where, each xu is an instance from X and where the


kernel function Ku(d(xu, x)) is defined so that it
decreases as the distance d(xu, x) increases.
March 5, 2025 51
• Here k is a user provided constant that specifies the

• 𝑓 is a global approximation to f (x), the contribution


number of kernel functions to be included.

from each of the Ku(d(xu, x)) terms is localized to a


region nearby the point xu

Choose each function Ku(d(xu, x)) to be a Gaussian

variance 𝜎u2
function centered at the point xu with some

The functional form of equ(1) can approximate any function


with arbitrarily small error, provided a sufficiently large

width 𝜎2 of each kernel can be separately specified


number k of such Gaussian kernels and provided the

March 5, 2025 52
The function given by equ(1) can be viewed as
describing a two layer network where the first layer of
units computes the values of the various Ku(d(xu, x))
and where the second layer computes a linear
combination of these first-layer unit values

March 5, 2025 53

You might also like