Machine Learning and Web Scraping Lecture 03

The document provides information about machine learning classification algorithms and the K-Nearest Neighbors (KNN) algorithm. It defines classification as predicting categorical outputs and discusses types of classification like binary and multi-class. It also outlines evaluation metrics like log loss, confusion matrix, and AUC-ROC curve. Finally, it describes how the KNN algorithm works by calculating distances to nearest neighbors and assigning a label based on majority vote of the K closest points.

Uploaded by

patrice mvogo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

41 views

Machine Learning and Web Scraping Lecture 03

Uploaded by

patrice mvogo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

Lecture Slides for

Machine Learning and

Web Scraping
LP BDB
April 2023
Classification Algorithm in Machine Learning
As we know, the Supervised Machine Learning algorithm can be broadly classified into Regression and
Classification Algorithms. In Regression algorithms, we have predicted the output for continuous values, but to
predict the categorical values, we need Classification algorithms.

What is the Classification Algorithm?

The Classification algorithm is a Supervised Learning technique that is used to identify the category of new
observations on the basis of training data. In Classification, a program learns from the given dataset or
observations and then classifies new observation into a number of classes or groups. Such as, Yes or No, 0 or 1,
Spam or Not Spam, cat or dog, etc. Classes can be called as targets/labels or categories.

Unlike regression, the output variable of Classification is a category, not a value, such as "Green or Blue", "fruit or
animal", etc. Since the Classification algorithm is a Supervised learning technique, hence it takes labeled input
data, which means it contains input with the corresponding output.
In classification algorithm, a discrete output function(y) is mapped to input variable(x).
The best example of an ML classification algorithm is Email Spam Detector.
The main goal of the Classification algorithm is to identify the category of a given dataset, and these algorithms are
mainly used to predict the output for the categorical data.
Classification algorithms can be better understood using the below diagram. In the below diagram, there are two
classes, class A and Class B. These classes have features that are similar to each other and dissimilar to other
classes.

The algorithm which implements the classification on a dataset is known as a classifier.

There are two types of Classifications:
•Binary Classifier: If the classification problem has only two possible outcomes, then it
is called as Binary Classifier.
Examples: YES or NO, MALE or FEMALE, SPAM or NOT SPAM, CAT or DOG, etc.
•Multi-class Classifier: If a classification problem has more than two outcomes, then it
is called as Multi-class Classifier.
Example: Classifications of types of crops, Classification of types of music.
Learners in Classification Problems:
In the classification problems, there are two types of learners:
• Lazy Learners: Lazy Learner firstly stores the training dataset and wait until it receives the test dataset. In
Lazy learner case, classification is done on the basis of the most related data stored in the training dataset. It
takes less time in training but more time for predictions.

Example: K-NN algorithm, Case-based reasoning

• Eager Learners: Eager Learners develop a classification model based on a training dataset before receiving
a test dataset. Opposite to Lazy learners, Eager Learner takes more time in learning, and less time in
prediction. Example: Decision Trees, Naïve Bayes, ANN.
Types of ML Classification Algorithms:
Classification Algorithms can be further divided into mainly two categories:
•Linear Models
• Logistic Regression
• Support Vector Machines
•Non-linear Models
• K-Nearest Neighbours
• Kernel SVM
• Naïve Bayes
• Decision Tree Classification
• Random Forest Classification
Evaluating a Classification model:
Once our model is completed, it is necessary to evaluate its performance; either it is a Classification or
Regression model. So for evaluating a Classification model, we have the following ways:

Log Loss or Cross-Entropy Loss:

•It is used for evaluating the performance of a classifier, whose output is a probability value between the 0 and 1.
•For a good binary Classification model, the value of log loss should be near to 0.
•The value of log loss increases if the predicted value deviates from the actual value.
•The lower log loss represents the higher accuracy of the model.

Confusion Matrix:
•The confusion matrix provides us a matrix/table as output and describes the performance of the model.
•It is also known as the error matrix.
•The matrix consists of predictions result in a summarized form, which has a total number of correct predictions
and incorrect predictions. The matrix looks like as below table:

Actual Positive Actual Negative

Predicted Positive True Positive False Positive
Predicted Negative False Negative True Negative
AUC-ROC curve:
•ROC curve stands for Receiver Operating Characteristics Curve and AUC stands for Area Under the Curve.
•It is a graph that shows the performance of the classification model at different thresholds.
•To visualize the performance of the multi-class classification model, we use the AUC-ROC Curve.
•The ROC curve is plotted with TPR and FPR, where TPR (True Positive Rate) on Y-axis and FPR(False Positive
Rate) on X-axis.

Use cases of Classification Algorithms

Classification algorithms can be used in different places. Below are some popular use cases of Classification
Algorithms:
•Email Spam Detection
•Speech Recognition
•Identifications of Cancer tumor cells.
•Drugs Classification
•Biometric Identification, etc.
K-Nearest Neighbor(KNN) Algorithm for Machine Learning
About KNN
•K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on Supervised Learning technique.
•K-NN algorithm assumes the similarity between the new case/data and available cases and put the new case into the category
that is most similar to the available categories.
•K-NN algorithm stores all the available data and classifies a new data point based on the similarity. This means when new data
appears then it can be easily classified into a well suite category by using K- NN algorithm.
•K-NN algorithm can be used for Regression as well as for Classification but mostly it is used for the Classification problems.
•K-NN is a non-parametric algorithm, which means it does not make any assumption on underlying data.
•It is also called a lazy learner algorithm because it does not learn from the training set immediately instead it stores the
dataset and at the time of classification, it performs an action on the dataset.
•KNN algorithm at the training phase just stores the dataset and when it gets new data, then it classifies that data into a category
that is much similar to the new data.
•Example: Suppose, we have an image of a creature that looks similar to cat and dog, but we want to know either it is a cat or
dog. So for this identification, we can use the KNN algorithm, as it works on a similarity measure. Our KNN model will find the
similar features of the new data set to the cats and dogs images and based on the most similar features it will put it in either cat
or dog category.
Why do we need a K-NN Algorithm?
Suppose there are two categories, i.e., Category A and Category B, and we have a new data point x1, so this
data point will lie in which of these categories. To solve this type of problem, we need a K-NN algorithm. With the
help of K-NN, we can easily identify the category or class of a particular dataset. Consider the below diagram:

How does K-NN work?

The K-NN working can be explained on the basis of the below algorithm:
•Step-1: Select the number K of the neighbors
•Step-2: Calculate the Euclidean distance of K number of neighbors
•Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
•Step-4: Among these k neighbors, count the number of the data points in each category.
•Step-5: Assign the new data points to that category for which the number of the neighbor is maximum.
•Step-6: Our model is ready.
Suppose we have a new data point and we need to put By calculating the Euclidean distance we got the nearest neighbors, as
it in the required category. Consider the below image: three nearest neighbors in category A and two nearest neighbors in
category B. Consider the below image:

•Firstly, we will choose the number of neighbors, so we will choose the k=5.
•Next, we will calculate the Euclidean distance between the data points. The
Euclidean distance is the distance between two points, which we have already
studied in geometry. It can be calculated as:

As we can see the 3 nearest neighbors are from

category A, hence this new data point must belong to
category A.
How to select the value of K in the K-NN Algorithm?
Below are some points to remember while selecting the value of K in the K-NN algorithm:
•There is no particular way to determine the best value for "K", so we need to try some values to find the best out
of them. The most preferred value for K is 5.
•A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of outliers in the model.
•Large values for K are good, but it may find some difficulties.

Advantages of KNN Algorithm:

•It is simple to implement.
•It is robust to the noisy training data
•It can be more effective if the training data is large.

Disadvantages of KNN Algorithm:

•Always needs to determine the value of K which may be complex some time.
•The computation cost is high because of calculating the distance between the data points for all the training
samples.
Python implementation of the KNN algorithm

Steps and source code are shared in another file

Support Vector Machine Algorithm
Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms, which is used for
Classification as well as Regression problems. However, primarily, it is used for Classification problems in
Machine Learning.
The goal of the SVM algorithm is to create the best line or decision boundary that can segregate n-dimensional
space into classes so that we can easily put the new data point in the correct category in the future. This best
decision boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme cases are called as
support vectors, and hence algorithm is termed as Support Vector Machine. Consider the below diagram in which
there are two different categories that are classified using a decision boundary or hyperplane:
Example: SVM can be understood with the example that we have used in the KNN classifier. Suppose we
see a strange cat that also has some features of dogs, so if we want a model that can accurately identify
whether it is a cat or dog, so such a model can be created by using the SVM algorithm. We will first train our
model with lots of images of cats and dogs so that it can learn about different features of cats and dogs, and
then we test it with this strange creature. So as support vector creates a decision boundary between these
two data (cat and dog) and choose extreme cases (support vectors), it will see the extreme case of cat and
dog. On the basis of the support vectors, it will classify it as a cat. Consider the below diagram:

Applications of SVM:
• Face detection,
• image classification,
• text categorization
• ….
Types of SVM
• Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can be classified
into two classes by using a single straight line, then such data is termed as linearly separable data, and
classifier is used called as Linear SVM classifier.
• Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means if a dataset
cannot be classified by using a straight line, then such data is termed as non-linear data and classifier
used is called as Non-linear SVM classifier.

Hyperplane and Support Vectors in the SVM algorithm:

Hyperplane: There can be multiple lines/decision boundaries to segregate the classes in n-dimensional space,
but we need to find out the best decision boundary that helps to classify the data points. This best boundary is
known as the hyperplane of SVM.
The dimensions of the hyperplane depend on the features present in the dataset, which means if there are 2
features (as shown in image), then hyperplane will be a straight line. And if there are 3 features, then hyperplane
will be a 2-dimension plane.
We always create a hyperplane that has a maximum margin, which means the maximum distance between the
data points.
Support Vectors:
The data points or vectors that are the closest to the hyperplane and which affect the position of the hyperplane
are termed as Support Vector. Since these vectors support the hyperplane, hence called a Support vector.
How does SVM works?
Linear SVM:
The working of the SVM algorithm can be understood by using an example. Suppose we have a dataset that has
two tags (green and blue), and the dataset has two features x1 and x2. We want a classifier that can classify the
pair(x1, x2) of coordinates in either green or blue. Consider the below image:
So as it is 2-d space so by just using a straight line, we can
easily separate these two classes. But there can be multiple
lines that can separate these classes. Consider the below
image:
Hence, the SVM algorithm helps to find the best line or decision boundary; this best boundary or region is called as
a hyperplane. SVM algorithm finds the closest point of the lines from both the classes. These points are called
support vectors. The distance between the vectors and the hyperplane is called as margin. And the goal of SVM is
to maximize this margin. The hyperplane with maximum margin is called the optimal hyperplane.
Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a straight line, but for non-linear data, we cannot
draw a single straight line. Consider the below image:

So to separate these data points, we need to

add one more dimension. For linear data, we
have used two dimensions x and y, so for non-
linear data, we will add a third dimension z. It
can be calculated as:

So now, SVM will divide the datasets into classes in the Since we are in 3-d Space, hence it is looking like a plane parallel to
following way. Consider the below image: the x-axis. If we convert it in 2d space with z=1, then it will become as:

Machine Learning Fundamentals A Concise Introduction by Hui Jiang
No ratings yet
Machine Learning Fundamentals A Concise Introduction by Hui Jiang
423 pages
Algorithm
No ratings yet
Algorithm
27 pages
Artificial Intelligence and Machine Learning
No ratings yet
Artificial Intelligence and Machine Learning
12 pages
Classification
No ratings yet
Classification
58 pages
FPA unit 2
No ratings yet
FPA unit 2
20 pages
Unit 4 Supervised Learning
100% (1)
Unit 4 Supervised Learning
75 pages
ML Unit 3
No ratings yet
ML Unit 3
12 pages
Unit 4
No ratings yet
Unit 4
26 pages
ML Unit-2
No ratings yet
ML Unit-2
24 pages
ML CH 3
No ratings yet
ML CH 3
88 pages
K-Nearest Neighbor (KNN) Algorithm For Machine Learning - Javatpoint
No ratings yet
K-Nearest Neighbor (KNN) Algorithm For Machine Learning - Javatpoint
18 pages
Unit 4
No ratings yet
Unit 4
23 pages
UNIT 2 - Notes
No ratings yet
UNIT 2 - Notes
31 pages
AIML Unit-IV & V
100% (1)
AIML Unit-IV & V
47 pages
Unit - 2 ML
No ratings yet
Unit - 2 ML
32 pages
Classification
No ratings yet
Classification
7 pages
K-Nearest Neighbor (KNN) Algorithm For Machine Learning
No ratings yet
K-Nearest Neighbor (KNN) Algorithm For Machine Learning
17 pages
Unit - 3
No ratings yet
Unit - 3
83 pages
ML-UNIT-2
No ratings yet
ML-UNIT-2
46 pages
UNIT 3 - Final
No ratings yet
UNIT 3 - Final
37 pages
KNN
No ratings yet
KNN
20 pages
Knn
No ratings yet
Knn
5 pages
K-Nearest Neighbor Algorithm
No ratings yet
K-Nearest Neighbor Algorithm
6 pages
Distance-Based Methods - KNN
No ratings yet
Distance-Based Methods - KNN
8 pages
Machine Learning Unit-3.1
No ratings yet
Machine Learning Unit-3.1
20 pages
K-Nearest Neighbor Algorithm
100% (1)
K-Nearest Neighbor Algorithm
6 pages
ML-Unit 5
No ratings yet
ML-Unit 5
40 pages
ML U4
No ratings yet
ML U4
48 pages
K-Nearest Neighbour (KNN) Algorithm_f3ec27282ed7dde87d5cf56f95272d1a
No ratings yet
K-Nearest Neighbour (KNN) Algorithm_f3ec27282ed7dde87d5cf56f95272d1a
5 pages
CSL0777 L22
No ratings yet
CSL0777 L22
35 pages
ML Mid2 Ans
No ratings yet
ML Mid2 Ans
24 pages
Untitled 9
No ratings yet
Untitled 9
17 pages
Knn
No ratings yet
Knn
3 pages
AI28
No ratings yet
AI28
5 pages
Module Iii
No ratings yet
Module Iii
15 pages
MachineLearning Unit-III Ppt
No ratings yet
MachineLearning Unit-III Ppt
26 pages
CSE-VSEM-503-B-PR-UNIT-2-NOTES
No ratings yet
CSE-VSEM-503-B-PR-UNIT-2-NOTES
17 pages
CH 04 Classification Techniques
No ratings yet
CH 04 Classification Techniques
89 pages
Classification-Bayesian Classification
No ratings yet
Classification-Bayesian Classification
9 pages
K Nearest Neighbor (KNN)
No ratings yet
K Nearest Neighbor (KNN)
9 pages
Lecture No 11
No ratings yet
Lecture No 11
49 pages
Unit 5
No ratings yet
Unit 5
28 pages
Unit 3 KNN
No ratings yet
Unit 3 KNN
16 pages
K Nearest Neighbors
No ratings yet
K Nearest Neighbors
22 pages
New Classification and Regression Models
No ratings yet
New Classification and Regression Models
7 pages
Unit Ii
No ratings yet
Unit Ii
102 pages
Unit 3 (Classification)
No ratings yet
Unit 3 (Classification)
12 pages
K Nearest Neighbors
No ratings yet
K Nearest Neighbors
11 pages
Unit - II
No ratings yet
Unit - II
37 pages
statistic inference unit 2 notes
No ratings yet
statistic inference unit 2 notes
34 pages
Lectures 7 and 8 - Data Anaysis in Management - MBM
No ratings yet
Lectures 7 and 8 - Data Anaysis in Management - MBM
78 pages
ML Unit-2
No ratings yet
ML Unit-2
26 pages
K-Nearest Neighbor Classification-Algorithm and Characteristics
No ratings yet
K-Nearest Neighbor Classification-Algorithm and Characteristics
6 pages
DW&M Unit 3 Part I
No ratings yet
DW&M Unit 3 Part I
101 pages
KNN
No ratings yet
KNN
9 pages
Classification
No ratings yet
Classification
50 pages
Total Listing Machine Learning
100% (1)
Total Listing Machine Learning
114 pages
Co-2 ML 2019
No ratings yet
Co-2 ML 2019
71 pages
ML Assignment 2 PDF
No ratings yet
ML Assignment 2 PDF
9 pages
5 no ans.
No ratings yet
5 no ans.
38 pages
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
MSC Thesis Proposal of Student Dropout Performance Analysis Using Machine Learning Techniques in Case of Wolaita Sodo University
100% (1)
MSC Thesis Proposal of Student Dropout Performance Analysis Using Machine Learning Techniques in Case of Wolaita Sodo University
28 pages
Traffic Flow Prediction For Intelligent Transporta
No ratings yet
Traffic Flow Prediction For Intelligent Transporta
8 pages
MO_AIP_10-10-24_B2C
No ratings yet
MO_AIP_10-10-24_B2C
15 pages
BCPUML Breast Cancer Prediction Using Machine Learning Approach—a Performance Analysis
No ratings yet
BCPUML Breast Cancer Prediction Using Machine Learning Approach—a Performance Analysis
10 pages
PRCV Lab Manual-Final
No ratings yet
PRCV Lab Manual-Final
60 pages
DATA MINING MODULE 3
No ratings yet
DATA MINING MODULE 3
27 pages
Understanding Emotions in Text Using Deep Learning and Big Data (PRINTED)
No ratings yet
Understanding Emotions in Text Using Deep Learning and Big Data (PRINTED)
32 pages
Data Science Vijay1
No ratings yet
Data Science Vijay1
88 pages
Data Science
No ratings yet
Data Science
38 pages
CNN Based Leaves Disease Detection in Potato Plant
No ratings yet
CNN Based Leaves Disease Detection in Potato Plant
5 pages
Lotte Fabien Et Al - Studying The Use of Fuzzy Inference Systems For Motor Imagery Classification
No ratings yet
Lotte Fabien Et Al - Studying The Use of Fuzzy Inference Systems For Motor Imagery Classification
3 pages
TIME - Vivian Siahaan - AMAZON STOCK PRICE - VISUALIZATION - FORECASTING - AND PREDIC
100% (1)
TIME - Vivian Siahaan - AMAZON STOCK PRICE - VISUALIZATION - FORECASTING - AND PREDIC
672 pages
Machine Learning Lab Dlihebca6sem
100% (1)
Machine Learning Lab Dlihebca6sem
25 pages
ML Mini Project - Docx New (A)
No ratings yet
ML Mini Project - Docx New (A)
17 pages
Recommender System Based On Customer Segmentation (RSCS)
No ratings yet
Recommender System Based On Customer Segmentation (RSCS)
28 pages
Lyn Thomas-Book
No ratings yet
Lyn Thomas-Book
85 pages
Random Forest Based Fault Classification Technique For Active Power System Networks
No ratings yet
Random Forest Based Fault Classification Technique For Active Power System Networks
4 pages
Heart Disease Prediction Using Machine Learning Techniques: Raparthi Yaswanth, Y. Md. Riyazuddin
No ratings yet
Heart Disease Prediction Using Machine Learning Techniques: Raparthi Yaswanth, Y. Md. Riyazuddin
5 pages
42-51CFDMLReview Updated1 (1)
No ratings yet
42-51CFDMLReview Updated1 (1)
11 pages
IE 7374 Spring Syllabus
No ratings yet
IE 7374 Spring Syllabus
5 pages
Technologies 09 00052 v3
No ratings yet
Technologies 09 00052 v3
17 pages
Hundred Page Machine Learning Book
No ratings yet
Hundred Page Machine Learning Book
34 pages
COSREV D 24 00138 - Reviewer
No ratings yet
COSREV D 24 00138 - Reviewer
25 pages
Novel Graph-Based Machine Learning Technique To Secure Smart Vehicles in Intelligent Transportation Systems
No ratings yet
Novel Graph-Based Machine Learning Technique To Secure Smart Vehicles in Intelligent Transportation Systems
9 pages
Classification
No ratings yet
Classification
14 pages
DETECTION OF DEFECTS IN FABRIC Report
No ratings yet
DETECTION OF DEFECTS IN FABRIC Report
65 pages
AIML Practical 02 22105A2021
No ratings yet
AIML Practical 02 22105A2021
8 pages
SVM - Classification - Jupyter Notebook
No ratings yet
SVM - Classification - Jupyter Notebook
2 pages
21 SVR
No ratings yet
21 SVR
22 pages