0% found this document useful (0 votes)
41 views

Machine Learning and Web Scraping Lecture 03

The document provides information about machine learning classification algorithms and the K-Nearest Neighbors (KNN) algorithm. It defines classification as predicting categorical outputs and discusses types of classification like binary and multi-class. It also outlines evaluation metrics like log loss, confusion matrix, and AUC-ROC curve. Finally, it describes how the KNN algorithm works by calculating distances to nearest neighbors and assigning a label based on majority vote of the K closest points.

Uploaded by

patrice mvogo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views

Machine Learning and Web Scraping Lecture 03

The document provides information about machine learning classification algorithms and the K-Nearest Neighbors (KNN) algorithm. It defines classification as predicting categorical outputs and discusses types of classification like binary and multi-class. It also outlines evaluation metrics like log loss, confusion matrix, and AUC-ROC curve. Finally, it describes how the KNN algorithm works by calculating distances to nearest neighbors and assigning a label based on majority vote of the K closest points.

Uploaded by

patrice mvogo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Lecture Slides for

Machine Learning and


Web Scraping
LP BDB
April 2023
Classification Algorithm in Machine Learning
As we know, the Supervised Machine Learning algorithm can be broadly classified into Regression and
Classification Algorithms. In Regression algorithms, we have predicted the output for continuous values, but to
predict the categorical values, we need Classification algorithms.

What is the Classification Algorithm?


The Classification algorithm is a Supervised Learning technique that is used to identify the category of new
observations on the basis of training data. In Classification, a program learns from the given dataset or
observations and then classifies new observation into a number of classes or groups. Such as, Yes or No, 0 or 1,
Spam or Not Spam, cat or dog, etc. Classes can be called as targets/labels or categories.

Unlike regression, the output variable of Classification is a category, not a value, such as "Green or Blue", "fruit or
animal", etc. Since the Classification algorithm is a Supervised learning technique, hence it takes labeled input
data, which means it contains input with the corresponding output.
In classification algorithm, a discrete output function(y) is mapped to input variable(x).
The best example of an ML classification algorithm is Email Spam Detector.
The main goal of the Classification algorithm is to identify the category of a given dataset, and these algorithms are
mainly used to predict the output for the categorical data.
Classification algorithms can be better understood using the below diagram. In the below diagram, there are two
classes, class A and Class B. These classes have features that are similar to each other and dissimilar to other
classes.

The algorithm which implements the classification on a dataset is known as a classifier.


There are two types of Classifications:
•Binary Classifier: If the classification problem has only two possible outcomes, then it
is called as Binary Classifier.
Examples: YES or NO, MALE or FEMALE, SPAM or NOT SPAM, CAT or DOG, etc.
•Multi-class Classifier: If a classification problem has more than two outcomes, then it
is called as Multi-class Classifier.
Example: Classifications of types of crops, Classification of types of music.
Learners in Classification Problems:
In the classification problems, there are two types of learners:
• Lazy Learners: Lazy Learner firstly stores the training dataset and wait until it receives the test dataset. In
Lazy learner case, classification is done on the basis of the most related data stored in the training dataset. It
takes less time in training but more time for predictions.

Example: K-NN algorithm, Case-based reasoning

• Eager Learners: Eager Learners develop a classification model based on a training dataset before receiving
a test dataset. Opposite to Lazy learners, Eager Learner takes more time in learning, and less time in
prediction. Example: Decision Trees, Naïve Bayes, ANN.
Types of ML Classification Algorithms:
Classification Algorithms can be further divided into mainly two categories:
•Linear Models
• Logistic Regression
• Support Vector Machines
•Non-linear Models
• K-Nearest Neighbours
• Kernel SVM
• Naïve Bayes
• Decision Tree Classification
• Random Forest Classification
Evaluating a Classification model:
Once our model is completed, it is necessary to evaluate its performance; either it is a Classification or
Regression model. So for evaluating a Classification model, we have the following ways:

Log Loss or Cross-Entropy Loss:


•It is used for evaluating the performance of a classifier, whose output is a probability value between the 0 and 1.
•For a good binary Classification model, the value of log loss should be near to 0.
•The value of log loss increases if the predicted value deviates from the actual value.
•The lower log loss represents the higher accuracy of the model.

Confusion Matrix:
•The confusion matrix provides us a matrix/table as output and describes the performance of the model.
•It is also known as the error matrix.
•The matrix consists of predictions result in a summarized form, which has a total number of correct predictions
and incorrect predictions. The matrix looks like as below table:

Actual Positive Actual Negative


Predicted Positive True Positive False Positive
Predicted Negative False Negative True Negative
AUC-ROC curve:
•ROC curve stands for Receiver Operating Characteristics Curve and AUC stands for Area Under the Curve.
•It is a graph that shows the performance of the classification model at different thresholds.
•To visualize the performance of the multi-class classification model, we use the AUC-ROC Curve.
•The ROC curve is plotted with TPR and FPR, where TPR (True Positive Rate) on Y-axis and FPR(False Positive
Rate) on X-axis.

Use cases of Classification Algorithms


Classification algorithms can be used in different places. Below are some popular use cases of Classification
Algorithms:
•Email Spam Detection
•Speech Recognition
•Identifications of Cancer tumor cells.
•Drugs Classification
•Biometric Identification, etc.
K-Nearest Neighbor(KNN) Algorithm for Machine Learning
About KNN
•K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on Supervised Learning technique.
•K-NN algorithm assumes the similarity between the new case/data and available cases and put the new case into the category
that is most similar to the available categories.
•K-NN algorithm stores all the available data and classifies a new data point based on the similarity. This means when new data
appears then it can be easily classified into a well suite category by using K- NN algorithm.
•K-NN algorithm can be used for Regression as well as for Classification but mostly it is used for the Classification problems.
•K-NN is a non-parametric algorithm, which means it does not make any assumption on underlying data.
•It is also called a lazy learner algorithm because it does not learn from the training set immediately instead it stores the
dataset and at the time of classification, it performs an action on the dataset.
•KNN algorithm at the training phase just stores the dataset and when it gets new data, then it classifies that data into a category
that is much similar to the new data.
•Example: Suppose, we have an image of a creature that looks similar to cat and dog, but we want to know either it is a cat or
dog. So for this identification, we can use the KNN algorithm, as it works on a similarity measure. Our KNN model will find the
similar features of the new data set to the cats and dogs images and based on the most similar features it will put it in either cat
or dog category.
Why do we need a K-NN Algorithm?
Suppose there are two categories, i.e., Category A and Category B, and we have a new data point x1, so this
data point will lie in which of these categories. To solve this type of problem, we need a K-NN algorithm. With the
help of K-NN, we can easily identify the category or class of a particular dataset. Consider the below diagram:

How does K-NN work?


The K-NN working can be explained on the basis of the below algorithm:
•Step-1: Select the number K of the neighbors
•Step-2: Calculate the Euclidean distance of K number of neighbors
•Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
•Step-4: Among these k neighbors, count the number of the data points in each category.
•Step-5: Assign the new data points to that category for which the number of the neighbor is maximum.
•Step-6: Our model is ready.
Suppose we have a new data point and we need to put By calculating the Euclidean distance we got the nearest neighbors, as
it in the required category. Consider the below image: three nearest neighbors in category A and two nearest neighbors in
category B. Consider the below image:

•Firstly, we will choose the number of neighbors, so we will choose the k=5.
•Next, we will calculate the Euclidean distance between the data points. The
Euclidean distance is the distance between two points, which we have already
studied in geometry. It can be calculated as:

As we can see the 3 nearest neighbors are from


category A, hence this new data point must belong to
category A.
How to select the value of K in the K-NN Algorithm?
Below are some points to remember while selecting the value of K in the K-NN algorithm:
•There is no particular way to determine the best value for "K", so we need to try some values to find the best out
of them. The most preferred value for K is 5.
•A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of outliers in the model.
•Large values for K are good, but it may find some difficulties.

Advantages of KNN Algorithm:


•It is simple to implement.
•It is robust to the noisy training data
•It can be more effective if the training data is large.

Disadvantages of KNN Algorithm:


•Always needs to determine the value of K which may be complex some time.
•The computation cost is high because of calculating the distance between the data points for all the training
samples.
Python implementation of the KNN algorithm

Steps and source code are shared in another file


Support Vector Machine Algorithm
Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms, which is used for
Classification as well as Regression problems. However, primarily, it is used for Classification problems in
Machine Learning.
The goal of the SVM algorithm is to create the best line or decision boundary that can segregate n-dimensional
space into classes so that we can easily put the new data point in the correct category in the future. This best
decision boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme cases are called as
support vectors, and hence algorithm is termed as Support Vector Machine. Consider the below diagram in which
there are two different categories that are classified using a decision boundary or hyperplane:
Example: SVM can be understood with the example that we have used in the KNN classifier. Suppose we
see a strange cat that also has some features of dogs, so if we want a model that can accurately identify
whether it is a cat or dog, so such a model can be created by using the SVM algorithm. We will first train our
model with lots of images of cats and dogs so that it can learn about different features of cats and dogs, and
then we test it with this strange creature. So as support vector creates a decision boundary between these
two data (cat and dog) and choose extreme cases (support vectors), it will see the extreme case of cat and
dog. On the basis of the support vectors, it will classify it as a cat. Consider the below diagram:

Applications of SVM:
• Face detection,
• image classification,
• text categorization
• ….
Types of SVM
• Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can be classified
into two classes by using a single straight line, then such data is termed as linearly separable data, and
classifier is used called as Linear SVM classifier.
• Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means if a dataset
cannot be classified by using a straight line, then such data is termed as non-linear data and classifier
used is called as Non-linear SVM classifier.

Hyperplane and Support Vectors in the SVM algorithm:


Hyperplane: There can be multiple lines/decision boundaries to segregate the classes in n-dimensional space,
but we need to find out the best decision boundary that helps to classify the data points. This best boundary is
known as the hyperplane of SVM.
The dimensions of the hyperplane depend on the features present in the dataset, which means if there are 2
features (as shown in image), then hyperplane will be a straight line. And if there are 3 features, then hyperplane
will be a 2-dimension plane.
We always create a hyperplane that has a maximum margin, which means the maximum distance between the
data points.
Support Vectors:
The data points or vectors that are the closest to the hyperplane and which affect the position of the hyperplane
are termed as Support Vector. Since these vectors support the hyperplane, hence called a Support vector.
How does SVM works?
Linear SVM:
The working of the SVM algorithm can be understood by using an example. Suppose we have a dataset that has
two tags (green and blue), and the dataset has two features x1 and x2. We want a classifier that can classify the
pair(x1, x2) of coordinates in either green or blue. Consider the below image:
So as it is 2-d space so by just using a straight line, we can
easily separate these two classes. But there can be multiple
lines that can separate these classes. Consider the below
image:
Hence, the SVM algorithm helps to find the best line or decision boundary; this best boundary or region is called as
a hyperplane. SVM algorithm finds the closest point of the lines from both the classes. These points are called
support vectors. The distance between the vectors and the hyperplane is called as margin. And the goal of SVM is
to maximize this margin. The hyperplane with maximum margin is called the optimal hyperplane.
Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a straight line, but for non-linear data, we cannot
draw a single straight line. Consider the below image:

So to separate these data points, we need to


add one more dimension. For linear data, we
have used two dimensions x and y, so for non-
linear data, we will add a third dimension z. It
can be calculated as:

So now, SVM will divide the datasets into classes in the Since we are in 3-d Space, hence it is looking like a plane parallel to
following way. Consider the below image: the x-axis. If we convert it in 2d space with z=1, then it will become as:

You might also like