Unit - II
Unit - II
Unit - II
Supervised Learning
(Classification/Regression)
Distance based methods:
Distance-based algorithms are machine learning algorithms that classify queries by
computing distances between these queries and a number of internally stored exemplars.
Exemplars that are closest to the query have the largest influence on the classification
assigned to the query.
→ K-Nearest Neighbours is one of the most basic yet essential classification algorithms in
Machine Learning. It belongs to the supervised learning domain and finds intense
application in pattern recognition, data mining and intrusion detection.
→ K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on
Supervised Learning technique.
→ Another classic distance-based supervised learning method The label y for x ∈ R D will be
the label of its nearest neighbor in training data. Also known as one-nearest-neighbor (1-NN)
Euclidean distance can be used to find the nearest neighbor.
→ It is widely disposable in real-life scenarios since it is non-parametric, meaning, it does
not make any underlying assumptions about the distribution of data.
→ K-NN algorithm assumes the similarity between the new case/data and available cases
and put the new case into the category that is most similar to the available categories.
→ K-NN algorithm stores all the available data and classifies a new data point based on the
similarity.
→ This means when new data appears then it can be easily classified into a well suite
category by using K- NN algorithm.
→ K-NN algorithm can be used for Regression as well as for Classification but mostly it is
used for the Classification problems.
→ K-NN is a non-parametric algorithm, which means it does not make any assumption on
underlying data.
→ It is also called a lazy learner algorithm because it does not learn from the training set
immediately instead it stores the dataset and at the time of classification, it performs an
action on the dataset.
1|Page
→ KNN algorithm at the training phase just stores the dataset and when it gets new data,
then it classifies that data into a category that is much similar to the new data.
Example: Suppose, we have an image of a creature that looks similar to cat and dog, but we
want to know either it is a cat or dog. So for this identification, we can use the KNN
algorithm, as it works on a similarity measure. Our KNN model will find the similar features
of the new data set to the cats and dogs images and based on the most similar features it
will put it in either cat or dog category.
The K-NN working can be explained on the basis of the below algorithm:
Step-1: Select the number K of the neighbors
Step-2: Calculate the Euclidean distance of K number of neighbors
Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
Step-4: Among these k neighbors, count the number of the data points in each category.
2|Page
Step-5: Assign the new data points to that category for which the number of the neighbor is
maximum.
Step-6: Our model is ready.
→ Suppose we have a new data point and we need to put it in the required category.
Consider the below image:
→ Firstly, we will choose the number of neighbors, so we will choose the k=5.
→ Next, we will calculate the Euclidean distance between the data points. The
Euclidean distance is the distance between two points, which we have already studied
in geometry. It can be calculated as:
3|Page
→ By calculating the Euclidean distance, we got the nearest neighbors, as three
nearest neighbors in category A and two nearest neighbors in category B. Consider the
below image:
→ As we can see the 3 nearest neighbours are from category A, hence this new data
point must belong to category A.
→ There is no particular way to determine the best value for "K", so we need to try
some values to find the best out of them. The most preferred value for K is 5.
→ A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of
outliers in the model.
→ Large values for K are good, but it may find some difficulties.
4|Page
→ The computation cost is high because of calculating the distance between the data
points for all the training samples.
o Decision Tree is a Supervised learning technique that can be used for both
classification and Regression problems, but mostly it is preferred for solving
Classification problems. It is a tree-structured classifier, where internal nodes
represent the features of a dataset, branches represent the decision rules and each
leaf node represents the outcome.
o In a Decision tree, there are two nodes, which are the Decision Node and Leaf
Node. Decision nodes are used to make any decision and have multiple branches,
whereas Leaf nodes are the output of those decisions and do not contain any further
branches.
o The decisions or the test are performed on the basis of features of the given dataset.
o It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions.
o It is called a decision tree because, similar to a tree, it starts with the root node, which
expands on further branches and constructs a tree-like structure.
o In order to build a tree, we use the CART algorithm, which stands for Classification
and Regression Tree algorithm.
o A decision tree simply asks a question, and based on the answer (Yes/No), it further
split the tree into subtrees.
o Below diagram explains the general structure of a decision tree
5|Page
Note: A decision tree can contain categorical data (YES/NO) as well as numeric data.
Need of Decision Trees:
There are various algorithms in Machine learning, so choosing the best algorithm for the given
dataset and problem is the main point to remember while creating a machine learning model.
Below are the two reasons for using the Decision tree:
o Decision Trees usually mimic human thinking ability while making a decision, so it is
easy to understand.
o The logic behind the decision tree can be easily understood because it shows a tree-
like structure.
In a decision tree, for predicting the class of the given dataset, the algorithm starts from the
root node of the tree. This algorithm compares the values of root attribute with the record
(real dataset) attribute and, based on the comparison, follows the branch and jumps to the
next node.
For the next node, the algorithm again compares the attribute value with the other sub-nodes
and move further. It continues the process until it reaches the leaf node of the tree. The
complete process can be better understood using the below algorithm:
> Step-1: Begin the tree with the root node, says S, which contains the complete
dataset.
> Step-2: Find the best attribute in the dataset using Attribute Selection Measure
(ASM).
6|Page
> Step-3: Divide the S into subsets that contains possible values for the best
attributes.
> Step-4: Generate the decision tree node, which contains the best attribute.
> Step-5: Recursively make new decision trees using the subsets of the dataset
created in step -3. Continue this process until a stage is reached where you cannot
further classify the nodes and called the final node as a leaf node.
Example: Suppose there is a candidate who has a job offer and wants to decide whether he
should accept the offer or not. So, to solve this problem, the decision tree starts with the root
node (Salary attribute by ASM). The root node splits further into the next decision node
(distance from the office) and one leaf node based on the corresponding labels. The next
decision node further gets split into one decision node (Cab facility) and one leaf node. Finally,
the decision node splits into two leaf nodes (Accepted offers and Declined offer). Consider
the below diagram:
While implementing a Decision tree, the main issue arises that how to select the best attribute
for the root node and for sub-nodes. So, to solve such problems there is a technique which is
called as Attribute selection measure or ASM. By this measurement, we can easily select the
best attribute for the nodes of the tree. There are two popular techniques for ASM, which
are:
➔ Information Gain
➔ Gini Index
1. Information Gain:
o Information gain is the measurement of changes in entropy after the segmentation of
a dataset based on an attribute.
7|Page
o It calculates how much information a feature provides us about a class.
o According to the value of information gain, we split the node and build the decision
tree.
o A decision tree algorithm always tries to maximize the value of information gain, and
a node/attribute having the highest information gain is split first. It can be calculated
using the below formula:
Where,
2. Gini Index:
o Gini index is a measure of impurity or purity used while creating a decision tree in the
CART(Classification and Regression Tree) algorithm.
o An attribute with the low Gini index should be preferred as compared to the high Gini
index.
o It only creates binary splits, and the CART algorithm uses the Gini index to create
binary splits.
o Gini index can be calculated using the below formula:
Pruning is a process of deleting the unnecessary nodes from a tree in order to get the optimal
decision tree.
A too-large tree increases the risk of overfitting, and a small tree may not capture all the
important features of the dataset. Therefore, a technique that decreases the size of the
learning tree without reducing accuracy is known as Pruning. There are mainly two types of
trees pruning technology used:
The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, which can be
described as:
9|Page
o Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is
independent of the occurrence of other features. Such as if the fruit is identified on
the bases of color, shape, and taste, then red, spherical, and sweet fruit is recognized
as an apple. Hence each feature individually contributes to identify that it is an apple
without depending on each other.
o Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.
Bayes' Theorem:
o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine
the probability of a hypothesis with prior knowledge. It depends on the conditional
probability.
o The formula for Bayes' theorem is given as:
Where,
P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a
hypothesis is true.
Working of Naïve Bayes' Classifier can be understood with the help of the below example:
Suppose we have a dataset of weather conditions and corresponding target variable "Play".
So using this dataset we need to decide that whether we should play or not on a particular
day according to the weather conditions. So to solve this problem, we need to follow the
below steps:
Problem: If the weather is sunny, then the Player should play or not?
10 | P a g e
Solution: To solve this, first consider the below dataset:
Outlook Play
0 Rainy Yes
1 Sunny Yes
2 Overcast Yes
3 Overcast Yes
4 Sunny No
5 Rainy Yes
6 Sunny Yes
7 Overcast Yes
8 Rainy No
9 Sunny No
10 Sunny Yes
11 Rainy No
12 Overcast Yes
13 Overcast Yes
Weather Yes No
11 | P a g e
Overcast 5 0
Rainy 2 2
Sunny 3 2
Total 10 5
Weather No Yes
Rainy 2 2 4/14=0.29
Sunny 2 3 5/14=0.35
P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)
P(Sunny)= 0.35
P(Yes)=0.71
P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)
P(Sunny|NO)= 2/4=0.5
P(No)= 0.29
P(Sunny)= 0.35
12 | P a g e
So, as we can see from the above calculation that P(Yes|Sunny)>P(No|Sunny)
13 | P a g e
o Logistic Regression is a significant machine learning algorithm because it has the ability
to provide probabilities and classify new data using continuous and discrete datasets.
o Logistic Regression can be used to classify the observations using different types of
data and can easily determine the most effective variables used for the classification.
The below image is showing the logistic function:
Note: Logistic regression uses the concept of predictive modeling as regression; therefore, it
is called logistic regression, but is used to classify samples; Therefore, it falls under the
classification algorithm.
Logistic Function (Sigmoid Function):
o The sigmoid function is a mathematical function used to map the predicted values to
probabilities.
o It maps any real value into another value within a range of 0 and 1.
o The value of the logistic regression must be between 0 and 1, which cannot go beyond
this limit, so it forms a curve like the "S" form. The S-form curve is called the Sigmoid
function or the logistic function.
o In logistic regression, we use the concept of the threshold value, which defines the
probability of either 0 or 1. Such as values above the threshold value tends to 1, and a
value below the threshold values tends to 0.
14 | P a g e
Logistic Regression Equation:
The Logistic regression equation can be obtained from the Linear Regression equation.
The mathematical steps to get Logistic Regression equations are given below:
o In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above
equation by (1-y):
o But we need range between -[infinity] to +[infinity], then take logarithm of the equation
it will become:
On the basis of the categories, Logistic Regression can be classified into three types:
o Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.
o Multinomial: In multinomial Logistic regression, there can be 3 or more possible
unordered types of the dependent variable, such as "cat", "dogs", or "sheep"
o Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types
of dependent variables, such as "low", "Medium", or "High".
Linear Regression:
➔ Linear Regression is a machine learning algorithm based on supervised learning. It
performs a regression task.
➔ Regression models a target prediction value based on independent variables. It is
mostly used for finding out the relationship between variables and forecasting.
Different regression models differ based on – the kind of relationship between
15 | P a g e
dependent and independent variables they are considering, and the number of
independent variables getting used.
➔ There are many names for a regression’s dependent variable. It may be called an
outcome variable, criterion variable, endogenous variable, or regress and. The
independent variables can be called exogenous variables, predictor variables, or
regressors.
➔ Linear regression is used in many different fields, including finance, economics, and
psychology, to understand and predict the behaviour of a particular variable.
➔ For example, in finance, linear regression might be used to understand the
relationship between a company’s stock price and its earnings, or to predict the
future value of a currency based on its past performance.
Linear regression performs the task to predict a dependent variable value (y) based
on a given independent variable (x)). Hence, the name is Linear Regression. In the
figure above, X (input) is the work experience and Y (output) is the salary of a person.
The regression line is the best fit line for our model. Hypothesis function for Linear
Regression:
➔ While training the model we are given : x: input training data (univariate – one input
variable(parameter)) y: labels to data (Supervised learning) When training the model
– it fits the best line to predict the value of y for a given value of x.
➔ The model gets the best regression fit line by finding the best θ1 and
θ2 values. θ1: intercept θ2: coefficient of x Once we find the best θ 1 and
θ2 values, we get the best fit line.
➔ So when we are finally using our model for prediction, it will predict the value
of y for the input value of x.
16 | P a g e
➔ How to update θ1 and θ2 values to get the best fit line Linear regression is a
powerful tool for understanding and predicting the behaviour of a variable, but
it has some limitations.
➔ One limitation is that it assumes a linear relationship between the
independent variables and the dependent variable, which may not always be
the case. In addition, linear regression is sensitive to outliers, or data points
that are significantly different from the rest of the data.
➔ These outliers can have a disproportionate effect on the fitted line, leading to
inaccurate predictions.
->Linear regression makes predictions for continuous/real or numeric variables such as sales,
salary, age, product price, etc.
-> Linear regression is one of the easiest and most popular Machine Learning algorithms. It
is a statistical method that is used for predictive analysis.
-> Linear regression algorithm shows a linear relationship between a dependent (y) and one
or more independent (y) variables, hence called as linear regression.
-> Since linear regression shows the linear relationship, which means it finds how the value of
the dependent variable is changing according to the value of the independent variable.
->The linear regression model provides a sloped straight line representing the relationship
between the variables. Consider the below image:
17 | P a g e
Mathematically, we can represent a linear regression as:
Skipy= a0+a1x+ ε
Here,
Y=DependentVariable(TargetVariable)
X=IndependentVariable(predictorVariable)
a0=interceptoftheline(Givesanadditionaldegreeoffreedom)
a1=Linearregressioncoefficient(scalefactortoeachinputvalue).
ε = random error
The values for x and y variables are training datasets for Linear Regression model
representation.
Linear regression can be further divided into two types of the algorithm:
o SimpleLinearRegression:
If a single independent variable is used to predict the value of a numerical dependent
variable, then such a Linear Regression algorithm is called Simple Linear Regression.
o MultipleLinearregression:
If more than one independent variable is used to predict the value of a numerical
dependent variable, then such a Linear Regression algorithm is called Multiple Linear
Regression.
A linear line showing the relationship between the dependent and independent variables is
called a regression line. A regression line can show two types of relationship:
o PositiveLinearRelationship:
If the dependent variable increases on the Y-axis and independent variable increases
on X-axis, then such a relationship is termed as a Positive linear relationship.
18 | P a g e
o NegativeLinearRelationship:
If the dependent variable decreases on the Y-axis and independent variable increases
on the X-axis, then such a relationship is called a negative linear relationship.
When working with linear regression, our main goal is to find the best fit line that means the
error between predicted values and actual values should be minimized. The best fit line will
have the least error.
The different values for weights or the coefficient of lines (a0, a1) gives a different line of
regression, so we need to calculate the best values for a0 and a1 to find the best fit line, so to
calculate this we use cost function.
Cost function-
o The different values for weights or coefficient of lines (a0, a1) gives the different line of
regression, and the cost function is used to estimate the values of the coefficient for
the best fit line.
o Cost function optimizes the regression coefficients or weights. It measures how a
linear regression model is performing.
o We can use the cost function to find the accuracy of the mapping function, which
maps the input variable to the output variable. This mapping function is also known
as Hypothesis function.
For Linear Regression, we use the Mean Squared Error (MSE) cost function, which is the
average of squared error occurred between the predicted values and actual values. It can be
written as:
Where,
19 | P a g e
N=Totalnumberofobservation
Yi=Actualvalue
(a1xi+a0)= Predicted value.
Residuals: The distance between the actual value and predicted values is called residual. If
the observed points are far from the regression line, then the residual will be high, and so
cost function will high. If the scatter points are close to the regression line, then the residual
will be small and hence the cost function.
Gradient Descent:
o Gradient descent is used to minimize the MSE by calculating the gradient of the cost
function.
o A regression model uses gradient descent to update the coefficients of the line by
reducing the cost function.
o It is done by a random selection of values of coefficient and then iteratively update the
values to reach the minimum cost function.
Model Performance:
The Goodness of fit determines how the line of regression fits the set of observations. The
process of finding the best model out of various models is called optimization. It can be
achieved by below method:
1. R-squared method:
Below are some important assumptions of Linear Regression. These are some formal checks
while building a Linear Regression model, which ensures to get the best possible result from
the given dataset.
20 | P a g e
o Linearrelationshipbetweenthefeaturesandtarget:
Linear regression assumes the linear relationship between the dependent and
independent variables.
o Smallornomulticollinearitybetweenthefeatures:
Multicollinearity means high-correlation between the independent variables. Due to
multicollinearity, it may difficult to find the true relationship between the predictors
and target variables. Or we can say, it is difficult to determine which predictor variable
is affecting the target variable and which is not. So, the model assumes either little or
no multicollinearity between the features or independent variables.
o HomoscedasticityAssumption:
Homoscedasticity is a situation when the error term is the same for all the values of
independent variables. With homoscedasticity, there should be no clear pattern
distribution of data in the scatter plot.
o Normaldistributionoferrorterms:
Linear regression assumes that the error term should follow the normal distribution
pattern. If error terms are not normally distributed, then confidence intervals will
become either too wide or too narrow, which may cause difficulties in finding
coefficients.
It can be checked using the q-q plot. If the plot shows a straight line without any
deviation, which means the error is normally distributed.
o Noautocorrelations:
The linear regression model assumes no autocorrelation in error terms. If there will be
any correlation in the error term, then it will drastically reduce the accuracy of the
model. Autocorrelation usually occurs if there is a dependency between residual
errors.
->Linear Regression and Logistic Regression are the two famous Machine Learning Algorithms
which come under supervised learning technique.
-> Since both the algorithms are of supervised in nature hence these algorithms use labeled
dataset to make the predictions.
-> But the main difference between them is how they are being used. The Linear Regression
is used for solving Regression problems whereas Logistic Regression is used for solving the
Classification problems.
21 | P a g e
-> The description of both the algorithms is given below along with difference table.
Linear Regression:
o Linear Regression is one of the most simple Machine learning algorithm that comes
under Supervised Learning technique and used for solving regression problems.
o It is used for predicting the continuous dependent variable with the help of
independent variables.
o The goal of the Linear regression is to find the best fit line that can accurately predict
the output for the continuous dependent variable.
o If single independent variable is used for prediction then it is called Simple Linear
Regression and if there are more than two independent variables then such regression
is called as Multiple Linear Regression.
o By finding the best fit line, algorithm establish the relationship between dependent
variable and independent variable. And the relationship should be of linear nature.
o The output for Linear regression should only be the continuous values such as price,
age, salary, etc. The relationship between the dependent variable and independent
variable can be shown in below image:
22 | P a g e
In above image the dependent variable is on Y-axis (salary) and independent variable is on x-
axis(experience). The regression line can be written as:
y= a0+a1x+ ε
Logistic Regression:
o Logistic regression is one of the most popular Machine learning algorithm that comes
under Supervised Learning techniques.
o It can be used for Classification as well as for Regression problems, but mainly used for
Classification problems.
o Logistic regression is used to predict the categorical dependent variable with the help
of independent variables.
o The output of Logistic Regression problem can be only between the 0 and 1.
o Logistic regression can be used where the probabilities between two classes is
required. Such as whether it will rain today or not, either 0 or 1, true or false etc.
o Logistic regression is based on the concept of Maximum Likelihood estimation.
According to this estimation, the observed data should be most probable.
o In logistic regression, we pass the weighted sum of inputs through an activation
function that can map values in between 0 and 1. Such activation function is known
as sigmoid function and the curve obtained is called as sigmoid curve or S-curve.
Consider the below image:
23 | P a g e
o The equation for logistic regression is:
Linear regression is used to predict the continuous dependent variable Logistic Regression is used to predict the categorical
using a given set of independent variables. dependent variable using a given set of independent
variables.
Linear Regression is used for solving Regression problem. Logistic regression is used for solving Classification
problems.
In Linear regression, we predict the value of continuous variables. In logistic Regression, we predict the values of
categorical variables.
In linear regression, we find the best fit line, by which we can easily predict the output. In Logistic Regression, we find the S-curve by
Least square estimation method is used for estimation of accuracy. Maximum likelihood estimation method is used
The output for Linear Regression must be a continuous value, such as price, age, etc. The output of Logistic Regression must be a
In Linear regression, it is required that relationship between dependent variable and independent In Logistic regression, it is not required to have the
independent variable.
24 | P a g e
In linear regression, there may be collinearity between the independent variables. In logistic regression, there should not be collinearity
1. Select a flight to travel: Decision trees are very good at classification and hence can
be used to select which flight would yield the best “bang-for-the-buck”. There are a
lot of parameters to consider, such as if the flight is connecting or non-stop, or how
reliable is the service record of the given airliner, etc.
2. Selecting alternative products: Often in companies, it is important to determine
which product will be more profitable at launch. Given the sales attributes such as
market conditions, competition, price, availability of raw materials, demand, etc. a
Decision Tree classifier can be used to accurately determine which of the products
would maximize the profits.
3. Sentiment Analysis: Sentiment Analysis is the determination of the overall opinion
of a given piece of text and is especially used to determine if the writer’s comment
towards a given product/service is positive, neutral or negative. Decision trees are
very versatile classifiers and are used for sentiment analysis in many Natural
Language Processing (NLP) applications.
4. Energy Consumption: It is very important for electricity supply boards to correctly
predict the amount of energy consumption in the near future for a particular region.
This is to make sure that un-used power can be diverted towards an area with a
higher demand to keep a regular and uninterrupted supply of power throughout the
grid. Decision Trees are often used to determine which region is expected to require
more or less power in the up-coming time-frame.
5. Fault Diagnosis: In the Engineering domain, one of the widely used applications of
decision trees is the determination of faults
25 | P a g e
3. Easy to use: Decision Trees are one of the simplest, yet most versatile algorithms in
Machine Learning. It is based on simple math and no complex formulas. They are
easy to visualize, understand and explain.
4. Versatile: A lot of business problems can be solved using Decision Trees. They find
their applications in the field of Engineering, Management, Medicine, etc. basically,
any situation where data is available and a decision needs to be taken in uncertain
conditions.
5. Resistant to data abnormalities: Data is never perfect and there are always many
abnormalities in the dataset. Some of the most common abnormalities are outliers,
missing data and noise..
6. Visualization of the decision taken: Often in Machine Learning models, data
scientists struggle to reason as to why a certain model is giving a certain set of
outputs.
The Generalized linear models (GLMs) which explains how Linear regression and Logistic
regression are a member of a much broader class of models.
-> GLMs can be used to construct the models for regression and classification problems by
using the type of distribution which best describes the data or labels given for training the
model.
-> Below given are some types of datasets and the corresponding distributions which
would help us in constructing the model for a particular type of data (The term data
specified here refers to the output data or the labels of the dataset).
26 | P a g e
3. Count-data – Poisson distribution
To understand GLMs we will begin by defining exponential families. Exponential
families are a class of distributions whose probability density function(PDF) can be
moulded into the following form:
27 | P a g e
Note: As mentioned above the value of phi (which is the same as the activation
or sigmoid function for Logistic regression) is not a coincidence
Support Vector Machine Algorithm:
➔ Support Vector Machine or SVM is one of the most popular Supervised Learning
algorithms, which is used for Classification as well as Regression problems. However,
primarily, it is used for Classification problems in Machine Learning.
➔ The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data point
in the correct category in the future. This best decision boundary is called a hyperplane.
➔ SVM chooses the extreme points/vectors that help in creating the hyperplane.
➔ These extreme cases are called as support vectors, and hence algorithm is termed as
Support Vector Machine. Consider the below diagram in which there are two different
categories that are classified using a decision boundary or hyperplane:
Example: SVM can be understood with the example that we have used in the KNN
classifier. Suppose we see a strange cat that also has some features of dogs, so if we
want a model that can accurately identify whether it is a cat or dog, so such a model
can be created by using the SVM algorithm.
-> We will first train our model with lots of images of cats and dogs so that it can
learn about different features of cats and dogs, and then we test it with this strange
creature. So as support vector creates a decision boundary between these two data
(cat and dog) and choose extreme cases (support vectors), it will see the extreme
case of cat and dog. On the basis of the support vectors, it will classify it as a cat.
Consider the below diagram:
28 | P a g e
➔ SVM algorithm can be used for Face detection, image classification, text
categorization, etc.
Types of SVM
SVM can be of two types:
o Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset
can be classified into two classes by using a single straight line, then such data is
termed as linearly separable data, and classifier is used called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which
means if a dataset cannot be classified by using a straight line, then such data is termed
as non-linear data and classifier used is called as Non-linear SVM classifier.
Support Vectors:
➔ The data points or vectors that are the closest to the hyperplane and which affect the
position of the hyperplane are termed as Support Vector. Since these vectors support
the hyperplane, hence called a Support vector.
29 | P a g e
The working of the SVM algorithm can be understood by using an example. Suppose we have
a dataset that has two tags (green and blue), and the dataset has two features x1 and x2. We
want a classifier that can classify the pair(x1, x2) of coordinates in either green or blue.
Consider the below image:
So as it is 2-d space so by just using a straight line, we can easily separate these two
classes. But there can be multiple lines that can separate these classes. Consider the
below image:
Hence, the SVM algorithm helps to find the best line or decision boundary; this best
boundary or region is called as a hyperplane.
-> SVM algorithm finds the closest point of the lines from both the classes. These
points are called support vectors.
-> The distance between the vectors and the hyperplane is called as margin. And the
goal of SVM is to maximize this margin.
-> The hyperplane with maximum margin is called the optimal hyperplane.
30 | P a g e
Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a straight line, but for non-linear
data, we cannot draw a single straight line. Consider the below image:
So to separate these data points, we need to add one more dimension. For linear
data, we have used two dimensions x and y, so for non-linear data, we will add a
third dimension z. It can be calculated as:
z=x2 +y2
By adding the third dimension, the sample space will become as below image:
31 | P a g e
So now, SVM will divide the datasets into classes in the following way. Consider the
below image:
➔ Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we
convert it in 2d space with z=1, then it will become as:
32 | P a g e
Binary Classification :
It is a process or task of classification, in which a given data is being classified into two
classes. It’s basically a kind of prediction about which of two groups the thing belongs to.
simultaneously. In multi-output classification, the model will give two or more outputs
after making any prediction. In other types of classifications, the model usually predicts
➔ In multi-label classification, zero or more labels are required as output for each input
sample, and the outputs are required simultaneously. The assumption is that the
➔ Let us suppose, two emails are sent to you, one is sent by an insurance company that
keeps sending their ads, and the other is from your bank regarding your credit card bill.
The email service provider will classify the two emails, the first one will be sent to the
spam folder and the second one will be kept in the primary one.
Binary vs Multiclass Classification
33 | P a g e
• k-Nearest Neighbors • Random Forest.
• Naive Bayes
Examples of binary
classification include-
Examples of multi-class classification
• Email spam detection
include:
(spam or not).
Examples • Face classification.
• Churn prediction (churn
• Plant species classification.
or not).
• Optical character recognition.
• Conversion prediction
(buy or not).
-> This process is known as binary classification, as there are two discrete classes, one is
spam and the other is primary. So, this is a problem of binary classification.
-> Binary classification uses some algorithms to do the task, some of the most common
algorithms used by binary classification.
->Multilabel classification is an important subfield of structured output prediction where
multiple labels must be assigned that respect semantic relationships such as subsumption,
mutual exclusion or weak forms of correlation.
34 | P a g e
MNIST:
> The MNIST database (Modified National Institute of Standards and Technology
database) is a large database of handwritten digits that is commonly used for
training various image processing systems.
35 | P a g e
> The database is also widely used for training and testing in the field of machine
learning.
➔ The MNIST used for to provides a baseline for testing image processing systems. You
could consider it as the “hello world” of machine learning. Data scientists will train an
algorithm on the MNIST dataset simply to test a new architecture or framework, to
ensure that they work.
➔ The MNIST dataset is an acronym that stands for the Modified National Institute of
Standards and Technology dataset.
➔ It is a dataset of 60,000 small square 28×28 pixel grayscale images of handwritten
single digits between 0 and 9.
➔ It's a simple model consisting of a convolutional layer with a max-pooling layer twice
followed by two fully connected layers with a softmax output of ten classes at the end.
After training for 30 epochs, the training accuracy was 99.98% & dev set accuracy was
99.05%.
➔ The MNIST database is a large database of handwritten digits that is commonly used
for training various image processing systems. The database is also widely used for
training and testing in the field of machine learning.
➔ It was created by "re-mixing" the samples from NIST's original datasets. The creators
felt that since NIST's training dataset was taken from
American CensusBureau employees, while the testing dataset was taken
from American high school students, it was not well-suited for machine learning
experiments.
36 | P a g e
➔ Furthermore, the black and white images from NIST were normalized to fit into a
28x28 pixel bounding box and anti-aliased, which introduced grayscale levels.
Ranking:
➔ Ranking is a machine learning technique to rank items. Ranking is useful for many
applications in information retrieval such as e-commerce, social networks,
recommendation systems, and so on. For example, a user searches for an article or an
item to buy online.
➔ Ranking method is a method of performance appraisal. Ranking method is the oldest
and most conventional for of method. In this method all employees are compared on
the basis of worth. They are ranked on the basis of best to worst.
➔ In statistics, ranking is the data transformation in which numerical or ordinal values
are replaced by their rank when the data are sorted. For example, the numerical data
3.4, 5.1, 2.6, 7.3 are observed, the ranks of these data items would be 2, 3, 1 and 4
respectively.
➔ In this method, one employee is compared to another employee. The end result is an
ordering of employees from best to worst.
➔ For example, in a group of 'n' employees, performance of employee-1 is compared
with performance of 'n-1' employees. Performance of employee-2 is compared with
performance of 'n-1' employees.
Applications: -
-> Search Engines — Given a user profile (location, age, sex, …) a textual query, sort web
-> Recommender Systems — Given a user profile and purchase history, sort the other
-> Travel Agencies — Given a user profile and filters (check-in/check-out dates, number
37 | P a g e