Unit I

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

Unit-I

Supervised Machine Learning

Supervised learning is the types of machine learning in which machines are trained
using well "labelled" training data, and on basis of that data, machines predict the
output. The labelled data means some input data is already tagged with the correct
output.

In supervised learning, the training data provided to the machines work as the supervisor
that teaches the machines to predict the output correctly. It applies the same concept as
a student learns in the supervision of the teacher.

Supervised learning is a process of providing input data as well as correct output data
to the machine learning model. The aim of a supervised learning algorithm is to find a
mapping function to map the input variable(x) with the output variable(y).

In the real-world, supervised learning can be used for Risk Assessment, Image
classification, Fraud Detection, spam filtering, etc.

Difference between Regression and Classification

Regression Algorithm Classification Algorithm


In Regression, the output variable must In Classification, the output variable must be a
be of continuous nature or real value. discrete value.
The task of the regression algorithm is to The task of the classification algorithm is to map the
map the input value (x) with the input value(x) with the discrete output variable(y).
continuous output variable(y).
Regression Algorithms are used with Classification Algorithms are used with discrete
continuous data. data.
In Regression, we try to find the best fit In Classification, we try to find the decision
line, which can predict the output more boundary, which can divide the dataset into different
accurately. classes.
Regression algorithms can be used to Classification Algorithms can be used to solve
solve the regression problems such as classification problems such as Identification of
Weather Prediction, House price spam emails, Speech Recognition, Identification of
prediction, etc. cancer cells, etc.
The regression Algorithm can be further The Classification algorithms can be divided into
divided into Linear and Non-linear Binary Classifier and Multi-class Classifier.
Regression.
Regression:

is a process of finding the correlations between dependent and independent variables. It


helps in predicting the continuous variables such as prediction of Market Trends,
prediction of House prices, etc.

The task of the Regression algorithm is to find the mapping function to map the input
variable(x) to the continuous output variable(y).

Example: Suppose we want to do weather forecasting, so for this, we will use the
Regression algorithm. In weather prediction, the model is trained on the past data, and
once the training is completed, it can easily predict the weather for future days.

Types of Regression Algorithm:

o Simple Linear Regression


o Multiple Linear Regression
o Polynomial Regression
o Support Vector Regression
o Decision Tree Regression
o Random Forest Regression

Terminologies Related to the Regression Analysis:

o Dependent Variable: The main factor in Regression analysis which we want to


predict or understand is called the dependent variable. It is also called target
variable.
o Independent Variable: The factors which affect the dependent variables or
which are used to predict the values of the dependent variables are called
independent variable, also called as a predictor.
o Outliers: Outlier is an observation which contains either very low value or very
high value in comparison to other observed values. An outlier may hamper the
result, so it should be avoided.
o Multicollinearity: If the independent variables are highly correlated with each
other than other variables, then such condition is called Multicollinearity. It
should not be present in the dataset, because it creates problem while ranking the
most affecting variable.
o Underfitting and Overfitting: If our algorithm works well with the training
dataset but not well with test dataset, then such problem is called Overfitting.
And if our algorithm does not perform well even with training dataset, then such
problem is called underfitting.

Types of regression:

There are various types of regressions which are used in data science and machine
learning. Each type has its own importance on different scenarios, but at the core, all
the regression methods analyze the effect of the independent variable on dependent
variables. Here we are discussing some important types of regression which are given
below:

o Linear Regression
o Logistic Regression
o Polynomial Regression
o Support Vector Regression
o Decision Tree Regression
o Random Forest Regression
o Ridge Regression
o Lasso Regression:

Linear Regression:

o Linear regression is a statistical regression method which is used for predictive


analysis.
o It is one of the very simple and easy algorithms which works on regression and
shows the relationship between the continuous variables.
o It is used for solving the regression problem in machine learning.
o Linear regression shows the linear relationship between the independent variable
(X-axis) and the dependent variable (Y-axis), hence called linear regression.
o If there is only one input variable (x), then such linear regression is called simple
linear regression. And if there is more than one input variable, then such linear
regression is called multiple linear regression.
o The relationship between variables in the linear regression model can be
explained using the below image. Here we are predicting the salary of an
employee on the basis of the year of experience.
mathematical equation for Linear regression:
Y= aX+b

Here, Y = dependent variables (target variables),


X= Independent variables (predictor variables),
a and b are the linear coefficients

Some popular applications of linear regression are:

o Analyzing trends and sales estimates


o Salary forecasting
o Real estate prediction
o Arriving at ETAs in traffic.

Polynomial regression:
Polynomial Regression is a regression algorithm that models the relationship between
a dependent(y) and independent variable(x) as nth degree polynomial
y= b0+b1x1+ b2x12+ b2x13+...... bnx1n

o It is also called the special case of Multiple Linear Regression in ML. Because
we add some polynomial terms to the Multiple Linear regression equation to
convert it into Polynomial Regression.
o It is a linear model with some modification in order to increase the accuracy.
o The dataset used in Polynomial regression for training is of non-linear nature.
o It makes use of a linear regression model to fit the complicated and non-linear
functions and datasets.
o Hence, "In Polynomial regression, the original features are converted into
Polynomial features of required degree (2,3,..,n) and then modeled using a
linear model."

Multiple Linear Regression

In the previous topic, we have learned about Simple Linear Regression, where a single
Independent/Predictor(X) variable is used to model the response variable (Y). But there
may be various cases in which the response variable is affected by more than one
predictor variable; for such cases, the Multiple Linear Regression algorithm is used.

Overfitting and Underfitting in Machine Learning:

Overfitting and Underfitting are the two main problems that occur in machine learning
and degrade the performance of the machine learning models.

The main goal of each machine learning model is to generalize well.


Here generalization defines the ability of an ML model to provide a suitable output by
adapting the given set of unknown input. It means after providing training on the
dataset, it can produce reliable and accurate output. Hence, the underfitting and
overfitting are the two terms that need to be checked for the performance of the model
and whether the model is generalizing well or not.
Before understanding the overfitting and underfitting, let's understand some basic term
that will help to understand this topic well:

o Signal: It refers to the true underlying pattern of the data that helps the
machine learning model to learn from the data.
o Noise: Noise is unnecessary and irrelevant data that reduces the performance of
the model.
o Bias: Bias is a prediction error that is introduced in the model due to
oversimplifying the machine learning algorithms. Or it is the difference
between the predicted values and the actual values.
o Variance: If the machine learning model performs well with the training
dataset, but does not perform well with the test dataset, then variance occurs.

Overfitting:

Overfitting occurs when our machine learning model tries to cover all the data points
or more than the required data points present in the given dataset. Because of this, the
model starts caching noise and inaccurate values present in the dataset, and all these
factors reduce the efficiency and accuracy of the model. The overfitted model has low
bias and high variance.

The chances of occurrence of overfitting increase as much we provide training to our


model. It means the more we train our model, the more chances of occurring the
overfitted model.

Overfitting is the main problem that occurs in supervised learning.

Example: The concept of the overfitting can be understood by the below graph of the
linear regression output:
How to avoid the Overfitting in Model

Both overfitting and underfitting cause the degraded performance of the machine
learning model. But the main cause is overfitting, so there are some ways by which we
can reduce the occurrence of overfitting in our model.

o Cross-Validation
o Training with more data
o Removing features
o Early stopping the training
o Regularization
o Ensembling

Underfitting

Underfitting occurs when our machine learning model is not able to capture the
underlying trend of the data. To avoid the overfitting in the model, the fed of training
data can be stopped at an early stage, due to which the model may not learn enough
from the training data. As a result, it may fail to find the best fit of the dominant trend
in the data.

In the case of underfitting, the model is not able to learn enough from the training data,
and hence it reduces the accuracy and produces unreliable predictions.
An underfitted model has high bias and low variance.

Example: We can understand the underfitting using below output of the linear
regression model:

How to avoid underfitting:

o By increasing the training time of the model.


o By increasing the number of features.

Least Squares Regression Method:

The least-squares regression method is a technique commonly used in Regression


Analysis. It is a mathematical method used to find the best fit line that represents
the relationship between an independent and dependent variable.

To understand the least-squares regression method lets get familiar with the concepts
involved in formulating the line of best fit.
Line of Best Fit:

Line of best fit is drawn to represent the relationship between 2 or more variables. To
be more specific, the best fit line is drawn across a scatter plot of data points in order to
represent a relationship between those data points

If we were to plot the best fit line that shows the depicts the sales of a company over a
period of time, it would look something like this:

Notice that the line is as close as possible to all the scattered data points. This is what
an ideal best fit line looks like

Steps to calculate the Line of Best Fit

To start constructing the line that best depicts the relationship between variables in the
data, we first need to get our basics right. Take a look at the equation below:

Surely, you’ve come across this equation before. It is a simple equation that represents
a straight line along 2 Dimensional data, i.e. x-axis and y-axis. To better understand
this, let’s break down the equation:

• y: dependent variable
• m: the slope of the line
• x: independent variable
• c: y-intercept

So the aim is to calculate the values of slope, y-intercept and substitute the
corresponding ‘x’ values in the equation in order to derive the value of the dependent
variable.
Let’s see how this can be done.

As an assumption, let’s consider that there are ‘n’ data points.

Step 1: Calculate the slope ‘m’ by using the following formula:

Step 2: Compute the y-intercept (the value of y at the point where the line crosses the
y-axis):

Step 3: Substitute the values in the final equation:

Simple, isn’t it?

Now let’s look at an example and see how you can use the least-squares regression
method to compute the line of best fit.

Least Squares Regression Example:

Consider an example. Tom who is the owner of a retail shop, found the price of different
T-shirts vs the number of T-shirts sold at his shop over a period of one week.

He tabulated this like shown below:


Let us use the concept of least squares regression to find the line of best fit for the above
data.

Step 1: Calculate the slope ‘m’ by using the following formula:

After you substitute the respective values, m = 1.518 approximately.

Step 2: Compute the y-intercept value

After you substitute the respective values, c = 0.305 approximately.

Step 3: Substitute the values in the final equation

Once you substitute the values, it should look something like this:

Let’s construct a graph that represents the y=mx + c line of best fit:
Now Tom can use the above equation to estimate how many T-shirts of price $8 can he
sell at the retail shop.

y = 1.518 x 8 + 0.305 = 12.45 T-shirts

This comes down to 13 T-shirts! That’s how simple it is to make predictions using
Linear Regression.

Now let’s try to understand based on what factors can we confirm that the above line is
the line of best fit.

The least squares regression method works by minimizing the sum of the square of the
errors as small as possible, hence the name least squares. Basically the distance between
the line of best fit and the error must be minimized as much as possible. This is the basic
idea behind the least squares regression method.

A few things to keep in mind before implementing the least squares regression method
is:

• The data must be free of outliers because they might lead to a biased and
wrongful line of best fit.
• The line of best fit can be drawn iteratively until you get a line with the minimum
possible squares of errors.
• This method works well even with non-linear data.
• Technically, the difference between the actual value of ‘y’ and the predicted
value of ‘y’ is called the Residual (denotes the error).
Ridge Regression:

o Ridge regression is one of the most robust versions of linear regression in which
a small amount of bias is introduced so that we can get better long term
predictions.
o The amount of bias added to the model is known as Ridge Regression penalty.
We can compute this penalty term by multiplying with the lambda to the squared
weight of each individual features.
o The equation for ridge regression will be:

o A general linear or polynomial regression will fail if there is high collinearity


between the independent variables, so to solve such problems, Ridge regression
can be used.
o Ridge regression is a regularization technique, which is used to reduce the
complexity of the model. It is also called as L2 regularization.
o It helps to solve the problems if we have more parameters than samples.

Lasso Regression:

o Lasso regression is another regularization technique to reduce the complexity of


the model.
o It is similar to the Ridge Regression except that penalty term contains only the
absolute weights instead of a square of weights.
o Since it takes absolute values, hence, it can shrink the slope to 0, whereas Ridge
Regression can only shrink it near to 0.
o It is also called as L1 regularization. The equation for Lasso regression will be:

Assumptions of Linear Regression

Below are some important assumptions of Linear Regression. These are some formal
checks while building a Linear Regression model, which ensures to get the best
possible result from the given dataset.
o Linear relationship between the features and target:
Linear regression assumes the linear relationship between the dependent and
independent variables.
o Small or no multicollinearity between the features:

Multicollinearity means high-correlation between the independent variables.


Due to multicollinearity, it may difficult to find the true relationship between the
predictors and target variables. Or we can say, it is difficult to determine which
predictor variable is affecting the target variable and which is not. So, the model
assumes either little or no multicollinearity between the features or independent
variables.

o Homoscedasticity Assumption:
Homoscedasticity is a situation when the error term is the same for all the
values of independent variables. With homoscedasticity, there should be no
clear pattern distribution of data in the scatter plot.
o Normal distribution of error terms:

Linear regression assumes that the error term should follow the normal
distribution pattern. If error terms are not normally distributed, then confidence
intervals will become either too wide or too narrow, which may cause
difficulties in finding coefficients.
It can be checked using the q-q plot. If the plot shows a straight line without
any deviation, which means the error is normally distributed.

o No autocorrelations:
The linear regression model assumes no autocorrelation in error terms. If there
will be any correlation in the error term, then it will drastically reduce the
accuracy of the model. Autocorrelation usually occurs if there is a dependency
between residual errors.

You might also like