DA Unit-3

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

UNIT-III

Regression
Regression Concepts:
Regression analysis is a form of predictive modelling technique which investigates
the relationship between a dependent (target) and independent variable (s)
(predictor). This technique is used for forecasting, time series modelling and finding
the causal effect relationship between the variables. For example, relationship
between rash driving and number of road accidents by a driver is best studied
through regression.
• Dependent-Target Variable, e.g: test score
• Independent Variable- Predictive Variable or Explanatory Variable,
e.g : age

Regression analysis estimates the relationship between two or more variables. Let’s
understand this with an easy example:

Let’s say, you want to estimate growth in sales of a company based on current
economic conditions. You have the recent company data which indicates that the
growth in sales is around two and a half times the growth in the economy. Using this
insight, we can predict future sales of the company based on current & past
information.

There are multiple benefits of using regression analysis. They are as follows:

1. It indicates the significant relationships between dependent variable and


independent variable.
2. It indicates the strength of impact of multiple independent variables on a
dependent variable.

There are various kinds of regression techniques available to make predictions.


These techniques are mostly driven by three metrics (number of independent
variables, type of dependent variables and shape of regression line). We’ll discuss
them in detail in the following sections.
For the creative ones, you can even cook up new regressions, if you feel the need to
use a combination of the parameters above, which people haven’t used before. But
before you start that, let us understand the most commonly used regressions:

• Linear Regression
• Logistic Regression
• Polynomial Regression
• Ridge Regression
• Lasso Regression

1. Linear Regression

It is one of the most widely known modeling technique. Linear regression is usually
among the first few topics which people pick while learning predictive modeling. In
this technique, the dependent variable is continuous, independent variable(s) can be
continuous or discrete, and nature of regression line is linear.

The relationship between the two variable is three types. They are

(i) Linear Relationship

• The graph of linear relationship between two variables looks as follows,


(ii) Non-Linear relationship

• The graph of linear relationship between two variables looks as follows,

(iii) No Relationship

• The graph of linear relationship between two variables looks as follows,


Linear Regression establishes a relationship between dependent variable (Y) and
one or more independent variables (X) using a best fit straight line (also known
as regression line).

Where
• y is dependent variable
• x is independent variable
• b is slope--> how much the line rises for each unit increase in x
• a is y intercept --> the value of y when x=0.

Simple Linear Regression: It represents the relationships between the two


variables. One is independent variables is X and one dependent variable is Y.
Multiple Linear Regression:
When you have multiple independent variables, then we call it as Multiple Linear
Regression

Assumptions of linear regression:


• There must be a linear relation between independent and dependent variables.
• There should not be any outliers present.
• No heteroscedasticity
• Sample observations should be independent.
• Error terms should be normally distributed with mean 0 and constant variance.
• Absence of multicollinearity and auto-correlation.

Logistic Regression
Logistic Regression is used to solve the classification problems, so it’s called as
Classification Algorithm that models the probability of output class.
• It is a classification problem where your target element is categorical
• Unlike in Linear Regression, in Logistic regression the output required is representedin
discrete values like binary 0 and 1.
• It estimates relationship between a dependent variable (target) and one or more
independent variable (predictors) where dependent variable is categorical/nominal.
• Logistic regression is a supervised learning classification algorithm used to predict
the probability of a dependent variable.
• The nature of target or dependent variable is dichotomous(binary), which means
there would be only two possible classes.
• In simple words, the dependent variable is binary in nature having data coded as
either 1 (stands for success/yes) or 0 (stands for failure/no), etc. but instead of giving
the exact value as 0 and 1, it gives the probabilistic values which lie between 0
and 1.
• Logistic Regression is much similar to the Linear Regression except that how they
are used. Linear Regression is used for solving Regression problems,
whereas Logistic regression is used for solving the classification problems.
• In Logistic regression, instead of fitting a regression line, we fit an "S" shaped
logistic function, which predicts two maximum values (0 or 1).

Sigmoid Function:
• It is the logistic expression especially used in Logistic Regression.
• The sigmoid function converts any line into a curve which has discrete values like
binary 0 and 1.
• In this session let’s see how a continuous linear regression can be manipulated and
converted into Classifies Logistic.
• The sigmoid function is a mathematical function used to map the predicted values
to probabilities.
• It maps any real value into another value within a range of 0 and 1.
• The value of the logistic regression must be between 0 and 1, which cannot go
beyond this limit, so it forms a curve like the "S" form. The S-form curve is called
the Sigmoid function or the logistic function.
• In logistic regression, we use the concept of the threshold value, which defines the
probability of either 0 or 1. Such as values above the threshold value tends to 1, and
a value below the threshold values tends to 0.

Where,
P represents Probability of Output class Y represents predicted output.

Assumptions for Logistic Regression:

• The dependent variable must be categorical in nature.


• The independent variable should not have multi-collinearity.

Logistic Regression Equation:

Example

Admissions(dependent variables) CGPA(Independent variables)

0 4.2
0 5.1

0 5.5

1 8.2

1 9.0

1 9.9

Logistic regression can be binomial, ordinal or multinomial.


• Binomial or binary logistic regression deals with situations in which the
observed outcome for a dependent variable can have only two possible
types, "0" and "1" (which may represent, for example, "dead" vs. "alive"
or "win" vs. "loss").
• Multinomial logistic regression deals with situations where the outcome
can have three or more possible types (e.g., "disease A" vs. "disease B"
vs. "disease C") that are not ordered.
• Ordinal logistic regression deals with dependent variables that are
ordered.
Differences Between Linear and Logistic Regression

Polynomial Regression
o Polynomial Regression is a type of regression which models the non-linear
dataset using a linear model.
o It is similar to multiple linear regression, but it fits a non-linear curve between
the value of x and corresponding conditional values of y.
o Suppose there is a dataset which consists of datapoints which are present in a
non-linear fashion, so for such case, linear regression will not best fit to those
datapoints. To cover such datapoints, we need Polynomial regression.
o In Polynomial regression, the original features are transformed into
polynomial features of given degree and then modeled using a linear
model. Which means the datapoints are best fitted using a polynomial line.
o The equation for polynomial regression also derived from linear regression
equation that means Linear regression equation Y= b 0+ b1x, is transformed
into Polynomial regression equation Y= b0+b1x+ b2x2+ b3x3+ + bnxn.
o Here Y is the predicted/target output, b0, b1,... bn are the regression
coefficients. x is our independent/input variable.
o The model is still linear as the coefficients are still linear with quadratic.

Need for Polynomial Regression:

• If we apply a linear model on a linear dataset, then it provides us a good


result as we have seen in Simple Linear Regression, but if we apply the same
model without any modification on a non-linear dataset, then it will produce
a drastic output. Due to which loss function will increase, the error rate will
be high, and accuracy will be decreased.
• So for such cases, where data points are arranged in a non-linear fashion,
we need the Polynomial Regression model. We can understand it in a better
way using the below comparison diagram of the linear dataset and non-linear
dataset.
• In the above image, we have taken a dataset which is arranged non-linearly.
So if we try to cover it with a linear model, then we can clearly see that it
hardly covers any data point. On the other hand, a curve is suitable to cover
most of the data points, which is of the Polynomial model.
• Hence, if the datasets are arranged in a non-linear fashion, then we should
use the Polynomial Regression model instead of Simple Linear Regression.

When we compare the above three equations, we can clearly see that all three
equations are Polynomial equations but differ by the degree of variables.

The Simple and Multiple Linear equations are also Polynomial equations with
a single degree, and the Polynomial regression equation is Linear equation
with the nth degree.

So if we add a degree to our linear equations, then it will be converted into


Polynomial Linear equations.

Stepwise Regression
• This form of regression is used when we deal with multiple independent
variables. In this technique, the selection of independent variables is done
with the help of an automatic process, which involves no human intervention.
• Stepwise regression basically fits the regression model by adding/dropping
co-variates one at a time based on a specified criterion. Some of the most
commonly used Stepwise regression methods are listed below:

Standard stepwise regression does two things. It adds and removes predictors
as needed for each step.

➢ Forward selection starts with most significant predictor in the model and
adds variable for each step.
➢ Backward elimination starts with all predictors in the model and removes
the least significant variable for each step.
➢ The aim of this modeling technique is to maximize the prediction power with
minimum number of predictor variables. It is one of the method to handle
higher dimensionality of data set.

Ridge Regression:
o Ridge regression is one of the most robust versions of linear regression in
which a small amount of bias is introduced so that we can get better long term
predictions.
o The amount of bias added to the model is known as Ridge Regression
penalty. We can compute this penalty term by multiplying with the lambda
to the squared weight of each individual features.
o The equation for ridge regression will be:

o A general linear or polynomial regression will fail if there is high collinearity


between the independent variables, so to solve such problems, Ridge
regression can be used.
o Ridge regression is a regularization technique, which is used to reduce the
complexity of the model. It is also called as L2 regularization.
o It helps to solve the problems if we have more parameters than samples.
Lasso Regression:
o Lasso regression is another regularization technique to reduce the complexity
of the model.
o It is similar to the Ridge Regression except that penalty term contains only the
absolute weights instead of a square of weights.
o Since it takes absolute values, hence, it can shrink the slope to 0, whereas
Ridge Regression can only shrink it near to 0.
o It is also called as L1 regularization. The equation for Lasso regression will
be:

You might also like