The Five Assumptions of Multiple Linear Regression

Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

About Course Basic Stats Machine Learning Software Tutorials  Tools 

The Five Assumptions of Multiple

Linear Regression

Multiple linear regression is a statistical method

we can use to understand the relationship
between multiple predictor variables and a
response variable.

However, before we perform multiple linear

regression, we must first make sure that five
assumptions are met:

1. Linear relationship: There exists a linear

relationship between each predictor variable and
the response variable.

2. No Multicollinearity: None of the predictor

variables are highly correlated with each other.

3. Independence: The observations are


4. Homoscedasticity: The residuals have

constant variance at every point in the linear

5. Multivariate Normality: The residuals of the

model are normally distributed.
If one or more of these assumptions are violated,
then the results of the multiple linear regression
may be unreliable.

In this article, we provide an explanation for each

assumption, how to determine if the assumption
is met, and what to do if the assumption is

Assumption 1: Linear Relationship

Multiple linear regression assumes that there is a
linear relationship between each predictor
variable and the response variable.

How to Determine if this Assumption is


The easiest way to determine if this assumption

is met is to create a scatter plot of each predictor
variable and the response variable.

This allows you to visually see if there is a linear

relationship between the two variables.

If the points in the scatter plot roughly fall along a

straight diagonal line, then there likely exists a
linear relationship between the variables.

For example, the points in the plot below look

like they fall on roughly a straight line, which
indicates that there is a linear relationship
between this particular predictor variable (x) and
the response variable (y):
What to Do if this Assumption is

If there is not a linear relationship between one

or more of the predictor variables and the
response variable, then we have a couple

1. Apply a nonlinear transformation to the

predictor variable such as taking the log or the
square root. This can often transform the
relationship to be more linear.

2. Add another predictor variable to the model.

For example, if the plot of x vs. y has a parabolic
shape then it might make sense to add X2 as an
additional predictor variable in the model.

3. Drop the predictor variable from the model. In

the most extreme case, if there exists no linear
relationship between a certain predictor variable
and the response variable then the predictor
variable may not be useful to include in the

Assumption 2: No Multicollinearity
Multiple linear regression assumes that none of
the predictor variables are highly correlated with
each other.

When one or more predictor variables are highly

correlated, the regression model suffers from
multicollinearity, which causes the coefficient
estimates in the model to become unreliable.

How to Determine if this Assumption is


The easiest way to determine if this assumption

is met is to calculate the VIF value for each
predictor variable.

VIF values start at 1 and have no upper limit. As

a general rule of thumb, VIF values greater than
5* indicate potential multicollinearity.

The following tutorials show how to calculate VIF

in various statistical software:

How to Calculate VIF in R

How to Calculate VIF in Python
How to Calculate VIF in Excel

* Sometimes researchers use a VIF value of 10

instead, depending on the field of study.

What to Do if this Assumption is

If one or more of the predictor variables has a
VIF value greater than 5, the easiest way to
resolve this issue is to simply remove the
predictor variable(s) with the high VIF values.

Alternatively, if you want to keep each predictor

variable in the model then you can use a
different statistical method such as ridge
regression, lasso regression, or partial least
squares regression that is designed to handle
predictor variables that are highly correlated.

Assumption 3: Independence
Multiple linear regression assumes that each
observation in the dataset is independent.

How to Determine if this Assumption is


The simplest way to determine if this assumption

is met is to perform a Durbin-Watson test, which
is a formal statistical test that tells us whether or
not the residuals (and thus the observations)
exhibit autocorrelation.

What to Do if this Assumption is


Depending on the nature of the way this

assumption is violated, you have a few options:

For positive serial correlation, consider adding

lags of the dependent and/or independent
variable to the model.
For negative serial correlation, check to make
sure that none of your variables
are overdifferenced.
For seasonal correlation, consider adding
seasonal dummy variables to the model.

Assumption 4: Homoscedasticity
Multiple linear regression assumes that the
residuals have constant variance at every point
in the linear model. When this is not the case,
the residuals are said to suffer from

When heteroscedasticity is present in a

regression analysis, the results of the regression
model become unreliable.

Specifically, heteroscedasticity increases the

variance of the regression coefficient estimates,
but the regression model doesn’t pick up on this.
This makes it much more likely for a regression
model to declare that a term in the model is
statistically significant, when in fact it is not.

How to Determine if this Assumption is


The simplest way to determine if this assumption

is met is to create a plot of standardized
residuals versus predicted values.

Once you fit a regression model to a dataset, you

can then create a scatter plot that shows the
predicted values for the response variable on the
x-axis and the standardized residuals of the
model on the y-axis.

If the points in the scatter plot exhibit a pattern,

then heteroscedasticity is present.

The following plot shows an example of a

regression model where heteroscedasticity is not
a problem:

Notice that the standardized residuals are

scattered about zero with no clear pattern.

The following plot shows an example of a

regression model where heteroscedasticity is a
Notice how the standardized residuals become
much more spread out as the predicted values
get larger. This “cone” shape is a classic sign of

What to Do if this Assumption is


There are three common ways to

fix heteroscedasticity:

1. Transform the response variable. The most

common way to deal with heteroscedasticity is to
transform the response variable by taking the
log, square root, or cube root of all of the values
of the response variable. This often causes
heteroscedasticity to go away.

2. Redefine the response variable. One way to

redefine the response variable is to use a rate,
rather than the raw value. For example, instead
of using the population size to predict the
number of flower shops in a city, we may instead
use population size to predict the number of
flower shops per capita.

In most cases, this reduces the variability that

naturally occurs among larger populations since
we’re measuring the number of flower shops per
person, rather than the sheer amount of flower

3. Use weighted regression. Another way to fix

heteroscedasticity is to use weighted regression,
which assigns a weight to each data point based
on the variance of its fitted value.

Essentially, this gives small weights to data

points that have higher variances, which shrinks
their squared residuals. When the proper weights
are used, this can eliminate the problem of

Related: How to Perform Weighted Regression

in R
Assumption 4: Multivariate
Multiple linear regression assumes that the
residuals of the model are normally distributed.

How to Determine if this Assumption is


There are two common ways to check if this

assumption is met:

1. Check the assumption visually using Q-Q


A Q-Q plot, short for quantile-quantile plot, is a

type of plot that we can use to determine
whether or not the residuals of a model follow a
normal distribution. If the points on the plot
roughly form a straight diagonal line, then the
normality assumption is met.

The following Q-Q plot shows an example of

residuals that roughly follow a normal
However, the Q-Q plot below shows an example
of when the residuals clearly depart from a
straight diagonal line, which indicates that they
do not follow normal distribution:
2. Check the assumption using a formal
statistical test like Shapiro-Wilk, Kolmogorov-
Smironov, Jarque-Barre, or D’Agostino-Pearson.

Keep in mind that these tests are sensitive to

large sample sizes – that is, they often conclude
that the residuals are not normal when your
sample size is extremely large. This is why it’s
often easier to use graphical methods like a Q-Q
plot to check this assumption.

What to Do if this Assumption is


If the normality assumption is violated, you have

a couple options:

1. First, verify that there are no extreme outliers

present in the data that cause the normality
assumption to be violated.

2. Next, you can apply a nonlinear transformation

to the response variable such as taking the
square root, the log, or the cube root of all of the
values of the response variable. This often
causes the residuals of the model to become
more normally distributed.

Additional Resources

The following tutorials provide additional

information about multiple linear regression and
its assumptions:
Introduction to Multiple Linear Regression
A Guide to Heteroscedasticity in Regression
A Guide to Multicollinearity & VIF in Regression

The following tutorials provide step-by-step

examples of how to perform multiple linear
regression using different statistical software:

How to Perform Multiple Linear Regression in

How to Perform Multiple Linear Regression in R
How to Perform Multiple Linear Regression in
How to Perform Multiple Linear Regression in


Zach Bobbitt
Hey there. My name is Zach Bobbitt. I
have a Masters of Science degree in
Applied Statistics and I’ve worked on
machine learning algorithms for
professional businesses in both healthcare and retail.
I’m passionate about statistics, machine learning, and
data visualization and I created Statology to be a
resource for both students and teachers alike. My
goal with this site is to help you learn statistics through
using simple terms, plenty of real-world examples, and
helpful illustrations.

How to Convert Factor to Date How to Create Kernel Density
in R (With Examples) Plots in R (With Examples)

4 Replies to “The Five Assumptions of

Multiple Linear Regression”
April 27, 2022 at 1:50 pm

Very helpful. Please note: The last

assumption should say ‘5’ not ‘4’.

October 9, 2022 at 1:19 pm

Typo mistake:
Assumption 4 typed twice.
That’s it….
You did a great job… Thanks

August 6, 2024 at 7:00 am

well explained

James Carmichael
August 6, 2024 at 5:07 pm

Thank you Rabina for your feedback!

We greatly appreciate it!

Leave a Reply
Your email address will not be published.
Required fields are marked *

Comment *
Name *

Email *



Search … 


Statology makes learning statistics easy by

explaining topics in simple and straightforward
ways. Our team of writers have over 40 years of
experience in the fields of Machine Learning, AI
and Statistics. Learn more about our team


Implementing Custom Menus in Google Sheets with Apps Script (Versus

Excel’s Ribbon Customization)
October 11, 2024
Introduction to Machine Learning: Key Concepts and
Algorithms Explained
October 11, 2024

5 Data Blogs Every Data Enthusiast Should Follow

October 11, 2024

How to Use the Python statistics.multimode() Function

October 11, 2024

How to Automate Data Updates with Google Sheets’

IMPORTDATA Function (Versus Excel’s Power Query)
October 10, 2024

Tips to Optimize Your Data Processing Workflow

October 10, 2024


Statology Study is the ultimate online statistics study guide that helps
you study and practice all of the core concepts taught in any elementary
statistics course and makes your life so much easier as a student.


Introduction to Statistics is our premier online video course that

teaches you all of the topics covered in introductory statistics. Get
started with our course today.


How to Create Partial Residual Plots in R

Introduction to Multiple Linear Regression

The 6 Assumptions of Logistic Regression (With


How to Check Linear Regression Assumptions in


7 Common Types of Regression (And When to

Use Each)

The Constant Variance Assumption: Definition &

© 2023 Statology | Privacy Policy

You might also like