The Five Assumptions of Multiple Linear Regression

About Course Basic Stats Machine Learning Software Tutorials  Tools 
The Five Assumptions of Multiple

Linear Regression
 BY ZACH BOBBITT  NOVEMBER 16, 2021
Multiple linear regression is a statistical method

we can use to understand the relationship
between multiple predictor variables and a
response variable.
However, before we perform multiple linear

regression, we must first make sure that five
assumptions are met:
1. Linear relationship: There exists a linear

relationship between each predictor variable and
the response variable.
2. No Multicollinearity: None of the predictor

variables are highly correlated with each other.
3. Independence: The observations are

independent.
4. Homoscedasticity: The residuals have

constant variance at every point in the linear
model.
5. Multivariate Normality: The residuals of the

model are normally distributed.
If one or more of these assumptions are violated,
then the results of the multiple linear regression
may be unreliable.
In this article, we provide an explanation for each

assumption, how to determine if the assumption
is met, and what to do if the assumption is
violated.
Assumption 1: Linear Relationship

Multiple linear regression assumes that there is a
linear relationship between each predictor
variable and the response variable.
How to Determine if this Assumption is

Met
The easiest way to determine if this assumption

is met is to create a scatter plot of each predictor
variable and the response variable.
This allows you to visually see if there is a linear

relationship between the two variables.
If the points in the scatter plot roughly fall along a

straight diagonal line, then there likely exists a
linear relationship between the variables.
For example, the points in the plot below look

like they fall on roughly a straight line, which
indicates that there is a linear relationship
between this particular predictor variable (x) and
the response variable (y):
What to Do if this Assumption is
Violated
If there is not a linear relationship between one

or more of the predictor variables and the
response variable, then we have a couple
options:
1. Apply a nonlinear transformation to the

predictor variable such as taking the log or the
square root. This can often transform the
relationship to be more linear.
2. Add another predictor variable to the model.

For example, if the plot of x vs. y has a parabolic
shape then it might make sense to add X2 as an
additional predictor variable in the model.
3. Drop the predictor variable from the model. In

the most extreme case, if there exists no linear
relationship between a certain predictor variable
and the response variable then the predictor
variable may not be useful to include in the
model.
Assumption 2: No Multicollinearity
Multiple linear regression assumes that none of
the predictor variables are highly correlated with
each other.
When one or more predictor variables are highly

correlated, the regression model suffers from
multicollinearity, which causes the coefficient
estimates in the model to become unreliable.

Met
The easiest way to determine if this assumption

is met is to calculate the VIF value for each
predictor variable.
VIF values start at 1 and have no upper limit. As

a general rule of thumb, VIF values greater than
5* indicate potential multicollinearity.
The following tutorials show how to calculate VIF

in various statistical software:
How to Calculate VIF in R

How to Calculate VIF in Python
How to Calculate VIF in Excel
* Sometimes researchers use a VIF value of 10

instead, depending on the field of study.

Violated
If one or more of the predictor variables has a
VIF value greater than 5, the easiest way to
resolve this issue is to simply remove the
predictor variable(s) with the high VIF values.
Alternatively, if you want to keep each predictor

variable in the model then you can use a
different statistical method such as ridge
regression, lasso regression, or partial least
squares regression that is designed to handle
predictor variables that are highly correlated.
Assumption 3: Independence
Multiple linear regression assumes that each
observation in the dataset is independent.

Met
The simplest way to determine if this assumption

is met is to perform a Durbin-Watson test, which
is a formal statistical test that tells us whether or
not the residuals (and thus the observations)
exhibit autocorrelation.

Violated
Depending on the nature of the way this

assumption is violated, you have a few options:
For positive serial correlation, consider adding

lags of the dependent and/or independent
variable to the model.
For negative serial correlation, check to make
sure that none of your variables
are overdifferenced.
For seasonal correlation, consider adding
seasonal dummy variables to the model.
Assumption 4: Homoscedasticity
Multiple linear regression assumes that the
residuals have constant variance at every point
in the linear model. When this is not the case,
the residuals are said to suffer from
heteroscedasticity.
When heteroscedasticity is present in a

regression analysis, the results of the regression
model become unreliable.
Specifically, heteroscedasticity increases the

variance of the regression coefficient estimates,
but the regression model doesn’t pick up on this.
This makes it much more likely for a regression
model to declare that a term in the model is
statistically significant, when in fact it is not.

Met
The simplest way to determine if this assumption

is met is to create a plot of standardized
residuals versus predicted values.
Once you fit a regression model to a dataset, you

can then create a scatter plot that shows the
predicted values for the response variable on the
x-axis and the standardized residuals of the
model on the y-axis.
If the points in the scatter plot exhibit a pattern,

then heteroscedasticity is present.
The following plot shows an example of a

regression model where heteroscedasticity is not
a problem:
Notice that the standardized residuals are

scattered about zero with no clear pattern.
The following plot shows an example of a

regression model where heteroscedasticity is a
problem:
Notice how the standardized residuals become
much more spread out as the predicted values
get larger. This “cone” shape is a classic sign of
heteroscedasticity:

Violated
There are three common ways to

fix heteroscedasticity:
1. Transform the response variable. The most

common way to deal with heteroscedasticity is to
transform the response variable by taking the
log, square root, or cube root of all of the values
of the response variable. This often causes
heteroscedasticity to go away.
2. Redefine the response variable. One way to

redefine the response variable is to use a rate,
rather than the raw value. For example, instead
of using the population size to predict the
number of flower shops in a city, we may instead
use population size to predict the number of
flower shops per capita.
In most cases, this reduces the variability that

naturally occurs among larger populations since
we’re measuring the number of flower shops per
person, rather than the sheer amount of flower
shops.
3. Use weighted regression. Another way to fix

heteroscedasticity is to use weighted regression,
which assigns a weight to each data point based
on the variance of its fitted value.
Essentially, this gives small weights to data

points that have higher variances, which shrinks
their squared residuals. When the proper weights
are used, this can eliminate the problem of
heteroscedasticity.
Related: How to Perform Weighted Regression

in R
Assumption 4: Multivariate
Normality
Multiple linear regression assumes that the
residuals of the model are normally distributed.

Met
There are two common ways to check if this

assumption is met:
1. Check the assumption visually using Q-Q

plots.
A Q-Q plot, short for quantile-quantile plot, is a

type of plot that we can use to determine
whether or not the residuals of a model follow a
normal distribution. If the points on the plot
roughly form a straight diagonal line, then the
normality assumption is met.
The following Q-Q plot shows an example of

residuals that roughly follow a normal
distribution:
However, the Q-Q plot below shows an example
of when the residuals clearly depart from a
straight diagonal line, which indicates that they
do not follow normal distribution:
2. Check the assumption using a formal
statistical test like Shapiro-Wilk, Kolmogorov-
Smironov, Jarque-Barre, or D’Agostino-Pearson.
Keep in mind that these tests are sensitive to

large sample sizes – that is, they often conclude
that the residuals are not normal when your
sample size is extremely large. This is why it’s
often easier to use graphical methods like a Q-Q
plot to check this assumption.

Violated
If the normality assumption is violated, you have

a couple options:
1. First, verify that there are no extreme outliers

present in the data that cause the normality
assumption to be violated.
2. Next, you can apply a nonlinear transformation

to the response variable such as taking the
square root, the log, or the cube root of all of the
values of the response variable. This often
causes the residuals of the model to become
more normally distributed.
Additional Resources
The following tutorials provide additional

information about multiple linear regression and
its assumptions:
Introduction to Multiple Linear Regression
A Guide to Heteroscedasticity in Regression
Analysis
A Guide to Multicollinearity & VIF in Regression
The following tutorials provide step-by-step

examples of how to perform multiple linear
regression using different statistical software:
How to Perform Multiple Linear Regression in

Excel
How to Perform Multiple Linear Regression in R
SPSS
Stata
POSTED IN PROGRAMMING •
Zach Bobbitt
Hey there. My name is Zach Bobbitt. I
have a Masters of Science degree in
Applied Statistics and I’ve worked on
machine learning algorithms for
professional businesses in both healthcare and retail.
I’m passionate about statistics, machine learning, and
data visualization and I created Statology to be a
resource for both students and teachers alike. My
goal with this site is to help you learn statistics through
using simple terms, plenty of real-world examples, and
helpful illustrations.
PREV NEXT
How to Convert Factor to Date How to Create Kernel Density
in R (With Examples) Plots in R (With Examples)
4 Replies to “The Five Assumptions of

Multiple Linear Regression”
Alan
April 27, 2022 at 1:50 pm
Very helpful. Please note: The last

assumption should say ‘5’ not ‘4’.
REPLY
Anonymous
October 9, 2022 at 1:19 pm
Typo mistake:
Assumption 4 typed twice.
That’s it….
You did a great job… Thanks
REPLY
Rabina
August 6, 2024 at 7:00 am
well explained
REPLY
James Carmichael
August 6, 2024 at 5:07 pm
Thank you Rabina for your feedback!

We greatly appreciate it!
REPLY
Leave a Reply
Your email address will not be published.
Required fields are marked *
Comment *
Name *
Email *
POST COMMENT
SEARCH
Search … 
ABOUT STATOLOGY
Statology makes learning statistics easy by

explaining topics in simple and straightforward
ways. Our team of writers have over 40 years of
experience in the fields of Machine Learning, AI
and Statistics. Learn more about our team
here.
FEATURED POSTS
Implementing Custom Menus in Google Sheets with Apps Script (Versus

Excel’s Ribbon Customization)
October 11, 2024
Introduction to Machine Learning: Key Concepts and
Algorithms Explained
October 11, 2024
5 Data Blogs Every Data Enthusiast Should Follow

October 11, 2024
How to Use the Python statistics.multimode() Function

October 11, 2024
How to Automate Data Updates with Google Sheets’

IMPORTDATA Function (Versus Excel’s Power Query)
October 10, 2024
Tips to Optimize Your Data Processing Workflow

October 10, 2024
STATOLOGY STUDY
Statology Study is the ultimate online statistics study guide that helps
you study and practice all of the core concepts taught in any elementary
statistics course and makes your life so much easier as a student.
INTRODUCTION TO STATISTICS COURSE
Introduction to Statistics is our premier online video course that

teaches you all of the topics covered in introductory statistics. Get
started with our course today.
YOU MIGHT ALSO LIKE
How to Create Partial Residual Plots in R
Introduction to Multiple Linear Regression
The 6 Assumptions of Logistic Regression (With

Examples)
How to Check Linear Regression Assumptions in

R
7 Common Types of Regression (And When to

Use Each)
The Constant Variance Assumption: Definition &

Example
© 2023 Statology | Privacy Policy

The Five Assumptions of Multiple Linear Regression

Uploaded by

Copyright:

Available Formats

The Five Assumptions of Multiple Linear Regression

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

The Five Assumptions of Multiple Linear Regression

Uploaded by

Copyright:

Available Formats

About Course Basic Stats Machine Learning Software Tutorials  Tools 

The Five Assumptions of Multiple

Multiple linear regression is a statistical method

However, before we perform multiple linear

1. Linear relationship: There exists a linear

2. No Multicollinearity: None of the predictor

3. Independence: The observations are

4. Homoscedasticity: The residuals have

5. Multivariate Normality: The residuals of the

In this article, we provide an explanation for each

Assumption 1: Linear Relationship

How to Determine if this Assumption is

The easiest way to determine if this assumption

This allows you to visually see if there is a linear

If the points in the scatter plot roughly fall along a

For example, the points in the plot below look

If there is not a linear relationship between one

1. Apply a nonlinear transformation to the

2. Add another predictor variable to the model.

3. Drop the predictor variable from the model. In

When one or more predictor variables are highly

How to Determine if this Assumption is

The easiest way to determine if this assumption

VIF values start at 1 and have no upper limit. As

The following tutorials show how to calculate VIF

How to Calculate VIF in R

* Sometimes researchers use a VIF value of 10

What to Do if this Assumption is

Alternatively, if you want to keep each predictor

How to Determine if this Assumption is

The simplest way to determine if this assumption

What to Do if this Assumption is

Depending on the nature of the way this

For positive serial correlation, consider adding

When heteroscedasticity is present in a

Specifically, heteroscedasticity increases the

How to Determine if this Assumption is

The simplest way to determine if this assumption

Once you fit a regression model to a dataset, you

If the points in the scatter plot exhibit a pattern,

The following plot shows an example of a

Notice that the standardized residuals are

The following plot shows an example of a

What to Do if this Assumption is

There are three common ways to

1. Transform the response variable. The most

2. Redefine the response variable. One way to

In most cases, this reduces the variability that

3. Use weighted regression. Another way to fix

Essentially, this gives small weights to data

Related: How to Perform Weighted Regression

How to Determine if this Assumption is

There are two common ways to check if this

1. Check the assumption visually using Q-Q

A Q-Q plot, short for quantile-quantile plot, is a

The following Q-Q plot shows an example of

Keep in mind that these tests are sensitive to

What to Do if this Assumption is

If the normality assumption is violated, you have

1. First, verify that there are no extreme outliers