Predict and Forecast: Part I: Linear Regression
Predict and Forecast: Part I: Linear Regression
Predict and Forecast: Part I: Linear Regression
Chapter 4:
Predict and forecast
Part I: Linear Regression
Ch 4. Prediction and Regression
2
1 Econometrics
1.1 Why “regression” and “econometrics”
• You’re able to demonstrate that a mean is equal to a target, you know how to compare two groups, you can check if two
variables are related. It’s important but it’s far from sufficient.
• As a manager you’ll need to predict values. But real products are complicated : a car is a combination of multiple
interacting factors. As a marketer you must know which ones are important for customers and which relative value those
customers give to each feature. Even more important: you need to create a model to predict how desirable a new car
would be.
• Then we want to measure the quality of our models and check when they are usable.
A model that predicts the price of a laptop with a margin of error close to 45K€ is completely useless.
A model that predicts only how much “men, students, in Lille, going twice to the cinema per month” are willing to
pay for this computer is completely useless as it’s only a tiny part of the potential market.
Ch 4. Prediction and Regression
4
1 Econometrics
1.1 Why “regression” and “econometrics”
• Historically methods to “predict” are coming from a field of science called Econometrics. Originally its purpose was to
test economic theories with real data.
• It does happen that tools developed for econometrics are incredibly interesting from a business point of view. They are
efficient, scientifically validated and well known by a large community of professionals. Some might even say “90% of
Market finance is econometrics”.
• Methods created for Econometrics are the most basic foundations of Big Data and Artificial Intelligence. Basically the
question was to develop even more efficient tools able to manipulate mountains of non structured data.
• In this lecture we cover the first important method called linear regression. More accurately the multivariate regression
whose objective is to predict a quantitative value thanks to one or many predictors.
You work for Tata Group. The Tata group comprises over 90 operating companies in seven business sectors. The group has
operations in more than 80 countries across six continents. Today, Tata Group wants to advertise and promote its image in
France but doesn’t know the French Advertising Market. The company especially wants to avoid mistakes such as wasting
money on low impact media. You need to find the most cost-effective media.
Specifically, the company is interested in the effectiveness of advertising to sell cars, foodstuff or IT products. A sample of 98
other companies has been selected for studies during a period of one month. You have been given the following data:
• We created a problematic:
“How can we create the best marketing campaign?”
• We want to check assumptions :
“A well-known actor is going to change the effectiveness of an add on TV “
“ YouTube for cars is more efficient”
• We want a model:
“Change in Brand Name Fame = 10 + 8xInvestment on TV + 5xFamous Actor”
• We must check the accuracy of this model
“If we spend €1 on TV, can we say that Fame is going to increase by 8 units? Is this value accurate? Is it exactly 8 or
from 2 to 14?
“Can we demonstrate that this model allows us to forecast the effect of a new marketing campaign?”
Ch 4. Prediction and Regression
7
1 Econometrics
1.3 Definitions
• We use regression analysis to predict the value of a dependent variable based on one or more independent variables.
The dependent variable is called Y. It’s a continuous quantitative variable (a price, a temperature…). It’s also called
the explained variable.
The independent variable is called X. It might be a quantitative or a qualitative variable. It’s also called the
explicative or explanatory variable. When we face more than one independent variable, they are called X 1, X2, …. Xk.
Ex: we want to predict the price of a house (Y) using its surface (X1) and its age (X2).
• The link between variables is a model. It’s just an estimate of the true relationship between those variables and it
probably neglects the effect of many other variables.
We want to obtain a correct specification of the model i.e. when we selected all the right variables to explain Y.
We want to obtain a correct shape for the model i.e. when we created the right function to link X and Y.
Ex: Y is a function of surface and age. We assume that other variables have no effect. This specification is probably not
correct as many other variables influence the price. We assume that the shape or “link between variables” is linear. No
exponential shape and so on. It would be close to a function like Y= a + b X 1 + c X2
Ch 4. Prediction and Regression
8
1 Econometrics
1.3 Definitions
• A scatter plot typically displays an explicative variable on the x-axis and the predicted variable on the y-axis.
Ex: we want to predict the price of a house (K€) using surface (m²). For clarity of purpose, we neglect “age”. The scatter plot
shows, as expected, that a larger house tends to be more expensive.
800
700
600
500
400
300
200
100
0
50 100 150 200 250 300 350
Ch 4. Prediction and Regression
9
1 Econometrics
1.3 Definitions
• The relationship is Y=f(X)+ where f is the function which links X to Y and is the error as a model is never perfect.
900
400
300
200
Model
100
0
50 100 150 200 250 300 350
Surface of house i
• is needed: some observed values for X or Y in the database may be faulty (error in coding…), Y may be explained by
other variables, the function may under-estimate the effect of X on Y…
• Obviously, we would like this error to be as small as possible!
Ch 4. Prediction and Regression
10
1 Econometrics
1.3 Definitions
• The equation of the model is the function that links X and Y.
• When we work on the whole population, the equation of this model is
900
800
700
600
500
400
300
Model
200 S
100
Y axis intercept
0
50 100 150 200 250 300 350
• The Y-axis intercept is . The slope is . Let’s assume that the equation of the line is .
When X, the surface, is increasing by 1, then the price of the house is increasing by 2,000€
When X is equal to 0 then the price of the house is 10,000 (?). Well… there is no such house, or it might be taxes or
the land itself?
Ch 4. Prediction and Regression
11
1 Econometrics
1.3 Definitions
Dependent
variable
Linear
component Random error
term
• Don’t forget that Y is the REAL value, it’s not “on” the line.
• The distance between the line (the predicted value) and the real value is the error.
Ch 4. Prediction and Regression
12
1 Econometrics
1.3 Definitions
• The error or statistical error ε is the amount by which a real value differs from the predicted value when we work on the
whole population. It should be as small as possible. This may be explained by:
a specification error: faulty selection of explicative variables or we forgot important ones.
Ex: predict the price of a house but not take into consideration the location
a miss-specified function: we created a linear relationship while the real shape is non-linear.
Ex: modeling COVID with a linear shape when it’s exponential
faulty data or outliers in the database.
Ex: the surface of a house is coded 100.000 m². Probably an error in the dataset!
• Practically, the error is NEVER observed as it assumes that the model has been created on the whole population…. While
you probably only collected a sample. What is observed is the residual called e. It’s an observable estimate of ε created
thanks to a sample.
Ex: the mean age in the population is 20. We select a sample from this population. The average age in this sample is
17. If a man is 21 then the residual is 4= 21-17 while the error is 1=21-20.
The error ε is obtained when the model is created from the population
The residual e is obtained when the model is created from a sample.
Ch 4. Prediction and Regression
13
1 Econometrics
1.3 Definitions
• As the sample might not behave exactly like the population, errors and residuals are not the same.
• We would like them to be the same as in this case that the sample is a good representation of the population.
900
800
700
Residual for house i Error for house i
600
500
200
100
0
50 100 150 200 250 300 350
Ch 4. Prediction and Regression
14
1 Econometrics
1.3 Definitions
• In the population the model is (Greek letters as population):
• We say that and are the estimated value of and . They are also called estimate of the intercept and estimate of the
slope. If we’re lucky (if the sample behaves exactly like the population), both values are equal.
• Y is the REAL value.
• All econometrics can be summarized in this simple sentence: how confident can we be that estimated coefficients are
good estimations of the real but unknow ones?
Ch 4. Prediction and Regression
15
1 Econometrics
1.3 Definitions
Predicted or
estimated Y
value for
observation i
No error term
• By definition, the predicted value is on the line. The error is the gap between the line and the real value… but we do not
discuss about the real value here! So we don’t write any “error”.
• Now let’s move to some practice. At the beginning, for pedagogical purpose, we assume that we face a single explicative
variable X and that the relationship between variables is indeed linear (we remove this limitation in the last part).
Ch 4. Prediction and Regression
16
2 Your first model (one explicative variable) Book: chapter 13
2.1 Global method part 2
2.1.1 Steps
What should be done to create a “good” model?
• As the error is the distance between the observed value and the predicted value, we want to minimize
• Which is equal to
• We want to minimize the sum of squared errors. Thus, the name “OLS” or ordinary least squared method or “MCE” pour
moindres carrés ordinaires.
Ch 4. Prediction and Regression
20
2 Your first model (one explicative variable)
2.2 Estimation of coefficients
2.2.1 Logic
We look for and that minimize this equation....
𝑛 𝑛
2
𝑀𝑖𝑛 ∑ 𝑒𝑖 =𝑀𝑖𝑛 ∑ ( 𝑦 𝑖 −𝒃𝟏 𝑥 𝑖 −𝒃 𝟎 )
2 and we can replace in equation 1…
But you will never has to derive them by hand.
𝑖=1 𝑖=1
(equation 1)
Ch 4. Prediction and Regression
21
2 Your first model (one explicative variable)
2.2 Estimation of coefficients
2.2.1 Example
We come back to SPSS and our dataset.
We select Analyze/Regression/Linear Regression
Explicative Estimated
variable value called b1
• Problem 13.9 has been solved on video (with all details). You can watch it at home and practice on the same problem.
• You can practice on
Problem 13.4 (file “cars”). Don’t forget that answers are available at the end of the book
Problem 13.6 (file “FTMBA”)
• The most important part is to be able to interpret the meaning of coefficients.
• You can, of course, try now to predict any quantitative variable thanks to another quantitative variable.
Ch
ap
13-
26
Ch 4. Prediction and Regression
27
3 Quality of a model (one explicative variable) Book: chapter 13
3.1 Concept part 3
• A model without an easy to read tool to measure its quality is completely worthless.
• Quality can be defined as “on average a close proximity between the predicted value and the observed value”.
Y Y
Y Y
X X
Ch 4. Prediction and Regression
28
3 Quality of a model (one explicative variable)
3.1 Concept
• Where there is no relationship between X and Y, we can observe two typical shapes
A random cluster of dots without any specific orientation
For any value of X, Y remains the same
Y Y
X X
No relationship
Ch 4. Prediction and Regression
29
3 Quality of a model Y
3.1 Concept real price
B=
predicted price
average price
Xi
• A good model has to explain why the price of your house is not equal to the price of the average house in the city. X
The distance is the quantity to explain.
The distance B=is the distance between the observed value and the predicted value. Its other name is the residual e
*. It’s what the model fails to explain.
The distance is what the distance between the predicted value and the average. It’s what the model can actually
explain.
Ch 4. Prediction and Regression
30
3 Quality of a model Y
3.1 Concept real price
predicted price
average price
Xi
• Let’s define quality as the ratio .
X
When the model is really good, the error is close to 0 so
When the model is really bad, the error is large, and close to A, so
• But we worked with only one house and the definition of quality has to integrate all houses.
Ch 4. Prediction and Regression
31
3 Quality of a model
3.1 Concept
The ratio remains the same
Quality=
But we take into consideration all houses:
Quality=
Unfortunately some real values might be above or below the predicted values (the line) so we might have issues with the
sign depending on the relative location of each dot. An idea is to remove this problem with the squared value:
Quality=
The quality is a ratio between “the squared sum of all the values that the model explain” and “the squared sum of all
quantities to be explained”. Let’s call this ratio “r²” or “coefficient of determination”.
Well, this explanation is just giving you the intuition. If we want to be accurate…
Ch 4. Prediction and Regression
32
3 Quality of a model Y
3.2 Definition real price
𝑆𝑆𝐸=∑ (𝑌¿¿𝑖−𝑌
^ )²¿
𝑆𝑆𝑇=∑ (𝑌¿¿i−𝑌)²¿
𝑖
predicted price
Xi
• We can demonstrate that:
X
• Quality, called coefficient of determination is the proportion of the total variation (variance) in the dependent variable
that is explained by variation (variance) in the independent variable
• The coefficient of determination is called r-squared and is denoted as r2
Y Y
Y
X X
X
Y Y r2 = 0
X X
r2 = 1 0 < r2 < 1
Ch 4. Prediction and Regression
35
3 Quality of a model (one explicative variable)
3.2 Definition
• There is no exact consensus for what a “good” r² is. The larger it is, the better the explanation of the variance of the
explained variable.
• An r² close to 0.97 in physics is probably the mark of a faulty theory. An r² close to 0.6 in sociology might be excellent
news! Some (subjective) values might be:
Low quality
Limited
Medium
Good
9 Very good
Excellent model or danger
• The danger comes from a “too good” model. Have we really created a model or estimated something obvious and
useless? Have we cheated the data?
• R² just measures the strength of the link in data and does not care about causality, plausibility or relevancy.
Ch 4. Prediction and Regression
36
3 Quality of a model (one explicative variable) R²
3.3 Example
We come back to SPSS and our dataset.
We select Analyze/Regression/Linear Regression
SSR
2 18934 SSE
𝑟 =0.581=
32600
SST
n is the sample size and k the number of predictors. In this part, as we have a single explicative variable, k=1 so the
denominator is (n-2).
In the numerator, SSE is the Sum of squared error and is measured in $². The square root makes us come back to $.
The denominator is the number of degrees of freedom (we will come back to that). In this context, it’s the “quantity
of usable information to increase the accuracy of the model over the minimum quantity of information that we need
to estimate unknown coefficients”. The sample size is n while two unknow values b 0 and b1 are estimated. The
number of usable information to increase accuracy is thus (n-2) as we “need” two values to estimate two unknown.
• In the example, the estimated value is 41.330: the average distance between the true price of a house and it’s estimated
value is $41,330 … and the model is not that accurate!
• If SSE=0, obviously the model is perfect and thus the distance to the line is equal to 0!
Ch 4. Prediction and Regression
38
3 Quality of a model (one explicative variable)
3.4 Standard error of the estimate
Y Y
X X
• The magnitude of SYX should always be judged relative to the size of the Y values in the sample data.
• The lower the value, the more interesting the model is.
Ch 4. Prediction and Regression
39
3 Quality of a model (one explicative variable)
3.4 Standard error of the estimate
• Problem 13.21 has been solved on video (with all details). You can watch it at home and practice on the same
problem.
Error
Explained Explicative
Coefficients variables
• There are k explicative variables but (k+1) coefficients to estimate. They are , , …, . It’s a common trap.
Ch 4. Prediction and Regression
41
4 Multiple Linear Regression
4.1 The logic
• As with the simple linear regression, coefficients are estimated thanks to a sample. They are called b 0, b1, …, bK;
Predicted
value for Explicative
Estimated
individual I variables
Coefficients
• The dataset in SPSS is exactly like the ones that you’re used to manipulate: one row per case and as many columns as
variables.
• There are k explicative variables but (k+1) estimated coefficients.
Ch 4. Prediction and Regression
42
4 Multiple Linear Regression
4.1 The logic
• With one explicative variable, we estimate a line. With two we estimate a surface and so on.
Y Ŷ b0 b1X1 b 2 X 2
X1
ble
ri a
v a
r
fo
l op
e X2
S
ia ble X 2
lo pe for var
S
X1
Chap 14-42
Ch 4. Prediction and Regression
43
4 Multiple Linear Regression
4.2 Practical example
• During your internship, you work for Baker&Pies which produces and sells pies
(how surprising) but shelf’s life is short (typically no more than one day).
Chap 14-44
Ch 4. Prediction and Regression
45
4 Multiple Linear Regression
4.2 Practical example
• Results are:
• As a manager and given this model, should you invest in ads? We will check….
Chap 14-45
Ch 4. Prediction and Regression
46
4 Multiple Linear Regression
4.2 Practical example
• Results are:
• If we plan the price to be $5.5 and to spend $350 in advertising, predicted sales are:
Chap 14-46
Ch 4. Prediction and Regression
47
4 Multiple Linear Regression
4.2 Practical example
• Results are:
• 52.1% of the variation in sales is explained by the variation in prices and advertising.
• The quality of this model is limited.
Chap 14-47
Ch 4. Prediction and Regression
48
4 Multiple Linear Regression
4.2 Practical example
• The coefficient of determination R² is a limited tool. Practically it does not take into consideration the number of
explicative variables nor the sample size.
Example: which model is preferable for you? model A R²=80% n=100 k=2
Model B R²=81% n=100 k=90
Technically, the R² of the second model should make you select it. But Model B requires 90 explicative variables while the
first one only needs 2: 88 more explicative variables only increase quality by 1%. Many of them are probably useless.
The quality of the model, when we take into consideration the sample size and the number of variables, is 79% for the first
model and only 1% for the second one. The first model is much better from a practical point of view.
Chap 14-49
Ch 4. Prediction and Regression
50
4 Multiple Linear Regression
4.3 Adjusted R²
• The Adjusted R²
penalizes excessive use of unimportant independent variables.
Is smaller than R²
Can be negative (but it’s not desirable!)
Tends to be close to R² when all explicative variables have a significant effect on Y. A low difference between R² and
Adjusted R² is desirable.
• In the example, 44.2% of the variation in pie sales is explained by the variation in price and advertising, taking into
account the sample size and number of independent variables.
Chap 14-50
Ch 4. Prediction and Regression
51
4 Multiple Linear Regression
4.3 Practice
• Problems 14.4 and 14.14 has been solved on video (with all details). You can watch it at home and practice on the same
problem.
For 14.4 only questions a, b, c (you can not solve d nor e yet)
For 14.14 only questions c and d.
• The most important part is to be able to interpret coefficients, the r² and the adjusted r².
Chap 14-51