Predict and Forecast: Part I: Linear Regression

Ch 4.
Prediction and Regression

1
Chapter 4:
Predict and forecast
Part I: Linear Regression
Ch 4. Prediction and Regression
2
1 Econometrics
1.1 Why “regression” and “econometrics”
• You’re able to demonstrate that a mean is equal to a target, you know how to compare two groups, you can check if two
variables are related. It’s important but it’s far from sufficient.
• As a manager you’ll need to predict values. But real products are complicated : a car is a combination of multiple
interacting factors. As a marketer you must know which ones are important for customers and which relative value those
customers give to each feature. Even more important: you need to create a model to predict how desirable a new car
would be.
Source: Gartner (blog)

3
1 Econometrics
• Some examples:
 Our company is working on a new laptop. We know prices of each competitors on the market and features of each computer. Given
characteristics of our computer (screen size, chipset, 3D card, number of HDMI….), what is the optimum market price for this new
laptop so as to be attractive?
 Given past financial performance of all companies in the car industry, can we predict the performance of Ford?
 Given characteristics of houses in Lille, can we estimate if the house we want to purchase is over valued? Which factors have a real
effect on the price of houses? What is the effect of one more m² on the price? What is the effect of location? How can we define
“location”? Is colour important? Is it more important than a basement?
 Given knowledge about contaminations for a given disease, can we predict the number of cases next month? (yes mathematical
tools from epidemiology and econometrics are closely related).
….
• Then we want to measure the quality of our models and check when they are usable.
 A model that predicts the price of a laptop with a margin of error close to 45K€ is completely useless.
 A model that predicts only how much “men, students, in Lille, going twice to the cinema per month” are willing to
pay for this computer is completely useless as it’s only a tiny part of the potential market.
4
1 Econometrics
• Historically methods to “predict” are coming from a field of science called Econometrics. Originally its purpose was to
test economic theories with real data.
• It does happen that tools developed for econometrics are incredibly interesting from a business point of view. They are
efficient, scientifically validated and well known by a large community of professionals. Some might even say “90% of
Market finance is econometrics”.
• Methods created for Econometrics are the most basic foundations of Big Data and Artificial Intelligence. Basically the
question was to develop even more efficient tools able to manipulate mountains of non structured data.
• In this lecture we cover the first important method called linear regression. More accurately the multivariate regression
whose objective is to predict a quantitative value thanks to one or many predictors.
… and there are many other ones!
Econometrics is the application of statistics to real data to extract simple relationships.

The linear regression is a method among other ones.
5
1 Econometrics
1.2 Introductory example
You work for Tata Group. The Tata group comprises over 90 operating companies in seven business sectors. The group has
operations in more than 80 countries across six continents. Today, Tata Group wants to advertise and promote its image in
France but doesn’t know the French Advertising Market. The company especially wants to avoid mistakes such as wasting
money on low impact media. You need to find the most cost-effective media.
Specifically, the company is interested in the effectiveness of advertising to sell cars, foodstuff or IT products. A sample of 98
other companies has been selected for studies during a period of one month. You have been given the following data:
 Activity of the company: Car manufacturer, IT Product or Foodstuff.

 Change in Brand name fame: Brand name fame after campaign – Brand name fame before campaign. A The predicted value
 YouTube: amount spent on YouTube advertising by the company(K€)
 Social Media: amount spent on Facebook, Instagram… advertising (K€)
 TV: amount spent in television advertising (K€)
 Strength: advertising quality, from 0 to 10, according to an independent analyst company. (Example: 10 for a very effective message, 0 for
a boring ad).
 Timing: effective campaign timing, from 0 to 10. (Example: 0= advertising cold beer in winter; 10= advertising cold beer in summertime,
during a Football match break).
 Actor: “YES” if a famous actor appears in the add, “NO” otherwise
 Innovative: “Very”, “usual” or “conservative” ad according to an independent analyst company. A very innovative ad promotes for
example use of Facebook.
6
1 Econometrics Book: chapter 13
1.2 Introductory example part 1
• We created a problematic:
 “How can we create the best marketing campaign?”
• We want to check assumptions :
 “A well-known actor is going to change the effectiveness of an add on TV “
 “ YouTube for cars is more efficient”
• We want a model:
 “Change in Brand Name Fame = 10 + 8xInvestment on TV + 5xFamous Actor”
• We must check the accuracy of this model
 “If we spend €1 on TV, can we say that Fame is going to increase by 8 units? Is this value accurate? Is it exactly 8 or
from 2 to 14?
 “Can we demonstrate that this model allows us to forecast the effect of a new marketing campaign?”
7
1 Econometrics
1.3 Definitions
• We use regression analysis to predict the value of a dependent variable based on one or more independent variables.
 The dependent variable is called Y. It’s a continuous quantitative variable (a price, a temperature…). It’s also called
the explained variable.
 The independent variable is called X. It might be a quantitative or a qualitative variable. It’s also called the
explicative or explanatory variable. When we face more than one independent variable, they are called X 1, X2, …. Xk.
Ex: we want to predict the price of a house (Y) using its surface (X1) and its age (X2).
• The link between variables is a model. It’s just an estimate of the true relationship between those variables and it
probably neglects the effect of many other variables.
 We want to obtain a correct specification of the model i.e. when we selected all the right variables to explain Y.
 We want to obtain a correct shape for the model i.e. when we created the right function to link X and Y.
Ex: Y is a function of surface and age. We assume that other variables have no effect. This specification is probably not
correct as many other variables influence the price. We assume that the shape or “link between variables” is linear. No
exponential shape and so on. It would be close to a function like Y= a + b X 1 + c X2
8
1 Econometrics
1.3 Definitions
• A scatter plot typically displays an explicative variable on the x-axis and the predicted variable on the y-axis.
Ex: we want to predict the price of a house (K€) using surface (m²). For clarity of purpose, we neglect “age”. The scatter plot
shows, as expected, that a larger house tends to be more expensive.
Scatter plot between surface (x-axis) and price (y-axis)

900
800
700
600
500
400
300
200
100
0
50 100 150 200 250 300 350
9
1 Econometrics
1.3 Definitions
• The relationship is Y=f(X)+ where f is the function which links X to Y and  is the error as a model is never perfect.
900
= Observed Value for house i 800

700
Error for house i
= Predicted Value for house i 600
500
400
300
200
Model
100
0
50 100 150 200 250 300 350
Surface of house i
•  is needed: some observed values for X or Y in the database may be faulty (error in coding…), Y may be explained by
other variables, the function may under-estimate the effect of X on Y…
• Obviously, we would like this error to be as small as possible!
10
1 Econometrics
1.3 Definitions
• The equation of the model is the function that links X and Y.
• When we work on the whole population, the equation of this model is
900
800
700
600
500
400
300
Model
200 S
100
Y axis intercept
0
50 100 150 200 250 300 350
• The Y-axis intercept is . The slope is . Let’s assume that the equation of the line is .
 When X, the surface, is increasing by 1, then the price of the house is increasing by 2,000€
 When X is equal to 0 then the price of the house is 10,000 (?). Well… there is no such house, or it might be taxes or
the land itself?
11
1 Econometrics
1.3 Definitions
Intercept in Slope Independent

the population coefficient in variable
the population
Dependent
variable
Linear
component Random error
term
• Don’t forget that Y is the REAL value, it’s not “on” the line.
• The distance between the line (the predicted value) and the real value is the error.
12
1 Econometrics
1.3 Definitions
• The error or statistical error ε is the amount by which a real value differs from the predicted value when we work on the
whole population. It should be as small as possible. This  may be explained by:
 a specification error: faulty selection of explicative variables or we forgot important ones.
Ex: predict the price of a house but not take into consideration the location
 a miss-specified function: we created a linear relationship while the real shape is non-linear.
Ex: modeling COVID with a linear shape when it’s exponential
 faulty data or outliers in the database.
Ex: the surface of a house is coded 100.000 m². Probably an error in the dataset!
• Practically, the error is NEVER observed as it assumes that the model has been created on the whole population…. While
you probably only collected a sample. What is observed is the residual called e. It’s an observable estimate of ε created
thanks to a sample.
Ex: the mean age in the population is 20. We select a sample from this population. The average age in this sample is
17. If a man is 21 then the residual is 4= 21-17 while the error is 1=21-20.
The error ε is obtained when the model is created from the population
The residual e is obtained when the model is created from a sample.
13
1 Econometrics
1.3 Definitions
• As the sample might not behave exactly like the population, errors and residuals are not the same.
• We would like them to be the same as in this case that the sample is a good representation of the population.
900
800
700
Residual for house i Error for house i
600
500
400 Model on the population

Model on a sample
300
200
100
0
50 100 150 200 250 300 350
14
1 Econometrics
1.3 Definitions
• In the population the model is (Greek letters as population):
• In a sample the model is (Latin letters as sample):
• We say that and are the estimated value of and . They are also called estimate of the intercept and estimate of the
slope. If we’re lucky (if the sample behaves exactly like the population), both values are equal.
• Y is the REAL value.
• All econometrics can be summarized in this simple sentence: how confident can we be that estimated coefficients are
good estimations of the real but unknow ones?
15
1 Econometrics
1.3 Definitions
Estimated Slope Independent

Estimated coefficient in variable
Intercept in the sample
the sample
Predicted or
estimated Y
value for
observation i
No error term
• By definition, the predicted value is on the line. The error is the gap between the line and the real value… but we do not
discuss about the real value here! So we don’t write any “error”.
• Now let’s move to some practice. At the beginning, for pedagogical purpose, we assume that we face a single explicative
variable X and that the relationship between variables is indeed linear (we remove this limitation in the last part).
16
2 Your first model (one explicative variable) Book: chapter 13
2.1 Global method part 2
2.1.1 Steps
What should be done to create a “good” model?
1) Draw a scatter plot between X and Y to get a first look

2) Transform your data if you detect a non-linear relationship
3) Estimate the coefficients and
4) Estimate the global quality of the model What is the quality of the model? Do we explain 10% 80?
5) Check that this model is valid. For instance, we assume implicitly that the relationship is linear. But is it true?
6) Check if the model is globally significant i.e., at least one explicative variable really influences Y
7) Check the individual significance i.e., check variable per variable that all of them really influence Y?
8) Estimate confidence intervals for estimated parameters. It’s kind of like a « margin of error ». For instance, 1m² changes
the price by 1000€ +/- 100 (quite accurate) or +/- 800 (not accurate)?
17
2 Your first model (one explicative variable)
2.1 Global method
2.1.2 Example
• This same example (same as in the textbook, chapter 13 part 1) is studied over
House Price Square Feet
(Y) (X)
the whole document.
• 245 1400
A real estate agent wishes to examine the relationship between the selling
price of (thousands of $) a home and its size (measured in square feet). 312 1600
• A random sample of 10 houses is selected. Notice that it’s far too small for 279 1700
practical models. 308 1875
199 1100
219 1550
• Open SPSS and load the file Regression1.sav
405 2350
324 2450
319 1425
255 1700
18
2 Your first model
2.1 Global method
2.1.2 Example
Draw a scatter plot with the Regression
plot or any other method
The link between variables is (more or

less) linear. We can add a line fit (activate
the plot with a double click and select fit
line / linear)
SPSS displays the equation of this line…

but where is it coming from?
19
2.2 Estimation of coefficients
2.2.1 Logic
• How can we estimate accurately and ? It’s easy to say that we select the “line the most in the middle of the cloud” but
it’s not satisfying.
• For a given point we want to minimize the error .
• For all points, we want to minimize the sum of all errors for the n points but some errors are positive, and others are
negative. They cancel each other and we might think that the model is a good one even if the distance to the line is large.
• So the strategy is to minimize the sum of squared errors
• As the error is the distance between the observed value and the predicted value, we want to minimize
• Which is equal to
• We want to minimize the sum of squared errors. Thus, the name “OLS” or ordinary least squared method or “MCE” pour
moindres carrés ordinaires.
20
2.2.1 Logic
We look for and that minimize this equation....
𝑛 𝑛
2
𝑀𝑖𝑛 ∑ 𝑒𝑖 =𝑀𝑖𝑛 ∑ ( 𝑦 𝑖 −𝒃𝟏 𝑥 𝑖 −𝒃 𝟎 )
2 and we can replace in equation 1…
But you will never has to derive them by hand.
𝑖=1 𝑖=1
(equation 1)
21
2.2.1 Example
We come back to SPSS and our dataset.
We select Analyze/Regression/Linear Regression
We obtain different tables and among them

the regression coefficients.
Estimated
Intercept value called b0
Explicative Estimated
variable value called b1
The regression equation is

22
2.2.1 Example 450
400 Value of the slope b1
• For the intercept,
House Price ($1000s)

350
0.110
300
250
200
150
Value of the Intercept b0 100
98.24 50
0
1000 1200 1400 1600 1800 2000 2200 2400 2600
Square Feet
Ch
ap
13-
22
• When the surface is equal to 0, then the estimated average price of the house Y is $98.240 .
• There is no house with such a surface. Practically it’s meaningless. Maybe this amount might be related to the land or
taxes (?)
23
2.2.1 Example 450
• For the slope

350
0.110
300
250
200
150
98.24 50
0
1000 1200 1400 1600 1800 2000 2200 2400 2600
Square Feet
Ch
ap
13-
23
• When the surface is increasing by one unit, the estimated average price of the house Y is increasing by 0.110 units
• We would say “if the surface is increasing by one unit, then the estimated price of the house Y is increasing by 0.110
thousand of dollars or $110”, everything else remaining constant.
• We obtained something incredibly important for a real estate agent: the link between surface and price!
24
2.2.1 Example 450
• For the prediction

350
0.110
300
250
200
150
98.24 50
0
1000 1200 1400 1600 1800 2000 2200 2400 2600
Square Feet
Ch
ap
13-
• For a house whose surface is 2000, then the predicted price is 24
25
2.2 Estimation of coefficients Relevant model
2.2.1 Example 450
400
• For the prediction

350
300
250
200 model
Irrelevant Irrelevant model
150
100
50
0
1000 1200 1400 1600 1800 2000 2200 2400 2600
Square Feet
Ch
ap
• The first major trap lies here. Can we predict the price of a house whose surface is 500 sq. feet? 13-
 We can use this model to predict the price of houses between 1000 and 2500 sq. feet. 25
 We can NOT use this model to predict the price of tiny or larger houses as we don’t even have data about them
• A model is meaningful only in the data interval. More accurately the further you move, the larger the risk of a large
error. The model seems rather linear in the interval 1000 ; 2500 but it does not mean that your equation as well as the
linear assumption will always be valid outside this interval.
26
2.3 Practice
• Problem 13.9 has been solved on video (with all details). You can watch it at home and practice on the same problem.
• You can practice on
 Problem 13.4 (file “cars”). Don’t forget that answers are available at the end of the book
 Problem 13.6 (file “FTMBA”)
• The most important part is to be able to interpret the meaning of coefficients.
• You can, of course, try now to predict any quantitative variable thanks to another quantitative variable.
Ch
ap
13-
26
27
3 Quality of a model (one explicative variable) Book: chapter 13
3.1 Concept part 3
• A model without an easy to read tool to measure its quality is completely worthless.
• Quality can be defined as “on average a close proximity between the predicted value and the observed value”.
Y Y
Good model Low quality model

Strong relationship Weak relationship
X X
Y Y
X X
28
3 Quality of a model (one explicative variable)
3.1 Concept
• Where there is no relationship between X and Y, we can observe two typical shapes
 A random cluster of dots without any specific orientation
 For any value of X, Y remains the same
Y Y
X X
No relationship
29
3 Quality of a model Y
3.1 Concept real price
B=
predicted price
average price
Xi
• A good model has to explain why the price of your house is not equal to the price of the average house in the city. X
 The distance is the quantity to explain.
 The distance B=is the distance between the observed value and the predicted value. Its other name is the residual e
*. It’s what the model fails to explain.
 The distance is what the distance between the predicted value and the average. It’s what the model can actually
explain.
30
3.1 Concept real price
predicted price
average price
Xi
• Let’s define quality as the ratio .
X
 When the model is really good, the error is close to 0 so
 When the model is really bad, the error is large, and close to A, so
• But we worked with only one house and the definition of quality has to integrate all houses.
31
3 Quality of a model
3.1 Concept
The ratio remains the same
Quality=
But we take into consideration all houses:
Quality=
Unfortunately some real values might be above or below the predicted values (the line) so we might have issues with the
sign depending on the relative location of each dot. An idea is to remove this problem with the squared value:
Quality=
The quality is a ratio between “the squared sum of all the values that the model explain” and “the squared sum of all
quantities to be explained”. Let’s call this ratio “r²” or “coefficient of determination”.
Well, this explanation is just giving you the intuition. If we want to be accurate…
32
3.2 Definition real price
𝑆𝑆𝐸=∑ (𝑌¿¿𝑖−𝑌
^ )²¿
𝑆𝑆𝑇=∑ (𝑌¿¿i−𝑌)²¿
𝑖
predicted price
average price 𝑆𝑆𝑅=∑ (𝑌^ ¿¿𝑖−𝑌)²¿
Xi
• We can demonstrate that:
X
Total Sum of squares = Error sum of squares + Regression Sum of Squares

Quantity to be explained = what the model can’t explain + what the model can explain
33
3.2 Definition
• The exact name of the three parts are
 SST= total sum of squares or total variation or the variation of prices around the mean price.
 SSR= regression sum of squares or explained variation.
 SSE= error sum of squares or unexplained variation (for lack of some explicative variables…).
• Quality, called coefficient of determination is the proportion of the total variation (variance) in the dependent variable
that is explained by variation (variance) in the independent variable
• The coefficient of determination is called r-squared and is denoted as r2
• Important property: (see note below)

34
3.2 Definition
Y Y
Y
X X
X
Y Y r2 = 0
X X
r2 = 1 0 < r2 < 1
35
3.2 Definition
• There is no exact consensus for what a “good” r² is. The larger it is, the better the explanation of the variance of the
explained variable.
• An r² close to 0.97 in physics is probably the mark of a faulty theory. An r² close to 0.6 in sociology might be excellent
news! Some (subjective) values might be:
 Low quality
 Limited
 Medium
 Good
 9 Very good
 Excellent model or danger
• The danger comes from a “too good” model. Have we really created a model or estimated something obvious and
useless? Have we cheated the data?
• R² just measures the strength of the link in data and does not care about causality, plausibility or relevancy.
36
3 Quality of a model (one explicative variable) R²
3.3 Example
We come back to SPSS and our dataset.
We select Analyze/Regression/Linear Regression
SSR
2 18934 SSE
𝑟 =0.581=
32600
SST
• In our model, variations in surface explain 58% of variations in prices.

• The quality of this model is limited. It does not mean that surface has no effect on price (when we think about it, a single
variable with such a large effect means that it’s an interesting one) but we probably forgot to take into consideration
many other variables such as age, location…
37
3 Quality of a model (one explicative variable) Standard error of the estimate
3.4 Standard error of the estimate

• A second tool to estimate quality is the standard error of the
estimate called SYX in the book.
• It measures the average dispersion around the regression line.
• The formula (that you don’t need to know) is:
 n is the sample size and k the number of predictors. In this part, as we have a single explicative variable, k=1 so the
denominator is (n-2).
 In the numerator, SSE is the Sum of squared error and is measured in $². The square root makes us come back to $.
 The denominator is the number of degrees of freedom (we will come back to that). In this context, it’s the “quantity
of usable information to increase the accuracy of the model over the minimum quantity of information that we need
to estimate unknown coefficients”. The sample size is n while two unknow values b 0 and b1 are estimated. The
number of usable information to increase accuracy is thus (n-2) as we “need” two values to estimate two unknown.
• In the example, the estimated value is 41.330: the average distance between the true price of a house and it’s estimated
value is $41,330 … and the model is not that accurate!
• If SSE=0, obviously the model is perfect and thus the distance to the line is equal to 0!
38
Y Y
Small standard Large standard

error of the error of the
estimate estimate
X X
• The magnitude of SYX should always be judged relative to the size of the Y values in the sample data.
• The lower the value, the more interesting the model is.
39
• Problem 13.21 has been solved on video (with all details). You can watch it at home and practice on the same
problem.

 Problem 13.16 (file “cars”). Don’t forget that answers are available at the end of the book
 Problem 13.18 (file “FTMBA”)
• The most important part is to be able to interpret the quality of a model thanks to the R²
40
4 Multiple Linear Regression Book: chapter 14
part 1
4.1 The logic
• Good news: the multiple linear regression (with more than one explicative variable) follows exactly the same logic as
the simple linear regression (with one explicative variable). Some specific aspects will change, though.
• Basics remain identical: one explained quantitative variable Y is explained by k explicative variables called X 1, X2, …, Xk.
Notice that n denotes the sample size and k the number of explicative variables.
• The model to predict the value for case i is:
Error
Explained Explicative
Coefficients variables
• There are k explicative variables but (k+1) coefficients to estimate. They are , , …, . It’s a common trap.
41
4 Multiple Linear Regression
4.1 The logic
• As with the simple linear regression, coefficients are estimated thanks to a sample. They are called b 0, b1, …, bK;
Predicted
value for Explicative
Estimated
individual I variables
Coefficients
• The dataset in SPSS is exactly like the ones that you’re used to manipulate: one row per case and as many columns as
variables.
• There are k explicative variables but (k+1) estimated coefficients.
42
4.1 The logic
• With one explicative variable, we estimate a line. With two we estimate a surface and so on.
Y Ŷ  b0  b1X1  b 2 X 2
X1
ble
ri a
v a
r
fo
l op
e X2
S
ia ble X 2
lo pe for var
S
X1
Chap 14-42
43
4.2 Practical example
• During your internship, you work for Baker&Pies which produces and sells pies
(how surprising) but shelf’s life is short (typically no more than one day).
• You wants to evaluate factors thought to influence demand as loss might be

significant is supply does not match demand. According to experts, the two most
important variables to predict sales (units) are advertising (hundreds of dollars)
and price (dollars)
• Data were collected for 15 weeks.
• The dataset regression2.sav is available in IESEG-Online. Open it. Chap 14-43
44
• Go to Analysis/Regression/Linear Regression and select two explicative variables:
The model is assumed to be
Chap 14-44
45
• Results are:
• Predicted sales are
b1= - 24.975 b2= 74.131

If the price is increasing by one unit (one If advertising is increasing by one unit
dollar) then sales decrease by 24.975 ($100) dollar) then sales increase by
units, everything else remaining the 74.131 units, everything else remaining
same. the same.
• As a manager and given this model, should you invest in ads? We will check….
Chap 14-45
46
• Results are:
• If we plan the price to be $5.5 and to spend $350 in advertising, predicted sales are:
Don’t forget the unit! ($100)

dollar)
Chap 14-46
47
• Results are:
• 52.1% of the variation in sales is explained by the variation in prices and advertising.
• The quality of this model is limited.
Until now nothing new….but the R² is problematic in a multiple linear regression….
With a multiple linear regression, the coefficient of determination r² is not recommended.

It has to be replaced by the Adjusted R²
Chap 14-47
48
• The coefficient of determination R² is a limited tool. Practically it does not take into consideration the number of
explicative variables nor the sample size.
• Problem n°1: not able to distinguish between models:

 Let’s assume that your company created a model with 2 variables and one with 200 variables but R² are the same.
Which model is the best? R² does not tell you.
• Problem n°2: the R² will always increase if a new explicative variable is added to the model. It can’t decrease.
 If we create a model with many variables, R² increases little by little… so we might end with an excellent R² while
each explicative variable has a negligible effect on the explained variable.
 We would like a “tool” whose value is increasing if a variable brings important information and decreasing if this
variable does not provide anything new.
 Even worse: the more we select explicative variables, the higher the risk of a fortuitous but disastrous correlation.
Example: you want to explain income and R² is 0.7 when explained by education, age, gender and experience). You select
randomly 54 new variables (shoe size, car color...) and R² jumps to 0.97. The model is meaningless as R² increases even if
the variable has no real effect on income.
• Problem n°3: when the number of explicative variables increases by one, we “decrease” the number of degrees of
freedom by one (remember that it’s more or less sample size – nb of estimated coefficients). More DF is desirable as it
decreases the margin of error of each confidence interval.*
Chap 14-48
49
4.3 Adjusted R²
• The adjusted coefficient of determination or Adj R² is the proportion of variation in Y explained by the variation of all
explicative variables adjusted for the number of variables and sample size:
where n is sample size, k the number of explicative variables
Example: which model is preferable for you? model A R²=80% n=100 k=2
Model B R²=81% n=100 k=90
Technically, the R² of the second model should make you select it. But Model B requires 90 explicative variables while the
first one only needs 2: 88 more explicative variables only increase quality by 1%. Many of them are probably useless.
The Adj R² helps us: adjR² model A = 1- (1-0.8)(99)/(97) = 79%

adjR² model B = 1- (1-0.81)(99)/(9) = close to 1%
The quality of the model, when we take into consideration the sample size and the number of variables, is 79% for the first
model and only 1% for the second one. The first model is much better from a practical point of view.
Chap 14-49
50
4.3 Adjusted R²
• The Adjusted R²
 penalizes excessive use of unimportant independent variables.
 Is smaller than R²
 Can be negative (but it’s not desirable!)
 Tends to be close to R² when all explicative variables have a significant effect on Y. A low difference between R² and
Adjusted R² is desirable.
• In the example, 44.2% of the variation in pie sales is explained by the variation in price and advertising, taking into
account the sample size and number of independent variables.
Chap 14-50
51
4.3 Practice
• Problems 14.4 and 14.14 has been solved on video (with all details). You can watch it at home and practice on the same
problem.
 For 14.4 only questions a, b, c (you can not solve d nor e yet)
 For 14.14 only questions c and d.

 Problem 14.6 question a, b, c, d (file “Bestcompanies”). Don’t forget that answers are available at the end of the
book
 Problem 14.8 questions a, b ,c , d(file “Restaurant”)
 Problem 14.17 questions c and d.
• The most important part is to be able to interpret coefficients, the r² and the adjusted r².
Chap 14-51

Predict and Forecast: Part I: Linear Regression

Uploaded by

Copyright:

Available Formats

Predict and Forecast: Part I: Linear Regression

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Predict and Forecast: Part I: Linear Regression

Uploaded by

Copyright:

Available Formats

Ch 4.

Prediction and Regression

Source: Gartner (blog)

… and there are many other ones!

Econometrics is the application of statistics to real data to extract simple relationships.

 Activity of the company: Car manufacturer, IT Product or Foodstuff.

Scatter plot between surface (x-axis) and price (y-axis)

= Observed Value for house i 800

Intercept in Slope Independent

400 Model on the population

• In a sample the model is (Latin letters as sample):

Estimated Slope Independent

1) Draw a scatter plot between X and Y to get a first look

The link between variables is (more or

SPSS displays the equation of this line…

We obtain different tables and among them

The regression equation is

House Price ($1000s)

House Price ($1000s)

House Price ($1000s)

House Price ($1000s)

Good model Low quality model

average price 𝑆𝑆𝑅=∑ (𝑌^ ¿¿𝑖−𝑌)²¿

Total Sum of squares = Error sum of squares + Regression Sum of Squares

• Important property: (see note below)

• In our model, variations in surface explain 58% of variations in prices.

3.4 Standard error of the estimate

Small standard Large standard

• You can practice on

• You wants to evaluate factors thought to influence demand as loss might be

The model is assumed to be

• Predicted sales are

b1= - 24.975 b2= 74.131

Don’t forget the unit! ($100)

Until now nothing new….but the R² is problematic in a multiple linear regression….

With a multiple linear regression, the coefficient of determination r² is not recommended.

• Problem n°1: not able to distinguish between models:

where n is sample size, k the number of explicative variables

The Adj R² helps us: adjR² model A = 1- (1-0.8)(99)/(97) = 79%

• You can practice on

You might also like