Lecture 13 (2)

Lecture 13: Statistical Inference by Dr.
Javed Iqbal
Multiple Regression (2):
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.967530
R Square 0.936115
Adjusted R
Square 0.920144
Standard Error 880.505444
Observations 11
ANOVA
Significance
df SS MS F F
Regression 2 90883135.85 45441568 58.61236 1.67E-05
Residual 8 6202318.693 775289.8
Total 10 97085454.55
Standard Upper
Coefficients Error t Stat P-value Lower 95% 95%
Intercept 18303.5208 1134.76186 16.12983 2.19E-07 15686.76 20920.29
Age(years) -950.4270 387.4188755 -2.45323 0.039736 -1843.82 -57.0375
Miles -0.0821 0.025520666 -3.21889 0.01226 -0.141 -0.0233
For the Orion car data: The estimated model in equation form is:
𝑦̂ = 18303.5 − 950.4 𝑥1 − 0.0821𝑥2
Where y: Price of car ($) 𝑥1 =Age of car (years), 𝑥2 = Miles driven

Coefficient of determination R2: Proportion of total variation in dependent variable (y) that
is explained by independent (x) variables of the model.
SST = SSR + SSE
𝑆𝑆𝑅 𝑆𝑆𝐸
𝑅2 = =1−
𝑆𝑆𝑇 𝑆𝑆𝑇
SST = Total sum of square, SSR = Sum of square due to Regression (explained SS),
SSE = Sum of square due to error (unexplained SS)
For the Orion data R2 = 0.936. This shows that 93.6% variation in prices of Orion used car is
explained by age of the car and number of miles driven through this model.
What explains the remaining 6.4% variation in car prices? These are other factors not
considered e.g., colour, condition of car, number of damages etc.
Statistical Significance of Coefficients:
For the hypothesis: 𝐻 0 ∶ 𝛽1 = 0 (Age of car is not a useful predictor of car price)
Against : 𝐻 1 ∶ 𝛽1 ≠ 0 (Age of car is a useful predictor of car price)
The test statistic is:
𝒃𝟏 −𝜷𝟏
𝒕=
𝑺𝑬(𝒃𝟏 )
Here 𝑏1 represents the sample estimate of the parameters. 𝛽1is the value of parameter under
the null hypothesis. SE (b1) means standard error of b1.
The statistics has a student’s T distribution with n – (k+1) degrees of freedom (n = number of
observations or sample size, k +1 = # model parameters including intercept)
T statistic is – 2.45 t( 0.025, 11-3 df) = ±2.306, Thus null hypothesis is rejected and we
conclude that the age of car is a useful predictor of its price.
(Alternatively, p-value = 0.039 < 0.05, reject the null hypothesis and same conclusion).
Ex: Test the hypothesis that there is a negative relationship between age and price of car.
For the hypothesis: 𝐻 0 ∶ 𝛽1 = 0 (No or positive relationship)
Against : 𝐻 1 ∶ 𝛽1 < 0 (Age increases car price decreases)
T statistic is – 2.45, t( 0.05, 8 df) = −1.860, Thus null hypothesis is rejected and we conclude
that there is indeed negative relationship between the age and price of car.
[Alternatively, the p-value of the test = Given (software reported) two tail p-value /2 =
0.03974/2 = 0.01987 < 0.05. Hence the null hypothesis is rejected in favor of alternative at 5%
sig level].
For the hypothesis: 𝐻 0 ∶ 𝛽2 = 0 (Number of miles driven is not a useful predictor of car price)
Against 𝐻 1 ∶ 𝛽2 ≠ 0 (Number of miles driven is a useful predictor of car price)
T statistic is – 3.22 t( 0.025, 11-3=8 df) = ±2.306, Thus null hypothesis is rejected and we
conclude that indeed the number of miles driven is a useful predictor of its price.
(Alternatively, p-value = 0.012 < 0.05, reject the null hypothesis and same conclusion).
[Note: The test of two tail hypothesis that regression parameter is zero is reported by default
by Excel’s as well as other software. One tail p-value can be obtained by dividing by 2]
Prediction from the model: Suppose we want to predict the price of an Orion which is 4 years
old and which is already driven 50,000 miles.
𝑦̂ = 18303.5 − 950.4 (4) − 0.0821(50000) = $10,396.9
Estimation of multiple regression in Excel:

Go to Data Tab> Data Analysis > Regression
Input the y range and x range (the x variables must be in adjacent columns). Click labels if
variable name row is also selected.
Note: Analysis Tool pack must be installed in Excel. To do this within Excel
Files > Options > Add-Ins >Analysis Tool Pack > Go > Analysis ToolPak > OK
Then the Analysis Tool Pack named ‘Data Analysis’ is visible in the Data tab.
#Multiple regression in R
orion=read.csv(file.choose()) # choose orion1.csv data
attach(orion)
head(orion)
model1=lm(price ~ age + miles, data=orion)
summary(model1)
round((summary(model1)$coefficients), 5) # to preset outcome with 5 decimals (avoid
scientific notation)
Anderson Ex 4, 5, pdf p-769:

[Ex 5: Estimated Eq:
̂ = 83.23 + 2.29 𝑇𝑉 + 1.30 𝑁𝑒𝑤𝑠𝑃𝑎𝑝𝑒𝑟]
𝑅𝑒𝑣𝑒𝑛𝑢𝑒
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.958663
R Square 0.919036
Adjusted R
Square 0.88665
Standard Error 0.642587
Observations 8
ANOVA
Significance
df SS MS F F
Regression 2 23.43541 11.7177 28.37777 0.001865
Residual 5 2.064592 0.412918
Total 7 25.5
Standard Upper Lower

Coefficients Error t Stat P-value Lower 95% 95% 95.0%
Intercept 83.23009 1.573869 52.88248 4.57E-08 79.18433 87.27585 79.18433
TV 2.290184 0.304065 7.531899 0.000653 1.508561 3.071806 1.508561
NewsPaper 1.300989 0.320702 4.056697 0.009761 0.476599 2.125379 0.476599
Note that in any regression study we generally expect to consider 4 aspects

(1) Estimation of parameters estimates and their real life interpretation
(2) Prediction of the value of dependent variable from the estimated model given predictors
(3) R sq interpretation in practical term
(4) Testing hypothesis on individual parameters / all parameters of the model
[F Test for overall significance in multiple regression:
𝐻0 : 𝛽1 = 𝛽2 = ⋯ 𝛽𝑘 = 0 vs 𝐻1 : 𝐴𝑡 𝑙𝑒𝑎𝑠𝑡 𝑜𝑛𝑒 𝛽𝑖 ≠ 0 (𝑖 = 1,2, . . 𝑘)
𝑀𝑆𝑅 𝑆𝑆𝑅/𝑘
𝐹= =
𝑀𝑆𝐸 𝑆𝑆𝐸/(𝑛 − 𝑘 − 1)
This F test with DF (k and n – (k+1) and is reported by Excel (in ANOVA section) and all
statistical software.
For the Orion case, F = 58.61, p-value = 0.0000167, null hypothesis is rejected, and we
conclude that at least one variable (age or miles or both) has a significant impact on price.
Note that ANOVA portion in Excel output cannot be used for testing equality of means (as
required in assignments].
Use of Qualitative /Categorical Independent Variables:
(i) Qualitative X variable Binary (two) Categories
How can you measure the effect of gender on wage? i.e. are the wages for male and female
same on average? Here we want to explain y = wage with the help of x = gender.
But gender is a qualitative variable. We can define a dummy variable e.g. Female = 1 if the
person is female and Female = 0 for male. We use this 0/1 code as x variable in the
regression.
̂ (Rs.) = 20,000 – 3500 Female
Example: 𝑊𝑎𝑔𝑒
The intercept: Average wage for a male person is Rs. 20,000.
The slope: Average wage of a female is Rs. 3500 lower than a male.
(ii) Qualitative X variable with k categories. Use k-1 dummy variables:
Consider the house price ($1000) in three different city Zones, East, West and South.
We code and include any two dummy variables e.g.
East = 1 if house is in East Zone, 0 otherwise
West = 1 if house is in West Zone, 0 otherwise
Keeping South Zone as reference. The estimated model
may be like:
̂ = 200 + 50 East – 75 West
𝑃𝑟𝑖𝑐𝑒
Interpretation:
200: Average house price in the South Zone is $200,000
50: House price in East Zone is on average $50,000 higher than in the South Zone.
-75: House price in West Zone is on average $75,000 lower than in the South Zone.
Anderson Example pdf p-786,
Here is Table 15.5, the data of y = time to repair (in hours) a water filtration unit is being
explained by x = month since last service.
𝑦̂ = 2.15 + 0.304 𝑥, R2 = 0.534
Interpret the intercept and slope and R2.
The time to repair y also depend on whether the defect is electrical or mechanical.
Again, this is a qualitative variable so we can code one of the category as 1 (e.g. electrical)
and other as 0 (mechanical).
𝑦̂ = 0.93 + 0.388 𝑥1 + 1.26 𝑥2 , R2 = 0.859
Here x1 = month since the last service and x2 = 1 if the defect is electrical and zero if
mechanical
Interpret the coefficients and R2. Predict service time for a mechanical repair issue when the
last service was 6 months earlier.
Look at the output of regression (Fig 15.7): Are each of the variables individually significant
and overall regression is significant at 5% level?
Some Exercises from Anderson:
Ex 4 pdf p-769, Ex 5, pdf p-769 (check Rev = 83.23 + 2.29 TVAd + 1.30NPAd)
Ex 14 pdf p-775 (only part d and f), Ex 34 p-pdf p-791 , Ex 38 pdf p-793
Some further exercises (especially to illustrate dummy x variables)
Ex1: Consider the factors such as the number of megapixels, weight (oz.), and overall score
(ranges from 0 to 100) of sample of Canon and Nikon cameras used to explain prices.
Observation Brand Price_$ Megapixels Weight_oz Score Brand

1 Canon 330 10 7 66 1
2 Canon 200 12 5 66 1
3 Canon 300 12 7 65 1
4 Canon 200 10 6 62 1
5 Canon 180 12 5 62 1
6 Canon 200 12 7 61 1
7 Canon 200 14 5 60 1
8 Canon 130 10 7 60 1
9 Canon 130 12 5 59 1
10 Canon 110 16 5 55 1
11 Canon 90 14 5 52 1
12 Canon 100 10 6 51 1
13 Canon 90 12 7 46 1
14 Nikon 270 16 5 65 0
15 Nikon 300 16 7 63 0
16 Nikon 200 14 6 61 0
17 Nikon 400 14 7 59 0
18 Nikon 120 14 5 57 0
19 Nikon 170 16 6 56 0
20 Nikon 150 12 5 56 0
21 Nikon 230 14 6 55 0
22 Nikon 180 12 6 53 0
23 Nikon 130 12 6 53 0
24 Nikon 80 12 7 52 0
25 Nikon 80 14 7 50 0
26 Nikon 100 12 4 46 0
27 Nikon 110 12 5 45 0
28 Nikon 130 14 4 42 0
Estimate the regression model, write down the estimated eq, interpret the coefficients. Predict
price of Nikon camera of 14 megapixels with a weight of 6 oz and score of 55. Interpret Rsq.
Test the hypothesis (at 5%) that the average price of Canon is significantly less than Nikon.
Ex2: Consider the data of sales prices of 176 houses to be explained by value of land, value
of improvement (all three variables in $1000) and the city area where the house is located.
(CHEVAL is the base area). The estimated regression is as follows.
̂ = −16.93 + 1.594 𝐿𝑎𝑛𝑑 + 1.301 𝐼𝑚𝑝 − 82.97DAVISISLES +10.187 HUNTERSGREE − 47.28 HYDEPARK
𝑆𝑎𝑙𝑒𝑠
SE 20.33 0.091 0.0468 32.536 22.731 28.396
Interpret each coefficient. Predict the price of a house located in Cheval that has value of land
and improvement as 100 and 200 (thousands of dollars). Test the hypothesis (at 5%) that
average prices in the Hydepark area are significantly less than Cheval area.

Lecture 13 (2)

Uploaded by

Copyright:

Available Formats

Lecture 13 (2)

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 13 (2)

Uploaded by

Copyright:

Available Formats

Lecture 13: Statistical Inference by Dr.

Where y: Price of car ($) 𝑥1 =Age of car (years), 𝑥2 = Miles driven

Estimation of multiple regression in Excel:

Anderson Ex 4, 5, pdf p-769:

Standard Upper Lower

Note that in any regression study we generally expect to consider 4 aspects

Observation Brand Price_$ Megapixels Weight_oz Score Brand

SE 20.33 0.091 0.0468 32.536 22.731 28.396

You might also like