Lecture 13 (2)

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Lecture 13: Statistical Inference by Dr.

Javed Iqbal
Multiple Regression (2):
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.967530
R Square 0.936115
Adjusted R
Square 0.920144
Standard Error 880.505444
Observations 11
ANOVA
Significance
df SS MS F F
Regression 2 90883135.85 45441568 58.61236 1.67E-05
Residual 8 6202318.693 775289.8
Total 10 97085454.55
Standard Upper
Coefficients Error t Stat P-value Lower 95% 95%
Intercept 18303.5208 1134.76186 16.12983 2.19E-07 15686.76 20920.29
Age(years) -950.4270 387.4188755 -2.45323 0.039736 -1843.82 -57.0375
Miles -0.0821 0.025520666 -3.21889 0.01226 -0.141 -0.0233

For the Orion car data: The estimated model in equation form is:
𝑦̂ = 18303.5 − 950.4 𝑥1 − 0.0821𝑥2

Where y: Price of car ($) 𝑥1 =Age of car (years), 𝑥2 = Miles driven


Coefficient of determination R2: Proportion of total variation in dependent variable (y) that
is explained by independent (x) variables of the model.
SST = SSR + SSE
𝑆𝑆𝑅 𝑆𝑆𝐸
𝑅2 = =1−
𝑆𝑆𝑇 𝑆𝑆𝑇
SST = Total sum of square, SSR = Sum of square due to Regression (explained SS),
SSE = Sum of square due to error (unexplained SS)
For the Orion data R2 = 0.936. This shows that 93.6% variation in prices of Orion used car is
explained by age of the car and number of miles driven through this model.
What explains the remaining 6.4% variation in car prices? These are other factors not
considered e.g., colour, condition of car, number of damages etc.
Statistical Significance of Coefficients:
For the hypothesis: 𝐻 0 ∶ 𝛽1 = 0 (Age of car is not a useful predictor of car price)
Against : 𝐻 1 ∶ 𝛽1 ≠ 0 (Age of car is a useful predictor of car price)
The test statistic is:
𝒃𝟏 −𝜷𝟏
𝒕=
𝑺𝑬(𝒃𝟏 )

Here 𝑏1 represents the sample estimate of the parameters. 𝛽1is the value of parameter under
the null hypothesis. SE (b1) means standard error of b1.
The statistics has a student’s T distribution with n – (k+1) degrees of freedom (n = number of
observations or sample size, k +1 = # model parameters including intercept)
T statistic is – 2.45 t( 0.025, 11-3 df) = ±2.306, Thus null hypothesis is rejected and we
conclude that the age of car is a useful predictor of its price.
(Alternatively, p-value = 0.039 < 0.05, reject the null hypothesis and same conclusion).
Ex: Test the hypothesis that there is a negative relationship between age and price of car.
For the hypothesis: 𝐻 0 ∶ 𝛽1 = 0 (No or positive relationship)
Against : 𝐻 1 ∶ 𝛽1 < 0 (Age increases car price decreases)
T statistic is – 2.45, t( 0.05, 8 df) = −1.860, Thus null hypothesis is rejected and we conclude
that there is indeed negative relationship between the age and price of car.
[Alternatively, the p-value of the test = Given (software reported) two tail p-value /2 =
0.03974/2 = 0.01987 < 0.05. Hence the null hypothesis is rejected in favor of alternative at 5%
sig level].
For the hypothesis: 𝐻 0 ∶ 𝛽2 = 0 (Number of miles driven is not a useful predictor of car price)
Against 𝐻 1 ∶ 𝛽2 ≠ 0 (Number of miles driven is a useful predictor of car price)
T statistic is – 3.22 t( 0.025, 11-3=8 df) = ±2.306, Thus null hypothesis is rejected and we
conclude that indeed the number of miles driven is a useful predictor of its price.
(Alternatively, p-value = 0.012 < 0.05, reject the null hypothesis and same conclusion).
[Note: The test of two tail hypothesis that regression parameter is zero is reported by default
by Excel’s as well as other software. One tail p-value can be obtained by dividing by 2]
Prediction from the model: Suppose we want to predict the price of an Orion which is 4 years
old and which is already driven 50,000 miles.
𝑦̂ = 18303.5 − 950.4 (4) − 0.0821(50000) = $10,396.9

Estimation of multiple regression in Excel:


Go to Data Tab> Data Analysis > Regression
Input the y range and x range (the x variables must be in adjacent columns). Click labels if
variable name row is also selected.

Note: Analysis Tool pack must be installed in Excel. To do this within Excel
Files > Options > Add-Ins >Analysis Tool Pack > Go > Analysis ToolPak > OK
Then the Analysis Tool Pack named ‘Data Analysis’ is visible in the Data tab.
#Multiple regression in R
orion=read.csv(file.choose()) # choose orion1.csv data
attach(orion)
head(orion)
model1=lm(price ~ age + miles, data=orion)
summary(model1)
round((summary(model1)$coefficients), 5) # to preset outcome with 5 decimals (avoid
scientific notation)

Anderson Ex 4, 5, pdf p-769:


[Ex 5: Estimated Eq:
̂ = 83.23 + 2.29 𝑇𝑉 + 1.30 𝑁𝑒𝑤𝑠𝑃𝑎𝑝𝑒𝑟]
𝑅𝑒𝑣𝑒𝑛𝑢𝑒
SUMMARY OUTPUT

Regression Statistics
Multiple R 0.958663
R Square 0.919036
Adjusted R
Square 0.88665
Standard Error 0.642587
Observations 8

ANOVA
Significance
df SS MS F F
Regression 2 23.43541 11.7177 28.37777 0.001865
Residual 5 2.064592 0.412918
Total 7 25.5

Standard Upper Lower


Coefficients Error t Stat P-value Lower 95% 95% 95.0%
Intercept 83.23009 1.573869 52.88248 4.57E-08 79.18433 87.27585 79.18433
TV 2.290184 0.304065 7.531899 0.000653 1.508561 3.071806 1.508561
NewsPaper 1.300989 0.320702 4.056697 0.009761 0.476599 2.125379 0.476599

Note that in any regression study we generally expect to consider 4 aspects


(1) Estimation of parameters estimates and their real life interpretation
(2) Prediction of the value of dependent variable from the estimated model given predictors
(3) R sq interpretation in practical term
(4) Testing hypothesis on individual parameters / all parameters of the model
[F Test for overall significance in multiple regression:
𝐻0 : 𝛽1 = 𝛽2 = ⋯ 𝛽𝑘 = 0 vs 𝐻1 : 𝐴𝑡 𝑙𝑒𝑎𝑠𝑡 𝑜𝑛𝑒 𝛽𝑖 ≠ 0 (𝑖 = 1,2, . . 𝑘)

𝑀𝑆𝑅 𝑆𝑆𝑅/𝑘
𝐹= =
𝑀𝑆𝐸 𝑆𝑆𝐸/(𝑛 − 𝑘 − 1)
This F test with DF (k and n – (k+1) and is reported by Excel (in ANOVA section) and all
statistical software.
For the Orion case, F = 58.61, p-value = 0.0000167, null hypothesis is rejected, and we
conclude that at least one variable (age or miles or both) has a significant impact on price.
Note that ANOVA portion in Excel output cannot be used for testing equality of means (as
required in assignments].
Use of Qualitative /Categorical Independent Variables:
(i) Qualitative X variable Binary (two) Categories
How can you measure the effect of gender on wage? i.e. are the wages for male and female
same on average? Here we want to explain y = wage with the help of x = gender.
But gender is a qualitative variable. We can define a dummy variable e.g. Female = 1 if the
person is female and Female = 0 for male. We use this 0/1 code as x variable in the
regression.
̂ (Rs.) = 20,000 – 3500 Female
Example: 𝑊𝑎𝑔𝑒
The intercept: Average wage for a male person is Rs. 20,000.
The slope: Average wage of a female is Rs. 3500 lower than a male.
(ii) Qualitative X variable with k categories. Use k-1 dummy variables:
Consider the house price ($1000) in three different city Zones, East, West and South.
We code and include any two dummy variables e.g.
East = 1 if house is in East Zone, 0 otherwise
West = 1 if house is in West Zone, 0 otherwise
Keeping South Zone as reference. The estimated model
may be like:
̂ = 200 + 50 East – 75 West
𝑃𝑟𝑖𝑐𝑒

Interpretation:
200: Average house price in the South Zone is $200,000
50: House price in East Zone is on average $50,000 higher than in the South Zone.
-75: House price in West Zone is on average $75,000 lower than in the South Zone.
Anderson Example pdf p-786,
Here is Table 15.5, the data of y = time to repair (in hours) a water filtration unit is being
explained by x = month since last service.
𝑦̂ = 2.15 + 0.304 𝑥, R2 = 0.534
Interpret the intercept and slope and R2.
The time to repair y also depend on whether the defect is electrical or mechanical.
Again, this is a qualitative variable so we can code one of the category as 1 (e.g. electrical)
and other as 0 (mechanical).
𝑦̂ = 0.93 + 0.388 𝑥1 + 1.26 𝑥2 , R2 = 0.859
Here x1 = month since the last service and x2 = 1 if the defect is electrical and zero if
mechanical
Interpret the coefficients and R2. Predict service time for a mechanical repair issue when the
last service was 6 months earlier.
Look at the output of regression (Fig 15.7): Are each of the variables individually significant
and overall regression is significant at 5% level?
Some Exercises from Anderson:
Ex 4 pdf p-769, Ex 5, pdf p-769 (check Rev = 83.23 + 2.29 TVAd + 1.30NPAd)
Ex 14 pdf p-775 (only part d and f), Ex 34 p-pdf p-791 , Ex 38 pdf p-793
Some further exercises (especially to illustrate dummy x variables)
Ex1: Consider the factors such as the number of megapixels, weight (oz.), and overall score
(ranges from 0 to 100) of sample of Canon and Nikon cameras used to explain prices.

Observation Brand Price_$ Megapixels Weight_oz Score Brand


1 Canon 330 10 7 66 1
2 Canon 200 12 5 66 1
3 Canon 300 12 7 65 1
4 Canon 200 10 6 62 1
5 Canon 180 12 5 62 1
6 Canon 200 12 7 61 1
7 Canon 200 14 5 60 1
8 Canon 130 10 7 60 1
9 Canon 130 12 5 59 1
10 Canon 110 16 5 55 1
11 Canon 90 14 5 52 1
12 Canon 100 10 6 51 1
13 Canon 90 12 7 46 1
14 Nikon 270 16 5 65 0
15 Nikon 300 16 7 63 0
16 Nikon 200 14 6 61 0
17 Nikon 400 14 7 59 0
18 Nikon 120 14 5 57 0
19 Nikon 170 16 6 56 0
20 Nikon 150 12 5 56 0
21 Nikon 230 14 6 55 0
22 Nikon 180 12 6 53 0
23 Nikon 130 12 6 53 0
24 Nikon 80 12 7 52 0
25 Nikon 80 14 7 50 0
26 Nikon 100 12 4 46 0
27 Nikon 110 12 5 45 0
28 Nikon 130 14 4 42 0
Estimate the regression model, write down the estimated eq, interpret the coefficients. Predict
price of Nikon camera of 14 megapixels with a weight of 6 oz and score of 55. Interpret Rsq.
Test the hypothesis (at 5%) that the average price of Canon is significantly less than Nikon.
Ex2: Consider the data of sales prices of 176 houses to be explained by value of land, value
of improvement (all three variables in $1000) and the city area where the house is located.
(CHEVAL is the base area). The estimated regression is as follows.
̂ = −16.93 + 1.594 𝐿𝑎𝑛𝑑 + 1.301 𝐼𝑚𝑝 − 82.97DAVISISLES +10.187 HUNTERSGREE − 47.28 HYDEPARK
𝑆𝑎𝑙𝑒𝑠

SE 20.33 0.091 0.0468 32.536 22.731 28.396

Interpret each coefficient. Predict the price of a house located in Cheval that has value of land
and improvement as 100 and 200 (thousands of dollars). Test the hypothesis (at 5%) that
average prices in the Hydepark area are significantly less than Cheval area.

You might also like