Lecture 13 (2)
Lecture 13 (2)
Lecture 13 (2)
Javed Iqbal
Multiple Regression (2):
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.967530
R Square 0.936115
Adjusted R
Square 0.920144
Standard Error 880.505444
Observations 11
ANOVA
Significance
df SS MS F F
Regression 2 90883135.85 45441568 58.61236 1.67E-05
Residual 8 6202318.693 775289.8
Total 10 97085454.55
Standard Upper
Coefficients Error t Stat P-value Lower 95% 95%
Intercept 18303.5208 1134.76186 16.12983 2.19E-07 15686.76 20920.29
Age(years) -950.4270 387.4188755 -2.45323 0.039736 -1843.82 -57.0375
Miles -0.0821 0.025520666 -3.21889 0.01226 -0.141 -0.0233
For the Orion car data: The estimated model in equation form is:
𝑦̂ = 18303.5 − 950.4 𝑥1 − 0.0821𝑥2
Here 𝑏1 represents the sample estimate of the parameters. 𝛽1is the value of parameter under
the null hypothesis. SE (b1) means standard error of b1.
The statistics has a student’s T distribution with n – (k+1) degrees of freedom (n = number of
observations or sample size, k +1 = # model parameters including intercept)
T statistic is – 2.45 t( 0.025, 11-3 df) = ±2.306, Thus null hypothesis is rejected and we
conclude that the age of car is a useful predictor of its price.
(Alternatively, p-value = 0.039 < 0.05, reject the null hypothesis and same conclusion).
Ex: Test the hypothesis that there is a negative relationship between age and price of car.
For the hypothesis: 𝐻 0 ∶ 𝛽1 = 0 (No or positive relationship)
Against : 𝐻 1 ∶ 𝛽1 < 0 (Age increases car price decreases)
T statistic is – 2.45, t( 0.05, 8 df) = −1.860, Thus null hypothesis is rejected and we conclude
that there is indeed negative relationship between the age and price of car.
[Alternatively, the p-value of the test = Given (software reported) two tail p-value /2 =
0.03974/2 = 0.01987 < 0.05. Hence the null hypothesis is rejected in favor of alternative at 5%
sig level].
For the hypothesis: 𝐻 0 ∶ 𝛽2 = 0 (Number of miles driven is not a useful predictor of car price)
Against 𝐻 1 ∶ 𝛽2 ≠ 0 (Number of miles driven is a useful predictor of car price)
T statistic is – 3.22 t( 0.025, 11-3=8 df) = ±2.306, Thus null hypothesis is rejected and we
conclude that indeed the number of miles driven is a useful predictor of its price.
(Alternatively, p-value = 0.012 < 0.05, reject the null hypothesis and same conclusion).
[Note: The test of two tail hypothesis that regression parameter is zero is reported by default
by Excel’s as well as other software. One tail p-value can be obtained by dividing by 2]
Prediction from the model: Suppose we want to predict the price of an Orion which is 4 years
old and which is already driven 50,000 miles.
𝑦̂ = 18303.5 − 950.4 (4) − 0.0821(50000) = $10,396.9
Note: Analysis Tool pack must be installed in Excel. To do this within Excel
Files > Options > Add-Ins >Analysis Tool Pack > Go > Analysis ToolPak > OK
Then the Analysis Tool Pack named ‘Data Analysis’ is visible in the Data tab.
#Multiple regression in R
orion=read.csv(file.choose()) # choose orion1.csv data
attach(orion)
head(orion)
model1=lm(price ~ age + miles, data=orion)
summary(model1)
round((summary(model1)$coefficients), 5) # to preset outcome with 5 decimals (avoid
scientific notation)
Regression Statistics
Multiple R 0.958663
R Square 0.919036
Adjusted R
Square 0.88665
Standard Error 0.642587
Observations 8
ANOVA
Significance
df SS MS F F
Regression 2 23.43541 11.7177 28.37777 0.001865
Residual 5 2.064592 0.412918
Total 7 25.5
𝑀𝑆𝑅 𝑆𝑆𝑅/𝑘
𝐹= =
𝑀𝑆𝐸 𝑆𝑆𝐸/(𝑛 − 𝑘 − 1)
This F test with DF (k and n – (k+1) and is reported by Excel (in ANOVA section) and all
statistical software.
For the Orion case, F = 58.61, p-value = 0.0000167, null hypothesis is rejected, and we
conclude that at least one variable (age or miles or both) has a significant impact on price.
Note that ANOVA portion in Excel output cannot be used for testing equality of means (as
required in assignments].
Use of Qualitative /Categorical Independent Variables:
(i) Qualitative X variable Binary (two) Categories
How can you measure the effect of gender on wage? i.e. are the wages for male and female
same on average? Here we want to explain y = wage with the help of x = gender.
But gender is a qualitative variable. We can define a dummy variable e.g. Female = 1 if the
person is female and Female = 0 for male. We use this 0/1 code as x variable in the
regression.
̂ (Rs.) = 20,000 – 3500 Female
Example: 𝑊𝑎𝑔𝑒
The intercept: Average wage for a male person is Rs. 20,000.
The slope: Average wage of a female is Rs. 3500 lower than a male.
(ii) Qualitative X variable with k categories. Use k-1 dummy variables:
Consider the house price ($1000) in three different city Zones, East, West and South.
We code and include any two dummy variables e.g.
East = 1 if house is in East Zone, 0 otherwise
West = 1 if house is in West Zone, 0 otherwise
Keeping South Zone as reference. The estimated model
may be like:
̂ = 200 + 50 East – 75 West
𝑃𝑟𝑖𝑐𝑒
Interpretation:
200: Average house price in the South Zone is $200,000
50: House price in East Zone is on average $50,000 higher than in the South Zone.
-75: House price in West Zone is on average $75,000 lower than in the South Zone.
Anderson Example pdf p-786,
Here is Table 15.5, the data of y = time to repair (in hours) a water filtration unit is being
explained by x = month since last service.
𝑦̂ = 2.15 + 0.304 𝑥, R2 = 0.534
Interpret the intercept and slope and R2.
The time to repair y also depend on whether the defect is electrical or mechanical.
Again, this is a qualitative variable so we can code one of the category as 1 (e.g. electrical)
and other as 0 (mechanical).
𝑦̂ = 0.93 + 0.388 𝑥1 + 1.26 𝑥2 , R2 = 0.859
Here x1 = month since the last service and x2 = 1 if the defect is electrical and zero if
mechanical
Interpret the coefficients and R2. Predict service time for a mechanical repair issue when the
last service was 6 months earlier.
Look at the output of regression (Fig 15.7): Are each of the variables individually significant
and overall regression is significant at 5% level?
Some Exercises from Anderson:
Ex 4 pdf p-769, Ex 5, pdf p-769 (check Rev = 83.23 + 2.29 TVAd + 1.30NPAd)
Ex 14 pdf p-775 (only part d and f), Ex 34 p-pdf p-791 , Ex 38 pdf p-793
Some further exercises (especially to illustrate dummy x variables)
Ex1: Consider the factors such as the number of megapixels, weight (oz.), and overall score
(ranges from 0 to 100) of sample of Canon and Nikon cameras used to explain prices.
Interpret each coefficient. Predict the price of a house located in Cheval that has value of land
and improvement as 100 and 200 (thousands of dollars). Test the hypothesis (at 5%) that
average prices in the Hydepark area are significantly less than Cheval area.