Regression and Correlation

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 125

Regression and Correlation

Recall: Covariance

 ( xi  X )( yi  Y )
cov ( x , y )  i 1

n 1
Interpreting Covariance
cov(X,Y) > 0 X and Y are positively correlated

cov(X,Y) < 0 X and Y are inversely correlated

cov(X,Y) = 0 X and Y are independent


Correlation coefficient

 Pearson’s Correlation Coefficient is


standardized covariance (unitless):

cov ariance( x, y )
r
var x var y
Correlation
• Measures the relative strength of the linear
relationship between two variables
• Unit-less
• Ranges between –1 and 1
• The closer to –1, the stronger the negative linear relationship
• The closer to 1, the stronger the positive linear relationship
• The closer to 0, the weaker any positive linear relationship
Scatter Plots of Data with Various
Correlation Coefficients
Y Y Y

X X X
r = -1 r = -0.6 r=0
Y
Y Y

X X X
r = +1 r = +0.3 r=0
Linear Correlation
Linear relationships Curvilinear relationships

Y Y

X X

Y Y

X X
Linear Correlation
Strong relationships Weak relationships

Y Y

X X

Y Y

X X
Linear Correlation
No relationship

X
Calculating by hand…

 ( x  x )( y
i 1
i i  y)
cov ariance( x, y ) n 1
rˆ  
var x var y n n

 i
( x
i 1
 x ) 2
 i
( y
i 1
 y ) 2

n 1 n 1
Simpler calculation formula…
Numerator of
n
covariance
 ( x  x )( y
i 1
i i  y)

rˆ  n 1 
n n SS xy
 (x  x)  ( y
i
2
i  y)2 rˆ 
i 1 i 1 SS x SS y
n 1 n 1
n

 ( x  x )( y
i i  y)
SS xy Numerators of
i 1
 variance
n n
SS x SS y
 (x  x)  ( y
i 1
i
2

i 1
i  y) 2
Correlation
Regression: Introduction

Basic idea:
Use data to identify relationships
among variables and use these
relationships to make predictions.
Linear regression
•Linear dependence: constant rate of increase of one variable
with respect to another (as opposed to, e.g., diminishing
returns).
•Regression analysis describes the relationship between two
(or more) variables.
•Examples:
– Income and educational level
– Demand for electricity and the weather
– Home sales and interest rates
•Our focus:
–Gain some understanding of
• the regression line
• regression error
– Learn how to interpret and use the results.
– Learn how to setup a regression analysis.
Two main questions:
•Prediction and Forecasting
– Predict home sales for December given the interest rate for this month.
– Use time series data (e.g., sales vs. year) to forecast future performance
(next year sales).
– Predict the selling price of houses in some area.
• Collect data on several houses (# of BR, #R, sq.ft, lot size, property tax) and
their selling price.
• Can we use this data to predict the selling price of a specific house?
•Quantifying causality
– Determine factors that relate to the variable to be predicted; e.g., predict
growth for the economy in the next quarter: use past history on
quarterly growth, index of leading economic indicators, and others.
– Want to determine advertising expenditure and promotion for the 2020.
•Sales over a quarter might be influenced by: ads in print, ads in radio, ads in
TV, and other promotions.
Motivated Example
• Predict the selling prices of houses in the region.
• Collect recent historical data on selling prices, and a number of
characteristics about each house sold (size, age, style, etc.).
•One of the factors that cause houses in the data set to sell for
different amounts of money is the fact that houses come in
various sizes.
•A preliminary model might posit that the average value per
square foot of a new house is $40 and that the average lot sells
for $20,000. The predicted selling price of a house of size X (in
square feet) would be: 20,000 + 400X.
•A house of 200 square meter would be estimated to sell for
20,000 + 400(200) = $100,000.
Motivated Example
•Probability Model:
– We know, however, that this is just an approximation, and the selling price
of this particular house of 200 square meter is not likely to be exactly
$100,000.
– Prices for houses of this size may actually range from $50,000 to $150,000.
– In other words, the deterministic model is not really suitable. We should
therefore consider a probabilistic model.
•Let Y be the actual selling price of the house. Then
Y = 20,000 + 40x + ,
where  (Greek letter epsilon) represents a random error term
(which might be positive or negative).
– If the error term  is usually small, then we can say the model is a good
one.
– The random term, in theory, accounts for all the variables that are not part
of the model (for instance, lot size, neighborhood, etc.).
– The value of  will vary from sale to sale, even if the house size remains
constant. That is, houses of the exact same size may sell for different prices.
Regression Model
•The variable we are trying to predict (Y) is called the
dependent (or response) variable.
•The variable x is called the independent (or predictor, or
explanatory) variable.
•Our model assumes that
E(Y | X = x) = 0 + 1x (the “population line”) (1)
The interpretation is as follows:
–When X (house size) is fixed at a level x, then we assume the mean of Y
(selling price) to be linear around the level x, where 0 is the (unknown)
intercept and 1 is the (unknown) slope or incremental change in Y per
unit change in X.
0 and 1 are not known exactly, but are estimated from sample data.
Their estimates are denoted b0 and b1.
•A simple regression model: Consider a model with only one
independent variable,.
•A multiple regression model: a model with multiple
independent variables.
House Number Y: Actual Selling X: House Size (10s m2)
Price ($1,000s)
1 89.5 20.0
2 79.9 14.8
3 83.1 20.5
Sample
4 56.9 12.5
15 houses
5 66.6 18.0
from the
6 82.5 14.3
region.
7 126.3 27.5
8 79.3 16.5
9 119.9 24.3
10 87.6 20.2
11 112.6 22.0
12 120.8 .019
13 78.5 12.3
14 74.3 14.0
15 74.8 16.7
Averages 88.84 18.17
Least Squares Estimation
•price<- c(89.5,79.9,83.1,56.9,66.6,82.5,126.3,79.3,119.9,87.6,112.6,120.8,78.5,74.3,74.8)
•size<- c(20.0,14.8,20.5,12.5,18.0,14.3,27.5,16.5,24.3,20.2,22.0,19.0,12.3,14.0,16.7)
•plot(size,price,xlab= “House size (100 sq m)”,ylab=“Selling price
($1,000)”,main=“House Size (X) vs Selling Price (Y)”)
Assumptions
•These data do not form a perfect line. This is not surprising,
considering that our data are random. In other words, if we
assume equation (1) then our line predicts the mean for any
given level x. However, when we actually take a
measurement (i.e., observe the data), we observe:
Yi = 0 + 1Xi + i, for i = 1,2,…, n = 15,
where i is the random error associated with the ith
observation.
–Since we don't know the true values of 0 and 1, it is clear that we do
not observe the actual errors (i) precisely either.
•Assumptions about the Error
–E(i ) = 0 for i = 1, 2,…,n.
(i ) =  where  is unknown.
–The errors are independent, that is, the error in the ith observation is
independent of the error observed in the jth observation.
–The i are normally distributed (with mean 0 and standard deviation ).
Least Squares Estimation
•Recall 0 and 1 are (unknown) population parameters.
– From the sample data, we will calculate numbers ˆ0 andˆ1 that are
estimates of the population parameters.
– How should these numbers be chosen? For any choice ofˆ0 and ˆ1 ,
we can write the following prediction equation
ŷ = ˆ0 + ̂1 X.
– The “hat” is used to denote a value estimated from the model, as
opposed to one that is actually observed.
– For each house in our sample of 15 we could check to see how well
this equation works at predicting the actual selling prices. Define ei to
be the error associated with the ith observation. That is:
ei = yi - (estimated selling price)
These are sometimes called the residuals or simply errors.
ˆ0 ˆ1
•We will pick the values of and that minimize  e 2, the i i
sum of the squares of the residuals. This method is often
called Least Squares Regression.
Linear Equations
Y
Y = m X + b
Change
m = S lo p e in Y
C h a n g e in X
b = Y -in te rc e p t
X

© 1984-1994 T/Maker Co.


EPI 809/Spring 2008 23
Linear Regression Model
• 1. Relationship Between Variables Is a
Linear Function
Population Population Random
Y-Intercept Slope Error

Yi   0  1X i   i
Dependent Independent
(Response) (Explanatory) Variable
Variable (e.g., Years s. serocon.)
(e.g., CD+ c.)
Population & Sample Regression
Models

25
Population & Sample Regression
Models
Population

 


26
Population & Sample Regression
Models
Population

Unknown
Relationship 
Yi   0  1X i   i
 


27
Population & Sample Regression
Models
Population Random Sample

Unknown
Relationship 
Yi   0  1X i   i 

 


28
Population & Sample Regression
Models
Population Random Sample

Unknown
 
Yi   0   1 X i   i
Relationship 
Yi   0  1X i   i 

 


29
Population Linear Regression
Model
Y Yi   0   1 X i   i Observed
value

i = Random error

E Y    0  1 X i

X
Observed value
30
Sample Linear Regression Model
Y  
Yi   0   1 X i   i

^i = Random
error
Unsampled
observation
  
Yi   0   1 X i
X
Observed value
31
Estimating Parameters:
Least Squares Method

32
Scatter plot
• 1. Plot of All (Xi, Yi) Pairs
• 2. Suggests How Well Model Will Fit
Y
60
40
20
0 X
0 20 40 60
33
Thinking Challenge

How would you draw a line through the


points? How do you determine which line
‘fits best’?

Y
60
40
20
0 X
0 20 40 60
34
Thinking Challenge
How would you draw a line through the
points? How do you determine which line
‘fits best’?

Slope changed
Y
60
40
20
0 X
0 20 40 60
Intercept unchanged
35
Thinking Challenge
How would you draw a line through the
points? How do you determine which line
‘fits best’?
Slope unchanged

Y
60
40
20
0 X
0 20 40 60
Intercept changed
36
Thinking Challenge
How would you draw a line through the
points? How do you determine which line
‘fits best’?

Slope changed
Y
60
40
20
0 X
0 20 40 60
Intercept changed
37
Least Squares
• 1. ‘Best Fit’ Means Difference Between
Actual Y Values & Predicted Y Values Are
a Minimum. But Positive Differences Off-
Set Negative ones

38
Least Squares
• 1. ‘Best Fit’ Means Difference Between
Actual Y Values & Predicted Y Values is a
Minimum. But Positive Differences Off-Set
Negative ones. So square errors!


n
Y
i 1
 Y
i
ˆ  i
2 n
  ˆi2
i 1

39
Least Squares
• 1. ‘Best Fit’ Means Difference Between Actual
Y Values & Predicted Y Values Are a Minimum.
But Positive Differences Off-Set Negative. So
square errors!


n
Y
i 1
 Yˆ
i
 i
2 n
  ˆi2
i 1

• 2. LS Minimizes the Sum of the Squared


Differences (errors) (SSE)
40
Least Squares Graphically
n
LS minimizes   i   1   2   3   4
 2  2  2  2  2

i 1
Y Y2   0   1 X 2   2
^ 44
^ 22
^ 11 ^ 33
  
Yi   0   1 X i
X
EPI 809/Spring 2008 41
Coefficient Equations
• Prediction equation
yˆi  ˆ0  ˆ1 xi

• Sample slope
SS xy   xi  x  yi  y 
ˆ1  
2
SS xx  i x  x 
• Sample Y - intercept

ˆ0  y  ˆ1x
EPI 809/Spring 2008 42
Derivation of Parameters (1)
• Least Squares (L-S):
Minimize squared error
n n

    yi  0  1 xi 
2 2
i
i 1 i 1

     yi   0  1 xi 
2 2
i
0 
 0  0
 2  ny  n 0  n1 x 

ˆ0  y  ˆ1x
43
Derivation of Parameters (1)
• Least Squares (L-S):
Minimize squared error
   i2    yi   0  1 xi 
2

0 
1 1
 2 xi  yi   0  1 xi 
 2 xi  yi  y  1 x  1 xi 

1  xi  xi  x    xi  yi  y 
1   xi  x   xi  x     xi  x   yi  y 

ˆ SS xy
1 
SS xx
44
Using the Equation
•Method of Least squares leads to that the intercept is 18.354
and the slope is 3.879.
–How do we predict the selling price of a house of 16,5 square meter?
• Plug in the value 16.50 in the regression equation and get predicted selling
price = 18.354 + 3.879× (16.50) = 82.357.
• Translate to a dollar amount, i.e., $82,357. This is the estimate you have of
the selling price of this house, that is, without any further information
about the house (e.g., neighborhood, number of rooms, lot size, age, etc.).
•Analyzing a Regression
•Estimating the Standard Error
–From the assumptions about the error, the magnitude of  should be a
good guide to the accuracy of a prediction.
–The number  is a population parameter, so we cannot know for
certain what its value is.
–We therefore use an estimate s that is provided in the regression
output under the name “standard error of the estimate” or just
“standard error.”
Making Predictions
•The estimate s is calculated by (SSE/(n-2))1/2.
–The reason why we divide by n - 2 and not n - 1 has to do with the
degrees of freedom issue.
–The value of s gives us some idea of the standard deviation of the errors if
the model is used to estimate selling prices. In addition, we will make use
of the normality assumption to help us make assessments of a prediction.
•Suppose a house occupies 200 sq m. How do we predict the
selling price?
–prediction interval: This is used if our goal is to determine a 95%
confidence interval on the actual selling price of the house. A 95%
prediction interval for the actual selling price is given by
(18.354 + 3.879× 20 )  t(n - 2, 0.025)s = 95.94  28.07.
–confidence interval: This is used if our goal is to determine a 95%
confidence interval on the mean selling price of all houses of this size (200
square m). (E[Y|X = x])
It is 95,940  t(n - 2, 0.025)s/√n = 95.94  7.25 .
–In the above examples use the t distribution with n - 2 degrees of freedom.
If n - 2  30 then the standard normal distribution can be used instead.
Making Inferences about Coefficients
•To assess the accuracy of the model, it involves determining
whether a particular variable like house size has any effect on
the selling price.
–Suppose that when a regression line is drawn it produces a horizontal line.
This means the selling price of the house is unaffected by the size of the
house.
–A horizontal line has a slope of 0, so when no linear relationship exists
between an independent variable and the dependent variable we should
expect to get 1 = 0.
–But of course, we only observe estimate of 1, which might only be “close”
to zero. To systematically determine when 1 might in fact be zero, we will
make inferences about it using our estimate , specifically, we will do
hypothesis tests and build confidence intervals.
•Testing 1, we can test any of the following:
–H0 : 1 = 0 versus HA : 1  0 
–H0 : 1  0 versus HA : 1 < 0
–H0 : 1  0 versus HA : 1 > 0
•  In each case, the null hypothesis can be reduced to H0: 1 = 0.
The test statistic in each case is ˆ  1  0 / sˆ
1
Example
•Can we conclude at the 1% level of significance that the size
of a house is linearly related to its selling price? Test H0 : 1 =
0 versus HA : 1  0
–Note this is a two-sided test, we are interested in whether there is any
relationship at all between price and size.
–Calculate T = (3.879 - 0) / 0.794 = 4.88.
–That is, we are 4.88 standard deviations from 0. So at the 1% level
(corresponding to thresholds  t(13, 0.005) =  3.012), we reject H0.
–There is sufficient evidence to conclude that house size does linearly
affect selling price.
•To get a p-value on this we would need to look up 4.88
inside the t-table.
–It is 0.00024 or 0.024%; very small indeed.
ˆ1  t( n2,0.025) s ˆ
•A 95% confidence interval for 1 is given by 1

–For this example: It is 3.879  (2.160)(0.794) = 3.879  1.715.


–Using the 15 data points, we are 95% confident that every extra
square foot increases the price of the house by anywhere from $21.64
to $55.94.
Method III: Measuring the Strength of the
Linear Relationship
•Consider the following equation:
Yi - Y = ( Yˆ- Y) + ei.
–Squaring both sides and summing over all data points, and after a little
algebra, we get:
i (Yi - )2 = i ( Yˆ- )2 + i ei2, which we usually rewrite as:
Y Y
SST = SSR + SSE, (2)
where SST = i (Yi - Y )2 , SSR = i ( -Yˆ )2Yand SSE = i ei2.
–Interpretation:
•SST stands for the “total sum of squares” - this is essentially the total variation in
the data set, i.e., the total variation of selling prices.
•SSR stands for “sum of squares due to regression” - this is the squared variation
around the mean of the estimated selling prices. This is sometimes called the total
variation explained by the regression.
•SSE stands for “sum of squares due to error” - this is simply the sum of the squared
residuals, and it is the variation in the Y variable that remains unexplained after
taking into account the variable X.
–The interpretation of equation (2) is that the total variation in Y (SST) is made
up of two parts: the total variation explained by the regression (SSR) and the
remaining unexplained variation (SSE).
Regression Statistics
•Define R2 = SSR/SST = 1- SSE/SST
–The fraction of the total variation explained by the regression.
–R2 is a measure of the explanatory power of the model.
–Multiple-R = (R2)1/2 (in one variable case = XY|)
•According to the definition of R2, adding extraneous
explanatory variables will artificially inflate the R2.
–We must be careful in interpreting this number.
–Introducing extra variables can lead to spurious results and can
interfere with the proper estimation of slopes for the important
variables.
•In order to penalize an excess of variables, we consider the
adjusted R2, which is
adjusted R2 = 1- [SSE/(n-k-1)]/[SST/(n-1)] .
Here n is the number of data and k is the number of
explanatory variables.
–The adjusted R2 thus divides numerator and denominator by their DF.
How to determine the value of used cars that
customers trade in when purchasing new cars?
• Car dealers across North America use the “Red Book” to help them
determine the value of used cars that their customers trade in when
purchasing new cars.
–The book, which is published monthly, lists average trade-in values for all
basic models of North American, Japanese and European cars.
–These averages are determined on the basis of the amounts paid at recent
used-car auctions.
–The book indicates alternative values of each car model according to its
condition and optional features, but it does not inform dealers how the
odometer reading affects the trade in value.
• Question: In an experiment to determine whether the odometer
reading should be included in the Red Book, an interested buyer of
used cars randomly selects ten 3-year-old cars of the same make,
condition, and optional features.
–The trade-in value and mileage for each car are shown in the following table.
Data
Odometer Reading(1,000 miles) 59 92 61 72 52 67 88 62 95 83
Trade-in Value ($100s) 37 41 43 39 41 39 35 40 29 33
• Run the regression, with Trade-in Value as the dependent variable
(Y) and Odometer Reading as the independent variable (X). The
output appears on the following page.
•Regression Statistics
– Multiple R = 0.893, R2 = 0.798, Adjusted R2 = 0.773 Standard Error =
2.178
– Analysis of Variance
df SS MS F Significance F
Regression 1 150.14 150.14 31.64 0.000
Residual 8 37.96 4.74
Total 9 188.10 
– Testing
Coeff. Stnd Error t-Stat P-value
Intercept 56.205 3.535 15.90 0.000
x -0.26682 0.04743 -5.63 0.000
F and F-significance
•F is a test statistic testing whether the estimated model is
meaningful; i.e., statistically significant.
–F =MSR/MSE
–A large F or a small p-value (or F-significance) implies that the model
is significant.
–It is unusual not to reject this null hypothesis.
Questions
• What does the regression line tell us about the relationship between
the two variables?
• Can we conclude at the 5% significance level that, for all cars of
the type described in the experiment, higher mileage results in a
lower trade-in value?
• Predict with 95% confidence the trade-in value of such a car that
has been driven 60,000 miles.
• A large national courier company has a policy of selling its cars
when the odometer reading reaches 75,000 miles. The company is
about to sell a large number of 3-year-old cars, each equipped with
the same optional features and in the same condition as the 10 cars
described in the experiment. The company president would like to
know the cars' mean trade-in price. Determine the 95% confidence
interval estimate of the expected value of all cars that have been
driven 75,000 miles.
Salary-budget Example
•A large corporation is concerned about maintaining parity in
salary levels across different divisions.
–As a rough guide, it determines that managers responsible for
comparable budgets in different divisions should have comparable
compensation.
•Data Analysis: The following is a list of salary levels for 20
managers and the sizes of the budgets they manage.
–salary<-
c(59.0,67.4,50.4,83.2,105.6,86.0,74.4,52.2,82.6,59.0,44.8,111.4,122.4,
82.6,57.0,70.8,54.6,111.0,86.2,79.0)
–budget<- c(3.5,5.0,2.5,6.0,7.5,4.5,6.0,4.0,4.5,5.0,2.5,12.5,9.0,7.5,6.0, 5.0,3.0,
8.5, 7.5, 6.5)
–Salary Y ($1000s)
–Budget X ($100,000s)
Salary-budget Example
• Want to fit a straight line to this data.
– The slope of this line gives the marginal increase in salary with respect to increase
in budget responsibility.
– The regression equation is SALARY = 31.9 + 7.73 BUDGET
– Each additional $100,000 of budget responsibility translates to an expected
additional salary of $7,730.
– If we wanted to know the average salary corresponding to a budget of 6.0, we get
a salary of 31.9 + 7.73(6.0) = 78.28.
• Why is the least squares criterion the correct principle to follow?
• Assumptions Underlying Least Squares
– The errors 1,…, n are independent of the values of X1,…,Xn.
– The errors have expected value zero; i.e., E[i] = 0.
– All the errors have the same variance: Var[i] = 2, for all i = 1,…,n.
– The errors are uncorrelated; i.e., Corr[i, j] = 0 if i  j.
• The first two assumptions imply that E[Y|X = x] = 0 + 1x.
– Do we necessarily believe that the variability in salary levels among managers
with large budgets is the same as the variability among managers with small
budgets?
How do we evaluate and use the regression line?
• Evaluate the explanatory power of a model.
– Without using X, how do we predict Y?
– Determine how much of the variability in Y values is explained by the X.
• Measure variability using sums of squared quantities.
• The ANOVA table.
– ANOVA is short for analysis of variance.
– This table breaks down the total variability into the explained and
unexplained parts.
– Total SS (9535.8) measures the total variability in the salary levels.
• Without using x, we will use sample mean to do prediction.
– The Regression SS (6884.7) is the explained variation.
• It measures how much variability is explained by differences in budgets.
– Error SS (2651.1) is the unexplained variation.
• This reflects differences in salary levels that cannot be attributed to differences
in budget responsibilities.
– The explained and unexplained variation sum to the Total SS.
• R-squared: R = SSR/SST = 6884.7/9538.8 = 72:2%
Multiple Regression
[ Cross-Sectional Data ]
Learning Objectives
• Explain the linear multiple regression
model [for cross-sectional data]
• Interpret linear multiple regression
computer output
• Explain multicollinearity
• Describe the types of multiple regression
models
Regression Modeling Steps
• Define problem or question
• Specify model
• Collect data
• Do descriptive data analysis
• Estimate unknown parameters
• Evaluate model
• Use model for prediction
Simple vs. Multiple
  represents the  i represents the unit
unit change in Y change in Y per unit
per unit change in change in X .
i
X.
• Does not take into • Takes into account
account any other the effect of other
variable besides i s.
single independent • “Net regression
variable.
coefficient.”
Assumptions
• Linearity - the Y variable is linearly related
to the value of the X variable.
• Independence of Error - the error
(residual) is independent for each value of X.
• Homoscedasticity - the variation around
the line of regression be constant for all values
of X.
• Normality - the values of Y be normally
distributed at each value of X.
Goal
Develop a statistical model that
can predict the values of a
dependent (response)
response variable
based upon the values of the
independent (explanatory)
explanatory
variables.
Simple Regression

A statistical model that utilizes


one quantitative independent
variable “X” to predict the
quantitative dependent
variable “Y.”
Multiple Regression
A statistical model that utilizes two
or more quantitative and
qualitative explanatory variables
(x1,..., xp) to predict a quantitative
dependent variable Y.
Caution: have at least two or more quantitative
explanatory variables (rule of thumb)
Multiple Regression Model
Y

X2

X1
Hypotheses
• H0: 1 = 2 = 3 = ... = P = 0

• H1: At least one regression


coefficient is not equal to zero
Hypotheses (alternate format)

H0: i = 0

H1: i  0
Types of Models
• Positive linear relationship
• Negative linear relationship
• No relationship between X and Y
• Positive curvilinear relationship
• U-shaped curvilinear
• Negative curvilinear relationship
Multiple Regression Models

M u ltip le
R e g r e s s io n
M o d e ls
Non-
L in e a r
L in e a r

Dum m y In te r -
L in e a r a c tio n
V a r ia b le

P o ly - S q u a re
Log R e c ip r o c a l E x p o n e n tia l
N o m ia l Root
Multiple Regression Models

MM uu lltt iipp llee


RR ee gg rr ee ss ss iioo nn
MM oo dd ee llss
NN oo nn --
LL iinn ee aa rr
LL iinn ee aa rr

DD uu mm mm yy IInn tt ee rr --
LL iinn ee aa rr aa cc tt iioo nn
VV aa rr iiaa bb llee

PP oo llyy -- SS qq uu aa rr ee
LL oo gg RR ee cc iipp rr oo cc aa ll EE xx pp oo nn ee nn tt iiaa ll
NN oo mm iiaa ll RR oo oo tt
Linear Model
Relationship between one dependent & two
or more independent variables is a linear
function
Population Population Random
Y-intercept slopes error
Y   00   11 X 11   22 X 2     PP X PP  

Dependent Independent
(response) (explanatory)
variable variables
Method of Least Squares
• The straight line that best fits the data.

• Determine the straight line for which the


differences between the actual values (Y)
and the values that would be predicted
from the fitted line of regression (Y-hat)
are as small as possible.
Measures of Variation
• Explained variation (sum of
squares due to regression)
• Unexplained variation (error sum
of squares)
• Total sum of squares
Coefficient of Multiple Determination

When null hypothesis


is rejected, a
relationship between Y
and the X variables
exists.
Strength measured by
R2 [ several types ]
Coefficient of Multiple
Determination
R2y.123- - -P

The proportion of Y that is


explained by the set of
explanatory variables selected
Standard Error of the Estimate

sy.x
the measure of
variability
around the line
of regression
Confidence interval estimates
• True mean
Y.X

• Individual
Y-hati
Interval Bands [from
[from simple
simple regression]
regression]

_ X
X X g iv e n
Multiple Regression Equation
Y-hat = 0 + 1x1 + 2x2 + ... + PxP + 
where:
0 = y-intercept {a constant value}

1 = slope of Y with variable x1 holding the


variables x2, x3, ..., xP effects constant
P = slope of Y with variable xP holding all
other variables’ effects constant
Mini-Case
Predict the consumption of home
heating oil during January for
homes located around Screne Lakes.
Two explanatory variables are
selected - - average daily
atmospheric temperature (oF) and
the amount of attic insulation (“).
Mini-Case
Develop a model for Oil (Gal) Temp(0F) Insulation
estimating heating oil 275.30 40 3
363.80 27 3
used for a single family
164.30 40 10
home in the month of 40.80 73 6
January based on average 94.30 64 6
temperature and amount 230.90 34 6
of insulation in inches. 366.70 9 6
300.60 8 10
237.80 23 10
121.40 63 3
31.40 65 10
203.50 41 6
441.10 21 3
323.00 38 3
52.50 58 10
Mini-Case
• What preliminary conclusions can home
owners draw from the data?

• What could a home owner expect heating


oil consumption (in gallons) to be if the
outside temperature is 15 oF when the
attic insulation is 10 inches thick?
Multiple Regression Equation
[mini-case]
Dependent variable: Gallons Consumed
-------------------------------------------------------------------------------------
Standard T
Parameter Estimate Error Statistic P-Value
--------------------------------------------------------------------------------------
CONSTANT 562.151 21.0931 26.6509 0.0000
Insulation -20.0123 2.34251 -8.54313 0.0000
Temperature -5.43658 0.336216 -16.1699 0.0000
--------------------------------------------------------------------------------------
R-squared = 96.561 percent
R-squared (adjusted for d.f.) = 95.9879 percent
Standard Error of Est. = 26.0138
+
Multiple Regression Equation
[mini-case]

Y-hat = 562.15 - 5.44x1 - 20.01x2

where: x1 = temperature [degrees F]


x2 = attic insulation [inches]
Multiple Regression Equation
[mini-case]
Y-hat = 562.15 - 5.44x1 - 20.01x2
thus:
• For a home with zero inches of attic
insulation and an outside temperature of 0 oF,
562.15 gallons of heating oil would be consumed.
[ caution .. data boundaries .. extrapolation ]

+
Extrapolation
Y
In te rp o la tio n

E x tr a p o la tio n E x tr a p o la tio n

X
R e le v a n t R a n g e
Multiple Regression Equation
[mini-case]
Y-hat = 562.15 - 5.44x1 - 20.01x2
• For a home with zero attic insulation and an outside temperature of zero,
562.15 gallons of heating oil would be consumed. [ caution .. data boundaries
.. extrapolation ]
• For each incremental increase in degree F of
temperature, for a given amount of attic insulation,
heating oil consumption drops 5.44 gallons.

+
Multiple Regression Equation
[mini-case]
Y-hat = 562.15 - 5.44x1 - 20.01x2
• For a home with zero attic insulation and an outside temperature of zero,
562 gallons of heating oil would be consumed. [ caution … ]
• For each incremental increase in degree F of temperature, for a given
amount of attic insulation, heating oil consumption drops 5.44 gallons.

• For each incremental increase in inches of


attic insulation, at a given temperature,
heating oil consumption drops 20.01
gallons.
Multiple Regression Prediction
[mini-case]

Y-hat = 562.15 - 5.44x1 - 20.01x2

with x1 = 15oF and x2 = 10 inches

Y-hat = 562.15 - 5.44(15) - 20.01(10)


= 280.45 gallons consumed
Coefficient of Multiple Determination
[mini-case]

R2y.12 = .9656

96.56 percent of the variation in


heating oil can be explained by
the variation in temperature and
insulation.
Coefficient of Multiple Determination

• Proportion of variation in Y ‘explained’ by all


X variables taken together
• R2Y.12 = Explained variation = SSR
Total variation SST
• Never decreases when new X variable is
added to model
– Only Y values determine SST
– Disadvantage when comparing models
Coefficient of Multiple Determination
Adjusted
Adjusted

• Proportion of variation in Y ‘explained’ by all


X variables taken together
• Reflects
– Sample size
– Number of independent variables
• Smaller [more conservative] than R2Y.12
• Used to compare models
Coefficient of Multiple Determination
(adjusted)

R2(adj) y.123- - -P

The proportion of Y that is explained by the


set of independent [explanatory] variables
selected, adjusted for the number of
independent variables and the sample size.
Coefficient of Multiple Determination
(adjusted) [Mini-Case]

R2adj = 0.9599

95.99 percent of the variation in


heating oil consumption can be
explained by the model - adjusted
for number of independent variables
and the sample size
Coefficient of Partial Determination

• Proportion of variation in Y ‘explained’ by


variable XP holding all others constant
• Must estimate separate models
• Denoted R2Y1.2 in two X variables case
– Coefficient of partial determination of X1 with Y
holding X2 constant
• Useful in selecting X variables
Coefficient of Partial
Determination
R2y1.234 --- P

The coefficient of partial variation of


variable Y with x1 holding constant
the effects of variables x2, x3, x4, ... xP.
Coefficient of Partial Determination
[Mini-Case]
R2y1.2 = 0.9561
For a fixed (constant) amount of
insulation, 95.61 percent of the variation
in heating oil can be explained by the
variation in average atmospheric
temperature.
Coefficient of Partial Determination
[Mini-Case]

R2y2.1 = 0.8588
For a fixed (constant) temperature,
85.88 percent of the variation in
heating oil can be explained by the
variation in amount of insulation.
Testing Overall Significance
• Shows if there is a linear relationship between
all X variables together & Y
• Uses p-value
• Hypotheses
– H0: 1 = 2 = ... = P = 0
• No linear relationship
– H1: At least one coefficient is not 0
• At least one X variable affects Y
Testing Model Portions

• Examines the contribution of a set of X


variables to the relationship with Y
• Null hypothesis:
– Variables in set do not improve significantly
the model when all other variables are included
• Must estimate separate models
• Used in selecting X variables
Diagnostic Checking
• H0 retain or reject
If reject - {p-value  0.05}
• R2adj
• Correlation matrix
• Partial correlation matrix
Multicollinearity
• High correlation between X variables
• Coefficients measure combined effect
• Leads to unstable coefficients depending on X
variables in model
• Always exists; matter of degree
• Example: Using both total number of rooms
and number of bedrooms as explanatory
variables in same model
Detecting Multicollinearity
• Examine correlation matrix
– Correlations between pairs of X variables are
more than with Y variable
• Few remedies
– Obtain new sample data
– Eliminate one correlated X variable
Evaluating Multiple Regression Model Steps

• Examine variation measures


• Do residual analysis
• Test parameter significance
– Overall model
– Portions of model
– Individual coefficients
• Test for multicollinearity
Multiple Regression Models
MM uu lltt iipp llee
RR ee gg rr ee ss ss iioo nn
MM oo dd ee llss
NN oo nn --
LL iinn ee aa rr
LL iinn ee aa rr

DD uu mm mm yy IInn tt ee rr --
LL iinn ee aa rr
VV aa rr iiaa bb llee aa cc tt iioo nn

PP oo llyy -- SS qq uu aa rr ee
LL oo gg RR ee cc iipp rr oo cc aa ll EE xx pp oo nn ee nn tt iiaa ll
NN oo mm iiaa ll RR oo oo tt
Dummy-Variable Regression Model

• Involves categorical X variable with


two levels
– e.g., female-male, employed-not employed, etc.
• Variable levels coded 0 & 1
• Assumes only intercept is different
– Slopes are constant across categories
Dummy-Variable Model Relationships

Y Same slopes b1

Females
b0 + b2

b0
Males
0 X1
0
Dummy Variables
• Permits use of • As part of Diagnostic
qualitative data Checking;
(e.g.: seasonal, class incorporate outliers
standing, location, (i.e.: large residuals)
gender). and influence
measures.
• 0, 1 coding
(nominative data)
Multiple Regression Models
MM uu lltt iipp llee
RR ee gg rr ee ss ss iioo nn
MM oo dd ee llss
NN oo nn --
LL iinn ee aa rr
LL iinn ee aa rr

DD uu mm mm yy IInn tt ee rr --
LL iinn ee aa rr
VV aa rr iiaa bb llee aa cc tt iioo nn

PP oo llyy -- SS qq uu aa rr ee
LL oo gg RR ee cc iipp rr oo cc aa ll EE xx pp oo nn ee nn tt iiaa ll
NN oo mm iiaa ll RR oo oo tt
Interaction Regression Model
• Hypothesizes interaction between pairs of X
variables
– Response to one X variable varies at different
levels of another X variable
• Contains two-way cross product terms
Y = 0 + 1x1 + 2x2 + 3x1x2 + 
• Can be combined with other models
e.g. dummy variable models
Effect of Interaction
• Given:
Y i   0   1 X 1i   2 X 2 i   3 X 1i X 2 i   i

• Without interaction term, effect of X1 on Y is


measured by 1
• With interaction term, effect of X1 on
Y is measured by 1 + 3X2
– Effect increases as X2i increases
Interaction Example
Y Y = 1 + 2X1 + 3X2 + 4X1X2

12

4
0 X1
0 0.5 1 1.5
Interaction Example
Y Y = 1 + 2X1 + 3X2 + 4X1X2

12

8
Y = 1 + 2X1 + 3(0) + 4X1(0) = 1 + 2X1
4
0 X1
0 0.5 1 1.5
Interaction Example
Y Y = 1 + 2X1 + 3X2 + 4X1X2
Y = 1 + 2X1 + 3(1) + 4X1(1) = 4 + 6X1
12

8
Y = 1 + 2X1 + 3(0) + 4X1(0) = 1 + 2X1
4
0 X1
0 0.5 1 1.5
Interaction Example
Y Y = 1 + 2X1 + 3X2 + 4X1X2
Y = 1 + 2X1 + 3(1) + 4X1(1) = 4 + 6X1
12

8
Y = 1 + 2X1 + 3(0) + 4X1(0) = 1 + 2X1
4
0 X1
0 0.5 1 1.5
Effect (slope) of X1 on Y does depend on X2 value
Multiple Regression Models
MM uu lltt iipp llee
RR ee gg rr ee ss ss iioo nn
MM oo dd ee llss
NN oo nn --
LL iinn ee aa rr
LL iinn ee aa rr

DD uu mm mm yy IInn tt ee rr --
LL iinn ee aa rr
VV aa rr iiaa bb llee aa cc tt iioo nn

PP oo llyy -- SS qq uu aa rr ee
LL oo gg RR ee cc iipp rr oo cc aa ll EE xx pp oo nn ee nn tt iiaa ll
NN oo mm iiaa ll RR oo oo tt
Inherently Linear Models
• Non-linear models that can be expressed in
linear form
– Can be estimated by least square in linear form
• Require data transformation
Curvilinear Model Relationships

Y Y

X 1
X 1

Y Y

X 1
X 11
Logarithmic Transformation
Y =  + 1 lnx1 + 2 lnx2 + 

Y
1 > 0

1 < 0
X 11
Square-Root Transformation

Y i   0   1 X 1i   2 X 2 i   i
Y
1 > 0

1 < 0
X 1
Reciprocal Transformation
1 1
Yi   0   1  2  i
X 1i X 2i

Y Asymptote

1 < 0

1 > 0
X 1
Exponential Transformation

Yi  e  0  1 X 1i   2 X 2 i
i
Y 1 > 0

1 < 0
X 1
Overview
• Explained the linear multiple regression
model
• Interpreted linear multiple regression
computer output
• Explained multicollinearity
• Described the types of multiple regression
models
Regression Analysis
[Multiple Regression]

*** End of Presentation ***


Questions?

You might also like