Chapter 15
Chapter 15
Chapter 15
Multiple Regression
15.1 - Multiple Regression Model
15.2 - Least Squares Method
15.3 - Multiple Coefficient of Determination
15.4 - Model Assumptions
15.5 - Testing for Significance
15.6 - Using the Estimated Regression Equation for Estimation and Prediction
15.7 - Categorical Independent Variables
15.8 - Residual Analysis
15.9 - Logistic Regression
1
Multiple Regression
In this chapter we continue our study of regression analysis by considering situations
involving two or more independent variables.
This subject area, called multiple regression analysis, enables us to consider more
factors and thus obtain better estimates than are possible with simple linear regression.
2
Multiple Regression Equation
The equation that describes how the mean value of ! is related to "# , "% , … , "' is:
3
Estimation Process
4
Least Squares Method
The formulas for the regression coefficients !" , !# , !$ ,…, !% involve the use of
matrix algebra. We will rely on computer software packages to perform the
calculations.
The emphasis will be on how to interpret the computer output rather than on how
to make the multiple regression computations.
5
Example: Programmer Salary Survey
A software firm collected data for a sample of 20 computer programmers. A
suggestion was made that regression analysis could be used to determine if salary
was related to the years of experience and the score on the firm’s programmer
aptitude test.
The years of experience, score on the aptitude test, and corresponding annual salary
($1000s) for a sample of 20 programmers is shown on the next slide.
6
Example
Exper. (Yrs.) Test score Salary ($1000s) Exper. (Yrs.) Test score Salary ($1000s)
4 78 24.0 9 88 38.0
7 100 43.0 2 73 26.6
1 86 23.7 10 75 36.2
5 82 34.3 5 81 31.6
8 86 35.8 6 74 29.0
10 84 38.0 8 87 34.0
0 75 22.2 4 79 30.1
1 80 23.1 6 94 33.9
6 83 30.0 3 70 28.2
6 91 33.0 3 89 30.0
7
Answer
Suppose we believe that salary (!) is related to the years of experience ("#) and the
score on the programmer aptitude test ("$) by the following regression model:
where
! = annual salary ($1000s)
"# = years of experience
"$ = score on programmer aptitude test
8
Solving for the Estimates of !" , !# , !$
9
Using Excel To Solve for the Estimates of !" , !# , !$
Activate and set the Confidence Level, Residuals and Standardized Residuals.
10
Solving for the Estimates of !" , !# , !$
11
Interpreting the Coefficients
!" = 1.404 Salary is expected to increase by $1,404 for each additional year of
experience (when the variable score on programmer attitude test is held constant).
!* = 0.251 Salary is expected to increase by $251 for each additional point scored
on the programmer aptitude test (when the variable years of experience
is held constant).
12
Multiple Coefficient of Determination
Relationship Among SST, SSR, SSE
where:
SST = total sum of squares
SSR = sum of squares due to regression
SSE = sum of squares due to error
ANOVA Output
13
Adjusted Multiple Coefficient of Determination
Adding independent variables, even ones that are not statistically significant, causes
the prediction errors to become smaller, thus reducing the sum of squares due to
error, SSE.
Because SSR = SST – SSE, when SSE becomes smaller, SSR becomes larger,
causing ! " = SSR/SST to increase.
14
Assumptions About the Error Term !
• The error ! is a random variable with mean of zero.
• The variance of !, denoted by σ2, is the same for all values of the independent
variables.
• The values of ε are independent.
15
Testing for Significance
In simple linear regression, the ! and " test provide the same conclusion.
In multiple regression, the ! and " test have different purposes.
The F test is referred to as the test for overall significance.
If the ! test shows an overall significance, the " test is used to determine whether
each of the individual independent variables is significant.
A separate " test is conducted for each of the independent variables in the model.
16
Testing for Significance: ! Test (Steps)
17
Example:
ANOVA Output
18
Using Excel To Perform F Test for Multiple Regression
19
Testing for Individual Significance: t Test (Steps)
20
Example
21
Testing for Significance: t Test Individual Parameters (Steps)
Regression Equation Output
22
Using Excel To Perform t Test for Multiple Regression
23
Testing for Significance: Multicollinearity
• Every attempt should be made to avoid including independent variables that are
highly correlated.
24
Using the Estimated Regression Equation for Estimation and Prediction
• The procedures for estimating the mean value of ! and predicting an individual value
of ! in multiple regression are similar to those in simple regression.
• The formulas required to develop interval estimates for the mean value of !" and for an
individual value of ! are beyond the scope of the textbook.
• Software packages for multiple regression will often provide these interval estimates.
Residual Analysis
For simple linear regression the residual plot against !" and the residual plot against #
provide the same information.
In multiple regression analysis it is preferable to use the residual plot against !" to
determine if the model assumptions are satisfied.
25
Standardized Residual Plot Against "!
• Standardized residuals are frequently used in residual plots for purposes of:
• Identifying outliers (typically, standardized residuals < –2 or > +2)
• Providing insight about the assumption that the error term # has a normal distribution
27
Standardized Residual Plot Against "!
28
Categorical Independent Variables
In many situations we must work with categorical independent variables such as
gender (male, female), method of payment (cash, check, credit card), etc.
For example, !" might represent gender where !" = 0 indicates male and !" = 1
indicates female.
29
Example: Programmer Salary Survey
As an extension of the problem involving the computer programmer salary survey,
suppose that management also believes that the annual salary is related to whether
the individual has a graduate degree in computer science or information systems.
The years of experience, the score on the programmer aptitude test, whether the
individual has a relevant graduate degree, and the annual salary ($1000) for each of
the sampled 20 programmers are shown on the next slide.
30
Example: Programmer Salary Survey
Exper. Test Score Degr. Salary Exper. Test Score Degr. Salary
(Yrs.) ($1000) (Yrs.) ($1000)
4 78 No 24.0 9 88 Yes 38.0
7 100 Yes 43.0 2 73 No 26.6
1 86 No 23.7 10 75 Yes 36.2
5 82 Yes 34.3 5 81 No 31.6
8 86 Yes 35.8 6 74 No 29.0
10 84 Yes 38.0 8 87 Yes 34.0
0 75 No 22.2 4 79 No 30.1
1 80 No 23.1 6 94 Yes 33.9
6 83 No 30.0 3 70 No 28.2
6 91 Yes 33.0 3 89 No 30.0
31
Example: Programmer Salary Survey
Regression Equation
where:
"! = annual salary ($1000)
#$ = years of experience
#% = score on programmer aptitude test
#& = 0 if individual does not have a graduate degree
1 if individual does have a graduate degree
(#& is a dummy variable)
32
Example: Programmer Salary Survey
ANOVA Output
169.299
33
Example: Programmer Salary Survey
34
More Complex Categorical Variables
If a categorical variable has ! levels, ! − 1 dummy variables are required, with
each dummy variable being coded as 0 or 1.
For example, a variable with levels A, B, and C could be represented by $% and $&
values of (0, 0) for A, (1, 0) for B, and (0, 1) for C.
Care must be taken in defining and interpreting the dummy variables.
For example, a variable indicating level of education could be represented by $% and
$& values as follows:
salespersons.
36
Answer
Excel’s Chart tools can be used to develop a scatter diagram and fit a straight
line to bivariate data.
The estimated regression equation and the coefficient of determination for
simple linear regression can also be developed.
Insert > Scatter Chart > Scatter > Select Data
Click OK.
37
Answer
The scatter diagram indicates a possible curvilinear relationship between the
length of time employed and the number of scales sold.
So, we develop a multiple regression model with two independent variables:
! and ! " .
38
Modeling Curvilinear Relationships
Excel’s Chart tools can be used to fit a
polynomial curve to the data. (Dialog box is
shown.)
39
Modeling Curvilinear Relationships
40
Modeling Curvilinear Relationships
Excel’s Chart tools output does not provide any means for testing the significance of
the results, so we need to use Excel’s Regression tool.
We will treat the values of ! " as a second independent variable (called MonthSq).
41
Answer
We should be pleased with the fit provided by the
estimated multiple regression equation.
42
Excel’s Regression Tool Output
43