Chapter 15

Download as pdf or txt
Download as pdf or txt
You are on page 1of 43

Chapter 15

Multiple Regression
15.1 - Multiple Regression Model
15.2 - Least Squares Method
15.3 - Multiple Coefficient of Determination
15.4 - Model Assumptions
15.5 - Testing for Significance
15.6 - Using the Estimated Regression Equation for Estimation and Prediction
15.7 - Categorical Independent Variables
15.8 - Residual Analysis
15.9 - Logistic Regression
1
Multiple Regression
In this chapter we continue our study of regression analysis by considering situations
involving two or more independent variables.
This subject area, called multiple regression analysis, enables us to consider more
factors and thus obtain better estimates than are possible with simple linear regression.

Multiple Regression Model

2
Multiple Regression Equation
The equation that describes how the mean value of ! is related to "# , "% , … , "' is:

Estimated Multiple Regression Equation

3
Estimation Process

4
Least Squares Method

Least Squares Criterion

Computation of Coefficient Values

The formulas for the regression coefficients !" , !# , !$ ,…, !% involve the use of
matrix algebra. We will rely on computer software packages to perform the
calculations.
The emphasis will be on how to interpret the computer output rather than on how
to make the multiple regression computations.
5
Example: Programmer Salary Survey
A software firm collected data for a sample of 20 computer programmers. A
suggestion was made that regression analysis could be used to determine if salary
was related to the years of experience and the score on the firm’s programmer
aptitude test.

The years of experience, score on the aptitude test, and corresponding annual salary
($1000s) for a sample of 20 programmers is shown on the next slide.

6
Example

Exper. (Yrs.) Test score Salary ($1000s) Exper. (Yrs.) Test score Salary ($1000s)
4 78 24.0 9 88 38.0
7 100 43.0 2 73 26.6
1 86 23.7 10 75 36.2
5 82 34.3 5 81 31.6
8 86 35.8 6 74 29.0
10 84 38.0 8 87 34.0
0 75 22.2 4 79 30.1
1 80 23.1 6 94 33.9
6 83 30.0 3 70 28.2
6 91 33.0 3 89 30.0

7
Answer
Suppose we believe that salary (!) is related to the years of experience ("#) and the
score on the programmer aptitude test ("$) by the following regression model:

where
! = annual salary ($1000s)
"# = years of experience
"$ = score on programmer aptitude test

8
Solving for the Estimates of !" , !# , !$

9
Using Excel To Solve for the Estimates of !" , !# , !$

Activate Data Analysis


Windows (File > Option > Data Analysis)
Mac (Tools > Excel Add-ins > Analysis ToolPak)

Click Data > Data Analysis > Regression

Select your Y Range values and X Range values

Activate and set the Confidence Level, Residuals and Standardized Residuals.

Set Location for the results and Click OK.

10
Solving for the Estimates of !" , !# , !$

Regression Equation Output:

Predictor Coef SE Coef T p


Constant 3.17 6.15607 0.5156 0.61279
Experience 1.4039 0.19857 7.0702 1.9E-09
Test Score 0.25089 0.07735 3.2433 0.00478

The estimated regression equation is:


SALARY = 3.174 + 1.404(EXPER) + 0.251(SCORE)
(Note: Predicted salary will be in thousands of dollars.)

11
Interpreting the Coefficients

In multiple regression analysis, we interpret each regression coefficient as follows:


!" represents an estimate of the change in # corresponding to one unit
increase in $" when all other independent variables are held constant.

!" = 1.404 Salary is expected to increase by $1,404 for each additional year of
experience (when the variable score on programmer attitude test is held constant).

!* = 0.251 Salary is expected to increase by $251 for each additional point scored
on the programmer aptitude test (when the variable years of experience
is held constant).

12
Multiple Coefficient of Determination
Relationship Among SST, SSR, SSE
where:
SST = total sum of squares
SSR = sum of squares due to regression
SSE = sum of squares due to error

ANOVA Output

13
Adjusted Multiple Coefficient of Determination
Adding independent variables, even ones that are not statistically significant, causes
the prediction errors to become smaller, thus reducing the sum of squares due to
error, SSE.

Because SSR = SST – SSE, when SSE becomes smaller, SSR becomes larger,
causing ! " = SSR/SST to increase.

The adjusted multiple coefficient of determination compensates for the number of


independent variables in the model.

14
Assumptions About the Error Term !
• The error ! is a random variable with mean of zero.
• The variance of !, denoted by σ2, is the same for all values of the independent
variables.
• The values of ε are independent.

15
Testing for Significance
In simple linear regression, the ! and " test provide the same conclusion.
In multiple regression, the ! and " test have different purposes.
The F test is referred to as the test for overall significance.

If the ! test shows an overall significance, the " test is used to determine whether
each of the individual independent variables is significant.
A separate " test is conducted for each of the independent variables in the model.

We refer to each of these " tests as a test for individual significance.

16
Testing for Significance: ! Test (Steps)

17
Example:

ANOVA Output

18
Using Excel To Perform F Test for Multiple Regression

Activate Data Analysis


Windows (File > Option > Data Analysis)
Mac (Tools > Excel Add-ins > Analysis ToolPak)

Click Data > Data Analysis > Regression

Select your Y Range values and X Range values

Set Location for the results and Click OK.

19
Testing for Individual Significance: t Test (Steps)

20
Example

21
Testing for Significance: t Test Individual Parameters (Steps)
Regression Equation Output

t statistic and p-value


used to test for the
individual significance
of “Experience”

22
Using Excel To Perform t Test for Multiple Regression

Activate Data Analysis


Windows (File > Option > Data Analysis)
Mac (Tools > Excel Add-ins > Analysis ToolPak

Click Data > Data Analysis > Regression

Select your Y Range values and X Range values

Activate and set the Confidence Level

Set Location for the results and Click OK.

23
Testing for Significance: Multicollinearity

• The term multicollinearity refers to the correlation among the independent


variables.

• If the estimated regression equation is to be used only for predictive purposes,


multicollinearity is usually not a serious problem.

• Every attempt should be made to avoid including independent variables that are
highly correlated.

24
Using the Estimated Regression Equation for Estimation and Prediction
• The procedures for estimating the mean value of ! and predicting an individual value
of ! in multiple regression are similar to those in simple regression.

• The formulas required to develop interval estimates for the mean value of !" and for an
individual value of ! are beyond the scope of the textbook.
• Software packages for multiple regression will often provide these interval estimates.

Residual Analysis
For simple linear regression the residual plot against !" and the residual plot against #
provide the same information.
In multiple regression analysis it is preferable to use the residual plot against !" to
determine if the model assumptions are satisfied.
25
Standardized Residual Plot Against "!
• Standardized residuals are frequently used in residual plots for purposes of:
• Identifying outliers (typically, standardized residuals < –2 or > +2)
• Providing insight about the assumption that the error term # has a normal distribution

• The computation of the standardized residuals in multiple regression analysis is too


complex to be done by hand. Excel’s Regression tool can be used.

Observation Predicted Y Residuals Standard Residuals


1 27.89626 –3.89626 –1.771707
2 37.95204 5.047957 2.295406
3 26.02901 –2.32901 –1.059048
4 32.11201 2.187986 0.992921
5 36.34251 0.53251 –0.246689
26
Using Excel To Perform Standardized Residuals

Activate Data Analysis


Windows (File > Option > Data Analysis)
Mac (Tools > Excel Add-ins > Analysis ToolPak)

Click Data > Data Analysis > Regression

Select your Y Range values and X Range values

Activate and set the Residuals and Standardized Residuals.

Set Location for the results and Click OK.

27
Standardized Residual Plot Against "!

28
Categorical Independent Variables
In many situations we must work with categorical independent variables such as
gender (male, female), method of payment (cash, check, credit card), etc.

For example, !" might represent gender where !" = 0 indicates male and !" = 1
indicates female.

In this case, !" is called a dummy or indicator variable.

29
Example: Programmer Salary Survey
As an extension of the problem involving the computer programmer salary survey,
suppose that management also believes that the annual salary is related to whether
the individual has a graduate degree in computer science or information systems.

The years of experience, the score on the programmer aptitude test, whether the
individual has a relevant graduate degree, and the annual salary ($1000) for each of
the sampled 20 programmers are shown on the next slide.

30
Example: Programmer Salary Survey

Exper. Test Score Degr. Salary Exper. Test Score Degr. Salary
(Yrs.) ($1000) (Yrs.) ($1000)
4 78 No 24.0 9 88 Yes 38.0
7 100 Yes 43.0 2 73 No 26.6
1 86 No 23.7 10 75 Yes 36.2
5 82 Yes 34.3 5 81 No 31.6
8 86 Yes 35.8 6 74 No 29.0
10 84 Yes 38.0 8 87 Yes 34.0
0 75 No 22.2 4 79 No 30.1
1 80 No 23.1 6 94 Yes 33.9
6 83 No 30.0 3 70 No 28.2
6 91 Yes 33.0 3 89 No 30.0

31
Example: Programmer Salary Survey
Regression Equation

where:
"! = annual salary ($1000)
#$ = years of experience
#% = score on programmer aptitude test
#& = 0 if individual does not have a graduate degree
1 if individual does have a graduate degree
(#& is a dummy variable)

32
Example: Programmer Salary Survey
ANOVA Output

169.299

33
Example: Programmer Salary Survey

Regression Equation Output

Predictor Coef SE Coef T p


Constant 7.945 7.382 1.076 0.298
Experience 1.148 0.298 3.856 0.001
Test Score 0.197 0.090 2.191 0.044
Grad. Degr. 2.280 1.987 1.148 0.268

34
More Complex Categorical Variables
If a categorical variable has ! levels, ! − 1 dummy variables are required, with
each dummy variable being coded as 0 or 1.
For example, a variable with levels A, B, and C could be represented by $% and $&
values of (0, 0) for A, (1, 0) for B, and (0, 1) for C.
Care must be taken in defining and interpreting the dummy variables.
For example, a variable indicating level of education could be represented by $% and
$& values as follows:

High Degree $% $&


Bachelor’s 0 0
Master’s 1 0
Ph.D. 0 1
35
Modeling Curvilinear Relationships
Example: Sales of Laboratory Scales Months Sales Months Sales
A manufacturer of laboratory scales 41 275 40 189
wants to investigate the relationship 106 296 51 235
between the length of employment of 76 317 9 83
their salespeople and the number of
104 376 12 112
scales sold.
22 162 6 67
The table gives the number of months
12 150 56 325
each salesperson has been employed by
the firm (x) and the number of scales 85 367 56 325
sold (y) by 15 randomly selected 111 308 EMPTY CELL EMPTY CELL

salespersons.

36
Answer
Excel’s Chart tools can be used to develop a scatter diagram and fit a straight
line to bivariate data.
The estimated regression equation and the coefficient of determination for
simple linear regression can also be developed.
Insert > Scatter Chart > Scatter > Select Data

Enter a Name for your chart


Highlight your Y values:
Highlight your X values:
Connect Data points with line by
a. Right click any data point
b. Click “Add Treadline”
c. Choose “Linear”
d. Check Display Equation and R-squared value

Click OK.
37
Answer
The scatter diagram indicates a possible curvilinear relationship between the
length of time employed and the number of scales sold.
So, we develop a multiple regression model with two independent variables:
! and ! " .

• This model is often referred to as a second-order polynomial or a quadratic model.

38
Modeling Curvilinear Relationships
Excel’s Chart tools can be used to fit a
polynomial curve to the data. (Dialog box is
shown.)

To get the dialog box, position the mouse


pointer over any data point in the scatter
diagram and right-click and select “Format
Treadline” and choose “Polynomial”.

The estimated multiple regression equation


and multiple coefficient of determination for
this second-order model are also obtained.

39
Modeling Curvilinear Relationships

Chart Tools Output

40
Modeling Curvilinear Relationships
Excel’s Chart tools output does not provide any means for testing the significance of
the results, so we need to use Excel’s Regression tool.

We will treat the values of ! " as a second independent variable (called MonthSq).

Months MonthsSq Sales Months MonthsSq Sales


41 1681 275 40 1600 189
106 11236 296 51 2601 235
76 5776 317 9 81 83
104 10816 376 12 144 112
22 484 162 6 36 67
12 144 150 56 3136 325
85 7225 367 56 361 325
111 12321 308 EMPTY CELL EMPTY CELL EMPTY CELL

41
Answer
We should be pleased with the fit provided by the
estimated multiple regression equation.

Excel’s Regression Tool Output


The overall model is
significant (p-value
for the ! test is
8.75E-07)

42
Excel’s Regression Tool Output

We can conclude that adding MonthsSq to the model is significant.

43

You might also like