Chapter 1 PDF
Chapter 1 PDF
Chapter 1 PDF
Modeling
Abstract
Applied Regression and Modeling: A Computer Integrated Approach creates a
balance between the theory, practical applications, and computer implementation behind Regressionone of the most widely used techniques
in analyzing and solving real world problems. The book begins with a
thorough explanation of prerequisite knowledge with a discussion of
Simple Regression Analysis including the computer applications. This is
followed by Multiple Regressiona widely used tool to predict a response
variable using two or more predictors. Since the analyses of regression
models involve tedious and complex computations, c omplete computer
analysis including the interpretation of multiple regression problems
along with the model adequacy tests and residual analysis using widely
used computer software are presented. The use of computers relieves the
analyst of tedious, repetitive calculations, and allows one to focus on
creating and interpreting successful models.
Finally, the book extends the concepts to Regression and Modeling.
Different models that provide a good fit to a set of data and provide a
good prediction of the response variable are discussed. Among models
discussed are the nonlinear, higher order, and interaction models,
Keywords
coefficient of correlation, correlation, dependent variable, dummy variable, independent variable, interaction model, least squares estimates,
least squares prediction equation, linear regression, multiple coefficient
of determination, multiple regression and modeling, nonlinear models,
regression line, residual analysis, scatterplot, second-order model, stepwise regression
Contents
Prefaceix
Acknowledgmentsxiii
Computer Software Integration, Computer Instructions, and Data Files xv
Chapter 1 Introduction to Regression and Correlation Analysis1
Chapter 2 Regression, Covariance, and Coefficient of Correlation13
Chapter 3 Illustration of Least Squares Regression Method29
Chapter 4 Regression Analysis Using a Computer51
Chapter 5 Multiple Regression: Computer Analysis81
Chapter 6 Model Building and Computer Analysis133
Chapter 7 Models with Qualitative Independent (Dummy)
Variables,Interaction Models, All Subset and
StepwiseRegression Models with Computer Analysis149
Chapter 8 Notes on Implementation of Regression Models185
Bibliography191
Index193
Preface
This book is about regression and modelingone of the most widely
used techniques in analyzing and solving real-world problems. Regression analysis is used to investigate the relationship between two or more
variables. Often we are interested in predicting one variable using one
or more variables. For example, we might be interested in the relationship between two variables: sales and profit for a chain of stores, n
umber
of hours required to produce certain number of products, number of
accidents versus blood alcohol level, advertising expenditures and sales,
or the height of parents compared to their children. In all these cases,
regression analysis can be applied to investigate the relationship between
the variables.
The book is divided into three parts(1) prerequisite to regression
analysis followed by a discussion on simple regression, (2) multiple regression analysis with applications, and (3) regression and modeling including
second-order models, nonlinear regression, regression using qualitative or
dummy variables, and interaction models in regressions. All these sections
provide examples with complete computer analysis and instructions commonly used in modeling and analyzing these problems. The book deals
with detailed analysis and interpretation of computer results. This will
help readers to appreciate the power of computer in applying regression
models. The readers will find that the understanding of computer results is
critical to implementing regression and modeling in real-world situation.
The purpose of simple regression analysis is to develop a statistical
model that can be used to predict the value of a response or dependent
variable using an independent variable. In a simple linear regression
method, we study the linear relationship between two variables. For
example, suppose that a Power Utility company is interested in developing a model that will enable them to predict the home heating cost based
on the size of homes in two of the Western states that they serve. This
model involves two variables: the heating cost and the size of the homes.
x PREFACE
The first part of the book shows how to model and analyze this type of
problem.
In the second part of the book, we expand the concept of simple regression to include multiple regression analysis. A multiple linear regression
involves one dependent or response variable, and two or more independent
variables or predictors. The concepts of simple regression discussed in the
previous chapter are also applicable to the multiple regression. We provide
graphical analysis known as matrix plots that are very useful in analyzing
multiple regression problems. A complete computer analysis including
the interpretation of multiple regression problems along with the model
adequacy tests and residual analysis using a computer are presented.
In the third part of the book, we discuss different types of models
using regression analysis. By model building, we mean selecting the model
that will provide a good fit to a set of data, and the one that will provide a
good prediction of the response or the dependent variable. In experimental situations, we often encounter both the quantitative and qualitative
variables. In the model building examples, we will show how to deal with
qualitative independent variables. The model building part also discusses
the nonlinear models including second-order, higher order, and interaction models. Complete computer analysis and interpretation of computer
results are presented with real-world applications. We also explain how to
model a regression problem using dummy variables. Finally, we discuss all
subset regression and stepwise regression and their applications.
The book is written for juniors, seniors, and graduate students in
business, MBAs, professional MBAs, and working people in business and
industry. Managers, practitioners, professionals, quality professionals,
quality engineers, and anyone involved in data analysis, business analytics, and quality and six sigma will find the book to be a valuable resource.
The book presents an in-depth treatment of regression and modeling in a concise form. The readers will find the book easy-to-read and
comprehend. The book takes the approach of organizing and presenting
the material in a way that allows the reader to understand the concepts
easily. The use of computers in modeling and analyzing simple, multiple, and higher order regression problems is emphasized throughout the
book. The book uses the most widely used computer software in data
analysis and quality used in industry and academia. Readers interested in
Preface
xi
Acknowledgments
I would like to thank the reviewers who took the time to provide excellent
insights which helped shape this book.
I would especially like to thank Mr. Karun Mehta, a friend and
engineer. His expertise and tireless efforts in helping to prepare this text
is greatly appreciated.
I am very thankful to Prof. Edward Engh for reviewing the book
and providing thoughtful advice. Ed has been a wonderful friend and
colleague. I have learned a great deal from him.
I would like to thank Dr. Roger Lee, a senior professor and colleague
for reading the initial draft and administering invaluable advice and
suggestions.
Thanks to all of my students for their input in making this book
possible. They have helped me pursue a dream filled with lifelong l earning.
This book wont be a reality without them.
I am indebted to senior acquisitions editor, Scott Isenberg; director
of production, Charlene Kronstedt; marketing manager; all the reviewers and collection editors, and the publishing team at Business Expert
Press for their counsel and support during the preparation of this book.
I acknowledge the help and support of Exeter Premedia Services
Chennai, IndiaTeam for reviewing and editing the manuscript.
I would like to thank my parents who always emphasized the importance of what education brings to the world. Lastly, I would like to express
a special appreciation to my wife Nilima, to my daughter Neha and her
husband David, my daughter Smita, and my son Rajeev for their love,
support and encouragement.
Computer Software
Integration, Computer
Instructions, and Data Files
We wrote the book so that it is not dependent on any particular software
package; however, we have used the most widely used packages in
regression analysis and modeling. We have also included the materials
with the c omputer instructions in Appendix A of the book. The c omputer
instructions are provided for both Excel and MINITAB that will f acilitate
using the book. Included are the following supplementary materials and
data files in separate folders:
Excel Data Files
MINITAB Data Files
APPENDIX_A: Computer Instructions for Excel and
MINITAB
APPENDIX_B: Statistical Concepts for Regression Analysis
All of the preceding materials can be downloaded from the Web using
the following link:
URL: http://businessexpertpress.com/books/applied-regression-and-
modeling-computer-integrated-approach
CHAPTER 1
Introduction to Regression
and Correlation Analysis
Introduction
In real world, managers are always faced with massive amount of data
involving several different variables. For example, they may have data
on sales, advertising, or the demand for one of the several products his
or her company markets. The data on each of these categoriessales,
advertising, and demand is a variable. Any time we collect data on any
entity, we call it a variable and statistics is used to study the variation in
the data. Using statistical tools we can also extract relationships between
different variables of interest. In dealing with different variables, often
a question arises regarding the relationship between the variables being
studied. In order to make effective decisions, it is important to know and
understand how the variables in question are related. Sometimes, when
faced with data having numerous variables, the decision-making process
is even more complicated. The objective of this text is to explore the tools
that will help the managers investigate the relationship between different
variables. The relationships are critical to making effective decisions. They
also help to predict one variable using the other variable or variables of
interest.
The relationship between two or more variables is investigated using
one of the most widely used toolsregression and correlation analysis.
Regression analysis is used to study and explain the mathematical relationship between two or more variables. By mathematical relationship we
mean whether the relationship between the variables is linear or nonlinear. Sometimes we may be interested in only two variables. For example,
we may be interested in the relationship between sales and advertising.
Companies spend millions of dollars in advertising and expect that an
100
90
Sales ($)
80
70
60
50
40
30
5.0
7.5
10.0
12.5
Advertisement ($)
15.0
17.5
Heating cost
400
300
200
100
0
0
10
20
30
40
Avg. temp.
50
60
70
positive relationship between the two variables where we can see a positive
trend. This means that an increase in one variable leads to an increase in
the other.
Figure 1.2 shows the relationship between the home heating cost and
the average outside temperature (Data File: HEAT.MTW). This plot
shows a tendency for the points to follow a straight line with a negative
slope. This means that there is an inverse or negative relationship between
the heating cost and the average temperature. As the average outside temperature increases, the home heating cost goes down. Figure 1.3 shows
a weak or no relationship between quality rating and material cost of a
product (Data File RATING.MTW).
9.5
Quality rating
9.0
8.5
8.0
7.5
7.0
200
250
300
350
400
Material cost
450
500
550
32
Electricity used
30
28
26
24
22
20
75
80
85
90
95
Summer temperature
100
105
32
Electricity used
30
28
26
24
22
20
75
80
85
90
95
100
Summer temperature
105
These plots demonstrate the relationship between two variables visually. The plots are very helpful in explaining the types of relationship
between the two variables and are usually the first step in studying such
relationships. The regression line shown in Figure 1.5 is known as the line
of best fit. This is the best-fitting line through the data points and is
uniquely determined using a mathematical technique known as the least
squares method. We will explain the least squares method in detail in
the subsequent chapters. In regression, the least squares method is used
to determine the best-fitting line or curve through the data points in the
scatterplot and provides the equation of the line or curve that is used in
predicting the dependent variable. For example, the electricity used for a
particular summer temperature in Figure 1.5 can be predicted using the
equation of the line.
In these examples, we demonstrated some cases where the relationships between the two variables of interest were linearpositive or direct
linear and inverse or negative linear. In a direct linear or positive relationship, the increase in the value of one variable leads to an increase in the
other. An example of this was shown earlier using the sales and advertising expenditure for a company (Figure 1.1). The inverse relationship
between the two variables shows that the increase in the value of one of
the variables leads to a decrease in the value of the other. This was demonstrated in Figure 1.2, which shows that as the average outside temperature
increases, the heating cost for homes decreases.
S
897.204
R-Sq 97.8%
Yield
20,000
15,000
10,000
5,000
0
50
100
150
200
Temp.
250
300
be used to predict the yield (y) for a particular temperature (x). This is an
example of nonlinear regression. The detailed analysis and explanation of
such regression models will be discussed in subsequent chapters.
Matrix Plots
Heating cost
450
400
350
300
250
200
150
100
50
0
25
50
Avg. temp.
House size
4
8
12
Age of furnace
Figure 1.7 Matrix plot of heating cost (y) and each of the
independent variable
10
The matrix plot in Figure 1.7 was developed using each Y versus
each X. From this plot, it is evident that there is a negative relationship
between the heating cost and the average temperature. This means that
an increase in the average temperature leads to decreased heating cost.
Similarly, the relationship between the heating cost and the other two
independent variableshouse size and age of the furnace is obvious from
this matrix plot. Figure 1.8 shows another form of matrix plot depicting
the relationship between the home heating cost based on the average outside temperature, size of the house (in thousands of square feet), and the
life of the furnace (years) by creating an array of scatterplots.
Using Figure 1.8, the simultaneous effect of heating cost and the three
independent variables can be assessed easily. The plot has three columns
and three rows.
The first column and the first row in Figure 1.8 show the relationship
between the heating cost (the response variable) and one of the independent variables, average temperature. The second row shows the relationship between the heating cost and two of the independent variablesthe
average temperature and the house size, while the third row in the plot
shows the relationship between the heating cost and the three independent variables. The previous visual displays are very useful in studying
the relationships among variables and creating the appropriate regression
models.
Heating cost
50
25
0
5
3
Avg. temp.
House size
1
12
8
4
Age of furnace
150 300 450 0
25
50
Figure 1.8 A matrix plot of average temp, house size, furnace age,
and heating cost
Summary
This chapter introduced a class of decision-making tools known as regression and correlation analysis. Regression models are widely used in the
real world in explaining the relationship between two or more variables.
The relationships among the variables in question are critical to making
effective decisions. They also help to predict one variable using the other
variable or variables of interest. Another tool often used in conjunction
with regression analysis is known as correlation analysis. The correlation
explains the degree of association between the two variables; that is, it
explains how strong or weak the relationship between the two variables
is. The simplest form of regression explores the relationship between two
variables and is studied using the technique of simple regression analysis. The problem involving many variables is studied using the technique
of multiple regression analysis. The objective in simple regression is to
predict one variable using the other. The variable to be predicted is known
as the dependent or response variable and the other one is known as
the independent variable or predictor. The multiple regression problem
involves one dependent and two or more independent variables. Describing
the relationship between two quantitative variables is called a bivariate
relationship. The chapter also introduced and presented several scatterplots and matrix plots. These plots are critical in investigating the relationships between two or more variables and are very helpful in the initial
stages of constructing the correct regression models. A computer software
is almost always used in building and analyzing regression models. We
introduced some of these widely used computer packages in this chapter.
Index
Adjusted coefficient of multiple
determination, 9698
All subset regression, 176178
Best fitting line equation, 3132,
5758
Bivariate relationship, 4
Coefficient of correlation
calculating, 25
definition, 8, 24
examples of, 2526
least squares method, 38
regression testing, 4748
using MINITAB, 25
Coefficient of determination, 3638,
5354, 5859
Coefficient of multiple determination,
9596
Confidence intervals, 3940, 5962,
109114
COOK1 (Cooks Distance), 7476
Correlation analysis, 2
Correlation coefficient. See Coefficient
of correlation
Covariance
calculating, 25
definition, 8, 23
interpretation of, 2324
limitations of, 24
using MINITAB, 25
Dependent variable, 2, 4, 1314
Dummy variables, 149150, 154166
Durbin-Watson statistic tests, 7679
Estimated multiple regression
equation, 83
Estimated regression equation, 1617,
64
Excel, 5255, 140141, 159, 163
194 Index
Index 195
normal equations, 22
regression equation, 15
simple linear regression method,
1517
Regression model assumptions
equality of variance assumption,
65, 67
independence of errors, 6566
linearity assumption, 6667
normality assumption, 65, 67
population regression model, 65
residual analysis, 6772
Regression testing
correlation coefficient, 4748
F-test, 4547
t-test, 4345
Residual analysis
calculating and storing residuals
and standardized residuals
using MINITAB, 6870
using MINITAB residual plots,
7072
Scatterplots
best-fitting curve, 89
with correlations, 26
definition, 4
heating cost vs. temperature, 5
hours vs. units, 31
multiple regression computer
analysis, 8891
quality rating vs. material cost, 6
with regression line, 7
sales vs. advertisement, 5
sales vs. advertisement expenditures,
1819
summer temperature vs. electricity
used, 6
x and y, nonlinear relationship, 89
Second-order model
computer results, 137, 142143