PREDICTIVE ANALYTICS
USING REGRESSION
Sumeet Gupta
Associate Professor
Indian Institute of Management Raipur
Outline
Basic Concepts
Applications of Predictive Modeling
Linear Regression in One Variable using OLS
Multiple Linear Regression
Assumptions in Regression
Explanatory Vs Predictive Modeling
Performance Evaluation of Predictive Models
Practical Exercises
Case: Nils Baker
Case: Pedigree Vs Grit
BASIC CONCEPTS
Predictive Modeling: Applications
Predictive customer activity on credit cards from their
demographic and historical activity patterns
Predicting the time to failure or equipment based on
utilization and environment conditions
Predicting expenditures on vacation travel based on
historical frequent flyer data
Predicting staffing requirements at help desks based on
historical data and product and sales information
Predicting sales from cross selling of products from
historical information
Predicting the impact of discounts on sales in retail outlets
Basic Concept: Relationships
Examples of relationships:
Sales and earnings
Cost and number produced
Microsoft and the stock market
Effort and results
Scatterplot
A picture to explore the relationship in bivariate data
Correlation r
Measures strength of the relationship (from 1 to 1)
Regression
Predicting one variable from the other
Basic Concept: Correlation
r = 1
A perfect straight line
tilting up to the right
r = 0
No overall tilt
No relationship?
r = 1
A perfect straight line
tilting down to the right
X
Y
X
Y
X
Y
X
Y
Basic Concepts: Simple Linear Model
Linear Model for the Population
The foundation for statistical inference in regression
Observed Y is a straight line, plus randomness
Y = + X +
Randomness of individuals
Population relationship, on average
Y
Basic Concepts: Simple Linear Model
Time Spent vs. Internet Pages Viewed
Two measures of the abilities of 25 Internet sites
At the top right are eBay, Yahoo!, and MSN
Correlation is r = 0.964
Linear relationship
Straight line
with scatter
Increasing relationship
Tilts up and to the right
Minutes per person
Very strong positive association (since r is close to 1)
90
eBay
Yahoo!
60
MSN
30
0
100
200
Pages per person
Basic Concepts: Simple Linear Model
Dollars vs. Deals
For mergers and acquisitions by investment bankers
244 deals worth $756 billion by Goldman Sachs
Correlation is r = 0.419
Positive association
Straight line
with scatter
Increasing relationship
Tilts up and to the right
$1,000
Dollars (billions)
Linear relationship
$500
$0
100
200 300
Deals
400
10
Basic Concepts: Simple Linear Model
Interest Rate vs. Loan Fee
For mortgages
If the interest rate is lower, does the bank make it up with a higher loan
fee?
Correlation is r = 0.890
Linear relationship
Straight line
with scatter
Decreasing relationship
Tilts down and to the right
Interest rate
Strong negative association
6.0%
5.5%
5.0%
0%
1%
2%
3%
Loan fee
4%
11
Basic Concepts: Simple Linear Model
Todays vs. Yesterdays Percent Change
Is there momentum?
If the market was up yesterday, is it more likely to be up today? Or is
each days performance independent?
3%
Correlation is r = 0.11
No relationship?
Tilt is neither
up nor down
2%
Today's change
A weak relationship?
1%
0%
-1%
-2%
-3%
-3% -2% -1% 0% 1% 2% 3%
Yesterday's change
12
Basic Concepts: Simple Linear Model
Call Price vs. Strike Price
For stock options
Call Price is the price of the option contract to buy stock at the
Strike Price
The right to buy at a lower strike price has more value
A nonlinear relationship
A curved relationship
Correlation r = 0.895
A negative relationship:
Higher strike price goes
with lower call price
$100
Call Price
Not a straight line:
$75
$50
$25
$0
$450
$500
$550
Strike Price
$600
$650
13
Basic Concepts: Simple Linear Model
Output Yield vs. Temperature
For an industrial process
With a best optimal temperature setting
A nonlinear relationship
Not a straight line:
Correlation r = 0.0155
r suggests no relationship
But relationship is strong
It tilts neither
up nor down
Yield of process
A curved relationship
160
150
140
130
120
500
600 700 800
Temperature
900
14
Basic Concepts: Simple Linear Model
Circuit Miles vs. Investment (lower left)
For telecommunications firms
A relationship with unequal variability
More vertical variation at the right than at the left
Variability is stabilized by taking logarithms (lower right)
r = 0.957
Log of miles
Circuit miles
(millions)
Correlation r = 0.820
2,000
1,000
0
0
1,000 2,000
Investment
($millions)
20
15
15
20
Log of investment
15
Basic Concepts: Simple Linear Model
Price vs. Coupon Payment
For trading in the bond market
Bonds paying a higher coupon generally cost more
Two clusters are visible
Ordinary bonds (value is from coupon)
Inflation-indexed bonds (payout rises with inflation)
for all bonds
Correlation r = 0.994
Ordinary bonds only
Bid price
Correlation r = 0.950
$150
$100
0%
5%
10%
Coupon rate
16
Basic Concepts: Simple Linear Model
Cost vs. Number Produced
For a production facility
It usually costs more to produce more
An outlier is visible
A disaster (a fire at the factory)
Cost
High cost, but few produced
r = 0.623
10,000
20
40
60
Number produced
Cost
5,000
Outlier removed:
More details,
r = 0.869
4,000
3,000
20
30
40
50
Number produced
17
Basic Concepts: OLS Modeling
Salary vs. Years Experience
For n = 6 employees
Linear (straight line) relationship
Increasing relationship
higher salary generally goes with higher experience
Experience
15
10
20
5
15
5
Salary
30
35
55
22
40
27
Salary ($thousand)
Correlation r = 0.8667
60
50
40
30
20
0
10
20 Experience
18
Basic Concepts: OLS Modeling
Summarizes bivariate data: Predicts Y from X
with smallest errors (in vertical direction, for Y axis)
Intercept is 15.32 salary (at 0 years of experience)
Slope is 1.673 salary (for each additional year of experience, on
average)
Salary (Y)
60
50
40
30
20
10
0
10
20
Experience (X)
19
Basic Concepts: OLS Modeling
Predicted Value comes from Least-Squares Line
For example, Mary (with 20 years of experience)
has predicted salary 15.32+1.673(20) = 48.8
So does anyone with 20 years of experience
Residual is actual Y minus predicted Y
Marys residual is 55 48.8 = 6.2
She earns about $6,200 more than the predicted salary for a person
with 20 years of experience
A person who earns less than predicted will have a negative residual
20
Basic Concepts: OLS Modeling
Marys residual is 6.2
60
Mary earns 55 thousand
50
Marys predicted value is 48.8
Salary
40
30
20
10
0
10
Experience
20
21
Basic Concepts: OLS Modeling
Standard Error of Estimate
Se = SY
(1 r ) nn 12
2
Approximate size of prediction errors (residuals)
Actual Y minus predicted Y:
Y[a+bX]
Example (Salary vs. Experience)
Se = 11.686 (1 0.86672 )
6 1
= 6.52
62
Predicted salaries are about 6.52 (i.e., $6,520) away from actual
salaries
22
Basic Concepts: OLS Modeling
Interpretation: similar to standard deviation
Can move Least-Squares Line up and down by Se
About 68% of the data are within one standard error of estimate
of the least-squares line
(For a bivariate normal distribution)
Salary
60
50
40
30
20
0
10 Experience20
23
Multiple Linear Regression
Linear Model for the Population
Y = ( + 1 X1 + 2 X2 + + k Xk) +
= (Population relationship)
+ Randomness
Where has a normal distribution with mean 0 and constant
standard deviation , and this randomness is independent from one
case to another
An assumption needed for statistical inference
24
Multiple Linear Regression: Results
Intercept: a
Predicted value for Y when every X is 0
Regression Coefficients: b1, b2, bk
The effect of each X on Y, holding all other X variables constant
Prediction Equation or Regression Equation
(Predicted Y) = a+b1 X1+b2 X2++bk Xk
The predicted Y, given the values for all X variables
Prediction Errors or Residuals
(Actual Y) (Predicted Y)
25
Multiple Linear Regression: Results
t Tests for Individual Regression Coefficients
Significant or not significant, for each X variable
Tests whether a particular X variable has an effect on Y, holding the
other X variables constant
Should be performed only if the F test is significant
Standard Errors of the Regression Coefficients
Sb1 , Sb2 ,!, Sbk
(with n k 1 degrees of freedom)
Indicates the estimated sampling standard deviation of each
regression coefficient
Used in the usual way to find confidence intervals and hypothesis
tests for individual regression coefficients
26
Multiple Linear Regression: Results
Predicted Page Costs for Audubon
= a + b1 X1 + b2 X2 + b3 X3
= $4,043 + 3.79(Audience) 124(Percent Male)
+ 0.903(Median Income)
= $4,043 + 3.79(1,645) 124(51.1) + 0.903(38,787)
= $38,966
Actual Page Costs are $25,315
Residual is $25,315 38,966 = $13,651
Audubon has Page Costs $13,651 lower than you would expect for
a magazine with its characteristics (Audience, Percent Male, and
Median Income)
27
Standard Error
Standard Error of Estimate Se
Indicates the approximate size of the prediction errors
About how far are the Y values from their predictions?
For the magazine data
Se = S = $21,578
Actual Page Costs are about $21,578 from their predictions for this
group of magazines (using regression)
Compare to SY = $45,446: Actual Page Costs are about $45,446 from
their average (not using regression)
Using the regression equation to predict Page Costs (instead of simply
using Y ) the typical error is reduced from $45,446 to $21,578
28
Coeff. of Determination
The strength of association is measured by the square of the multiple
correlation coefficient, R2, which is also called the coefficient of
multiple determination.
R2 =
SS reg
SS y
R2 is adjusted for the number of independent variables and the sample
size by using the following formula:
Adjusted
R2
= R2
k(1 - R 2)
n-k-1
29
Coeff. of Determination
Coefficient of Determination R2
Indicates the percentage of the variation in Y that is explained by
(or attributed to) all of the X variables
How well do the X variables explain Y?
For the magazine data
R2 = 0.787 = 78.7%
The X variables (Audience, Percent Male, and Median Income) taken
together explain 78.7% of the variance of Page Costs
This leaves 100% 78.7% = 21.3% of the variation in Page Costs
unexplained
30
The F test
Is the regression significant?
Do the X variables, taken together, explain a significant amount of
the variation in Y?
The null hypothesis claims that, in the population, the X variables
do not help explain Y; all coefficients are 0
H0: 1 = 2 = = k = 0
The research hypothesis claims that, in the population, at least
one of the X variables does help explain Y
H1: At least one of 1, 2, , k 0
31
The F test
H0 : R2pop = 0
This is equivalent to the following null hypothesis:
H0: 1 = 2 = 3 = . . . = k = 0
The overall test can be conducted by using an F statistic:
F=
SS reg /k
SS res /(n - k - 1)
R 2 /k
(1 - R 2 )/(n- k - 1)
which has an F distribution with k and (n - k -1) degrees of freedom.
32
Performing the F test
Three equivalent methods for performing F test; they
always give the same result
Use the p-value
If p < 0.05, then the test is significant
Same interpretation as p-values in Chapter 10
Use the R2 value
If R2 is larger than the value in the R2 table, then the result is significant
Do the X variables explain more than just randomness?
Use the F statistic
If the F statistic is larger than the value in the F table, then the result is
significant
33
Example: F test
For the magazine data, The X variables (Audience, Percent
Male, and Median Income) explain a very highly significant
percentage of the variation in Page Costs
The p-value, listed as 0.000, is less than 0.0005, and is therefore
very highly significant (since it is less than 0.001)
The R2 value, 78.7%, is greater than 27.1% (from the R2 table at
level 0.1% with n = 55 and k = 3), and is therefore very highly
significant
The F statistic, 62.84, is greater than the value (between 7.054
and 6.171) from the F table at level 0.1%, and is therefore very
highly significant
34
t Tests
A t test for each regression coefficient
To be used only if the F test is significant
If F is not significant, you should not look at the t tests
Does the jth X variable have a significant effect on Y, holding the
other X variables constant?
Hypotheses are
H0: j = 0,
H1: j 0
Test using the confidence interval
b j tSb j
use the t table with n k 1 degrees of freedom
Or use the t statistic
tstatistic = b j / Sb j
compare to the t table value with n k 1 degrees of freedom
35
Example: t Tests
Testing b1, the coefficient for Audience
b1 = 3.79, t = 13.5, p = 0.000
Audience has a very highly significant effect on Page Costs, after
adjusting for Percent Male and Median Income
Testing b2, the coefficient for Percent Male
b2 = 124, t = 0.90, p = 0.374
Percent Male does not have a significant effect on Page Costs, after
adjusting for Audience and Median Income
Testing b3, the coefficient for Median Income
b3 = 0.903, t = 2.44, p = 0.018
Median Income has a significant effect on Page Costs, after adjusting
for Audience and Percent Male
36
Assumptions in Regression
Assumptions underlying the statistical techniques
should be tested twice
First for the separate variables
Second for the multivariate model variate, which acts
collectively for the variables in the analysis and thus must
meet the same assumption as individual variables. Differs for
different multivariate technique
Assumptions in Regression
Linearity
The independent variable has a linear relationship with the dependent
variable
Normality
The residuals or the dependent variable follow a normal distribution
Multicollinearity
When some X variables are too similar to one another
Homoskedasticity
The variability in Y values for a given set of predictors is the same
regardless of the values of the predictors
Independence among cases (Absence of correlated errors)
The cases are independent of each other
38
Assumptions in Regression
Normality
The residuals or the dependent variable follow a normal
distribution
If the variation from normality is significant then all
statistical tests are invalid
Graphical Analysis
Histogram and Normal probability plot
Peaked and Skewed distribution result in non-normality
Statistical Analysis
If Z value exceeds critical value, then the distribution is non-
normal
Kolmogorov Smirnov Test; Shapiro-Wilks Test
39
Assumptions in Regression
Normality
40
Assumptions in Regression
Homoskedasticity
Assumption related primarily to dependence
relationships between variables
Assumption that the dependent variable(s) exhibit
equal levels of variance across the range of predictor
variable(s).
The variance of the dependent variable should not
be concentrated in only a limited range of the
independent values
Source
Type of variable
Skewed distribution
41
Assumptions in Regression
Homoskedasticity
Graphical Analysis
Analysis of residuals in case of Regression
Statistical Analysis
Variances within groups formed by non-metric variables
Levene Test
Boxs M Test
Remedy
Data Transformation
42
Assumptions in Regression
Homoskedasticity
Graphical Analysis
43
Assumptions in Regression
Linearity
Assumption for all multivariate techniques based on
correlational measures such as
multiple regression,
logistics regression,
factor analysis, and
structural equation modeling
Correlation represents only the linear association
between variables
Identification
Scatterplots or examination of residuals using regression
Remedy
Data Transformations
44
Assumptions in Regression
Linearity
45
Assumptions in Regression
Absence of Correlated Errors
Prediction errors should not be correlated with each
other
Identification
Most possible cause is the data collection process, such as
two separate groups in the data collection process
Remedy
Including the omitted causal factor into the multivariate analysis
46
Assumptions in Regression
Multicollinearity
Multicollinearity arises when intercorrelations among the predictors
are very high.
Multicollinearity can result in several problems, including:
The partial regression coefficients may not be estimated precisely.
The standard errors are likely to be high.
The magnitudes as well as the signs of the partial regression
coefficients may change from sample to sample.
It becomes difficult to assess the relative importance of the
independent variables in explaining the variation in the dependent
variable.
Predictor variables may be incorrectly included or removed in
stepwise regression.
47
Assumptions in Regression
Multicollinearity
The ability of an independent variable to improve the prediction of the
dependent variable is related not only to its correlation to the
dependent variable, but also to the correlation(s) of the additional
independent variable to the independent variable(s) already in the
regression equation
Collinearity is the association, measured as the correlation,
between tow independent variables
Multicollinearity refers to the correlation among three or more
independent variables
Impact
Reduces any single IVs predictive power by the extent to which it is
associated with the other independent variables
48
Assumptions in Regression
Multicollinearity
Measuring Multicollinearity
Tolerance
Amount of variability of the selected independent variable not explained
by the other independent variables
Tolerance Values should be high
Cut-off is 0.1 but greater than 0.5 gives better results
VIF
Inverse of Tolerance
Should be low (typically below 2.0 and usually below 10)
49
Assumptions in Regression
Multicollinearity
Remedy for Multicollinearity
A simple procedure for adjusting for multicollinearity consists of
using only one of the variables in a highly correlated set of
variables.
Omit highly correlated independent variables and identify other
independent variables to help the prediction
Alternatively, the set of independent variables can be transformed into
a new set of predictors that are mutually independent by using
techniques such as principal components analysis.
More specialized techniques, such as ridge regression and latent root
regression, can also be used.
50
Assumptions in Regression
Data Transformations
To correct violations of the statistical assumptions
underlying the multivariate techniques
To improve the relationship between variables
Transformation to achieve Normality and
Homoscedasticity
Flat Distribution Inverse transformation
Negatively Skewed Distribution Square Root
Transformation
Positively Skewed Distribution Logarithmic Transformation
If the residuals in regression are cone shaped then
Cone opens to right Inverse transformation
Cone opens to left Square root transformation
51
Assumptions in Regression
Data Transformations
Transformation to achieve
Linearity
52
Assumptions in Regression
Data Transformations
53
Assumptions in Regression
General guidelines for transformation
For a noticeable effect of transformation the ratio of a variables
mean to the standard deviation should be less than 4.0
When the transformation can be performed on either of the two
variables, select the one with smallest ratio of mean/sd.
Transformation should be applied to independent variables
except in case of heteroscedasticity
Heteroscedasticity can only be remedied by transformation of
the dependent variable in a dependent relationship
If the heteroscedastic relationship is also non-linear the
dependent variable and perhaps the independent variables must
be transformed
Transformations may change the interpretation of the variables
54
Issues in Regression
Variable Selection
How to choose from a long list of X variables?
Too many: waste the information in the data
Too few: risk ignoring useful predictive information
Model Misspecification
Perhaps the multiple regression linear model is wrong
Unequal variability? Nonlinearity? Interaction?
EXPLANATORY VS
PREDICTIVE MODELING
Explanatory Vs Predictive Modeling
Explanatory models fits the data closely, whereas a good
predictive model predicts new cases accurately
Explanatory models uses entire dataset for estimating the
best-fit model and to maximize explanatory variance (R2).
Predictive models estimate the model on training set and
assess it on the new, unobserved data
Performance measures for explanatory models measures
how close the data fit the models, whereas in predictive
models performance is measured by predictive accuracy
Performance Evaluation
Prediction Error for observation i= Actual y value
predicted y value
Popular numerical measures of predictive accuracy
MAE or MAD (Mean absolute error / deviation)
Average Error
MAPE (Mean Absolute Percentage Error)
Performance Evaluation
RMSE (Root mean squared error)
Total SSE (total sum of squared errors)
CASE
Case: Pedigree Vs Grit
Why does a low R2 does not make the regression useless?
Describe a situation in which a useless regression has a high R2.
Check the validity of the linear regression model assumptions.
Estimate the excess returns of Bobs and Putneys funds. Between them, who
is expected to obtain higher returns at their current funds and by how much?
If hired by the firm, who is expected to obtain higher returns and by how
much?
Can you prove at the 5% level of significance that Bob would get higher
expected returns if he had attended Princeton instead of Ohio State?
Can you prove at the 10% level of significance that Bob would get at least 1%
higher expected returns by managing a growth fund?
Is there strong evidence that fund managers with MBA perform worse than
fund managers without MBA? What is held constant in this comparison?
Based on your analysis of the case, which candidate do you support for
AMBTPMs job opening: Bob or Putney? Discuss
Case: Nils Baker
Is the presence of a physical Bank Branch creating
demand for checking accounts?
Thank You