Linear Regression in R
Linear Regression in R
Linear Regression in R
Linear Regression in R
Danilo Bianchi
Introduction
Learning Objectives
Problem Definition
Exploratory Data Analysis (EDA)
Univariate Model
Residuals Diagnostics
Residuals vs Fitted
Normal Q-Q Plot
Scale-Location
Residuals vs Leverage
Multi-variate Model
Collinearity Test
Model Search
Exercises
Introduction
Learning Objectives
1. Apply a linear regression in R
2. Interpret the linear model outputs in R
3. Assess the linear model validity (perform diagnostics tests)
4. Choose significant variables for the model
5. Compare and choose the best alternative from a selection of models
Problem Definition
We have the credit data for several clients of a bank.
We’d like to know what are the characteristics of a client that brings a high revenue, so we know who to target.
It would also be nice to understand what kind of clients bring a low revenue, so we can develop strategies to
increase it.
We first load the data and get a high-level summary (remember to set the right working directory with setwd).
The summary command gives an overview of all the variables and their distribution.
For numeric variables, summary will give the minimum and maximum values, as well as the mean and all
four quartiles.
file:///C:/Users/cpina/Desktop/Camila/BIG%20DATA/prueba%202%20BD/Linear_Example.html 1/19
5/4/2018 Linear Regression in R
For categorical variables, summary will give the number of occurences in each category.
# Load data
credit<-read.csv("Credit_Data.csv")
# Get summary for each variable
summary(credit)
We see that the variable Default is numeric although it needed to be categorical (Yes/No). Let’s fix this.
While it helped us to understand our data, knowing the summary won’t get us far on understanding what drives
revenue. For this, it’s helpful to plot every variable against revenue to see if we can find any visual pattern.
Moreover, because a linear model is sensitive to multicollinearity (high correlation between explanatory variables)
file:///C:/Users/cpina/Desktop/Camila/BIG%20DATA/prueba%202%20BD/Linear_Example.html 2/19
5/4/2018 Linear Regression in R
it would also be valuable to see the correlation between each variable and all others. This situation is so common
that there’s already a function in R to do that for us.
The credit amount seems to have a fairly good correlation to Revenue. Let’s look at both more closely and build a
linear reression model between them.
Univariate Model
Let’s create a linear regression using the “Revenue” as the dependent variable and “Credit Amount” as the
explanatory (independent) variable.
R function lm lets us do this by specifying the dependent variable and then the independent variables followed by
a tilde (~). To simplify, the function lets us specify the data frame we want to use as a source so we don’t need to
repeat it for every variable.
file:///C:/Users/cpina/Desktop/Camila/BIG%20DATA/prueba%202%20BD/Linear_Example.html 3/19
5/4/2018 Linear Regression in R
##
## Call:
## lm(formula = Revenue ~ CredAmt, data = credit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -47.530 -12.946 0.615 13.730 39.842
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.490e+01 8.623e-01 17.28 <2e-16 ***
## CredAmt 2.975e-02 1.996e-04 149.02 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.81 on 998 degrees of freedom
## Multiple R-squared: 0.957, Adjusted R-squared: 0.9569
## F-statistic: 2.221e+04 on 1 and 998 DF, p-value: < 2.2e-16
Now we have the model, we need to know how to interpret the output.
Call provides the model which is created. Very useful when dealing with several models.
Residuals (y - yhat) are the difference between the real values of dependent variable (y) and the modeled values
(aka predicted) using the independent variables (yhat). Because we have one residual for each data point, the
output shows the min, max and quartiles of the residuals distribution. This should give us a notion if the residuals
are normally distributed. Note that the mean of the residuals should be zero < mean(resid(fit1)) =
1.985068810^{-16} >!
Coefficients: Gives information on the value (Estimate), error (Std. Error), and significance (Pr(>|t|)) for each of
the coefficients considered in the model.
Signf. codes is the legend for the markup next to the significance of the coeffiecients.
Residual standard error (RSE) is the estimative of the standard deviation of the residuals. That is, in a linear
regression, the residuals are considered to be distributed normally with mean zero (see Residuals above) and
some standard deviation. The RSE is an estimative of that standard deviation < sd(resid(fit1)) = 17.8006252 >.
The smaller the RSE, the lower is the variance between the real and the modeled values, i.e. the better is the
model.
Multiple R-Squared is the measure of how much variance is explained by the model. If R-square is 1 it means all
variance is explained by the model. The multiple R-squared always increases as we add more variables into the
model.
Adjusted R-squared is the R-squared adjusted by the complexity of the model. It considers the number of
variables in the model. We should use this measure to compare models with different number of variables. It may
decrease if an additional variable is not contributing to explaining the variance.
F-statistic gives the significance of the model rather of a single variable. It tests for the null hypothesis that all the
model coefficients are zero. The F-statistic is the ratio of the explained and unexplained variance (< anova(fit1) >).
(for more help interpreting the output check Stack Exchange Discussion on R output
(http://stats.stackexchange.com/questions/5135/interpretation-of-rs-lm-output) and the UCLA’s Annotated
Regression Output in SAS (http://www.ats.ucla.edu/stat/sas/output/reg.htm) )
file:///C:/Users/cpina/Desktop/Camila/BIG%20DATA/prueba%202%20BD/Linear_Example.html 4/19
5/4/2018 Linear Regression in R
Perhaps it’s better to understand these concepts in a plot. We can use the lm object (fit1) we created to retrieve
relevant information of the model and create the plot (for more on plot formats click here
(http://www.statmethods.net/advgraphs/parameters.html) ).
# assign axes
y <- credit$Revenue
x <- credit$CredAmt
# Generate scatterplot
plot(credit$CredAmt, credit$Revenue, # x vs y axes
xlab = "Independent (Credit Amount)", # x label
ylab = "Residuals", # y label
pch = 21, # type (21 = filled circle)
col = "black", # circle color
bg = "gray" , # fill color
cex = 1, # size (x default)
frame = TRUE)
# Plot residuals
n <- length(y)
for (i in 1 : n)
lines(c(x[i], x[i]), c(y[i], yhat[i]), col = "blue" , lwd = 0.5, lty=1)
file:///C:/Users/cpina/Desktop/Camila/BIG%20DATA/prueba%202%20BD/Linear_Example.html 5/19
5/4/2018 Linear Regression in R
From the chart above, we can see some of the relevant elements in a linear model
Now we can understand and interpret what our model is telling, we need to assess if it is a valid model.
Residuals Diagnostics
In a model diagnosis, we check for the assumptions underlying the linear regression. Also, we’d like to verify if
there’s any influential point in our dataset.
There’s a quick command that will generate four diagnostic plots to let us do that.
file:///C:/Users/cpina/Desktop/Camila/BIG%20DATA/prueba%202%20BD/Linear_Example.html 6/19
5/4/2018 Linear Regression in R
Residuals vs Fitted
We assumed that the residuals were homogeneously distributed (heteroscedasticity). That is to say, there
shouldn’t be any pattern or dependence of the residuals on the predicted values. We can verify that by plotting the
residuals against the predicted values.
In this graph, we see that the residuals are farily distributed around zero and also they there’s no pattern in how
they’re distributed.
file:///C:/Users/cpina/Desktop/Camila/BIG%20DATA/prueba%202%20BD/Linear_Example.html 7/19
5/4/2018 Linear Regression in R
# generate qq-plot
plot(fit1,which=2)
file:///C:/Users/cpina/Desktop/Camila/BIG%20DATA/prueba%202%20BD/Linear_Example.html 8/19
5/4/2018 Linear Regression in R
In our model, the Q-Q plot lies fairly well in a straight line deviating at the extremes. This indicates it is normally
distributed with some slightly long tails. We can compare the Q-Q plot with a histogram of the residuals.
file:///C:/Users/cpina/Desktop/Camila/BIG%20DATA/prueba%202%20BD/Linear_Example.html 9/19
5/4/2018 Linear Regression in R
If the normality test fails, the R-squared and coefficient values would still be right. However, we wouldn’t be able to
trust the estimates for the standard deviation and the p-values.
Scale-Location
This graph is similar to the Residuals vs. Fitted. The difference is that the residuals are normalized. So the y-axis
will always be positive and measures how many standard deviations a residual falls from the residuals distribution.
In this, we would be able to detect outliers that would present high standard deviations (far from zero). It also
enables us to check again for heteroscedasticity by detecting any pattern. In this case, the red line seems fairly
well-behaved.
file:///C:/Users/cpina/Desktop/Camila/BIG%20DATA/prueba%202%20BD/Linear_Example.html 10/19
5/4/2018 Linear Regression in R
Residuals vs Leverage
With this plot, we have a much better view of the most influential points in our dataset.
To understand what is an influential point, we need to understand what is leverage and an outlier.
Leverage is a measure of how much weight a point has on the coefficients of a linear fit. In other words, it
measures how much the coefficients would change by a one-unit change in the y-value of that point. High leverage
points are points that fall far from the mean of the predictor (i.e. very high/low x compared to the dataset).
Outliers are data points with large residuals (i.e. very high/low y compared to similar x’s).
Influence is the measure of how much the coefficients would change if a certain point were removed. Highly
influential points are outliers with high leverage.
file:///C:/Users/cpina/Desktop/Camila/BIG%20DATA/prueba%202%20BD/Linear_Example.html 11/19
5/4/2018 Linear Regression in R
The Residuals vs Levereage plot helps us identify influential points by showing their leverage and residuals.
Outliers are points with high standardized residual (measured in standard deviations). These are shown with a y
value far from zero. Points with high leverage are shown with a x value far from zero. Thus, influential points are
those with a high x and y values, located at the right-hand top/bottom corners of the graph.
To help us identify influential points, this plot also shows the Cook’s distance (a measure of influence) for those
points as dashed red lines. Points leaving the boundaries marked by those lines can be considered as highly
influential and deserve a closer look and special treatment. In our case, it appears that observation 916 is very
influential but it still falls within the boundary of an acceptable Cook’s distance (not shown because it is not near).
Multi-variate Model
When building a model, we’re not constrained in using only one variable. We can try to enhance our predictive
power by adding more explanatory variables into the model.
Let us build a model that considers both Credit Amount and Credit history.
file:///C:/Users/cpina/Desktop/Camila/BIG%20DATA/prueba%202%20BD/Linear_Example.html 12/19
5/4/2018 Linear Regression in R
##
## Call:
## lm(formula = Revenue ~ CredAmt + CredHist, data = credit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -41.213 -12.169 -0.693 12.684 31.688
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.1700383 2.6588002 -3.449 0.000586 ***
## CredAmt 0.0299477 0.0001794 166.957 < 2e-16 ***
## CredHistA31 7.5728079 3.3643153 2.251 0.024608 *
## CredHistA32 20.1922327 2.6064696 7.747 2.32e-14 ***
## CredHistA33 28.9940478 2.9995753 9.666 < 2e-16 ***
## CredHistA34 33.4208933 2.6764075 12.487 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.7 on 994 degrees of freedom
## Multiple R-squared: 0.9667, Adjusted R-squared: 0.9665
## F-statistic: 5772 on 5 and 994 DF, p-value: < 2.2e-16
It looks that the model is better than the previous one. That is, the Credit History helps us explain a little bit more
variance than using only the Credit Amount (higher adjusted R-squared).
Collinearity Test
However, we should check for another assumption now – the explanatory variables should be uncorrelated. To
check that, let us create a correlation table.
# plot
plot(credit$CredHist,credit$CredAmt,
xlab = "Credit History",
ylab = "Credit Amount")
file:///C:/Users/cpina/Desktop/Camila/BIG%20DATA/prueba%202%20BD/Linear_Example.html 13/19
5/4/2018 Linear Regression in R
But, do we have the best model? Adding one variable increased the predictive power of the model. We still have
several other variables to test – how can we select a better model?
For that, we’d have to comparate all the bi-variate models using all variables, then with multi-variate models with 3
variables, 4 variables, and so on…
Model Search
As you guessed, there’s an automatic way to select good models. We will touch on two of them:
Forward selection: in this method, we test all univariate models and select the best. Then, we test all the
other remaining variables to build a bi-variate model. We go on until the addition of another variable won’t
increase the model confidence by a certain threshold.
Backward elimination: the idea is similar to the forward selection method. In the backward elimination, we
start with a model that has all variables and remove the worst variable. We test the model again and again
and stop until the all the variables have a significance lower than a certain limit.
file:///C:/Users/cpina/Desktop/Camila/BIG%20DATA/prueba%202%20BD/Linear_Example.html 14/19
5/4/2018 Linear Regression in R
file:///C:/Users/cpina/Desktop/Camila/BIG%20DATA/prueba%202%20BD/Linear_Example.html 15/19
5/4/2018 Linear Regression in R
## Start: AIC=8905.81
## Revenue ~ 1
##
## Df Sum of Sq RSS AIC
## + CredAmt 1 7043426 316545 5761.5
## + Duration 1 2772613 4587359 8435.1
## + Purpose 9 953875 6406097 8785.0
## + Job 3 775594 6584378 8800.5
## + Property 3 657661 6702311 8818.2
## + House 2 221244 7138728 8879.3
## + CredHist 4 242672 7117300 8880.3
## + PerStat 3 217711 7142261 8881.8
## + Default 1 164672 7195300 8885.2
## + StatExCheAcc 3 131926 7228046 8893.7
## + SavBal 4 117499 7242473 8897.7
## + Othdebt 2 76497 7283475 8899.4
## + EmpTen 4 102600 7257372 8899.8
## + NumbCred 1 25337 7334635 8904.4
## <none> 7359972 8905.8
## + Age 1 3432 7356540 8907.3
## + PresRes 1 2181 7357790 8907.5
##
## Step: AIC=5761.47
## Revenue ~ CredAmt
##
## Df Sum of Sq RSS AIC
## + CredHist 4 71485 245060 5513.5
## + Age 1 21166 295379 5694.3
## + NumbCred 1 10818 305727 5728.7
## + House 2 4513 312033 5751.1
## + EmpTen 4 4579 311967 5754.9
## + Property 3 3673 312873 5755.8
## + PresRes 1 905 315641 5760.6
## <none> 316545 5761.5
## + StatExCheAcc 3 1426 315119 5763.0
## + Duration 1 68 316477 5763.3
## + Default 1 24 316521 5763.4
## + PerStat 3 1259 315286 5763.5
## + Othdebt 2 380 316166 5764.3
## + Job 3 366 316180 5766.3
## + SavBal 4 751 315794 5767.1
## + Purpose 9 2107 314439 5772.8
##
## Step: AIC=5513.5
## Revenue ~ CredAmt + CredHist
##
## Df Sum of Sq RSS AIC
## + Age 1 33482 211579 5368.6
## + House 2 5111 239949 5496.4
## + EmpTen 4 5981 239080 5496.8
## + Default 1 3365 241695 5501.7
## + Property 3 3503 241557 5505.1
## + PerStat 3 3458 241602 5505.3
file:///C:/Users/cpina/Desktop/Camila/BIG%20DATA/prueba%202%20BD/Linear_Example.html 16/19
5/4/2018 Linear Regression in R
file:///C:/Users/cpina/Desktop/Camila/BIG%20DATA/prueba%202%20BD/Linear_Example.html 17/19
5/4/2018 Linear Regression in R
summary(fit_for)
##
## Call:
## lm(formula = Revenue ~ CredAmt + CredHist + Age + Default + EmpTen,
## data = credit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28.0489 -11.9743 0.6923 12.0077 27.9325
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.3603090 3.6153166 2.589 0.009765 **
## CredAmt 0.0299103 0.0001686 177.383 < 2e-16 ***
## CredHistA31 8.2618879 3.1197936 2.648 0.008221 **
## CredHistA32 20.6640166 2.4282225 8.510 < 2e-16 ***
## CredHistA33 30.5193670 2.7925681 10.929 < 2e-16 ***
## CredHistA34 36.4678345 2.5235479 14.451 < 2e-16 ***
## Age -0.5137518 0.0446453 -11.507 < 2e-16 ***
## DefaultYes -3.5336624 1.0524095 -3.358 0.000816 ***
## EmpTenA72 -0.4102247 2.2082353 -0.186 0.852663
## EmpTenA73 0.2801253 2.0419229 0.137 0.890911
## EmpTenA74 3.7774582 2.1747553 1.737 0.082706 .
## EmpTenA75 1.3238294 2.0697135 0.640 0.522568
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.5 on 988 degrees of freedom
## Multiple R-squared: 0.9718, Adjusted R-squared: 0.9715
## F-statistic: 3093 on 11 and 988 DF, p-value: < 2.2e-16
file:///C:/Users/cpina/Desktop/Camila/BIG%20DATA/prueba%202%20BD/Linear_Example.html 18/19
5/4/2018 Linear Regression in R
##
## Call:
## lm(formula = Revenue ~ CredHist + CredAmt + EmpTen + Age + Default,
## data = credit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28.0489 -11.9743 0.6923 12.0077 27.9325
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.3603090 3.6153166 2.589 0.009765 **
## CredHistA31 8.2618879 3.1197936 2.648 0.008221 **
## CredHistA32 20.6640166 2.4282225 8.510 < 2e-16 ***
## CredHistA33 30.5193670 2.7925681 10.929 < 2e-16 ***
## CredHistA34 36.4678345 2.5235479 14.451 < 2e-16 ***
## CredAmt 0.0299103 0.0001686 177.383 < 2e-16 ***
## EmpTenA72 -0.4102247 2.2082353 -0.186 0.852663
## EmpTenA73 0.2801253 2.0419229 0.137 0.890911
## EmpTenA74 3.7774582 2.1747553 1.737 0.082706 .
## EmpTenA75 1.3238294 2.0697135 0.640 0.522568
## Age -0.5137518 0.0446453 -11.507 < 2e-16 ***
## DefaultYes -3.5336624 1.0524095 -3.358 0.000816 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.5 on 988 degrees of freedom
## Multiple R-squared: 0.9718, Adjusted R-squared: 0.9715
## F-statistic: 3093 on 11 and 988 DF, p-value: < 2.2e-16
For this dataset, both methods returned the same model. However, it can happen that these methods return
different results.
Exercises
1. Select 3 variables to build a model to explain the Revenue (fit3).
2. At first glance, is it better or worse than our previous models (fit1 and fit2)? Why?
3. Is it a valid model? Why?
file:///C:/Users/cpina/Desktop/Camila/BIG%20DATA/prueba%202%20BD/Linear_Example.html 19/19