Homework #6: Student: Mario Perez
Homework #6: Student: Mario Perez
Homework #6: Student: Mario Perez
Question 9.1
Using the same crime data set uscrime.txt as in Question 8.2, apply Principal Component Analysis and
then create a regression model using the first few principal components. Specify your new model in terms of
the original variables (not the principal components), and compare its quality to that of your solution to
Question 8.2. You can use the R function prcomp for PCA. (Note that to first scale the data, you can include
scale. = TRUE to scale as part of the PCA function. Don’t forget that, to make a prediction for the new
city, you’ll need to unscale the coefficients (i.e., do the scaling calculation in reverse)!)
Answer:
For this question, I used the function prcomp as well as the function lm in order to make a prediction for
the new city. The steps involved in the process were as follow:
# Clear environment
rm(list = ls())
# Load libraries
library(DAAG)
set.seed(1)
## M So Ed Po1
## Min. :11.90 Min. :0.0000 Min. : 8.70 Min. : 4.50
## 1st Qu.:13.00 1st Qu.:0.0000 1st Qu.: 9.75 1st Qu.: 6.25
## Median :13.60 Median :0.0000 Median :10.80 Median : 7.80
## Mean :13.86 Mean :0.3404 Mean :10.56 Mean : 8.50
## 3rd Qu.:14.60 3rd Qu.:1.0000 3rd Qu.:11.45 3rd Qu.:10.45
## Max. :17.70 Max. :1.0000 Max. :12.20 Max. :16.60
## Po2 LF M.F Pop
## Min. : 4.100 Min. :0.4800 Min. : 93.40 Min. : 3.00
## 1st Qu.: 5.850 1st Qu.:0.5305 1st Qu.: 96.45 1st Qu.: 10.00
## Median : 7.300 Median :0.5600 Median : 97.70 Median : 25.00
## Mean : 8.023 Mean :0.5612 Mean : 98.30 Mean : 36.62
## 3rd Qu.: 9.700 3rd Qu.:0.5930 3rd Qu.: 99.20 3rd Qu.: 41.50
## Max. :15.700 Max. :0.6410 Max. :107.10 Max. :168.00
## NW U1 U2 Wealth
## Min. : 0.20 Min. :0.07000 Min. :2.000 Min. :2880
## 1st Qu.: 2.40 1st Qu.:0.08050 1st Qu.:2.750 1st Qu.:4595
## Median : 7.60 Median :0.09200 Median :3.400 Median :5370
## Mean :10.11 Mean :0.09547 Mean :3.398 Mean :5254
## 3rd Qu.:13.25 3rd Qu.:0.10400 3rd Qu.:3.850 3rd Qu.:5915
## Max. :42.30 Max. :0.14200 Max. :5.800 Max. :6890
## Ineq Prob Time Crime
## Min. :12.60 Min. :0.00690 Min. :12.20 Min. : 342.0
## 1st Qu.:16.55 1st Qu.:0.03270 1st Qu.:21.60 1st Qu.: 658.5
## Median :17.60 Median :0.04210 Median :25.80 Median : 831.0
## Mean :19.40 Mean :0.04709 Mean :26.60 Mean : 905.1
## 3rd Qu.:22.75 3rd Qu.:0.05445 3rd Qu.:30.45 3rd Qu.:1057.5
## Max. :27.60 Max. :0.11980 Max. :44.00 Max. :1993.0
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 2.4534 1.6739 1.4160 1.07806 0.97893 0.74377
## Proportion of Variance 0.4013 0.1868 0.1337 0.07748 0.06389 0.03688
## Cumulative Proportion 0.4013 0.5880 0.7217 0.79920 0.86308 0.89996
## PC7 PC8 PC9 PC10 PC11 PC12
## Standard deviation 0.56729 0.55444 0.48493 0.44708 0.41915 0.35804
## Proportion of Variance 0.02145 0.02049 0.01568 0.01333 0.01171 0.00855
## Cumulative Proportion 0.92142 0.94191 0.95759 0.97091 0.98263 0.99117
## PC13 PC14 PC15
## Standard deviation 0.26333 0.2418 0.06793
## Proportion of Variance 0.00462 0.0039 0.00031
## Cumulative Proportion 0.99579 0.9997 1.00000
# Looking at the scree plot, we can see that only the first 4 or 5 factors may b
e relevant
# in order to retain them for an exploratory factor analysis.
# Another approach is to use the function screeplot and pick the # of items with
stdev > 1
# We can see that the first 4 or 5 principal components are significant (PCA 5 b
arely touches the line)
screeplot(PCA, type = "line")
abline(h=1, col="green")
##
## Call:
## lm(formula = V6 ~ ., data = as.data.frame(PCcrime))
##
## Residuals:
## Min 1Q Median 3Q Max
## -420.79 -185.01 12.21 146.24 447.86
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 905.09 35.59 25.428 < 2e-16 ***
## PC1 65.22 14.67 4.447 6.51e-05 ***
## PC2 -70.08 21.49 -3.261 0.00224 **
## PC3 25.19 25.41 0.992 0.32725
## PC4 69.45 33.37 2.081 0.04374 *
## PC5 -229.04 36.75 -6.232 2.02e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 244 on 41 degrees of freedom
## Multiple R-squared: 0.6452, Adjusted R-squared: 0.6019
## F-statistic: 14.91 on 5 and 41 DF, p-value: 2.446e-08
# Calculate intercept b0
beta0 <- lm.model$coefficients[1]
# Calculate b1, b2, etc... coefficients
betas <- lm.model$coefficients[2:(k+1)]
# Calculate implied regression coefficients a1, a2, etc...
alpha <- PCA$rotation[,1:k] %*% betas
# Calculating original alpha and beta values
mu <- sapply(UScrime[,1:15],mean)
sigma <- sapply(UScrime[,1:15],sd)
original.alpha <- alpha/sigma
original.beta0 <- beta0 - sum(alpha*mu /sigma)
# Calculate R^2 and R^2 adjusted value in order to get model's accuracy
estimates <- as.matrix(UScrime[,1:15]) %*% original.alpha + original.beta0
SSE = sum((estimates - UScrime[,16])^2)
SStot = sum((UScrime[,16] - mean(UScrime[,16]))^2)
R2 = 1 - SSE/SStot
R2.adjusted = R2 - (1-R2)*k/(nrow(UScrime)-k-1)
R2
## [1] 0.6451941
R2.adjusted
## [1] 0.601925
## 1
## 1388.926