Homework #6: Student: Mario Perez

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Homework #6

Student: Mario Perez / m.perez@gatech.edu

Question 9.1

Using the same crime data set uscrime.txt as in Question 8.2, apply Principal Component Analysis and
then create a regression model using the first few principal components. Specify your new model in terms of
the original variables (not the principal components), and compare its quality to that of your solution to
Question 8.2. You can use the R function prcomp for PCA. (Note that to first scale the data, you can include
scale. = TRUE to scale as part of the PCA function. Don’t forget that, to make a prediction for the new
city, you’ll need to unscale the coefficients (i.e., do the scaling calculation in reverse)!)

Answer:

For this question, I used the function prcomp as well as the function lm in order to make a prediction for
the new city. The steps involved in the process were as follow:

• Loaded uscrime.txt data into R


o UScrime = read.table("uscrime.txt",header=TRUE)
• Performed Exploratory Data Analysis using several functions such as str(), summary(), and
head() in order to understand the structure, statistical summary of the data, and a sample view of
the first few rows of the data.
• Initialized a new variable named datapoint to store the data frame with the following provided
values:
o M = 14.0, So = 0, Ed = 10.0, Po1 = 12.0, Po2 = 15.5, LF = 0.640, M.F
= 94.0, Pop = 150, NW = 1.1, U1 = 0.120, U2 = 3.6, Wealth = 3200,
Ineq = 20.1, Prob = 0.04, Time = 39.0
• Ran prcomp on the matrix of scaled predictors in order to perform Principal Component Analysis.
o PCA = prcomp(UScrime[,1:15], scale. = TRUE, center = TRUE)
• Plotted PCA results as a scree plot in order to determine the relevant Principal Components (PCs)
o Looking at the scree plot, we can see that only the first 4 or 5 factors may be relevant in
order to retain them for an exploratory factor analysis.
• Another approach was to use the function screeplot and pick the # of items with stdev > 1
o screeplot(PCA, type = "line")
abline(h=1, col="green")
o We can see that the first 5 principal components are significant (PCA 5 barely touches the
line)
o Based on the screeplot, I decided to use 5 principal components for my analysis
 k = 5
• I then proceeded to combine principal components 1:5 with original crime dataframe
o PCcrime = cbind(PCA$x[,1:k],UScrime[,16])
• The next step was to create a linear regression model with the combined dataset.
o lm.model <- lm(V6~., data = as.data.frame(PCcrime))
o The model returned an R-squared value of 0.6452, and an Adjusted R-squared of 0.6019
• I then proceeded to calculate intercept b0, as well as the rest of the beta coefficients (b1, b2, …, bk),
and the implied regression alpha coefficients (a1, a2, … , aj)
• In order to make a prediction for the new city, we need to unscale the coefficients. Using the mean
and the standard deviation, I made the scaling calculation in reverse in order to obtain the original
alpha and beta values.
o mu <- sapply(UScrime[,1:15],mean)
sigma <- sapply(UScrime[,1:15],sd)
original.alpha <- alpha/sigma
original.beta0 <- beta0 - sum(alpha*mu /sigma)
• After unscaling the coefficients, I then calculated R2 and adjusted-R2 in order to get model's accuracy
o The results were the following:
o R2 = 0.6451941
o Adjusted-R2 = 0.601925
• Finally, I proceeded to run the prediction using the above linear model with the datapoint
provided. The predicted observed crime rate in a city with the above provided data resulted in a
value of Crime = 1388.926
• In summary
o Based on the above steps, our linear model using PCA yielded a prediction of Crime =
1388.926 with an R2 = 0.645 and an Adjusted-R2 = 0.602
o The 6-factor, cross-validated linear model from my previous assignment yielded a prediction
of Crime = 1304.245 with an R2 = 0.638 and an Adjusted-R2 = 0.584
o Based on the above, the new linear model using PCA yielded a slightly better model in terms
of quality.
R code for Question 9.1:

# Homework 6 - Advanced Data Preparation


# Student: Mario Luis Perez

# Clear environment
rm(list = ls())

# Set my working directory (where data is located)


setwd("C:/Users/malperez/OneDrive/HPE/GATech/_ISYE-6501/R")

# Load libraries
library(DAAG)

## Loading required package: lattice

# --- Question 9.1 ---


# Using the same crime data set uscrime.txt as in Question 8.2,
# apply Principal Component Analysis and then create a regression
# model using the first few principal components. Specify your
# new model in terms of the original variables (not the principal
# components), and compare its quality to that of your solution
# to Question 8.2. You can use the R function prcomp for PCA.
# (Note that to first scale the data, you can include scale. =
# TRUE to scale as part of the PCA function. Don't forget that,
# to make a prediction for the new city, you'll need to unscale
# the coefficients (i.e., do the scaling calculation in reverse)!)

set.seed(1)

# Load crime data


UScrime = read.table("uscrime.txt",header=TRUE)

# *** Perform Exploratory Data Analysis ***


# View structure of dataset
str(UScrime)

## 'data.frame': 47 obs. of 16 variables:


## $ M : num 15.1 14.3 14.2 13.6 14.1 12.1 12.7 13.1 15.7 14 ...
## $ So : int 1 0 1 0 0 0 1 1 1 0 ...
## $ Ed : num 9.1 11.3 8.9 12.1 12.1 11 11.1 10.9 9 11.8 ...
## $ Po1 : num 5.8 10.3 4.5 14.9 10.9 11.8 8.2 11.5 6.5 7.1 ...
## $ Po2 : num 5.6 9.5 4.4 14.1 10.1 11.5 7.9 10.9 6.2 6.8 ...
## $ LF : num 0.51 0.583 0.533 0.577 0.591 0.547 0.519 0.542 0.553 0.632 ..
.
## $ M.F : num 95 101.2 96.9 99.4 98.5 ...
## $ Pop : int 33 13 18 157 18 25 4 50 39 7 ...
## $ NW : num 30.1 10.2 21.9 8 3 4.4 13.9 17.9 28.6 1.5 ...
## $ U1 : num 0.108 0.096 0.094 0.102 0.091 0.084 0.097 0.079 0.081 0.1 ...
## $ U2 : num 4.1 3.6 3.3 3.9 2 2.9 3.8 3.5 2.8 2.4 ...
## $ Wealth: int 3940 5570 3180 6730 5780 6890 6200 4720 4210 5260 ...
## $ Ineq : num 26.1 19.4 25 16.7 17.4 12.6 16.8 20.6 23.9 17.4 ...
## $ Prob : num 0.0846 0.0296 0.0834 0.0158 0.0414 ...
## $ Time : num 26.2 25.3 24.3 29.9 21.3 ...
## $ Crime : int 791 1635 578 1969 1234 682 963 1555 856 705 ...

# View statistical summary of dataset


summary(UScrime)

## M So Ed Po1
## Min. :11.90 Min. :0.0000 Min. : 8.70 Min. : 4.50
## 1st Qu.:13.00 1st Qu.:0.0000 1st Qu.: 9.75 1st Qu.: 6.25
## Median :13.60 Median :0.0000 Median :10.80 Median : 7.80
## Mean :13.86 Mean :0.3404 Mean :10.56 Mean : 8.50
## 3rd Qu.:14.60 3rd Qu.:1.0000 3rd Qu.:11.45 3rd Qu.:10.45
## Max. :17.70 Max. :1.0000 Max. :12.20 Max. :16.60
## Po2 LF M.F Pop
## Min. : 4.100 Min. :0.4800 Min. : 93.40 Min. : 3.00
## 1st Qu.: 5.850 1st Qu.:0.5305 1st Qu.: 96.45 1st Qu.: 10.00
## Median : 7.300 Median :0.5600 Median : 97.70 Median : 25.00
## Mean : 8.023 Mean :0.5612 Mean : 98.30 Mean : 36.62
## 3rd Qu.: 9.700 3rd Qu.:0.5930 3rd Qu.: 99.20 3rd Qu.: 41.50
## Max. :15.700 Max. :0.6410 Max. :107.10 Max. :168.00
## NW U1 U2 Wealth
## Min. : 0.20 Min. :0.07000 Min. :2.000 Min. :2880
## 1st Qu.: 2.40 1st Qu.:0.08050 1st Qu.:2.750 1st Qu.:4595
## Median : 7.60 Median :0.09200 Median :3.400 Median :5370
## Mean :10.11 Mean :0.09547 Mean :3.398 Mean :5254
## 3rd Qu.:13.25 3rd Qu.:0.10400 3rd Qu.:3.850 3rd Qu.:5915
## Max. :42.30 Max. :0.14200 Max. :5.800 Max. :6890
## Ineq Prob Time Crime
## Min. :12.60 Min. :0.00690 Min. :12.20 Min. : 342.0
## 1st Qu.:16.55 1st Qu.:0.03270 1st Qu.:21.60 1st Qu.: 658.5
## Median :17.60 Median :0.04210 Median :25.80 Median : 831.0
## Mean :19.40 Mean :0.04709 Mean :26.60 Mean : 905.1
## 3rd Qu.:22.75 3rd Qu.:0.05445 3rd Qu.:30.45 3rd Qu.:1057.5
## Max. :27.60 Max. :0.11980 Max. :44.00 Max. :1993.0

# View top rows of dataset


head(UScrime)

## M So Ed Po1 Po2 LF M.F Pop NW U1 U2 Wealth Ineq


## 1 15.1 1 9.1 5.8 5.6 0.510 95.0 33 30.1 0.108 4.1 3940 26.1
## 2 14.3 0 11.3 10.3 9.5 0.583 101.2 13 10.2 0.096 3.6 5570 19.4
## 3 14.2 1 8.9 4.5 4.4 0.533 96.9 18 21.9 0.094 3.3 3180 25.0
## 4 13.6 0 12.1 14.9 14.1 0.577 99.4 157 8.0 0.102 3.9 6730 16.7
## 5 14.1 0 12.1 10.9 10.1 0.591 98.5 18 3.0 0.091 2.0 5780 17.4
## 6 12.1 0 11.0 11.8 11.5 0.547 96.4 25 4.4 0.084 2.9 6890 12.6
## Prob Time Crime
## 1 0.084602 26.2011 791
## 2 0.029599 25.2999 1635
## 3 0.083401 24.3006 578
## 4 0.015801 29.9012 1969
## 5 0.041399 21.2998 1234
## 6 0.034201 20.9995 682

# Initializing new data point


datapoint = data.frame(M = 14.0, So = 0, Ed = 10.0, Po1 = 12.0, Po2 = 15.5, LF =
0.640, M.F = 94.0, Pop = 150, NW = 1.1, U1 = 0.120, U2 = 3.6, Wealth = 3200, Ine
q = 20.1, Prob = 0.04, Time = 39.0)

# Run PCA on the matrix of scaled predictors


PCA = prcomp(UScrime[,1:15], scale. = TRUE, center = TRUE)
summary(PCA)

## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 2.4534 1.6739 1.4160 1.07806 0.97893 0.74377
## Proportion of Variance 0.4013 0.1868 0.1337 0.07748 0.06389 0.03688
## Cumulative Proportion 0.4013 0.5880 0.7217 0.79920 0.86308 0.89996
## PC7 PC8 PC9 PC10 PC11 PC12
## Standard deviation 0.56729 0.55444 0.48493 0.44708 0.41915 0.35804
## Proportion of Variance 0.02145 0.02049 0.01568 0.01333 0.01171 0.00855
## Cumulative Proportion 0.92142 0.94191 0.95759 0.97091 0.98263 0.99117
## PC13 PC14 PC15
## Standard deviation 0.26333 0.2418 0.06793
## Proportion of Variance 0.00462 0.0039 0.00031
## Cumulative Proportion 0.99579 0.9997 1.00000

# Print PCA results


print(PCA)

## Standard deviations (1, .., p=15):


## [1] 2.45335539 1.67387187 1.41596057 1.07805742 0.97892746 0.74377006
## [7] 0.56729065 0.55443780 0.48492813 0.44708045 0.41914843 0.35803646
## [13] 0.26332811 0.24180109 0.06792764
##
## Rotation (n x k) = (15 x 15):
## PC1 PC2 PC3 PC4 PC5
## M -0.30371194 0.06280357 0.1724199946 -0.02035537 -0.35832737
## So -0.33088129 -0.15837219 0.0155433104 0.29247181 -0.12061130
## Ed 0.33962148 0.21461152 0.0677396249 0.07974375 -0.02442839
## Po1 0.30863412 -0.26981761 0.0506458161 0.33325059 -0.23527680
## Po2 0.31099285 -0.26396300 0.0530651173 0.35192809 -0.20473383
## LF 0.17617757 0.31943042 0.2715301768 -0.14326529 -0.39407588
## M.F 0.11638221 0.39434428 -0.2031621598 0.01048029 -0.57877443
## Pop 0.11307836 -0.46723456 0.0770210971 -0.03210513 -0.08317034
## NW -0.29358647 -0.22801119 0.0788156621 0.23925971 -0.36079387
## U1 0.04050137 0.00807439 -0.6590290980 -0.18279096 -0.13136873
## U2 0.01812228 -0.27971336 -0.5785006293 -0.06889312 -0.13499487
## Wealth 0.37970331 -0.07718862 0.0100647664 0.11781752 0.01167683
## Ineq -0.36579778 -0.02752240 -0.0002944563 -0.08066612 -0.21672823
## Prob -0.25888661 0.15831708 -0.1176726436 0.49303389 0.16562829
## Time -0.02062867 -0.38014836 0.2235664632 -0.54059002 -0.14764767
## PC6 PC7 PC8 PC9 PC10
## M -0.449132706 -0.15707378 -0.55367691 0.15474793 -0.01443093
## So -0.100500743 0.19649727 0.22734157 -0.65599872 0.06141452
## Ed -0.008571367 -0.23943629 -0.14644678 -0.44326978 0.51887452
## Po1 -0.095776709 0.08011735 0.04613156 0.19425472 -0.14320978
## Po2 -0.119524780 0.09518288 0.03168720 0.19512072 -0.05929780
## LF 0.504234275 -0.15931612 0.25513777 0.14393498 0.03077073
## M.F -0.074501901 0.15548197 -0.05507254 -0.24378252 -0.35323357
## Pop 0.547098563 0.09046187 -0.59078221 -0.20244830 -0.03970718
## NW 0.051219538 -0.31154195 0.20432828 0.18984178 0.49201966
## U1 0.017385981 -0.17354115 -0.20206312 0.02069349 0.22765278
## U2 0.048155286 -0.07526787 0.24369650 0.05576010 -0.04750100
## Wealth -0.154683104 -0.14859424 0.08630649 -0.23196695 -0.11219383
## Ineq 0.272027031 0.37483032 0.07184018 -0.02494384 -0.01390576
## Prob 0.283535996 -0.56159383 -0.08598908 -0.05306898 -0.42530006
## Time -0.148203050 -0.44199877 0.19507812 -0.23551363 -0.29264326
## PC11 PC12 PC13 PC14 PC15
## M 0.39446657 0.16580189 -0.05142365 0.04901705 0.0051398012
## So 0.23397868 -0.05753357 -0.29368483 -0.29364512 0.0084369230
## Ed -0.11821954 0.47786536 0.19441949 0.03964277 -0.0280052040
## Po1 -0.13042001 0.22611207 -0.18592255 -0.09490151 -0.6894155129
## Po2 -0.13885912 0.19088461 -0.13454940 -0.08259642 0.7200270100
## LF 0.38532827 0.02705134 -0.27742957 -0.15385625 0.0336823193
## M.F -0.28029732 -0.23925913 0.31624667 -0.04125321 0.0097922075
## Pop 0.05849643 -0.18350385 0.12651689 -0.05326383 0.0001496323
## NW -0.20695666 -0.36671707 0.22901695 0.13227774 -0.0370783671
## U1 -0.17857891 -0.09314897 -0.59039450 -0.02335942 0.0111359325
## U2 0.47021842 0.28440496 0.43292853 -0.03985736 0.0073618948
## Wealth 0.31955631 -0.32172821 -0.14077972 0.70031840 -0.0025685109
## Ineq -0.18278697 0.43762828 -0.12181090 0.59279037 0.0177570357
## Prob -0.08978385 0.15567100 -0.03547596 0.04761011 0.0293376260
## Time -0.26363121 0.13536989 -0.05738113 -0.04488401 0.0376754405

# Plot PCA results as a scree plot


plot(PCA)

# Looking at the scree plot, we can see that only the first 4 or 5 factors may b
e relevant
# in order to retain them for an exploratory factor analysis.
# Another approach is to use the function screeplot and pick the # of items with
stdev > 1
# We can see that the first 4 or 5 principal components are significant (PCA 5 b
arely touches the line)
screeplot(PCA, type = "line")
abline(h=1, col="green")

# Number of principal components to use


k = 5

# Combining principal components 1:5 with original crime dataframe


PCcrime = cbind(PCA$x[,1:k],UScrime[,16])

# Creating linear regression model with the combined dataset


lm.model <- lm(V6~., data = as.data.frame(PCcrime))
summary(lm.model)

##
## Call:
## lm(formula = V6 ~ ., data = as.data.frame(PCcrime))
##
## Residuals:
## Min 1Q Median 3Q Max
## -420.79 -185.01 12.21 146.24 447.86
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 905.09 35.59 25.428 < 2e-16 ***
## PC1 65.22 14.67 4.447 6.51e-05 ***
## PC2 -70.08 21.49 -3.261 0.00224 **
## PC3 25.19 25.41 0.992 0.32725
## PC4 69.45 33.37 2.081 0.04374 *
## PC5 -229.04 36.75 -6.232 2.02e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 244 on 41 degrees of freedom
## Multiple R-squared: 0.6452, Adjusted R-squared: 0.6019
## F-statistic: 14.91 on 5 and 41 DF, p-value: 2.446e-08

# Calculate intercept b0
beta0 <- lm.model$coefficients[1]
# Calculate b1, b2, etc... coefficients
betas <- lm.model$coefficients[2:(k+1)]
# Calculate implied regression coefficients a1, a2, etc...
alpha <- PCA$rotation[,1:k] %*% betas
# Calculating original alpha and beta values
mu <- sapply(UScrime[,1:15],mean)
sigma <- sapply(UScrime[,1:15],sd)
original.alpha <- alpha/sigma
original.beta0 <- beta0 - sum(alpha*mu /sigma)

# Calculate R^2 and R^2 adjusted value in order to get model's accuracy
estimates <- as.matrix(UScrime[,1:15]) %*% original.alpha + original.beta0
SSE = sum((estimates - UScrime[,16])^2)
SStot = sum((UScrime[,16] - mean(UScrime[,16]))^2)
R2 = 1 - SSE/SStot
R2.adjusted = R2 - (1-R2)*k/(nrow(UScrime)-k-1)

R2

## [1] 0.6451941

R2.adjusted

## [1] 0.601925

# Calculate lm() prediction with the new data point


prediction_df <- data.frame(predict(PCA, datapoint))
lm.prediction = predict(lm.model, prediction_df)
lm.prediction

## 1
## 1388.926

You might also like