STAT630Slide Adv Data Analysis
STAT630Slide Adv Data Analysis
Lecturer
Samuel Iddi, PhD
Department of Statistics
University of Ghana
siddi@ug.edu.gh
Identifying outliers
Multicolinearity: VIF
Multivariate analysis
Teaching Assistants
TA:
Tutorial:
Phone number:
E-mail:
Grading
Homework Assignments (20%), Interim Assessment (30%) and
Final Exams (50%).
Guidelines
Homeworks should be submitted one week from the day
assigned.
R program codes will be submitted via email.
Late submission will not be accepted.
Duplicate solutions will not be graded.
TA will discussion solutions during tutorial hours.
Interim assessment could take the form of project,
presentation and defense.
Computing
The main software package for this course is regular R version
2.3.1 (or higher). Install R by visiting the website:
www.r-project.org. You may also use RStudio but this only
works after you have installed R first.
Regression
Establish a relationship (often linear) between dependent
(or response) variable Y and independent (or predictor)
variable X .
We observe pairs (X , Y ).
A statistical relationship between the two variables is given
by
Y = f (X ) + ε
where ε is the error committed by using f (X ) to
approximate Y .
If scatterplot of (X , Y ) is approximately linear,
f (X ) = β0 + β1 X is chosen.
Y = β0 + β1 X + ε is called a linear regression model used
to describe the relationship between Y and X .
Parameters β0 and β1 are the intercept and slope of the
straight line respectively.
Regression
If f (X ) is nonlinear, then
Y = f (X ) + ε
β̂0 = Ȳ − β̂1 X̄
where βˆ0 and βˆ1 are unbiased estimators of β0 and β1
respectively.
SSE
Unbiased estimator of σ 2 , s2 = MSE = n−2 where
(SSXY )2
SSE = SSYY −
SSXX
β0 and β1 are random and are normally distributed.
X̄ 2
Thus, E(βˆ0 ) = β0 , Var(βˆ0 ) = σ 2 n1 + SS XX
, E(βˆ1 ) = β1
σ2
and Var(βˆ1 ) = SSXX .
Hypothesis Testing and Confidence Intervals
β̂1 − c
q 2
∼ tn−1 .
P s
(Xi −X̄ )2
E(Ŷh ) = β0 + β1 xh
and
1 (xh − X̄ )2
2
Var(Ŷh ) = σ +
n SSXX
.
The CI of the mean response
s
1 (xh − X̄ )2
Ŷh ± t1−α/2,n−2 s2 +
n SSXX
Prediction of mean new observations
and
1 (xh − X̄ )2
2
Var(Ŷh ) = σ 1+ +
n SSXX
.
CI can also be constructed based on the t−distribution
with n − 2 df .
Such an interval is called prediction interval.
Analysis of Variance
Another tool for interpretation results from a linear
regression is the ANOVA.
SOV df SS MS E(MS) F
Regression 1 SSR MSR = SSR
1 σ 2 + β12 SSXX
Error n−2 SSE MSE = SSE
n−2 σ2 MSR
MSE
Total n−1 SSTO
It is given by
SSR SSTO − SSE SSE
r2 = = =1−
SSTO SSTO SSTO
SSXY
r=√ is the correlation coefficient.
SSXX SSYY
Coefficient of determination
Note that
0 ≤ r 2 ≤ 1. Why?
If β1 = 0 then r 2 = 0.
Multiple regression
Matrix Notation: Y = X β + ε
◦ Y is n × 1
◦ X is n × p
◦ β is p × 1
◦ ε ∼ N(0, σ 2 I) is n × 1
Categorical and Continuous Covariates
Consider the following example:
◦ Y length of stay in hospital ⇒ continuous response
◦ Age ⇒ continuous variable
◦ Gender ⇒ two-level categorical variable
◦ Disability status ⇒ three-level categorical variable
Create DUMMY variables for categorical variables.
Let X1 = Age
1 if male
X2 =
0 if female
1 if not disabled
X3 =
0 otherwise
1 if partially disabled
X4 =
0 otherwise
β̂ = (X 0 X )−1 XY
SOV df SS MS F
Regression p−1 SSR MSR = SSR
p−1
Error n−p SSE MSE = SSE
n−p
MSR
MSE
Total n−1 SSTO
ANOVA
HO : β1 = β2 = · · · = βp−1 = 0
HA : β1 = β2 = · · · = βp−1 6= 0 for some k
Test statistics
MSR
F∗ = ∼ Fp−1,n−p
MSE
Decision rule
if F ∗ ≤ F (1 − α; p − 1, n − p), conclude HO
if F ∗ > F (1 − α; p − 1, n − p), conclude HA
Inference about the regression parameters
E(β̂) = β and Cov(β̂) = MSE(X 0 X )−1
Test for βk
◦ Hypothesis, HO : βk = 0 vs HA : βk 6= 0
◦ Test statistics, t ∗ = β̂k
se(β̂k )
◦ Decision rule, if |t ∗ | ≤ t1−α/2,n−p conclude HO and if
|t ∗ | > t1−α/2,n−p conclude HA
Interval estimation of βˆk
Joint inferences
◦ Bonferroni joint confidence interval: β̂k ± Bse(β̂k )
where B = t1−α/2g,n−p
◦ Scheffe joint confidence interval: β̂k ± Sse(β̂k ) where
S 2 = gF1−α;g,n−p (preferred for large g)
Inference about mean response
E(Yh ) = Xh0 β and E(Ŷh ) = Xh0 β̂
σ 2 {Ŷh } = Xh0 σ 2 {β}Xh and s2 {Ŷh } = Xh0 s2 {β}Xh
(1 − α)100% confidence limits for Ŷh
Ŷh ± t1−α/2,n−p se(Ŷh )
Confidence region for regression surface
(Working-Hotelling band)
Ŷh ± Wse(Ŷh )
where W 2 = pF (1 − α; p, n − p)
Simultaneous CI for several mean response
◦ use working-hotelling band
◦ use Bonferroni (g intervals),
Ŷh ± Bse(Ŷh )
where B = t1−α/2g,n−p
Coefficient of multiple determination
Multiple determination
SSR SSE
R2 = =1−
SSTO SSTO
R 2 usually get larger by inclusion of more predictors.
Multiple correlation √
R= R2
Departures from the model
Models must be check for assumptions.
standard residuals
ei − ē
e∗ = √ ,
MSE
where ē = 0. Makes it easier to detect outliers. Values
greater than 3 or less than -3 are regarded "unusual"
observations.
Boxplot, stem and leaf plots, histogram, dot plots, and time
plots can be used.
Residual analysis
Test statistics:
n
X (ei − ei−1 )2
D= Pn 2
i=2 i=1 ei
Compute di1 = |ei1 − ẽ1 | and di1 = |ei2 − ẽ2 | with ẽ1 and
ẽ2 the medians.
d¯1 − d¯2
t∗ = q ∼ tn−2
sp n11 + n12
Procedure:
◦ use the relationship Inσi = γ0 + γ1 xi
◦ HO : γ1 = 0 versus HA : γi > 0
◦ Regress ei2 against Xi and obtain SSR ∗
Statistics:
SSR ∗
χ2BP = 2 2
2 ∼ χ 1 .
SSE
n
SSR ∗
For q predictors, χ2BP = q
SSE 2
∼ χ2q
( n )
Test for Normality
Procedure
Fit a regression line and compute the residuals
Testing method
Test to check if the linear regression model is appropriate
Linearity
◦ modify the regression model
◦ use transformation on X and Y
Homoscedasticity
◦ methods of weighted least squares
◦ variance stabilization transformation
Independence
◦ use time series regression model
◦ use generalized least squares
Remedial measures
Outliers
◦ discard outliers (be careful!)
◦ add interaction with independent variables
◦ use robust estimation method (median or absolute
deviation)
Normality
◦ use generalized linear model
◦ use transformation on Y
◦ rank regression
◦ non-parametric regression
Lack of fit
◦ include more independent variables
Transformation
Linearizing transforms
plot the data
(sand<-read.table("./Datasets/sand.txt", header=T))
## mgco3 temp
## 1 9.2 17.5
## 2 9.2 21.0
## 3 9.4 20.0
## 4 9.0 15.3
## 5 8.5 14.0
## 6 8.5 13.1
## 7 8.8 13.3
## 8 8.5 13.0
## 9 9.3 19.0
## 10 9.0 18.7
Demonstration in R
##
## Call:
## lm(formula = mgco3 ~ temp, data = sand)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.192410 -0.145541 0.008722 0.147426 0.178326
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.30034 0.30597 23.859 1.01e-08 ***
## temp 0.09943 0.01827 5.443 0.000614 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1697 on 8 degrees of freedom
## Multiple R-squared: 0.7874,Adjusted R-squared: 0.7608
## F-statistic: 29.63 on 1 and 8 DF, p-value: 0.0006136
Diagnostics in R
par(mfrow=c(1,2))
plot(reg$fitted.values, reg$residuals, xlab="Fitted Values",
ylab="Residuals")
library(car)
0.1
Residuals
Residuals
0.0
0.0
−0.2
−0.2
5 2
par(mfrow=c(1,2));plot(reg,1:2)
4 7
Standardized residuals
1.0
0.1
Residuals
0.0
0.0
−1.0
−0.2
5 2 5
2
Scale−Location
1.2
2
5
Standardized residuals
7
0.8
0.4
0.0
Fitted values
lm(mgco3 ~ temp)
Understanding diagnosis plot
Residual vs Fitted
◦ Show if there is non-linear patterns that is not
captured by the model
◦ If there is no distinctive pattern in the plot,
non-linearity is not a problem
Normal Q-Q Plot
◦ Shows if residuals are normally distributed
◦ If residuals follow a straight line (not perfect straight
line) then there is no deviation from normality
◦ Severe deviation from straight dashed-line must be of
concern
Understanding diagnosis plot
Scale-Location/Spread-Location Plot
◦ Shows if residuals are spread equally along the
ranges of predictor.
◦ Used to check for homoscedasticity assumption
◦ A horizontal line with equally (randomly) spread points
is indicative of equal variance
Formal test in R
Formal Test for Normality and Constancy of Variance
shapiro.test(reg$residuals) ##Test for Normality
##
## Shapiro-Wilk normality test
##
## data: reg$residuals
## W = 0.82427, p-value = 0.02854
Normality is suspect
##
## Shapiro-Wilk normality test
##
## data: reg2$residuals
## W = 0.8351, p-value = 0.03854
##
## Shapiro-Wilk normality test
##
## data: reg3$residuals
## W = 0.82978, p-value = 0.03326
##
## Shapiro-Wilk normality test
##
## data: reg4$residuals
## W = 0.84463, p-value = 0.05012
print(p)
0.118
0.116
0.114
1/mgco3
0.112
0.110
0.108
0.106
15.0 17.5 20.0
temp
Scatterplot of regression fit in R
Alternatively, we can use the base graphics.
plot(sand$temp,1/sand$mgco3)
abline(reg4)
0.118
1/sand$mgco3
0.114
0.110
0.106
14 16 18 20
sand$temp
CI of mean and PI for new observation in R
0.116
1/mgco3
0.112
0.108
0.104
0.112
0.106
14 16 18 20
sand$temp
An example with categorical independent variable
##
## Call:
## lm(formula = failure ~ locf, data = ex2)
##
## Coefficients:
## (Intercept) locfLoc 2 locfLoc 3
## 50.37 -28.24 70.84
ANOVA results from R
anova(ex2reg)
Standardized residuals
3
2
Residuals
100
1
0
0
−100
15
−1
12 15
12
20 40 60 80 100 120 −1 0 1
## NULL
Diagnostics plots in R
par(mfrow=c(1,2))
plot(ex2reg)[1:2]
Constant Leverage:
Scale−Location Residuals vs Factor Levels
14
Standardized residuals
14
Standardized residuals
3
1.5
2
12
1.0
15
1
0.5
0
−1
15
12
0.0
locf :
20 40 60 80 100 120 Loc 1 Loc 2 Loc 3
## NULL
Formal diagnostics in R
shapiro.test(ex2reg$residuals)
##
## Shapiro-Wilk normality test
##
## data: ex2reg$residuals
## W = 0.8065, p-value = 0.004461
library(nortest)
lillie.test(ex2reg$residuals);
##
## Lilliefors (Kolmogorov-Smirnov) normality test
##
## data: ex2reg$residuals
## D = 0.25084, p-value = 0.01185
Remedial measures
tapply(ex2$failure, ex2$locf, var)/tapply(ex2$failure, ex2$locf, mean) # Square root
7
2
Studentized Residuals(ex2reg2)
1
0
−1
1
−2
−2 −1 0 1 2
t Quantiles
Remedial measures in R
shapiro.test(ex2reg2$residuals)
##
## Shapiro-Wilk normality test
##
## data: ex2reg2$residuals
## W = 0.978, p-value = 0.954
ncvTest(ex2reg2)
95%
−80
−90
log−likelihood
−100
−110
−120
−2 −1 0 1 2
Diagnostics after transformation
The graph seems to suggest the natural log transformation
since 0 is included in the 95% confidence interval. However, a
transformation of Y 0.1 appears to be better.
ex2reg3 <- lm(failure^0.1 ~ locf, data = ex2)
qqPlot(ex2reg3)
Studentized Residuals(ex2reg3)
7
2
1
0
−1
1
−2
−2 −1 0 1 2
t Quantiles
Diagnostics after transformation
shapiro.test(ex2reg3$residuals)
##
## Shapiro-Wilk normality test
##
## data: ex2reg3$residuals
## W = 0.97697, p-value = 0.9446
ncvTest(ex2reg3)
Influence?
Identifying outlying Y observations
Residuals
ei = Yi − Ŷi
Semistudentized Residuals
ei
ei∗ = √
MSE
Studentized Residuals
ei
ri =
s(ei )
MSE(1 − hii ). H = X (X 0 X )−1 X 0 , the hat
p
where s(ei ) =
matrix.
Deleted Residuals
ei
di = Yi − Ŷi(i) and di =
1 − hii
Studentized Deleted Residuals
di
ti =
s(di )
Identifying outlying X observations
Let H be the hat matrix
n
X
0 ≤ hii ≤ 1 hii = p
i=1
Criteria
Ŷi − Ŷi(i)
DFFITSi =
MSE(i) hii
Rule of Thumb
◦ F (Di , p, n − p) < 10% or 20% not influential case.
◦ F (Di , p, n − p) > 50% influential case.
Influential cases
Rule of Thumb
◦ |DFBETAS|k (i) > 1 influential case for small data
◦ |DFBETAS|k (i) > √2n for large data sets.
Example: Body Fats Data
Variables
n = 20
X1 skinfold thickness
X2 thigh circumference
X3 midarm circumference
Y amount of fat
Some Implementation in R
body<-read.table("./Datasets/body.txt", header=T)
head(body)
## x1 x2 x3 y
## 1 19.5 43.1 29.1 11.9
## 2 24.7 49.8 28.2 22.8
## 3 30.7 51.9 37.0 18.7
## 4 29.8 54.3 31.1 20.1
## 5 19.1 42.2 30.9 12.9
## 6 25.6 53.9 23.7 21.7
(model<-lm(y~x1+x2, data=body))
##
## Call:
## lm(formula = y ~ x1 + x2, data = body)
##
## Coefficients:
## (Intercept) x1 x2
## -19.1742 0.2224 0.6594
Some Implementation in R
par(mfrow=c(1,2));plot(model, 4:5)
2
3
Standardized residuals
14 0.5
0.4
Cook's distance
1
0
13
0.2
14
−1
3 0.5
Cook's
13 distance
0.0
−2
5 10 15 20 0.0 0.1 0.2 0.3
2 3
1.5
Cook's distance
13 1
14
0.5
0
Leverage hii
lm(y ~ x1 + x2)
Some Implementation in R
model$residuals
## 1 2 3 4 5
## -1.6827093112 3.6429311788 -3.1759701405 -3.1584651200 -0.0002886579
## 6 7 8 9 10
## -0.3608155187 0.7161991891 4.0147327554 2.6551057360 -2.4748115410
## 11 12 13 14 15
## 0.3358063798 2.2255110139 -3.9468613463 3.4474561945 0.5705871038
## 16 17 18 19 20
## 0.6422984777 -0.8509464751 -0.7829198812 -2.8572887647 1.0404487275
Understanding diagnosis plot
rstandard(model)
## 1 2 3 4 5
## -0.7402261049 1.4765806027 -1.5757913418 -1.3171518783 -0.0001308889
## 6 7 8 9 10
## -0.1519867578 0.3064529233 1.6606069555 1.1095479783 -1.0316491806
## 11 12 13 14 15
## 0.1407848595 0.9272163634 -1.7121512617 1.4686083237 0.2747599004
## 16 17 18 19 20
## 0.2655244112 -0.3538020409 -0.3435016352 -1.1631291289 0.4197625143
rstudent(model)
## 1 2 3 4 5
## -0.7299854027 1.5342541325 -1.6543295725 -1.3484842072 -0.0001269809
## 6 7 8 9 10
## -0.1475490938 0.2981276214 1.7600924916 1.1176487404 -1.0337284208
## 11 12 13 14 15
## 0.1366610657 0.9231785040 -1.8259027246 1.5247630510 0.2671500921
## 16 17 18 19 20
## 0.2581323416 -0.3445090997 -0.3344080836 -1.1761712768 0.4093564171
Some Implementation in R
influence(model)$hat
## 1 2 3 4 5 6
## 0.20101253 0.05889478 0.37193301 0.11094009 0.24801034 0.12861620
## 7 8 9 10 11 12
## 0.15551745 0.09628780 0.11463564 0.11024435 0.12033655 0.10926629
## 13 14 15 16 17 18
## 0.17838181 0.14800684 0.33321201 0.09527739 0.10559466 0.19679280
## 19 20
## 0.06695419 0.05008526
dffits(model)
## 1 2 3 4 5
## -3.661472e-01 3.838103e-01 -1.273067e+00 -4.763483e-01 -7.292347e-05
## 6 7 8 9 10
## -5.668650e-02 1.279371e-01 5.745212e-01 4.021649e-01 -3.638725e-01
## 11 12 13 14 15
## 5.054583e-02 3.233366e-01 -8.507812e-01 6.355141e-01 1.888521e-01
## 16 17 18 19 20
## 8.376829e-02 -1.183735e-01 -1.655265e-01 -3.150707e-01 9.399706e-02
Some Implementation in R
(cd<-cooks.distance(model))
## 1 2 3 4 5
## 4.595055e-02 4.548118e-02 4.901567e-01 7.216190e-02 1.883399e-09
## 6 7 8 9 10
## 1.136518e-03 5.764939e-03 9.793853e-02 5.313352e-02 4.395704e-02
## 11 12 13 14 15
## 9.037986e-04 3.515436e-02 2.121502e-01 1.248925e-01 1.257530e-02
## 16 17 18 19 20
## 2.474925e-03 4.926142e-03 9.636470e-03 3.236006e-02 3.096787e-03
Some Implementation in R
dfbeta(model)
## (Intercept) x1 x2
## 1 -2.5873120410 -4.045756e-02 6.851256e-02
## 2 1.3885856142 3.359106e-02 -3.996603e-02
## 3 -6.7460811783 -3.417891e-01 2.959198e-01
## 4 -0.8298018102 -8.699579e-02 5.576707e-02
## 5 -0.0005491464 -9.548311e-06 1.507863e-05
## 6 0.3417033826 1.252797e-02 -1.327784e-02
## 7 -0.6662800194 -4.869870e-03 1.625790e-02
## 8 2.0621886739 1.119746e-01 -9.133441e-02
## 9 -1.2562286444 -8.876192e-02 7.137567e-02
## 10 1.9837372912 7.407218e-02 -7.811621e-02
## 11 -0.0776962164 5.331751e-03 -7.452907e-04
## 12 -1.0957798277 6.844439e-03 2.047112e-02
## 13 0.9361849968 1.685643e-01 -1.063493e-01
## 14 3.6377705378 3.307612e-02 -8.349511e-02
## 15 -0.0258331550 -3.893438e-02 2.059515e-02
## 16 0.0800533575 1.345694e-02 -7.525586e-03
## 17 0.6827037689 1.715290e-02 -2.275412e-02
## 18 1.1340627287 2.347924e-02 -3.472625e-02
## 19 -1.0715507527 -1.221911e-03 1.855296e-02
## 20 0.0873645353 7.127872e-04 -9.895689e-04
Some Implementation in R
influence.measures(model)
## Influence measures of
## lm(formula = y ~ x1 + x2, data = body) :
##
## dfb.1_ dfb.x1 dfb.x2 dffit cov.r cook.d hat inf
## 1 -3.05e-01 -1.31e-01 2.32e-01 -3.66e-01 1.361 4.60e-02 0.2010
## 2 1.73e-01 1.15e-01 -1.43e-01 3.84e-01 0.844 4.55e-02 0.0589
## 3 -8.47e-01 -1.18e+00 1.07e+00 -1.27e+00 1.189 4.90e-01 0.3719 *
## 4 -1.02e-01 -2.94e-01 1.96e-01 -4.76e-01 0.977 7.22e-02 0.1109
## 5 -6.37e-05 -3.05e-05 5.02e-05 -7.29e-05 1.595 1.88e-09 0.2480 *
## 6 3.97e-02 4.01e-02 -4.43e-02 -5.67e-02 1.371 1.14e-03 0.1286
## 7 -7.75e-02 -1.56e-02 5.43e-02 1.28e-01 1.397 5.76e-03 0.1555
## 8 2.61e-01 3.91e-01 -3.32e-01 5.75e-01 0.780 9.79e-02 0.0963
## 9 -1.51e-01 -2.95e-01 2.47e-01 4.02e-01 1.081 5.31e-02 0.1146
## 10 2.38e-01 2.45e-01 -2.69e-01 -3.64e-01 1.110 4.40e-02 0.1102
## 11 -9.02e-03 1.71e-02 -2.48e-03 5.05e-02 1.359 9.04e-04 0.1203
## 12 -1.30e-01 2.25e-02 7.00e-02 3.23e-01 1.152 3.52e-02 0.1093
## 13 1.19e-01 5.92e-01 -3.89e-01 -8.51e-01 0.827 2.12e-01 0.1784
## 14 4.52e-01 1.13e-01 -2.98e-01 6.36e-01 0.937 1.25e-01 0.1480
## 15 -3.00e-03 -1.25e-01 6.88e-02 1.89e-01 1.775 1.26e-02 0.3332 *
## 16 9.31e-03 4.31e-02 -2.51e-02 8.38e-02 1.309 2.47e-03 0.0953
## 17 7.95e-02 5.50e-02 -7.61e-02 -1.18e-01 1.312 4.93e-03 0.1056
## 18 1.32e-01 7.53e-02 -1.16e-01 -1.66e-01 1.462 9.64e-03 0.1968
## 19 -1.30e-01 -4.07e-03 6.44e-02 -3.15e-01 1.002 3.24e-02 0.0670
## 20 1.02e-02 2.29e-03 -3.31e-03 9.40e-02 1.224 3.10e-03 0.0501
Some Implementation in R
library(car)
outlierTest(model) # Bonferonni p-value for most extreme obs
Model selection
Model validation
Procedures for variable reduction
General case X1 , X2 , . . . , Xn
2n − 1 different models
Criteria for comparing the regression models
Different criteria can be used
Rp2
MSEp
Cp
PRESSp
SSEp
Rp2 = 1 −
SSTO
Given that SSTO is constant we will look for the smallest
SSEp
Ra2 or MSEp criterion
If we want to account for the number of parameters
n − 1 SSE MSE
Ra2 = 1 − =1−
n − p SSTO SSTO/(n − 1)
Data
X1 blood clotting score
X2 prognostic index
X3 enzyme function test score
X4 liver function test
Y survival
Objective
Predict survival of patients with a particular type of liver
operation.
Implementation in R
surg<-read.table("./Datasets/surgical.txt", header=T)
cor(surg)
## x1 x2 x3 x4 y logy
## x1 1.00000000 0.09011973 -0.14963411 0.5024157 0.3725187 0.3464042
## x2 0.09011973 1.00000000 -0.02360544 0.3690256 0.5539760 0.5928888
## x3 -0.14963411 -0.02360544 1.00000000 0.4164245 0.5802438 0.6651216
## x4 0.50241567 0.36902563 0.41642451 1.0000000 0.7223266 0.7262058
## y 0.37251865 0.55397598 0.58024382 0.7223266 1.0000000 0.9130965
## logy 0.34640419 0.59288884 0.66512160 0.7262058 0.9130965 1.0000000
Implementation in R
pairs(surg[,-6])
20 40 60 80 1 2 3 4 5 6
8
x1
4
60
x2
20
60 100
x3
20
5
x4
3
1
600
y
200
4 6 8 10 20 60 100 200 600
Implementation in R: Diagnostics
model<-lm(y~x1+x2+x3+x4, data=surg)
library(car)
par(mfrow=c(1,2))
qqnorm(model$residuals)
qqline(model$residuals)
plot(model$fitted,model$residuals,xlab="Fitted",ylab="Residuals",
main="Time")
abline(h=0)
300
Sample Quantiles
Residuals
100
100
0
LogTime LogTime
Sample Quantiles
0.10
0.10
Residuals
0.00
0.00
−0.10
−0.10
#install.packages('leaps')
library(leaps)
modelsel<-regsubsets(logy~x1+x2+x3+x4,data=surg,nbest=2)
summary(modelsel)
0.97
0.97
0.88
r2
0.81
0.69
0.53
0.44
(Intercept)
x1
x2
x3
x4
Implementation in R: Variable Selection
car::subsets(modelsel, statistic='rsq',legend=F)
x1−x2−x3 x1−x2−x3−x4
0.9
x2−x3−x4
x2−x3
0.8
Statistic: rsq
0.7
x3−x4
0.6
x4
0.5
x3
Subset Size
Implementation in R: Variable Selection
plot(modelsel, scale="adjr2")
0.97
0.97
0.88
adjr2
0.81
0.67
0.52
0.43
(Intercept)
x1
x2
x3
x4
Implementation in R: Variable Selection
car::subsets(modelsel, statistic='adjr2', legend=F)
x1−x2−x3 x1−x2−x3−x4
0.9
x2−x3−x4
0.8
x2−x3
Statistic: adjr2
0.7
x3−x4
0.6
x4
0.5
x3
Subset Size
Implementation in R: Variable Selection
plot(modelsel, scale="Cp")
160
Cp
280
510
790
940
(Intercept)
x1
x2
x3
x4
Implementation in R: Variable Selection
car::subsets(modelsel, statistic='cp',legend=F)
x3
800
x4
600
Statistic: cp
x3−x4
400
x2−x3
200
x2−x3−x4
x1−x2−x3 x1−x2−x3−x4
0
Subset Size
Implementation in R: Variable Selection
plot(modelsel) ##scale=bic
−180
−170
−100
bic
−79
−51
−32
−24
(Intercept)
x1
x2
x3
x4
Implementation in R: Variable Selection
car::subsets(modelsel, legend=F) ##statistics=bic
x3
x4
−50
x3−x4
x2−x3
Statistic: bic
−100
x2−x3−x4
−150
x1−x2−x3 x1−x2−x3−x4
Subset Size
Implementation in R: Variable Selection
car::subsets(modelsel, statistic='adjr2',
legend=c(3.5, -37)) ##statistics=bic
x1−x2−x3 x1−x2−x3−x4
0.9
x2−x3−x4
0.8
x2−x3
Statistic: adjr2
0.7
x3−x4
0.6
x4
0.5
x3
Subset Size
Implementation in R: Variable Selection
mod<-lm(logy~x1+x2+x3+x4,data=surg)
library(DAAG)
press(mod)
## [1] 0.1455741
Implementation in R: Comparing Two Models
##
## Attaching package: ’MASS’
## The following object is masked from ’package:DAAG’:
##
## hills
dropterm(fit1,test="F")
stepAIC(fit1)
## Start: AIC=-324.71
## logy ~ x1 + x2 + x3 + x4
##
## Df Sum of Sq RSS AIC
## - x4 1 0.00009 0.10986 -326.67
## <none> 0.10977 -324.71
## - x1 1 0.35544 0.46521 -248.73
## - x2 1 1.00583 1.11560 -201.50
## - x3 1 1.28075 1.39052 -189.60
##
## Step: AIC=-326.67
## logy ~ x1 + x2 + x3
##
## Df Sum of Sq RSS AIC
## <none> 0.10986 -326.67
## - x1 1 0.63315 0.74301 -225.45
## - x2 1 1.29732 1.40718 -190.96
## - x3 1 2.12263 2.23249 -166.04
##
## Call:
## lm(formula = logy ~ x1 + x2 + x3, data = surg)
##
## Coefficients:
## (Intercept) x1 x2 x3
## 0.483621 0.069225 0.009295 0.009524
Implementation in R: Automatic Search Procedure
stepAIC(fit1,direction="backward")
## Start: AIC=-324.71
## logy ~ x1 + x2 + x3 + x4
##
## Df Sum of Sq RSS AIC
## - x4 1 0.00009 0.10986 -326.67
## <none> 0.10977 -324.71
## - x1 1 0.35544 0.46521 -248.73
## - x2 1 1.00583 1.11560 -201.50
## - x3 1 1.28075 1.39052 -189.60
##
## Step: AIC=-326.67
## logy ~ x1 + x2 + x3
##
## Df Sum of Sq RSS AIC
## <none> 0.10986 -326.67
## - x1 1 0.63315 0.74301 -225.45
## - x2 1 1.29732 1.40718 -190.96
## - x3 1 2.12263 2.23249 -166.04
##
## Call:
## lm(formula = logy ~ x1 + x2 + x3, data = surg)
##
## Coefficients:
## (Intercept) x1 x2 x3
## 0.483621 0.069225 0.009295 0.009524
Implementation in R: Automatic Search Procedure
stepAIC(fit1,direction="both")
## Start: AIC=-324.71
## logy ~ x1 + x2 + x3 + x4
##
## Df Sum of Sq RSS AIC
## - x4 1 0.00009 0.10986 -326.67
## <none> 0.10977 -324.71
## - x1 1 0.35544 0.46521 -248.73
## - x2 1 1.00583 1.11560 -201.50
## - x3 1 1.28075 1.39052 -189.60
##
## Step: AIC=-326.67
## logy ~ x1 + x2 + x3
##
## Df Sum of Sq RSS AIC
## <none> 0.10986 -326.67
## + x4 1 0.00009 0.10977 -324.71
## - x1 1 0.63315 0.74301 -225.45
## - x2 1 1.29732 1.40718 -190.96
## - x3 1 2.12263 2.23249 -166.04
##
## Call:
## lm(formula = logy ~ x1 + x2 + x3, data = surg)
##
## Coefficients:
## (Intercept) x1 x2 x3
## 0.483621 0.069225 0.009295 0.009524
Multicolinearity
Multicolinearity affects
◦ regression coefficients
◦ s{β̂}
◦ fitted values and predictors
◦ tests and confidence intervals
crew<-read.table("./Datasets/crew.txt", header=T)
cor(crew[,-3])
## x1 x2
## x1 1 0
## x2 0 1
Some Implementation in R
summary(lm(y~x1,data=crew))
##
## Call:
## lm(formula = y ~ x1, data = crew)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.750 -3.750 0.125 4.500 6.000
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.500 10.111 2.324 0.0591 .
## x1 5.375 1.983 2.711 0.0351 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.609 on 6 degrees of freedom
## Multiple R-squared: 0.5505,Adjusted R-squared: 0.4755
## F-statistic: 7.347 on 1 and 6 DF, p-value: 0.03508
Implementation in R
summary(lm(y~x2,data=crew))
##
## Call:
## lm(formula = y ~ x2, data = crew)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.000 -4.688 -0.250 5.250 7.250
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27.250 11.608 2.348 0.0572 .
## x2 9.250 4.553 2.032 0.0885 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.439 on 6 degrees of freedom
## Multiple R-squared: 0.4076,Adjusted R-squared: 0.3088
## F-statistic: 4.128 on 1 and 6 DF, p-value: 0.08846
Implementation in R
summary(lm(y~x1+x2,data=crew))
##
## Call:
## lm(formula = y ~ x1 + x2, data = crew)
##
## Residuals:
## 1 2 3 4 5 6 7 8
## 1.625 -1.375 -1.625 1.375 -2.125 1.875 0.625 -0.375
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.3750 4.7405 0.079 0.940016
## x1 5.3750 0.6638 8.097 0.000466 ***
## x2 9.2500 1.3276 6.968 0.000937 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.877 on 5 degrees of freedom
## Multiple R-squared: 0.958,Adjusted R-squared: 0.9412
## F-statistic: 57.06 on 2 and 5 DF, p-value: 0.000361
Implementation in R
vif(lm(y~x1+x2,data=crew))
## x1 x2
## 1 1
sqrt(vif(lm(y~x1+x2,data=crew)))>2 #problem
## x1 x2
## FALSE FALSE
Body Fat Example: Implementation in R
body<-read.table("./Datasets/body.txt", header=T)
summary(lm(y~x1,data=body))
##
## Call:
## lm(formula = y ~ x1, data = body)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.1195 -2.1904 0.6735 1.9383 3.8523
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.4961 3.3192 -0.451 0.658
## x1 0.8572 0.1288 6.656 3.02e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.82 on 18 degrees of freedom
## Multiple R-squared: 0.7111,Adjusted R-squared: 0.695
## F-statistic: 44.3 on 1 and 18 DF, p-value: 3.024e-06
Implementation in R
summary(lm(y~x2,data=body))
##
## Call:
## lm(formula = y ~ x2, data = body)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.4949 -1.5671 0.1241 1.3362 4.4084
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -23.6345 5.6574 -4.178 0.000566 ***
## x2 0.8565 0.1100 7.786 3.6e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.51 on 18 degrees of freedom
## Multiple R-squared: 0.771,Adjusted R-squared: 0.7583
## F-statistic: 60.62 on 1 and 18 DF, p-value: 3.6e-07
Implementation in R
summary(lm(y~x1+x2,data=body))
##
## Call:
## lm(formula = y ~ x1 + x2, data = body)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9469 -1.8807 0.1678 1.3367 4.0147
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -19.1742 8.3606 -2.293 0.0348 *
## x1 0.2224 0.3034 0.733 0.4737
## x2 0.6594 0.2912 2.265 0.0369 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.543 on 17 degrees of freedom
## Multiple R-squared: 0.7781,Adjusted R-squared: 0.7519
## F-statistic: 29.8 on 2 and 17 DF, p-value: 2.774e-06
Implementation in R
summary(lm(y~x1+x2+x3,data=body))
##
## Call:
## lm(formula = y ~ x1 + x2 + x3, data = body)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.7263 -1.6111 0.3923 1.4656 4.1277
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 117.085 99.782 1.173 0.258
## x1 4.334 3.016 1.437 0.170
## x2 -2.857 2.582 -1.106 0.285
## x3 -2.186 1.595 -1.370 0.190
##
## Residual standard error: 2.48 on 16 degrees of freedom
## Multiple R-squared: 0.8014,Adjusted R-squared: 0.7641
## F-statistic: 21.52 on 3 and 16 DF, p-value: 7.343e-06
Implementation in R
cor(body[,-4])
## x1 x2 x3
## x1 1.0000000 0.9238425 0.4577772
## x2 0.9238425 1.0000000 0.0846675
## x3 0.4577772 0.0846675 1.0000000
vif(lm(y~x1+x2+x3,data=body))
## x1 x2 x3
## 708.84 564.34 104.61
sqrt(vif(lm(y~x1+x2+x3,data=body)))>2 #problem
## x1 x2 x3
## TRUE TRUE TRUE
Preliminary
Let i = 1, . . . , n denote subjects, individuals, items,
experimental units etc., j = 1, 2, . . . , p denotes variables,
characteristics measured for each individual. The jth
measurement for the ith subject will be denoted by xij . For n
measurements on p variables, the data is setup as follows:
Example
A selection of four receipts from the university bookstore was
obtained in order to investigate the nature of book sales. Each
receipt provided, among other things, the number of books sold
and the total amount of each sale. Let the first variable be the
total Cedi sales and the second variable be the number of
books sold. Then we can regard the corresponding numbers on
the receipts as four measurements on two variables.
Example (cont.)
Suppose the data, in tabular form, are
Solution
Using the notation just introduced, we have
yij = axij + b, i = 1, 2, . . . , n
yik = cxik + d, i = 1, 2, . . . , n
provided a and c have the same sign (ac > 0). What
happens if ac < 0?
Do not overestimate the importance of linear association
measures.
◦ They do not capture other kinds of association (eg.
quadratic relationship between variables)
◦ Correlation coefficient is very sensitive to outliers.
Matrix representation:
Sample mean: x̄ = n1 ni=1 x j = (x̄1 , x̄2 , . . . , x̄p )0
P
Sample covariance P matrix:
S n = (sjk ) = n1 ni=1 (x j − x̄)(x k − x̄)0 . i.e.
s11 s12 . . . s1p
s21 s11 . . . s2p
Sn = .
. .. ..
. . .
sp1 sp2 . . . spp
Sample correlation matrix: R = (rjk ) with rjj = 1. i.e.
1 r12 . . . r1p
r21 1 . . . r2p
R= .
. .. ..
. . .
Sample Mean, Covariance, and Correlation as Matrix
Operations
1 0
x̄ = X 1n
n
1 0 1
S = X In − Jn X
n−1 n
R = D −1/2 SD −1/2
S = D 1/2 RD 1/2
Examples
Find x̄, S, R.
TRY: Using the IRIS data set, find x̄, S, R for the entire
data and for each species.
R Demonstration
data<-c(42, 52, 48, 58, 4, 5, 4, 3)
(x<-matrix(data,4,2))
## [,1] [,2]
## [1,] 42 4
## [2,] 52 5
## [3,] 48 4
## [4,] 58 3
colnames(x)<-c("x1","x2")
colMeans(x)
## x1 x2
## 50 4
##Descriptive Statistics
var(x) # or cov(x)
## x1 x2
## x1 45.33333 -2.0000000
## x2 -2.00000 0.6666667
apply(x,2,var)
R Demonstraation
cor(x)
## x1 x2
## x1 1.0000000 -0.3638034
## x2 -0.3638034 1.0000000
## [1] 4
## [,1] [,2] [,3] [,4]
## [1,] 1 0 0 0
## [2,] 0 1 0 0
## [3,] 0 0 1 0
## [4,] 0 0 0 1
## [,1] [,2] [,3] [,4]
## [1,] 1 1 1 1
## [2,] 1 1 1 1
## [3,] 1 1 1 1
## [4,] 1 1 1 1
R Demonstration
#Mean#
(MeanX=1/n*t(x)%*%J[,1])
## [,1]
## x1 50
## x2 4
#Covariance#
(CovX=(1/n)*t(x)%*%(I-(1/n)*J)%*%x)
## x1 x2
## x1 34.0 -1.5
## x2 -1.5 0.5
#Correlation#
Sjj<-diag(CovX)
(Dhalf<-diag(sqrt(Sjj)))
## [,1] [,2]
## [1,] 5.830952 0.0000000
## [2,] 0.000000 0.7071068
R Demonstration
(DInvHalf<-solve(Dhalf))
## [,1] [,2]
## [1,] 0.1714986 0.000000
## [2,] 0.0000000 1.414214
## [,1] [,2]
## [1,] 1.0000000 -0.3638034
## [2,] -0.3638034 1.0000000
Introduction
A generalization of the familiar bell-shaped density to
several dimension plays a fundamental role in multivariate
analysis.
While real data are never exactly multivariate normal, the
normal density is often a useful approximation to the ‘true’
population distribution.
One advantage of the multivariate distribution is that it has
got mathematically elegant properties.
The normal distributions are useful in practice for two
reasons:
◦ it is a bona fide model for practice
◦ the sampling distribution of a lot of statistics is
approximately normal (central limit effect)
The univariate normal distribution
1 1 x−µ 2
f (x) = √ e− 2 ( σ )
2πσ 2
Introduction
X ∼ Np (µ, Σ).
Assessing normality
Many of the multivariate techniques depend on the
assumption of (multivariate) normality (except some large
sample theory results)
1 2 3 4 5 6 7 8 9 10
-1 -0.1 0.16 2.30 1.71 1.54 0.41 0.62 0.80 1.26
R Demonstration
qqnorm(sample)
qqline(sample)
2.0
2.0
Sample Quantiles
1.0
1.0
0.0
0.0
−1.0
−1.0
The pair of points lie very nearly along a straight line and
we will not reject the notion that these data are normally
distributed - particularly with sample size as small as
n = 10
As a rule of thumb: do not trust this procedure in samples
of size n ≤ 20
There can be quite a bit of variability in the straightness of
the QQ-plot for small samples even when the observations
are known to come from a normal population.
Is the QQ-plot straight?
If it is straight, the correlation between the pairs would be 1.
We can test for correlation coefficient.
Test for Correlation Coefficient
Correlation coefficient
Pn
j=1 (x(j) − x̄)(q(j) − q̄)
rQ = qP
n 2
Pn 2
j=1 (x(j) − x̄) j=1 (q(j) − q̄)
8.584
rQ = √ √ = 0.994
8.472 8.795
The test for normality at the 10% level of significant is
provided by referring rQ = 0.994 to entries in the Table
corresponding to n = 10 and α = 0.10. This entry is
0.9351.
Since rQ > 0.9351, we do not reject the hypothesis of
normality.
As an alternative, the Shapiro-Wilk test statistics may be
computed.
For large sample sizes, the two statistics are nearly the
same so either can be used to judge lack of normality.
Multivariate: Gamma plot or Chi-square plot
Warnings:
◦ the χ2p is only an approximation, an F distribution can
be used instead.
◦ the quantities are not independent.
Construction of the Gamma Plot
Order
2 2
d(1) ≤ · · · ≤ d(n)
Graph ! !
i − 21
χ2p 2
, d(i)
n
i− 12 2 i− 12
i n d(i) q2 n
4
3
2
1
0 1 2 3 4 5 6
χ22−quantile
Construction of the Gamma Plot
X1<-c(6, 10, 8)
X2<-c(9, 6, 3)
(X<-cbind(X1, X2))
## X1 X2
## [1,] 6 9
## [2,] 10 6
## [3,] 8 3
mu0<-t(c(9,5))
n<-nrow(X)
p<-ncol(X)
alpha<-0.1
R Demonstration
## [,1]
## [1,] 0.7777778
F<-qf((1-alpha),p,n-p)
(CV<-(n-1)*p*F/(n-p))
## [1] 198
F>CV
## [1] FALSE
R Demonstration
##Create a function to perform Hotelling's T^2###
"Hotelling" <- function(dat,mu,alpha){
# The data matrix is "dat".
# The mean vector is mu. (a list of numbers).
if(!is.matrix(dat)) dat=as.matrix(dat)
Hotelling=NULL
n = nrow(dat)
p = ncol(dat)
cm=matrix(colMeans(dat),p,1)
S = cov(dat)
sinv=solve(S)
mu0=matrix(mu,p,1)
dev=cm-mu0
T2 = n*(t(dev)%*%sinv%*%dev)
d2=n-p
tt=T2*d2/(p*(n-1))
pvalue=1-pf(tt,p,d2)
F=qf((1-alpha),p,n-p)
CV=(n-1)*p*F/(n-p)
Hotelling=cbind(Hotelling,c(T2,CV, pvalue))
row.names(Hotelling)=c("Hotelling-T2","Critical Value","p.value")
Hotelling
}
Example
## [,1]
## Hotelling-T2 9.73877286
## Critical Value 10.71860470
## p.value 0.06492834
(n − 1)p
n(x̄ − µ)0 S −1 (x̄ − µ) ≤ Fp,n−p (α)
(n − p)
(n − 1)p
Fp,n−p (α)
(n − p)
α
The ratio of their length for αj = m is
α
tn−1 2m
q
p(n−1)
(n−p) Fp,n−p (α)
contain (µj , µk ).
Example[College Data]
Bonferroni intervals:
s s
s2 s2
α dj α dj
d̄j − tn−1 ≤ δj ≤ d̄j + tn−1
2g n 2g n
Example: Measurement of biochemical oxygen demand (BoD)
and a suspended solids(SS) were obtained for n = 11 sample
splits from two laboratories. The data are displayed in the table
below.
We use the Effluent data set ignoring the pairs. The R script
’Behrens.R’ is used to perform the Behrens-Fisher Test.
source("./Scripts/Behrens.R")
x1<-effluent[,1:2]
x2<-effluent[,3:4]
Behrens(x1,x2)
X `i = µ + τ ` + e `i
with i = 1, 2, . . . , n` and ` = 1, 2, . . . , g.
We take e `i ∼ N(0, Σ) and E(X `i ) = µ + τ ` , with µ =
overall mean, τ ` = group effect.
The model is overspecified Pgand so we impose the
identifiability constraints `=1 n` τ ` = 0. That is, all group
effects add up to zero and there are n` individuals in each
group.
The following terminology is sometimes useful
◦ systematic part: µ + τ `
◦ random part: e `i
MANOVA
Remark:
◦ The "between" matrix B is often denoted as H, the
"hypothesis" matrix.
◦ The "within" matrix W is often denoted as E, the
"error" matrix
Test Statistics
There are four statistics in common use.
Wilk’s Lambda:
|W |
Λ = = det(W (B + W )−1 )
|B + W |
p p p
Y Y Y 1
= λj = θj =
1 + φj
j=1 j=1 j=1
Pp
Pillar’s Trace: tr (B(B + W )−1 ) = j=1 θj
Hotelling-Lawley
P Trace:P
θj
tr BW −1
= pj=1 φj = pj=1 1−θj
Note that T 2 is defined slightly different here.
Roy’s Greatest Root: The largest root of the equation
|B − φW | = 0:
θmax
φmax =
1 − θmax
MANOVA Table for Comparing Population Mean
Vectors
H0 : τ 1 = τ 2 = · · · = τ g = 0
and is called the Wilk’s Lambda.
Example
anova2<-aov(x2~as.factor(trt))
summary(anova2)
## MANOVA
m<-manova(x[,2:3]~as.factor(trt))
summary(m) #Pillai's trace default
## MANOVA
summary(m,test='Wilks')
## Call:
## manova(resp ~ Fact1 + Fact2 + Fact1 * Fact2)
##
## Terms:
## Fact1 Fact2 Fact1:Fact2 Residuals
## resp 1 1.7405 0.7605 0.0005 1.7640
## resp 2 1.3005 0.6125 0.5445 2.6280
## resp 3 0.4205 4.9005 3.9605 64.9240
## Deg. of Freedom 1 1 1 16
##
## Residual standard errors: 0.3320392 0.4052777 2.014386
## Estimated effects may be unbalanced
summary(m1)
H0 : Σ1 = Σ2 = · · · = Σg = Σ
H1 : Σi 6= Σj for some 1 ≤ i 6= j ≤ g
1
S pooled = Pg [(n1 −1)S 2 +(n2 −1)S 2 +· · ·+(ng −1)S g ]
`=1 (n` − 1)
1
v = p(p + 1)(g − 1)
2
g
!
2p2 + 3p − 1
X 1 1
u = − Pg
(n` − 1) `=1 (n` − 1)
6(p + 1)(g − 1)
`−1
PCA
Consider a data set with p response variables. The
summary information of this data will consist of the
following pieces:
◦ p sample means
◦ p sample standard deviation;equivalently p sample
variance (ie, the diagonal of the covariance matrix)
◦ p(p−1)
2 sample covariance (ie the off-diagonal elements
of the covariance matrix);equivalently the correlation
could be computed from this covariance matrix.
Principal Component are used for several tasks:
◦ Data Reduction: When the data can be reduced from
p to a fewer variables, the manipulation of the data
becomes easier. This is true when the data have to be
used as input to other procedures such as regression
model.
◦ Display of data: Of course, p > 3 variables cannot be
plotted at once. Usually pairs of variables are plotted
against each other. At best, fancy graphical systems
allow rotating graphs of three variables.
◦ Interpretation: It is often hoped for that the original p
variables can be replaced by, for example, smaller k
new variables, which have an intuitive interpretation
when the interpretation is of greater scientific interest
than the original variables, then insight might gained.
◦ Inference: Above tasks refer to manipulating a sample
of n objects.Often,one wants to make statements
about the population. Then, inferential tools have to be
invoked. Of course, data reduction will always lead to
information reduction or information loss.
In summary, one wants to reduce p original variables to k
new variables, where k is a compromise between:
◦ minimum number of variables
◦ maximum amount of information retained
PCA
Do not
◦ over estimate importance of the principal components
analysis.
◦ over- interpret principal components analysis
They can be useful:
◦ in exploratory phase of a data analysis (to learn about
the structure of the data)
◦ as input to other statistical procedures.
Population principal component
Concepts can be developed for both the population level (ie
based on the theoretical distribution) or for the sample level
(ie based on the simple random sample to be studied).
In virtually all cases, the two theories are dual to each
other, although the assumptions involved and the
generality of the conclusion can be different.
Consider a random vector X = (X1 , . . . , Xp ) with
covariance matrix Σ and a correlation ρ.
We want to apply a linear transformation of X in order to
obtain Y = (Y1 , ......Yp );
Y = LX or Yj = `0j X (j = 1, 2, ......p)
Population PCA
We requirement.
◦ the Yj are uncorrelated.
◦ the Yj have maximal variance.
◦ the coefficient vector `j have a unit length
The last requirement is formalized as follows:
p
X
||`j ||2 = `0j `j = `2jk = 1
k =1
These requirement can be translated in an iterative
procedure;
Y1 = `01 X with maximal variance, with condition `01 `1 = 1
Y2 = `02 X with maximal variance, with conditions
`02 `2 = 1, cov(Y1 , Y2 ) = 0
...
Yj = `0j X with maximal variance and conditions
`0j `j = 1, cov(Y1 , Yj ) = · · · = cov(Yj−1 , Yj ) = 0
Mathematical argument
Covariance or Correlation?
◦ The original variables could have different units. Then
we are comparing apples and oranges.
◦ Original variables could have widely varying standard
deviation.
In both cases, the analysis is driven by the variable with
large standard deviation.
kilometer or millimeter (×106 ) ⇒ principal components will
be pulled towards the variable in millimeters.
A solution is provided by using the correlation matrix
instead. Thus means that all variables are replaced by
their standardized versions.
The variance changes to 1 and total variability is equal to p.
Example
##
## Attaching package: ’psych’
## The following object is masked from ’package:DAAG’:
##
## cities
## The following objects are masked from ’package:ggplot2’:
##
## %+%, alpha
## The following object is masked from ’package:car’:
##
## logit
attach(iris)
sapply(iris[,-5],mean)*10
sapply(iris[,-5],sd)*10
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4
## Standard deviation 20.5626888 4.92616228 2.79659615 1.543861813
## Proportion of Variance 0.9246187 0.05306648 0.01710261 0.005212184
## Cumulative Proportion 0.9246187 0.97768521 0.99478782 1.000000000
##
## Loadings:
## Comp.1 Comp.2 Comp.3 Comp.4
## Sepal.Length 0.361 0.657 0.582 0.315
## Sepal.Width 0.730 -0.598 -0.320
## Petal.Length 0.857 -0.173 -0.480
## Petal.Width 0.358 -0.546 0.754
(eigval<-(pcacov$sdev)^2)
pcacov pcacov
400
400
300
300
Variances
Variances
200
200
100
100
0
400
Component variance
300
300
Variances
200
200
100
100
0
Component number
R Demonstration
loadings(pcacov)
##
## Loadings:
## Comp.1 Comp.2 Comp.3 Comp.4
## Sepal.Length 0.361 0.657 0.582 0.315
## Sepal.Width 0.730 -0.598 -0.320
## Petal.Length 0.857 -0.173 -0.480
## Petal.Width 0.358 -0.546 0.754
##
## Comp.1 Comp.2 Comp.3 Comp.4
## SS loadings 1.00 1.00 1.00 1.00
## Proportion Var 0.25 0.25 0.25 0.25
## Cumulative Var 0.25 0.50 0.75 1.00
## [1] 457.2957
sum((pcacov$loadings[,1])^2)
## [1] 1
Observations
pca<-princomp(iris[,-5]*10,scores=T,cor=F)
head(pca$scores)
10
0.2
16 132
118 123
0.2
Sepal.Width 63
10
34 88 77131
119 13051
3315 110 59108 106
53
136
0.1
66
5
6
17 69 7576 2137 15 132
19 109 55 87 12619
0.1
45
20
47 73
80 Sepal.Length
5
11
49 137145
125
149126 Sepal.Length 26
68 7498 29 32
22 42 120
82 93 72
10 78 11
Comp.3
865751 121
101
144136
106 132 36 28
103
13450401 49
37 111142
116
140 8170 135
8335
Comp.2
385
28 52 53
87
66 141
130
103 52 34118
23 18
144
41 316492818
0.0
2932
21 71138
78 108
113
146
105 123 61 54 147 46
124 140
40 148 131 5 4717
0
827 76117 112 38
113 6 33 16
0.0
725
12 24 92128
59 Petal.Width
Petal.Length 37924
30 117
2741
Petal.Length
50 139 119 94 9099195100
4127 12
0
62
96 150
75
98
6764 104
77133
129 84
569796
4810425 142
572220 Sepal.Width
36
30
3 89
85 7955
134
127 99 14
39 65 62 148
138 121
48
4335
10
31 97
65 74
72 112
115
124 58 129
89 146
Petal.Width111 45
125
426
2
46 56
100122
84
102
143 109 43133 44
23
7105
128 144
−0.2 −0.1
13
39 83
68
95 135
147 139 110
149 73 14186
−5
−5
60
8093
91
70 114
90 88 102 67
60143
114 150
9982
107120
81 85 71 116 145
58 6369
54
−10
94 122
−0.2
−10
42 107 149
137
115 101
61
Comp.1 Comp.2
−3 −1 0 1 2 3 −10 −5 0 5 10
R Demonstration
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4
## Standard deviation 1.7083611 0.9560494 0.38308860 0.143926497
## Proportion of Variance 0.7296245 0.2285076 0.03668922 0.005178709
## Cumulative Proportion 0.7296245 0.9581321 0.99482129 1.000000000
#pairs(pca_cor$scores)
#correlation between original variables and principal components
cor(iris[,-5], pca_cor$scores)
## [1] 4
R Demonstration: PCA based on correlation
par(mfrow=c(1,2))
screeplot(pca_cor,type="lines",main="Scree Plot")
biplot(pca_cor,choices=1:2,main="Plot of PC1 and PC2")
Scree Plot
Plot of PC1 and PC2
−10 −5 0 5 10
3.0
16 132
118
10
34 110
3315
2.0
17 6
19
Variances
45
20
47 137
5
11
49
22 8651 145
125
149 Sepal.Length
126
121
101
144136
57
111142106
Comp.2
2337
5
38
28
18
41144
32 525116
87
66
138
140
3
141
130
103
108
113123
29
40
27
25
12
7 821
24 62
71
76 146
78
105
148
117
128
9259 131Petal.Width
Petal.Length
50 139
150 133119
104
0
36 96
67
89
8575
98
6477
55
79
134129
4330
3
48
35
10 97
6574
72127
112
115
1.0
31
426
46
13 2 56
100 124
122
84
102
143109
39
14 83
68
93135
95 147
73
−10 −5
9 60
80
91
9990114
70
107
81
82 88
58 54
63 69
120
94
−0.2
42
0.0
61
Comp.1
Exploration of the Principal Components
One major observation of the graphs is that there is a “gap"
between the observations. They give a bimodal
expression.
This is explained by the fact that the principal components
were calculated on all 150 observations, ignoring the
SPECIES variable.
When no formal inference is needed, but the principal
components are only applied to summarize and simplify
the data, the procedure is valid.
When formal inference is required, such a MARGINAL
PCA procedure would be very much unacceptable.
SOLUTIONS: Calculate PCs for each species separately
or calculate PCs on transformed data (PCA by groups),
that are corrected for the effect of SPECIES (“partialled
out" PCA).
Exercise
(1) In the analysis above (based on correlation matrix)
(a) plot the third and fourth principal components. Take a
look at the scales on the axes of this and the plot
based on covariance matrix. What do you observe?
(b) Plot Sepal.Length versus the first principal
component. Is the orientation expected?
(2) Perform principal components analysis for each of the
species separately using
(a) covariance matrix
(b) correlation matrix