0% found this document useful (0 votes)
195 views238 pages

STAT630Slide Adv Data Analysis

This document provides lecture notes for an advanced data analysis course. It outlines the course objectives, which include developing skills in linear regression, assumptions of regression models, model fitting and diagnostics, and applying techniques in R. It also lists the textbook, reference materials, instructor contact information, and grading policy. Key concepts covered include simple and multiple linear regression, estimation methods, hypothesis testing, confidence intervals, prediction, and analysis of variance.

Uploaded by

Tennyson
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
195 views238 pages

STAT630Slide Adv Data Analysis

This document provides lecture notes for an advanced data analysis course. It outlines the course objectives, which include developing skills in linear regression, assumptions of regression models, model fitting and diagnostics, and applying techniques in R. It also lists the textbook, reference materials, instructor contact information, and grading policy. Key concepts covered include simple and multiple linear regression, estimation methods, hypothesis testing, confidence intervals, prediction, and analysis of variance.

Uploaded by

Tennyson
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 238

STAT630: Advanced Data Analysis

Lecture Notes/ Part II

Lecturer
Samuel Iddi, PhD

Department of Statistics
University of Ghana
siddi@ug.edu.gh

February 21, 2019


Learning Objectives:
 Develop deeper understanding in the theory of linear
regression model.

 List and explain assumptions underlying linear regression


models

 Fit regression model to real-life data

 Understand estimation strategies of linear model

 Perform uncertainty assessments


Learning Objectives:
 Derive confidence and prediction intervals

 Assess suitability of regression model by performing model


diagnosis

 Apply remedial measures to correct departures of the


model

 Familiarize with techniques for binary and count data

 Implement and apply the above procedures in R.

 Interpret results from computer outputs.


Learning Objectives:
 Understand problems associated with multi-dimensional
data.
 Study basic multivariate distribution theory and methods.
 Discuss some fundamental and important multivariate
statistical techniques.
 Understand inference and how to apply them to real-life
problem arising from various scientific fields.
 Learn some methods for data reduction.
 Implement and apply the techniques in R.
Course Syllabus:
 Introduction: general context, linear algebra.

 Linear regression models: simple, multiple regression.

 Estimation methods: Least squares and likelihood


estimators.

 Testing: Hypothesis testing and ANOVA test.

 Diagnosis: Formal and informal test for linearity,


homoscedasticity, Gaussianity, independence (residual
analysis).
Course Syllabus:
 Remedial measures: Transformations, Box-Cox

 Model selection: sequential (backward/forward/stepwise),


information creteria (AIC/BIC/Cp), best subset procedure.

 Identifying outliers

 Multicolinearity: VIF

 Multivariate analysis

 Comparison of mean vectors

 Multivariate analysis of variance

 Principal component analysis


Textbook:
1 Kutner, M. H., Nachtsheim, C. J., Neter, J. and Li, W.
(2005). Applied Linear Statistical Models. 5th Ed. New
York: McGraw-Hill Irwin.
2 Rencher, A. C. and Schaalje, G. B. (2008). Linear Models
in Statistics. 2nd Ed. New Jersey:John Wiley & Sons, Inc.
3 Johnson, R. A. and Wichern, D. W. (2007). Applied
Multivariate Statistical Analysis. 3rd Ed. Prentice-Hall.
Reference:
1 Maindonald, J. and Braun, J. (2003). Data Analysis and
Graphics Using R. New York: Cambridge University Press.
2 Chambers (2008). Software for Data Analysis. Springer.
3 Venables W.N. and Ripley B.D (1997). Modern Applied
Statistics with S-PLUS. 2nd Ed. New York: Springer.
Office hour:
Wednesday: 9:30am to 1:00pm or by appointment.
Phone Number: 0267603243, Office: 208
E-mail: siddi@ug.edu.gh; isamuel.gh@gmail.com
(Sending an email is the easiest way to contact me)
Personal Website: www.samueliddi.com

Teaching Assistants
TA:
Tutorial:
Phone number:
E-mail:
Grading
Homework Assignments (20%), Interim Assessment (30%) and
Final Exams (50%).
Guidelines
 Homeworks should be submitted one week from the day
assigned.
 R program codes will be submitted via email.
 Late submission will not be accepted.
 Duplicate solutions will not be graded.
 TA will discussion solutions during tutorial hours.
 Interim assessment could take the form of project,
presentation and defense.
Computing
The main software package for this course is regular R version
2.3.1 (or higher). Install R by visiting the website:
www.r-project.org. You may also use RStudio but this only
works after you have installed R first.
Regression
 Establish a relationship (often linear) between dependent
(or response) variable Y and independent (or predictor)
variable X .
 We observe pairs (X , Y ).
 A statistical relationship between the two variables is given
by
Y = f (X ) + ε
where ε is the error committed by using f (X ) to
approximate Y .
 If scatterplot of (X , Y ) is approximately linear,
f (X ) = β0 + β1 X is chosen.
 Y = β0 + β1 X + ε is called a linear regression model used
to describe the relationship between Y and X .
 Parameters β0 and β1 are the intercept and slope of the
straight line respectively.
Regression
 If f (X ) is nonlinear, then

Y = f (X ) + ε

is the nonlinear regression.


 Example, the growth model

Y = a(1 − exp(bX ))c + ε

 Formally, suppose we observe the pairs Yi , Xi for


i = 1, 2, . . . , n, the simple linear regression model is given
by
Yi = β0 + β1 Xi + εi .
 Model assumptions:
◦ The values Xi are precisely known.
◦ Yi is a continuous variable and is random.
Regression
 Model assumptions:
◦ β0 , β1 and σ are called parameters which are
† unknown
† constant and not random
† not dependent on trial number
† εi is random error term and not observable.
† εi ∼ N(0, σ 2 ) for all i. That is, E(εi ) = 0 and
Var(εi ) = σ 2 .
† For any two different trials, i and j, εi and εj are
independent.
 Since β0 , β1 and σ 2 are unknown, we estimate them from
the data.
 This gives Ŷi = βˆ0 + βˆ1 Xi where β̂0 estimates β0 and β̂1
estimates β1 .
 Also, σ̂ 2 estimates σ 2 .
Estimation
 Least squares estimators:
P P P P
SSXY (Xi − X̄ )(Yi − Ȳ ) Xi Yi − Xi Yi /n
β̂1 = = = P 2
(Xi − X̄ )2 Xi − ( Xi )2 /n
P
SSXX
P

β̂0 = Ȳ − β̂1 X̄
where βˆ0 and βˆ1 are unbiased estimators of β0 and β1
respectively.
SSE
 Unbiased estimator of σ 2 , s2 = MSE = n−2 where

(SSXY )2
SSE = SSYY −
SSXX
 β0 and β1 are random and are normally distributed.
 
X̄ 2
 Thus, E(βˆ0 ) = β0 , Var(βˆ0 ) = σ 2 n1 + SS XX
, E(βˆ1 ) = β1
σ2
and Var(βˆ1 ) = SSXX .
Hypothesis Testing and Confidence Intervals

 If β1 = 0 ⇔ no relationship between X and Y .


 To test the hypothesis H0 : β1 = c, we use the test statistics

β̂1 − c
q 2
∼ tn−1 .
P s
(Xi −X̄ )2

 The (1 − α)100% CI for β1 is given by


s
s2
β̂1 ± t1−α/2,n−2
(Xi − X̄ )2

where t1−α/2,n−2 is the upper α/2 quantile from the


t−distribution with n − 2 degrees of freedom.
Hypothesis Testing and Confidence Intervals
 Similarly, we can construct CI and Test hypothesis for β0 .
 The values of the fitted line at X = xh is given by
Ŷh = βˆ0 + β̂1 xh .
 The mean response is estimated with

E(Ŷh ) = β0 + β1 xh

and
1 (xh − X̄ )2
 
2
Var(Ŷh ) = σ +
n SSXX
.
 The CI of the mean response
s 
1 (xh − X̄ )2

Ŷh ± t1−α/2,n−2 s2 +
n SSXX
Prediction of mean new observations

 To predict new observations at X = xh

E(Ŷh ) = β̂0 + β̂1 xh

and
1 (xh − X̄ )2
 
2
Var(Ŷh ) = σ 1+ +
n SSXX
.
 CI can also be constructed based on the t−distribution
with n − 2 df .
 Such an interval is called prediction interval.
Analysis of Variance
 Another tool for interpretation results from a linear
regression is the ANOVA.

 The total variation of Y ,


X
SSYY = (Yi − Ȳ )2

is sometimes called "sum of squares total" (SSTO) or "sum


of squares total (corrected)" ["corrected" means Y values
are corrected to have mean zero.]

 SSTO is partitioned to different sources. i.e.

SSTO = SSE + SSR


X X X
(Yi − Ȳ )2 = (Yi − Ŷi )2 + (Ŷi − Ȳ )2

where SSE is the "sum of squares residuals" (unexplained


variability) and SSR is the "sum of squares regression"
(explained variability accounted for by the regression).
Summary of Partition

Table: ANOVA Table

SOV df SS MS E(MS) F
Regression 1 SSR MSR = SSR
1 σ 2 + β12 SSXX
Error n−2 SSE MSE = SSE
n−2 σ2 MSR
MSE
Total n−1 SSTO

 SOV = Source of Variation


 df = degrees of freedom
 SS = Sum of Squares
 MS = Mean Sum of Squares
 F ∼ F1,n−2 assuming that HO : regression line is
"significant" is true.
 The test for HO : β1 = 0 versus HA : β1 6= 0 is the same
hypothesis as that for the t− test, [F = t 2 ]
Analysis of Variance
σ 2 +β 2 SS
 Observe that E(F ) = E(MSR)
MSE =
1
σ2
XX
. Thus,
◦ if β1 is close to 0, then the numerator is close to 1
◦ conversely, if β1 6= 0, then the numerator is larger than
1.

 Note, the F −test automatically does a two-sided test and


cannot be used to test HA : β1 < 0 or HA : β1 > 0.

 General linear model test: We can also test HO : β1 = 0


using a test based on reduction in sum of squares.

 How much is a model improved by adding another


explanatory variable?

 Full model: Yi = β0 + β1 Xi + εi gives


X
SSE(F ) = (Yi − Ŷi )2 = SSE.
 Reduce model: Yi = β0 + εi yields
X
SSE(R) = (Yi − Ȳ )2 = SSTO.
General linear model test
 To test if the full model does better than the reduced
model, we can use a different from of F −test given by
SSE(R)−SSE(F )
dfR −dfF
F∗ = SSE(F )
∼ FdfR −dfF ,dfF .
dfF

 If F ∗ is too big, we conclude that the full model fits


significantly better than the reduced model. Thus, β1 6= 0.

 R 2 −Coefficient of determination: measures the proportion


of total variability in Y that is explained by the linear
regression model.

 It is given by
SSR SSTO − SSE SSE
r2 = = =1−
SSTO SSTO SSTO
SSXY
 r=√ is the correlation coefficient.
SSXX SSYY
Coefficient of determination

Note that
 0 ≤ r 2 ≤ 1. Why?

 If all points fall exactly on a line (with β1 6= 0) then r 2 = 1.

 If β1 = 0 then r 2 = 0.
Multiple regression

 For more than one predictor variables.

 Let X1 , X2 , . . . , Xp−1 be predictor variables

 Model: Yi = β0 + β1 Xi1 + β2 Xi2 + · · · + βp−1 Xip−1 + εi


◦ parameters β0 , β1 , . . . , βp−1
◦ Xi1 , Xi2 , . . . , Xip−1 are unknown
◦ εi ∼ N(0, σ 2 ) with i = 1, 2, . . . , n

 Matrix Notation: Y = X β + ε
◦ Y is n × 1
◦ X is n × p
◦ β is p × 1
◦ ε ∼ N(0, σ 2 I) is n × 1
Categorical and Continuous Covariates
 Consider the following example:
◦ Y length of stay in hospital ⇒ continuous response
◦ Age ⇒ continuous variable
◦ Gender ⇒ two-level categorical variable
◦ Disability status ⇒ three-level categorical variable
 Create DUMMY variables for categorical variables.
 Let X1 = Age

1 if male
X2 =
0 if female

1 if not disabled
X3 =
0 otherwise

1 if partially disabled
X4 =
0 otherwise

 Model: Yi = β0 + β1 Xi1 + β2 Xi2 + β3 Xi3 + β4 Xi4 + εi


Estimation of regression coefficients
 Least square and maximum likelihood lead to the estimator

β̂ = (X 0 X )−1 XY

 Fitted values and residuals:


◦ Ŷ = X β̂
◦ Residuals, ei = Yi − Ŷi or e = Y − Ŷ
◦ Ŷ = HY = X (X 0 X )−1 XY
◦ e = (I − H)Y

Table: ANOVA Table

SOV df SS MS F
Regression p−1 SSR MSR = SSR
p−1
Error n−p SSE MSE = SSE
n−p
MSR
MSE
Total n−1 SSTO
ANOVA

 Test for regression coefficients

HO : β1 = β2 = · · · = βp−1 = 0
HA : β1 = β2 = · · · = βp−1 6= 0 for some k

 Test statistics
MSR
F∗ = ∼ Fp−1,n−p
MSE
 Decision rule

if F ∗ ≤ F (1 − α; p − 1, n − p), conclude HO
if F ∗ > F (1 − α; p − 1, n − p), conclude HA
Inference about the regression parameters
 E(β̂) = β and Cov(β̂) = MSE(X 0 X )−1
 Test for βk
◦ Hypothesis, HO : βk = 0 vs HA : βk 6= 0
◦ Test statistics, t ∗ = β̂k
se(β̂k )
◦ Decision rule, if |t ∗ | ≤ t1−α/2,n−p conclude HO and if
|t ∗ | > t1−α/2,n−p conclude HA
 Interval estimation of βˆk

βk ± t(1 − α/2, n − p)se(β̂k )

 Joint inferences
◦ Bonferroni joint confidence interval: β̂k ± Bse(β̂k )
where B = t1−α/2g,n−p
◦ Scheffe joint confidence interval: β̂k ± Sse(β̂k ) where
S 2 = gF1−α;g,n−p (preferred for large g)
Inference about mean response
 E(Yh ) = Xh0 β and E(Ŷh ) = Xh0 β̂
 σ 2 {Ŷh } = Xh0 σ 2 {β}Xh and s2 {Ŷh } = Xh0 s2 {β}Xh
 (1 − α)100% confidence limits for Ŷh
Ŷh ± t1−α/2,n−p se(Ŷh )
 Confidence region for regression surface
(Working-Hotelling band)
Ŷh ± Wse(Ŷh )
where W 2 = pF (1 − α; p, n − p)
 Simultaneous CI for several mean response
◦ use working-hotelling band
◦ use Bonferroni (g intervals),
Ŷh ± Bse(Ŷh )
where B = t1−α/2g,n−p
Coefficient of multiple determination

 Multiple determination

SSR SSE
R2 = =1−
SSTO SSTO
 R 2 usually get larger by inclusion of more predictors.

 A modified measure is the adjusted coefficient of multiple


determination
SSE  
2 n−p n−1 SSE
R =1− SSTO
=1−
n−1
n−p SSTO

 Multiple correlation √
R= R2
Departures from the model
 Models must be check for assumptions.

 Linearity: Regression function not linear.

 Homoscedasticity: When the error terms do not have


constant variance.

 Independence: Error terms are not independent.

 Outliers: The fit is right except for some outliers.

 Normality: The error terms are not normally distributed.

 Model extension: important variables are not in the model.

 How do we check for assumptions.


◦ residual analysis
◦ formal test
◦ lack of fit
Residual analysis
 Residuals, ei = Yi − Ŷi . Why?

 standard residuals
ei − ē
e∗ = √ ,
MSE
where ē = 0. Makes it easier to detect outliers. Values
greater than 3 or less than -3 are regarded "unusual"
observations.

 Diagnostic plots are important to tell whether regression is


even appropriate.

 Univariate plots of X and Y


◦ look for outliers
◦ examine shape of distribution

 Boxplot, stem and leaf plots, histogram, dot plots, and time
plots can be used.
Residual analysis

 Bivariate plots X versus Y


◦ is the relationship linear? nonlinear?
◦ are there two-dimensional outliers?
◦ does the assumption of constant of variance
reasonable?

 Bivariate plot of residuals versus X


◦ useful for detecting nonlinearity
◦ a nonlinear pattern is indication of nonlinear
relationship between X and Y .
◦ a "random" pattern may suggest linear relationship
and no outliers.
◦ a megaphone pattern may be an indication of
nonconstant variance.

 Bivariate plot of residuals versus Y?


Residual analysis

 Plot of residuals versus time collection or spatial


arrangement → any pattern indicates a lack of
independence.

 Plot of histogram or normal QQ-plot or normal PP-plot


◦ residuals normally distributed?
◦ if they are, points lie approximately on a straight line in
both QQ-plot and PP-plot.
◦ non-normal residuals up when observations in the
tails of the distribution are far from straight line.
Formal Test

 Test for randomness (independence): Durbin-Watson


Test

 Consider the simple linear regression Yi = β0 + β1 xi + εi


with εi = ρεi−1 + µi , |ρ| < 1, µi ∼ N(0, σ 2 ) and ρ is
autocorrelation parameter.

 Test the hypothesis: HO : ρ = 0 versus HA : ρ > 0.

 Test statistics:
n
X (ei − ei−1 )2
D= Pn 2
i=2 i=1 ei

 Decision: if D > du do not reject HO , D < dl reject HO and


dl ≤ D ≤ du inconclusive.
Formal test

 Test for constancy of variance: (1) Levene Test

 Split sample into (two-) groups of size n1 and n2 .

 Compute di1 = |ei1 − ẽ1 | and di1 = |ei2 − ẽ2 | with ẽ1 and
ẽ2 the medians.

 Test statistics: Two sample t-test

d¯1 − d¯2
t∗ = q ∼ tn−2
sp n11 + n12

(di1 −d¯1 )2 + (di2 −d¯2 )2


P P
where sp = n−2

 Decision: if t ∗ > tn−2 , reject constancy of variance at


significant level α. Otherwise, we do not reject.
Formal test
 Test for constancy of variance: (2) Breusch-Pagan Test

 εi independent and normally distributed

 large sample size

 Procedure:
◦ use the relationship Inσi = γ0 + γ1 xi
◦ HO : γ1 = 0 versus HA : γi > 0
◦ Regress ei2 against Xi and obtain SSR ∗

 Statistics:
SSR ∗
χ2BP = 2 2
2 ∼ χ 1 .
SSE
n

SSR ∗
For q predictors, χ2BP = q
SSE 2
∼ χ2q
( n )
Test for Normality
Procedure
 Fit a regression line and compute the residuals

 Calculate the expected residuals under normality


√ 
i − 0.375

MSEΦ
n + 0.25

 Compute the correlation coefficient between the model


residuals and expected residuals under normality

 test with table of critical points available in most statistical


textbooks.
Other test
 Shapiro-Wilk
 Kolmogorov-Smirnov (mostly used)
 Lillifors
Lack of fit

Testing method
 Test to check if the linear regression model is appropriate

 That is, HO : E(Y ) = β0 + β1 X versus


HA : E(Y ) 6= β0 + β1 X

 This test procedure is only possible if repeated responses


are available at some values of X .
Responses Y for a given X are
1. Independent
2. Normally distributed
3. Same variance, σ 2
Remedial measures

 If linear model is not appropriate then


◦ use a more appropriate model
◦ employ some transformation and analyze the
transform data

 Linearity
◦ modify the regression model
◦ use transformation on X and Y

 Homoscedasticity
◦ methods of weighted least squares
◦ variance stabilization transformation

 Independence
◦ use time series regression model
◦ use generalized least squares
Remedial measures

 Outliers
◦ discard outliers (be careful!)
◦ add interaction with independent variables
◦ use robust estimation method (median or absolute
deviation)

 Normality
◦ use generalized linear model
◦ use transformation on Y
◦ rank regression
◦ non-parametric regression

 Lack of fit
◦ include more independent variables
Transformation
Linearizing transforms
 plot the data

 if non-linear relationship exist between X and Y , pick a


transformation to try to get a straight line

◦ X 0 = log(X ) X0 = X
◦ X0 = X2 X 0 = exp(X )
0 1
◦ X =X X 0 = exp(−X )
Non-normality or Unequal error variance
 Often non constant variance occur along with nonlinearity

 Transform response using



◦ Y0 = Y
◦ Y 0 = log(Y )
◦ Y 0 = Y1
Special cases
Count data

 Y = β0 + β1 X is often a good starting point
 A slightly better version of this is the Freedmann-Tukey
transform for stabilizing variance
√ √
Y + Y + 1 = β0 + β1 X

 You must be able to interpret your scientific question in


terms of parameters for the transformed data
Proportion data

 use the arcsin-square-root transform: arcsin( Y )
 A more modern approach is the logit transformation
 
Y
logit(Y ) = log
1−Y
Box-Cox Family of Transformations
 Corrects for skeweness, non-normality, unequal variance,
nonlinearity.
 Procedure:


0 if λ 6= 0
Y =
log(Y ) if λ = 0 by definition

 Estimates βˆ0 , βˆ1 , σ̂ 2 and λ̂ are obtained by maximizing the


likelihood function
n n
!
Y 1 1X λ
fY λ (Yiλ , xi , β0 , β1 , σ 2 ) = exp − (Yi − β0 − β1 Xi )2
i=1
i (2πσ 2 )n/2 σ
i=1

 One can show that the ML estimate λ̂ equals


λˆML = arg min SSE(W /X )
K1 (Y λ − 1) if λ 6= 0

where W = for specific constants
K2 log(Y ) if λ = 0
1 Qn
K1 = and K2 = ( i=1 Yi )1/n
λK2λ−1
Demonstration in R
Example-MgCo3 content of sand
The skeletal chemical composition is believed to be related to
the mean water temperature. A simple linear regression is fitted
to the following data.

(sand<-read.table("./Datasets/sand.txt", header=T))

## mgco3 temp
## 1 9.2 17.5
## 2 9.2 21.0
## 3 9.4 20.0
## 4 9.0 15.3
## 5 8.5 14.0
## 6 8.5 13.1
## 7 8.8 13.3
## 8 8.5 13.0
## 9 9.3 19.0
## 10 9.0 18.7
Demonstration in R

reg<-lm(mgco3 ~ temp, data = sand)


summary(reg)

##
## Call:
## lm(formula = mgco3 ~ temp, data = sand)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.192410 -0.145541 0.008722 0.147426 0.178326
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.30034 0.30597 23.859 1.01e-08 ***
## temp 0.09943 0.01827 5.443 0.000614 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1697 on 8 degrees of freedom
## Multiple R-squared: 0.7874,Adjusted R-squared: 0.7608
## F-statistic: 29.63 on 1 and 8 DF, p-value: 0.0006136
Diagnostics in R
par(mfrow=c(1,2))
plot(reg$fitted.values, reg$residuals, xlab="Fitted Values",
ylab="Residuals")
library(car)

## Loading required package: carData

qqPlot(reg$residuals,xlab="Normal Quantiles", ylab="Residuals")


0.1

0.1
Residuals

Residuals
0.0

0.0
−0.2

−0.2

5 2

8.6 8.8 9.0 9.2 9.4 −1.5 −0.5 0.5 1.5


Diagnostics plots in R
We can also use the plot() function on the output of the
lm().

par(mfrow=c(1,2));plot(reg,1:2)

Residuals vs Fitted Normal Q−Q


0.2

4 7

Standardized residuals

1.0
0.1
Residuals

0.0

0.0
−1.0
−0.2

5 2 5
2

8.6 8.8 9.0 9.2 9.4 −1.5 −0.5 0.5 1.5

Fitted values Theoretical Quantiles


Diagnostics plots in R
par(mfrow=c(1,1));plot(reg, 3)

Scale−Location
1.2

2
5
Standardized residuals

7
0.8
0.4
0.0

8.6 8.8 9.0 9.2 9.4

Fitted values
lm(mgco3 ~ temp)
Understanding diagnosis plot

 Residual vs Fitted
◦ Show if there is non-linear patterns that is not
captured by the model
◦ If there is no distinctive pattern in the plot,
non-linearity is not a problem
 Normal Q-Q Plot
◦ Shows if residuals are normally distributed
◦ If residuals follow a straight line (not perfect straight
line) then there is no deviation from normality
◦ Severe deviation from straight dashed-line must be of
concern
Understanding diagnosis plot

 Scale-Location/Spread-Location Plot
◦ Shows if residuals are spread equally along the
ranges of predictor.
◦ Used to check for homoscedasticity assumption
◦ A horizontal line with equally (randomly) spread points
is indicative of equal variance
Formal test in R
Formal Test for Normality and Constancy of Variance
shapiro.test(reg$residuals) ##Test for Normality

##
## Shapiro-Wilk normality test
##
## data: reg$residuals
## W = 0.82427, p-value = 0.02854

ncvTest(reg) #Breusch-Pagan test Constant Variance

## Non-constant Variance Score Test


## Variance formula: ~ fitted.values
## Chisquare = 0.004297782, Df = 1, p = 0.94773

 Normality is suspect

 Homoscedasticity seems plausible


Some remedial measures in R

Let us attempt some remedial measures by transformation on


the response.
reg2<-lm(log(mgco3) ~ temp, data = sand)
#qPlot(reg2$residuals,xlab="Normal Quantiles", ylab="Residuals")
shapiro.test(reg2$residuals) ##Test for Normality

##
## Shapiro-Wilk normality test
##
## data: reg2$residuals
## W = 0.8351, p-value = 0.03854

ncvTest(reg2) #Breusch-Pagan test Constant Variance

## Non-constant Variance Score Test


## Variance formula: ~ fitted.values
## Chisquare = 0.0009718099, Df = 1, p = 0.97513
Some remedial measures in R

reg3<-lm(sqrt(mgco3) ~ temp, data = sand)


#qqPlot(reg3$residuals,xlab="Normal Quantiles", ylab="Residuals")
shapiro.test(reg3$residuals) ##Test for Normality

##
## Shapiro-Wilk normality test
##
## data: reg3$residuals
## W = 0.82978, p-value = 0.03326

ncvTest(reg3) #Breusch-Pagan test Constant Variance

## Non-constant Variance Score Test


## Variance formula: ~ fitted.values
## Chisquare = 0.0002827769, Df = 1, p = 0.98658
Remedial measures in R

reg4<-lm(1/mgco3 ~ temp, data = sand)


#qqPlot(reg4$residuals,xlab="Normal Quantiles", ylab="Residuals")
shapiro.test(reg4$residuals) ##Test for Normality

##
## Shapiro-Wilk normality test
##
## data: reg4$residuals
## W = 0.84463, p-value = 0.05012

ncvTest(reg4) #Breusch-Pagan test Constant Variance

## Non-constant Variance Score Test


## Variance formula: ~ fitted.values
## Chisquare = 0.01555509, Df = 1, p = 0.90075
Scatterplot of regression fit in R

Create a scatterplot with regression fit. We use the ggplot()


function in the package ggplot2.
library(ggplot2)
p <- ggplot(sand, aes(x = temp, y = 1/mgco3))
p <- p + geom_point()
p <- p + geom_smooth(method = lm, se = FALSE)
Scatterplot of regression fit in R

print(p)

0.118

0.116

0.114
1/mgco3

0.112

0.110

0.108

0.106
15.0 17.5 20.0
temp
Scatterplot of regression fit in R
Alternatively, we can use the base graphics.
plot(sand$temp,1/sand$mgco3)
abline(reg4)
0.118
1/sand$mgco3

0.114
0.110
0.106

14 16 18 20

sand$temp
CI of mean and PI for new observation in R

To obtain CI of mean prediction and prediction interval for new


observation at temp = 15, we use the predict() function.

predict(reg4, data.frame(temp=15), interval = "confidence",


level = 0.95)

## fit lwr upr


## 1 0.113878 0.1121042 0.1156518

predict(reg4, data.frame(temp=15), interval = "prediction",


level = 0.95)

## fit lwr upr


## 1 0.113878 0.10857 0.1191859
CI of mean and PI for new observation in R

 Observe that the prediction interval is wider than the CI for


the mean response. This is reasonable because you are
less confident in predicting an individual response than the
mean response for all individuals.
 The CI for the mean response and the prediction interval
for an individual response become wider as xh moves
away from x̄. That is, you get a more sensitive CI and
prediction interval for xh ’s near the center of the data.
Confidence and prediction bands in R
Now we construct confidence and prediction bands along with
the fitted least square line.
##Confidence Band
p <- ggplot(sand, aes(x = temp, y = 1/mgco3))
p <- p + geom_point()
p <- p + geom_smooth(method = lm, se = T)
print(p)

0.116
1/mgco3

0.112

0.108

0.104

15.0 17.5 20.0


temp
Confidence and prediction bands in R
#using Base graphics
plot(sand$temp,1/sand$mgco3)
abline(reg4)
x.pred <- data.frame(temp = seq(min(sand$temp), max(sand$temp),
length = 20))
lines(x.pred$temp, predict(reg4, x.pred, interval = "confidence")[
, "upr"], col = "blue")
lines(x.pred$temp, predict(reg4, x.pred, interval = "confidence")[
, "lwr"], col = "blue")
0.118
1/sand$mgco3

0.112
0.106

14 16 18 20

sand$temp
An example with categorical independent variable

An example with categorical covariate.


 A company operates computers at three different locations.
 Each location has slightly different working conditions.
 The variable of interest is the length of time between
computer failures.
 Is there a difference in failure times between locations?
An example with categorical independent variable

ex2 <- read.table("./Datasets/CH18TA05.txt",


col.names = c("failure", "loc", "observation"))
ex2 <- within(ex2, locf <- factor(loc, levels = c(1:3),
labels = c("Loc 1","Loc 2", "Loc 3")))
(ex2reg <- lm(failure ~ locf, data = ex2))

##
## Call:
## lm(formula = failure ~ locf, data = ex2)
##
## Coefficients:
## (Intercept) locfLoc 2 locfLoc 3
## 50.37 -28.24 70.84
ANOVA results from R
anova(ex2reg)

## Analysis of Variance Table


##
## Response: failure
## Df Sum Sq Mean Sq F value Pr(>F)
## locf 2 26053 13026.6 2.0504 0.1714
## Residuals 12 76239 6353.2

boxplot(failure ~ locf, data = ex2)


300
150
0

Loc 1 Loc 2 Loc 3


Diagnostics plots in R
par(mfrow=c(1,2))
plot(ex2reg)[1:2]

Residuals vs Fitted Normal Q−Q


14 14
200

Standardized residuals

3
2
Residuals

100

1
0

0
−100

15

−1
12 15
12

20 40 60 80 100 120 −1 0 1

Fitted values Theoretical Quantiles

## NULL
Diagnostics plots in R
par(mfrow=c(1,2))
plot(ex2reg)[1:2]

Constant Leverage:
Scale−Location Residuals vs Factor Levels
14
Standardized residuals

14

Standardized residuals

3
1.5

2
12
1.0

15

1
0.5

0
−1
15
12
0.0

locf :
20 40 60 80 100 120 Loc 1 Loc 2 Loc 3

Fitted values Factor Level Combinations

## NULL
Formal diagnostics in R
shapiro.test(ex2reg$residuals)

##
## Shapiro-Wilk normality test
##
## data: ex2reg$residuals
## W = 0.8065, p-value = 0.004461

library(nortest)
lillie.test(ex2reg$residuals);

##
## Lilliefors (Kolmogorov-Smirnov) normality test
##
## data: ex2reg$residuals
## D = 0.25084, p-value = 0.01185

leveneTest(ex2reg, center = mean) # Levene Test

## Levene's Test for Homogeneity of Variance (center = mean)


## Df F value Pr(>F)
## group 2 2.5796 0.117
## 12
Remedial measures in R

Remedial measures
tapply(ex2$failure, ex2$locf, var)/tapply(ex2$failure, ex2$locf, mean) # Square root

## Loc 1 Loc 2 Loc 3


## 35.51206 49.86237 133.38597

tapply(ex2$failure, ex2$locf, sd)/tapply(ex2$failure, ex2$locf, mean) # Natural Log

## Loc 1 Loc 2 Loc 3


## 0.8396571 1.5010521 1.0490337

tapply(ex2$failure, ex2$locf, sd)/tapply(ex2$failure, ex2$locf, mean)^2 # 1/Y

## Loc 1 Loc 2 Loc 3


## 0.016669785 0.067828835 0.008654822

The most consistent set of values is standard deviation over the


means, so the natural log transformation is suggested.
Remedial measures in R
boxplot(log(failure) ~ locf, data = ex2)
6
5
4
3
2
1

Loc 1 Loc 2 Loc 3


Remedial measures in R
ex2reg2 <- lm(log(failure) ~ locf, data = ex2)
qqPlot(ex2reg2)

7
2
Studentized Residuals(ex2reg2)

1
0
−1

1
−2

−2 −1 0 1 2

t Quantiles
Remedial measures in R

shapiro.test(ex2reg2$residuals)

##
## Shapiro-Wilk normality test
##
## data: ex2reg2$residuals
## W = 0.978, p-value = 0.954

ncvTest(ex2reg2)

## Non-constant Variance Score Test


## Variance formula: ~ fitted.values
## Chisquare = 0.7178845, Df = 1, p = 0.39684
Box-Cox analysis in R
A more complete analysis of transformations can be done with
a Box-Cox analysis.
boxCox(ex2reg)

95%
−80
−90
log−likelihood

−100
−110
−120

−2 −1 0 1 2
Diagnostics after transformation
The graph seems to suggest the natural log transformation
since 0 is included in the 95% confidence interval. However, a
transformation of Y 0.1 appears to be better.
ex2reg3 <- lm(failure^0.1 ~ locf, data = ex2)
qqPlot(ex2reg3)
Studentized Residuals(ex2reg3)

7
2
1
0
−1

1
−2

−2 −1 0 1 2

t Quantiles
Diagnostics after transformation

shapiro.test(ex2reg3$residuals)

##
## Shapiro-Wilk normality test
##
## data: ex2reg3$residuals
## W = 0.97697, p-value = 0.9446

ncvTest(ex2reg3)

## Non-constant Variance Score Test


## Variance formula: ~ fitted.values
## Chisquare = 0.2171432, Df = 1, p = 0.64123

Both normality and homoscedasticity seem to be fulfilled.


Outlying cases

 Observations separated from the remainder of the data

 Should we retain or eliminate outliers?

 Outlying with respect to Y or X or both Y and X

 Influence?
Identifying outlying Y observations
 Residuals
ei = Yi − Ŷi
 Semistudentized Residuals
ei
ei∗ = √
MSE
 Studentized Residuals
ei
ri =
s(ei )
MSE(1 − hii ). H = X (X 0 X )−1 X 0 , the hat
p
where s(ei ) =
matrix.
 Deleted Residuals
ei
di = Yi − Ŷi(i) and di =
1 − hii
 Studentized Deleted Residuals
di
ti =
s(di )
Identifying outlying X observations
 Let H be the hat matrix
n
X
0 ≤ hii ≤ 1 hii = p
i=1

where p is the number of parameters in the model.


2p
 If i is an outlier, hii > n

 Criteria

0.2 ≤ hii ≤ 0.5 ⇒ moderate


0.5 < hii ⇒ high

Test for Outliers


We can perform a formal test for outliers using Bonferroni test

|ti | ≤ t(1 − α/2n; n − p − 1)

[observation is not an outlier].


Influential cases

 Influence on a single fitted value

Ŷi − Ŷi(i)
DFFITSi =
MSE(i) hii

 It measures how much the estimated value changes Ŷi if


we do not consider the ith case.
 Rule of Thumb
◦ |DFFITS|i > 1q
for medium data sets.
p
◦ |DFFITS|i > 2 n for large data sets.
Influential cases

 Influence on a all fitted value


 Cook’s Distance: It considers the influence of the ith case
on the n fitted values
Pn
j=1 Ŷj − Ŷj(i)
Di =
pMSE(i)

 Rule of Thumb
◦ F (Di , p, n − p) < 10% or 20% not influential case.
◦ F (Di , p, n − p) > 50% influential case.
Influential cases

 Influence on the regression coefficients

 DFBETAS: It considers the influence of the ith case on the


regression coefficients.

β̂i − β̂k (i)


DFBETASk (i) =
MSE(i) Ckk

where Ckk is the diagonal element of (X 0 X )−1 .

 Rule of Thumb
◦ |DFBETAS|k (i) > 1 influential case for small data
◦ |DFBETAS|k (i) > √2n for large data sets.
Example: Body Fats Data

Variables
 n = 20

 X1 skinfold thickness

 X2 thigh circumference

 X3 midarm circumference

 Y amount of fat
Some Implementation in R

body<-read.table("./Datasets/body.txt", header=T)
head(body)

## x1 x2 x3 y
## 1 19.5 43.1 29.1 11.9
## 2 24.7 49.8 28.2 22.8
## 3 30.7 51.9 37.0 18.7
## 4 29.8 54.3 31.1 20.1
## 5 19.1 42.2 30.9 12.9
## 6 25.6 53.9 23.7 21.7

(model<-lm(y~x1+x2, data=body))

##
## Call:
## lm(formula = y ~ x1 + x2, data = body)
##
## Coefficients:
## (Intercept) x1 x2
## -19.1742 0.2224 0.6594
Some Implementation in R

par(mfrow=c(1,2));plot(model, 4:5)

Cook's distance Residuals vs Leverage

2
3

Standardized residuals
14 0.5
0.4
Cook's distance

1
0
13
0.2

14

−1
3 0.5
Cook's
13 distance
0.0

−2
5 10 15 20 0.0 0.1 0.2 0.3

Obs. number Leverage


Some Implementation in R
par(mfrow=c(1,1));plot(model, 6)

Cook's dist vs Leverage hii (1 − hii)


0.0 0.1 0.2 0.3 0.4 0.5

2 3
1.5
Cook's distance

13 1
14

0.5
0

0.05 0.15 0.2 0.25 0.3 0.35

Leverage hii
lm(y ~ x1 + x2)
Some Implementation in R

model$residuals

## 1 2 3 4 5
## -1.6827093112 3.6429311788 -3.1759701405 -3.1584651200 -0.0002886579
## 6 7 8 9 10
## -0.3608155187 0.7161991891 4.0147327554 2.6551057360 -2.4748115410
## 11 12 13 14 15
## 0.3358063798 2.2255110139 -3.9468613463 3.4474561945 0.5705871038
## 16 17 18 19 20
## 0.6422984777 -0.8509464751 -0.7829198812 -2.8572887647 1.0404487275
Understanding diagnosis plot

 Cook’s distance plot


◦ Used to find influential outliers
◦ Plots observations against their respective Cook’s
distance.
◦ Largest cook distances are labelled.
◦ We expect cook’s distance less than 1, with no other
cook’s distance ’significantly’ larger than the rest.
Understanding diagnosis plot
 Cook’s distance vs Leverage
◦ Used to detect highly influential cases
◦ Cook’s distance above 1 indicates highly influential
points
◦ Leverage values greater than 2p/n indicate high
leverage observation.
◦ The coutours are standadized residuals labelled with
their magnitude.
 Residuals vs Leverage
◦ Helps to find influential subjects
◦ Outlying values are found at the upper right or lower
right corner of the plot
◦ Cases outside the dash line (cook’s distance) are
influential to the results of the model
Some Implementation in R

rstandard(model)

## 1 2 3 4 5
## -0.7402261049 1.4765806027 -1.5757913418 -1.3171518783 -0.0001308889
## 6 7 8 9 10
## -0.1519867578 0.3064529233 1.6606069555 1.1095479783 -1.0316491806
## 11 12 13 14 15
## 0.1407848595 0.9272163634 -1.7121512617 1.4686083237 0.2747599004
## 16 17 18 19 20
## 0.2655244112 -0.3538020409 -0.3435016352 -1.1631291289 0.4197625143

rstudent(model)

## 1 2 3 4 5
## -0.7299854027 1.5342541325 -1.6543295725 -1.3484842072 -0.0001269809
## 6 7 8 9 10
## -0.1475490938 0.2981276214 1.7600924916 1.1176487404 -1.0337284208
## 11 12 13 14 15
## 0.1366610657 0.9231785040 -1.8259027246 1.5247630510 0.2671500921
## 16 17 18 19 20
## 0.2581323416 -0.3445090997 -0.3344080836 -1.1761712768 0.4093564171
Some Implementation in R

influence(model)$hat

## 1 2 3 4 5 6
## 0.20101253 0.05889478 0.37193301 0.11094009 0.24801034 0.12861620
## 7 8 9 10 11 12
## 0.15551745 0.09628780 0.11463564 0.11024435 0.12033655 0.10926629
## 13 14 15 16 17 18
## 0.17838181 0.14800684 0.33321201 0.09527739 0.10559466 0.19679280
## 19 20
## 0.06695419 0.05008526

dffits(model)

## 1 2 3 4 5
## -3.661472e-01 3.838103e-01 -1.273067e+00 -4.763483e-01 -7.292347e-05
## 6 7 8 9 10
## -5.668650e-02 1.279371e-01 5.745212e-01 4.021649e-01 -3.638725e-01
## 11 12 13 14 15
## 5.054583e-02 3.233366e-01 -8.507812e-01 6.355141e-01 1.888521e-01
## 16 17 18 19 20
## 8.376829e-02 -1.183735e-01 -1.655265e-01 -3.150707e-01 9.399706e-02
Some Implementation in R

(cd<-cooks.distance(model))

## 1 2 3 4 5
## 4.595055e-02 4.548118e-02 4.901567e-01 7.216190e-02 1.883399e-09
## 6 7 8 9 10
## 1.136518e-03 5.764939e-03 9.793853e-02 5.313352e-02 4.395704e-02
## 11 12 13 14 15
## 9.037986e-04 3.515436e-02 2.121502e-01 1.248925e-01 1.257530e-02
## 16 17 18 19 20
## 2.474925e-03 4.926142e-03 9.636470e-03 3.236006e-02 3.096787e-03
Some Implementation in R

dfbeta(model)

## (Intercept) x1 x2
## 1 -2.5873120410 -4.045756e-02 6.851256e-02
## 2 1.3885856142 3.359106e-02 -3.996603e-02
## 3 -6.7460811783 -3.417891e-01 2.959198e-01
## 4 -0.8298018102 -8.699579e-02 5.576707e-02
## 5 -0.0005491464 -9.548311e-06 1.507863e-05
## 6 0.3417033826 1.252797e-02 -1.327784e-02
## 7 -0.6662800194 -4.869870e-03 1.625790e-02
## 8 2.0621886739 1.119746e-01 -9.133441e-02
## 9 -1.2562286444 -8.876192e-02 7.137567e-02
## 10 1.9837372912 7.407218e-02 -7.811621e-02
## 11 -0.0776962164 5.331751e-03 -7.452907e-04
## 12 -1.0957798277 6.844439e-03 2.047112e-02
## 13 0.9361849968 1.685643e-01 -1.063493e-01
## 14 3.6377705378 3.307612e-02 -8.349511e-02
## 15 -0.0258331550 -3.893438e-02 2.059515e-02
## 16 0.0800533575 1.345694e-02 -7.525586e-03
## 17 0.6827037689 1.715290e-02 -2.275412e-02
## 18 1.1340627287 2.347924e-02 -3.472625e-02
## 19 -1.0715507527 -1.221911e-03 1.855296e-02
## 20 0.0873645353 7.127872e-04 -9.895689e-04
Some Implementation in R

influence.measures(model)

## Influence measures of
## lm(formula = y ~ x1 + x2, data = body) :
##
## dfb.1_ dfb.x1 dfb.x2 dffit cov.r cook.d hat inf
## 1 -3.05e-01 -1.31e-01 2.32e-01 -3.66e-01 1.361 4.60e-02 0.2010
## 2 1.73e-01 1.15e-01 -1.43e-01 3.84e-01 0.844 4.55e-02 0.0589
## 3 -8.47e-01 -1.18e+00 1.07e+00 -1.27e+00 1.189 4.90e-01 0.3719 *
## 4 -1.02e-01 -2.94e-01 1.96e-01 -4.76e-01 0.977 7.22e-02 0.1109
## 5 -6.37e-05 -3.05e-05 5.02e-05 -7.29e-05 1.595 1.88e-09 0.2480 *
## 6 3.97e-02 4.01e-02 -4.43e-02 -5.67e-02 1.371 1.14e-03 0.1286
## 7 -7.75e-02 -1.56e-02 5.43e-02 1.28e-01 1.397 5.76e-03 0.1555
## 8 2.61e-01 3.91e-01 -3.32e-01 5.75e-01 0.780 9.79e-02 0.0963
## 9 -1.51e-01 -2.95e-01 2.47e-01 4.02e-01 1.081 5.31e-02 0.1146
## 10 2.38e-01 2.45e-01 -2.69e-01 -3.64e-01 1.110 4.40e-02 0.1102
## 11 -9.02e-03 1.71e-02 -2.48e-03 5.05e-02 1.359 9.04e-04 0.1203
## 12 -1.30e-01 2.25e-02 7.00e-02 3.23e-01 1.152 3.52e-02 0.1093
## 13 1.19e-01 5.92e-01 -3.89e-01 -8.51e-01 0.827 2.12e-01 0.1784
## 14 4.52e-01 1.13e-01 -2.98e-01 6.36e-01 0.937 1.25e-01 0.1480
## 15 -3.00e-03 -1.25e-01 6.88e-02 1.89e-01 1.775 1.26e-02 0.3332 *
## 16 9.31e-03 4.31e-02 -2.51e-02 8.38e-02 1.309 2.47e-03 0.0953
## 17 7.95e-02 5.50e-02 -7.61e-02 -1.18e-01 1.312 4.93e-03 0.1056
## 18 1.32e-01 7.53e-02 -1.16e-01 -1.66e-01 1.462 9.64e-03 0.1968
## 19 -1.30e-01 -4.07e-03 6.44e-02 -3.15e-01 1.002 3.24e-02 0.0670
## 20 1.02e-02 2.29e-03 -3.31e-03 9.40e-02 1.224 3.10e-03 0.0501
Some Implementation in R

library(car)
outlierTest(model) # Bonferonni p-value for most extreme obs

## No Studentized residuals with Bonferonni p < 0.05


## Largest |rstudent|:
## rstudent unadjusted p-value Bonferonni p
## 13 -1.825903 0.086586 NA

Repeat by setting Y = 50.4 for observation 28. What changes


do you observe?
Model Building Process

 Data collection and preparation

 Reduction of explanatory or predictor variables

 Model selection

 Model validation
Procedures for variable reduction

 Let X be a set of predictor variables.

 How can we identify a good subset of predictor variables?

 Need some criterion

 If we have four variables, let’s say X1 , X2 , X3 , X4

 How many models can we fit? 24 − 1 = 15

 General case X1 , X2 , . . . , Xn

2n − 1 different models
Criteria for comparing the regression models
Different criteria can be used
 Rp2

 MSEp

 Cp

 PRESSp

 AIC and SBC(BIC)


Notation and Remarks
 p − 1 number of potential variables in the pool

 All the models considered with intercept β0 hence p


parameters

 Assume that n > p


Rp2 or SSEp criterion
 Use the coefficient of multiple determination R 2

 Identify subsets with high Rp2

 Rp2 is equivalent to SSEp criterion since

SSEp
Rp2 = 1 −
SSTO
 Given that SSTO is constant we will look for the smallest
SSEp
Ra2 or MSEp criterion
 If we want to account for the number of parameters

n − 1 SSE MSE
Ra2 = 1 − =1−
n − p SSTO SSTO/(n − 1)

 Ra2 increase iff MSE decreases


Mallows Cp -criterion
This criterion tries to identify the models with smaller bias in the
regression
SSEp
Cp = − (n − 2p)
MSE(X1 , X2 , . . . , Xp )
 MSE(X1 , X2 , . . . , Xp ) unbiased estimator of σ 2
 SSEp is the error sum of the squares for the fitted
regression model.
 No bias implies E(Cp ) ∼
=p
 We search for models with 1) Cp small 2) Cp mean p
PRESSp criterion The idea is to see how well the use of the
fitted values for a specific subset model predict Yi observed.
 Yi observed values
 Ŷi (i) fitted value for i made with a model that does not
contain i
 Predict sum of squares PRESSp = ni=1 (Yi − Ŷi (i))2
P
Surgical Dataset

Data
 X1 blood clotting score
 X2 prognostic index
 X3 enzyme function test score
 X4 liver function test
 Y survival

Objective
Predict survival of patients with a particular type of liver
operation.
Implementation in R

surg<-read.table("./Datasets/surgical.txt", header=T)

cor(surg)

## x1 x2 x3 x4 y logy
## x1 1.00000000 0.09011973 -0.14963411 0.5024157 0.3725187 0.3464042
## x2 0.09011973 1.00000000 -0.02360544 0.3690256 0.5539760 0.5928888
## x3 -0.14963411 -0.02360544 1.00000000 0.4164245 0.5802438 0.6651216
## x4 0.50241567 0.36902563 0.41642451 1.0000000 0.7223266 0.7262058
## y 0.37251865 0.55397598 0.58024382 0.7223266 1.0000000 0.9130965
## logy 0.34640419 0.59288884 0.66512160 0.7262058 0.9130965 1.0000000
Implementation in R
pairs(surg[,-6])

20 40 60 80 1 2 3 4 5 6

8
x1

4
60

x2
20

60 100
x3

20
5

x4
3
1

600
y

200
4 6 8 10 20 60 100 200 600
Implementation in R: Diagnostics
model<-lm(y~x1+x2+x3+x4, data=surg)
library(car)
par(mfrow=c(1,2))
qqnorm(model$residuals)
qqline(model$residuals)
plot(model$fitted,model$residuals,xlab="Fitted",ylab="Residuals",
main="Time")
abline(h=0)

Normal Q−Q Plot Time


300

300
Sample Quantiles

Residuals
100

100
0

−2 −1 0 1 2 −100 100 300 500

Theoretical Quantiles Fitted


Implementation in R: Diagnostics
model2<-lm(logy~x1+x2+x3+x4, data=surg)
par(mfrow=c(1,2))
#qqPlot(model2$residuals,main="LogTime",ylab="Residuals")
qqnorm(model2$residuals,main="LogTime")
qqline(model2$residuals)
plot(model2$fitted,model2$residuals,xlab="Fitted",ylab="Residuals",
main="LogTime")
abline(h=0)

LogTime LogTime
Sample Quantiles

0.10

0.10
Residuals
0.00

0.00
−0.10

−0.10

−2 −1 0 1 2 1.6 2.0 2.4 2.8

Theoretical Quantiles Fitted


Implementation in R: Variable Selection

#install.packages('leaps')
library(leaps)
modelsel<-regsubsets(logy~x1+x2+x3+x4,data=surg,nbest=2)

summary(modelsel)

## Subset selection object


## Call: regsubsets.formula(logy ~ x1 + x2 + x3 + x4, data = surg, nbest = 2)
## 4 Variables (and intercept)
## Forced in Forced out
## x1 FALSE FALSE
## x2 FALSE FALSE
## x3 FALSE FALSE
## x4 FALSE FALSE
## 2 subsets of each size up to 4
## Selection Algorithm: exhaustive
## x1 x2 x3 x4
## 1 ( 1 ) " " " " " " "*"
## 1 ( 2 ) " " " " "*" " "
## 2 ( 1 ) " " "*" "*" " "
## 2 ( 2 ) " " " " "*" "*"
## 3 ( 1 ) "*" "*" "*" " "
## 3 ( 2 ) " " "*" "*" "*"
## 4 ( 1 ) "*" "*" "*" "*"
Implementation in R: Variable Selection
plot(modelsel, scale="r2")

0.97

0.97

0.88
r2

0.81

0.69

0.53

0.44
(Intercept)

x1

x2

x3

x4
Implementation in R: Variable Selection
car::subsets(modelsel, statistic='rsq',legend=F)

x1−x2−x3 x1−x2−x3−x4
0.9

x2−x3−x4

x2−x3
0.8
Statistic: rsq

0.7

x3−x4
0.6

x4
0.5

x3

1.0 1.5 2.0 2.5 3.0 3.5 4.0

Subset Size
Implementation in R: Variable Selection
plot(modelsel, scale="adjr2")

0.97

0.97

0.88
adjr2

0.81

0.67

0.52

0.43
(Intercept)

x1

x2

x3

x4
Implementation in R: Variable Selection
car::subsets(modelsel, statistic='adjr2', legend=F)

x1−x2−x3 x1−x2−x3−x4
0.9

x2−x3−x4
0.8

x2−x3
Statistic: adjr2

0.7

x3−x4
0.6

x4
0.5

x3

1.0 1.5 2.0 2.5 3.0 3.5 4.0

Subset Size
Implementation in R: Variable Selection
plot(modelsel, scale="Cp")

160
Cp

280

510

790

940
(Intercept)

x1

x2

x3

x4
Implementation in R: Variable Selection
car::subsets(modelsel, statistic='cp',legend=F)

x3
800

x4
600
Statistic: cp

x3−x4
400

x2−x3
200

x2−x3−x4

x1−x2−x3 x1−x2−x3−x4
0

1.0 1.5 2.0 2.5 3.0 3.5 4.0

Subset Size
Implementation in R: Variable Selection
plot(modelsel) ##scale=bic

−180

−170

−100
bic

−79

−51

−32

−24
(Intercept)

x1

x2

x3

x4
Implementation in R: Variable Selection
car::subsets(modelsel, legend=F) ##statistics=bic

x3
x4
−50

x3−x4

x2−x3
Statistic: bic

−100

x2−x3−x4
−150

x1−x2−x3 x1−x2−x3−x4

1.0 1.5 2.0 2.5 3.0 3.5 4.0

Subset Size
Implementation in R: Variable Selection
car::subsets(modelsel, statistic='adjr2',
legend=c(3.5, -37)) ##statistics=bic

x1−x2−x3 x1−x2−x3−x4
0.9

x2−x3−x4
0.8

x2−x3
Statistic: adjr2

0.7

x3−x4
0.6

x4
0.5

x3

1.0 1.5 2.0 2.5 3.0 3.5 4.0

Subset Size
Implementation in R: Variable Selection

 Function press in package DAAG will return the PRESS


statistic for a model object produced by lm

mod<-lm(logy~x1+x2+x3+x4,data=surg)
library(DAAG)

## Loading required package: lattice


##
## Attaching package: ’DAAG’
## The following object is masked from ’package:car’:
##
## vif

press(mod)

## [1] 0.1455741
Implementation in R: Comparing Two Models

fit1 <- lm(logy ~ x1 + x2 + x3 + x4, data=surg)


fit2 <- lm(logy ~ x1 + x2+x3,data=surg)
anova(fit1, fit2)

## Analysis of Variance Table


##
## Model 1: logy ~ x1 + x2 + x3 + x4
## Model 2: logy ~ x1 + x2 + x3
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 49 0.10977
## 2 50 0.10986 -1 -8.8092e-05 0.0393 0.8436

fit1 <- lm(logy ~ x1 + x2 + x3 + x4, data=surg)


Implementation in R: Comparing Two Models
library(MASS)

##
## Attaching package: ’MASS’
## The following object is masked from ’package:DAAG’:
##
## hills

dropterm(fit1,test="F")

## Single term deletions


##
## Model:
## logy ~ x1 + x2 + x3 + x4
## Df Sum of Sq RSS AIC F Value Pr(F)
## <none> 0.10977 -324.71
## x1 1 0.35544 0.46521 -248.73 158.66 <2e-16 ***
## x2 1 1.00583 1.11560 -201.50 448.99 <2e-16 ***
## x3 1 1.28075 1.39052 -189.60 571.71 <2e-16 ***
## x4 1 0.00009 0.10986 -326.67 0.04 0.8436
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Implementation in R: Comparing Two Models

fit2<-update(fit1, .~. -x3)


dropterm(fit2,test="F")

## Single term deletions


##
## Model:
## logy ~ x1 + x2 + x4
## Df Sum of Sq RSS AIC F Value Pr(F)
## <none> 1.3905 -189.60
## x1 1 0.00163 1.3921 -191.54 0.0586 0.8097443
## x2 1 0.48530 1.8758 -175.44 17.4502 0.0001181 ***
## x4 1 0.84196 2.2325 -166.04 30.2750 1.285e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Implementation in R: Automatic Search Procedure

stepAIC(fit1)

## Start: AIC=-324.71
## logy ~ x1 + x2 + x3 + x4
##
## Df Sum of Sq RSS AIC
## - x4 1 0.00009 0.10986 -326.67
## <none> 0.10977 -324.71
## - x1 1 0.35544 0.46521 -248.73
## - x2 1 1.00583 1.11560 -201.50
## - x3 1 1.28075 1.39052 -189.60
##
## Step: AIC=-326.67
## logy ~ x1 + x2 + x3
##
## Df Sum of Sq RSS AIC
## <none> 0.10986 -326.67
## - x1 1 0.63315 0.74301 -225.45
## - x2 1 1.29732 1.40718 -190.96
## - x3 1 2.12263 2.23249 -166.04
##
## Call:
## lm(formula = logy ~ x1 + x2 + x3, data = surg)
##
## Coefficients:
## (Intercept) x1 x2 x3
## 0.483621 0.069225 0.009295 0.009524
Implementation in R: Automatic Search Procedure

stepAIC(fit1,direction="backward")

## Start: AIC=-324.71
## logy ~ x1 + x2 + x3 + x4
##
## Df Sum of Sq RSS AIC
## - x4 1 0.00009 0.10986 -326.67
## <none> 0.10977 -324.71
## - x1 1 0.35544 0.46521 -248.73
## - x2 1 1.00583 1.11560 -201.50
## - x3 1 1.28075 1.39052 -189.60
##
## Step: AIC=-326.67
## logy ~ x1 + x2 + x3
##
## Df Sum of Sq RSS AIC
## <none> 0.10986 -326.67
## - x1 1 0.63315 0.74301 -225.45
## - x2 1 1.29732 1.40718 -190.96
## - x3 1 2.12263 2.23249 -166.04
##
## Call:
## lm(formula = logy ~ x1 + x2 + x3, data = surg)
##
## Coefficients:
## (Intercept) x1 x2 x3
## 0.483621 0.069225 0.009295 0.009524
Implementation in R: Automatic Search Procedure
stepAIC(fit1,direction="both")

## Start: AIC=-324.71
## logy ~ x1 + x2 + x3 + x4
##
## Df Sum of Sq RSS AIC
## - x4 1 0.00009 0.10986 -326.67
## <none> 0.10977 -324.71
## - x1 1 0.35544 0.46521 -248.73
## - x2 1 1.00583 1.11560 -201.50
## - x3 1 1.28075 1.39052 -189.60
##
## Step: AIC=-326.67
## logy ~ x1 + x2 + x3
##
## Df Sum of Sq RSS AIC
## <none> 0.10986 -326.67
## + x4 1 0.00009 0.10977 -324.71
## - x1 1 0.63315 0.74301 -225.45
## - x2 1 1.29732 1.40718 -190.96
## - x3 1 2.12263 2.23249 -166.04
##
## Call:
## lm(formula = logy ~ x1 + x2 + x3, data = surg)
##
## Coefficients:
## (Intercept) x1 x2 x3
## 0.483621 0.069225 0.009295 0.009524
Multicolinearity

 A problem when the predictor variables are correlated


among themselves
 In practice
◦ perfect correlation is difficult to find
◦ does not inhibit our ability to obtain good fit
◦ interpretation of regression coefficients is not fully
applicable
Multicolinearity

 Multicolinearity affects
◦ regression coefficients
◦ s{β̂}
◦ fitted values and predictors
◦ tests and confidence intervals

 These information can be used to detect multicolinearity


◦ correction exceeding 0.8 (rule of thumb)
◦ drastic changes in estimated parameters
◦ large values of s{β̂}
◦ etc.
Multicolinearity Diagnostics

 Adding or deleting variables changes the regression


coefficients

 Standard deviation of the regression coefficient become


large

 The estimated regression coefficients are not statistically


significant but the relation exists.

 Large values of correlation between pairs of predictors.


Multicolinearity Diagnostics

Variance Inflation Factor

(VIF )k = (1 − Rk2 )−1 , k = 1, 2, . . . , p − 1

 Rk2 is the coefficient of multiple determination when Xk is


regressed on the p − 2 variables

 Rk2 = 1 ⇒ (VIF )k = ∞, Rk2 = 0 ⇒ (VIF )k = 1

 Maximum value > 10 implies multicolinearity


P
VIFk
 VIF k = p−1  1 multicolinearity
Some Implementation in R

crew<-read.table("./Datasets/crew.txt", header=T)
cor(crew[,-3])

## x1 x2
## x1 1 0
## x2 0 1
Some Implementation in R

summary(lm(y~x1,data=crew))

##
## Call:
## lm(formula = y ~ x1, data = crew)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.750 -3.750 0.125 4.500 6.000
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.500 10.111 2.324 0.0591 .
## x1 5.375 1.983 2.711 0.0351 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.609 on 6 degrees of freedom
## Multiple R-squared: 0.5505,Adjusted R-squared: 0.4755
## F-statistic: 7.347 on 1 and 6 DF, p-value: 0.03508
Implementation in R

summary(lm(y~x2,data=crew))

##
## Call:
## lm(formula = y ~ x2, data = crew)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.000 -4.688 -0.250 5.250 7.250
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27.250 11.608 2.348 0.0572 .
## x2 9.250 4.553 2.032 0.0885 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.439 on 6 degrees of freedom
## Multiple R-squared: 0.4076,Adjusted R-squared: 0.3088
## F-statistic: 4.128 on 1 and 6 DF, p-value: 0.08846
Implementation in R

summary(lm(y~x1+x2,data=crew))

##
## Call:
## lm(formula = y ~ x1 + x2, data = crew)
##
## Residuals:
## 1 2 3 4 5 6 7 8
## 1.625 -1.375 -1.625 1.375 -2.125 1.875 0.625 -0.375
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.3750 4.7405 0.079 0.940016
## x1 5.3750 0.6638 8.097 0.000466 ***
## x2 9.2500 1.3276 6.968 0.000937 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.877 on 5 degrees of freedom
## Multiple R-squared: 0.958,Adjusted R-squared: 0.9412
## F-statistic: 57.06 on 2 and 5 DF, p-value: 0.000361
Implementation in R

vif(lm(y~x1+x2,data=crew))

## x1 x2
## 1 1

sqrt(vif(lm(y~x1+x2,data=crew)))>2 #problem

## x1 x2
## FALSE FALSE
Body Fat Example: Implementation in R

body<-read.table("./Datasets/body.txt", header=T)
summary(lm(y~x1,data=body))

##
## Call:
## lm(formula = y ~ x1, data = body)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.1195 -2.1904 0.6735 1.9383 3.8523
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.4961 3.3192 -0.451 0.658
## x1 0.8572 0.1288 6.656 3.02e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.82 on 18 degrees of freedom
## Multiple R-squared: 0.7111,Adjusted R-squared: 0.695
## F-statistic: 44.3 on 1 and 18 DF, p-value: 3.024e-06
Implementation in R

summary(lm(y~x2,data=body))

##
## Call:
## lm(formula = y ~ x2, data = body)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.4949 -1.5671 0.1241 1.3362 4.4084
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -23.6345 5.6574 -4.178 0.000566 ***
## x2 0.8565 0.1100 7.786 3.6e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.51 on 18 degrees of freedom
## Multiple R-squared: 0.771,Adjusted R-squared: 0.7583
## F-statistic: 60.62 on 1 and 18 DF, p-value: 3.6e-07
Implementation in R

summary(lm(y~x1+x2,data=body))

##
## Call:
## lm(formula = y ~ x1 + x2, data = body)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9469 -1.8807 0.1678 1.3367 4.0147
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -19.1742 8.3606 -2.293 0.0348 *
## x1 0.2224 0.3034 0.733 0.4737
## x2 0.6594 0.2912 2.265 0.0369 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.543 on 17 degrees of freedom
## Multiple R-squared: 0.7781,Adjusted R-squared: 0.7519
## F-statistic: 29.8 on 2 and 17 DF, p-value: 2.774e-06
Implementation in R

summary(lm(y~x1+x2+x3,data=body))

##
## Call:
## lm(formula = y ~ x1 + x2 + x3, data = body)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.7263 -1.6111 0.3923 1.4656 4.1277
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 117.085 99.782 1.173 0.258
## x1 4.334 3.016 1.437 0.170
## x2 -2.857 2.582 -1.106 0.285
## x3 -2.186 1.595 -1.370 0.190
##
## Residual standard error: 2.48 on 16 degrees of freedom
## Multiple R-squared: 0.8014,Adjusted R-squared: 0.7641
## F-statistic: 21.52 on 3 and 16 DF, p-value: 7.343e-06
Implementation in R

cor(body[,-4])

## x1 x2 x3
## x1 1.0000000 0.9238425 0.4577772
## x2 0.9238425 1.0000000 0.0846675
## x3 0.4577772 0.0846675 1.0000000

vif(lm(y~x1+x2+x3,data=body))

## x1 x2 x3
## 708.84 564.34 104.61

sqrt(vif(lm(y~x1+x2+x3,data=body)))>2 #problem

## x1 x2 x3
## TRUE TRUE TRUE
Preliminary
Let i = 1, . . . , n denote subjects, individuals, items,
experimental units etc., j = 1, 2, . . . , p denotes variables,
characteristics measured for each individual. The jth
measurement for the ith subject will be denoted by xij . For n
measurements on p variables, the data is setup as follows:

Variable 1 Variable2 ... Variable j ... Variable p


Item 1 x11 x12 ... x1j ... x1p
Item 2 x21 x22 ... x2j ... x2p
.. .. .. .. ..
. . . . .
Item j xi1 xi2 ... xij ... xip
.. .. .. .. ..
. . . . .
Item n xn1 xn2 ... xnj ... xnp
Table: Data Setup
The data can be grouped into a matrix form as:
 
x11 x12 . . . x1p
x21 x22 . . . x2p 
X = (xij )i,j =  .
 
. .. .. 
 . . . 
xn1 xn2 . . . xnp

Example
A selection of four receipts from the university bookstore was
obtained in order to investigate the nature of book sales. Each
receipt provided, among other things, the number of books sold
and the total amount of each sale. Let the first variable be the
total Cedi sales and the second variable be the number of
books sold. Then we can regard the corresponding numbers on
the receipts as four measurements on two variables.
Example (cont.)
Suppose the data, in tabular form, are

Variable 1 (Cedi sales): 42 52 48 58


Variable 2 (number of books): 4 5 4 3

Solution
Using the notation just introduced, we have

x11 = 42, x21 = 52, x31 = 48, x41 = 58


x12 = 4, x22 = 5, x32 = 4, x42 = 3
and the data array X is
 
42 4
52 5
X =
46

4
58 3
Descriptive statistics
 Sample mean
n
1X
x̄j = xij , j = 1, . . . , p.
n
i=1
 
x̄1
x̄2 
Mean vector x̄ = (x̄1 , x̄2 , . . . , x̄p )0 =  . 
 
 .. 
x̄p
 Sample variance
n
1X
sj2 = (xij − x̄j )2 .
n
i=1

A very common definition is


n
1 X
sj2 = (xij − x̄j )2 .
n−1
i=1
 When sample variance are put along the main diagonal of
a matrix, they will be denoted by sjj = sj2 . Standard
q
deviation: sj2
 Linear association
◦ Sample covariance:
n
1X
sjk = (xij − x̄j )(xik − x̄k )
n
i=1

We denote the p × p symmetric covariance matrix


S n = (sjk )j,k and the matrix of squares and cross
products as W = nS n = nS n−1
◦ Sample correlation: (Pearson’s product moment
correlation)
Pn
sjk (xij − x̄j )(xik − x̄k )
rjk = √ = qP i=1
sjj skk n 2
Pn 2
i=1 (xij − x̄j ) i=1 (xik − x̄k )
Remarks on correlation:
 unitless
 covariance of standardized variables
 range −1 ≤ rij ≤ 1
 rjk = 0 implies no linear association
 rjk > 0 implies both pair of variables tend to deviate from
their respective means in the same direction.
 The correlation is invariant under the following
transformation

yij = axij + b, i = 1, 2, . . . , n
yik = cxik + d, i = 1, 2, . . . , n

provided a and c have the same sign (ac > 0). What
happens if ac < 0?
 Do not overestimate the importance of linear association
measures.
◦ They do not capture other kinds of association (eg.
quadratic relationship between variables)
◦ Correlation coefficient is very sensitive to outliers.
Matrix representation:
 Sample mean: x̄ = n1 ni=1 x j = (x̄1 , x̄2 , . . . , x̄p )0
P
 Sample covariance P matrix:
S n = (sjk ) = n1 ni=1 (x j − x̄)(x k − x̄)0 . i.e.
 
s11 s12 . . . s1p
s21 s11 . . . s2p 
Sn =  .
 
. .. .. 
 . . . 
sp1 sp2 . . . spp
 Sample correlation matrix: R = (rjk ) with rjj = 1. i.e.
 
1 r12 . . . r1p
r21 1 . . . r2p 
R= .
 
. .. .. 
 . . . 
Sample Mean, Covariance, and Correlation as Matrix
Operations

1 0
x̄ = X 1n
n  
1 0 1
S = X In − Jn X
n−1 n

Let D = diag(sjj ) then

R = D −1/2 SD −1/2
S = D 1/2 RD 1/2
Examples

 Consider the data array


 
42 4
52 5
X =
48

4
58 3

Find x̄, S, R.
 TRY: Using the IRIS data set, find x̄, S, R for the entire
data and for each species.
R Demonstration
data<-c(42, 52, 48, 58, 4, 5, 4, 3)
(x<-matrix(data,4,2))

## [,1] [,2]
## [1,] 42 4
## [2,] 52 5
## [3,] 48 4
## [4,] 58 3

colnames(x)<-c("x1","x2")
colMeans(x)

## x1 x2
## 50 4

##Descriptive Statistics
var(x) # or cov(x)

## x1 x2
## x1 45.33333 -2.0000000
## x2 -2.00000 0.6666667

apply(x,2,var)
R Demonstraation
cor(x)

## x1 x2
## x1 1.0000000 -0.3638034
## x2 -0.3638034 1.0000000

#####Sample Mean, Covariance and Correlation


## as Matrix Operations####
n<-nrow(x)
I<-diag(n)
J<-matrix(1,n,n)
n;I;J

## [1] 4
## [,1] [,2] [,3] [,4]
## [1,] 1 0 0 0
## [2,] 0 1 0 0
## [3,] 0 0 1 0
## [4,] 0 0 0 1
## [,1] [,2] [,3] [,4]
## [1,] 1 1 1 1
## [2,] 1 1 1 1
## [3,] 1 1 1 1
## [4,] 1 1 1 1
R Demonstration
#Mean#
(MeanX=1/n*t(x)%*%J[,1])

## [,1]
## x1 50
## x2 4

#Covariance#
(CovX=(1/n)*t(x)%*%(I-(1/n)*J)%*%x)

## x1 x2
## x1 34.0 -1.5
## x2 -1.5 0.5

#Correlation#
Sjj<-diag(CovX)
(Dhalf<-diag(sqrt(Sjj)))

## [,1] [,2]
## [1,] 5.830952 0.0000000
## [2,] 0.000000 0.7071068
R Demonstration

(DInvHalf<-solve(Dhalf))

## [,1] [,2]
## [1,] 0.1714986 0.000000
## [2,] 0.0000000 1.414214

(R<-DInvHalf %*% CovX %*% DInvHalf)

## [,1] [,2]
## [1,] 1.0000000 -0.3638034
## [2,] -0.3638034 1.0000000
Introduction
 A generalization of the familiar bell-shaped density to
several dimension plays a fundamental role in multivariate
analysis.
 While real data are never exactly multivariate normal, the
normal density is often a useful approximation to the ‘true’
population distribution.
 One advantage of the multivariate distribution is that it has
got mathematically elegant properties.
 The normal distributions are useful in practice for two
reasons:
◦ it is a bona fide model for practice
◦ the sampling distribution of a lot of statistics is
approximately normal (central limit effect)
 The univariate normal distribution
1 1 x−µ 2
f (x) = √ e− 2 ( σ )
2πσ 2
Introduction

 The corresponding multivariate normal density function is


 
1 1
f (x) = f (x; µ, Σ) = √ p exp (x − µ)0 Σ−1 (x − µ)
( 2π)p |Σ| 2

where −∞ < x < ∞.


 Often the notation φp is used to denote the multivariate
normal density, while Φp denotes the multivariate
cumulative distribution function.
 A multivariate normal random variable will be indicated by

X ∼ Np (µ, Σ).
Assessing normality
 Many of the multivariate techniques depend on the
assumption of (multivariate) normality (except some large
sample theory results)

 Even in large samples, the approximation will be better if


the distribution is closer to normality.

 The detection of departures from normality is crucial.

 Three properties of the normal distribution will be


important eg.
◦ marginal distributions are normal
◦ linear combinations are normal
◦ density contours are ellipsoids

 The detection of OUTLIERS (wild observations) will play


an important role as well.

 Checks in univariate (bivariate) margins are usually not


difficult.
Quantile-Quantile Plot (QQ-Plot)

 Special plots called QQ Plots can be used to assess the


assumption of normality.
 They are plots of the sample quantile versus the quantile
one would expect to observe if the observations actually
were normally distributed.
 When the points lie very nearly along a straight line, the
normality assumption remain tenable. Normality is suspect
if the points deviate from a straight line.
 Moreover, the pattern of the deviations can provide clues
about the nature of the nonnormality.
 Once the reasons for nonnormality are identified,
corrective action is often possible.
QQ-Plot
 Let x1 , x2 , . . . , xn be n observation and denote their ordered
values by x(1) , x(2) , . . . , x(n) values by
x(1) ≤ x(2) ≤ · · · ≤ x(n)
where the x(j) ’s are the sample quantiles.
 When the x(j) ’s are distinct, exactly j observations are less
than or equal to x(j) . The proportion j/n of the sample at or
to the left of x(j) is often approximated by j−1/2
n for
analytical convenience (continuity correction).
 On the other hand, we can compute quantile qj of the
normal distribution
Z q(j)
1 2 j − 1/2
P(Z ≤ q(j) ) = φ(z)dz = √ e−z /2 dz =
−∞ 2π n
 Solve this equation for q(j) and plot the pairs (q(j) , x(j) ). If
normality is true, the expected sample quantile should be
x(j) ≈ σq(j) + µ
Example (Constructing a QQ-Plot)

Radiation Data: A sample of n = 10 observations gives the


values in the following table.

1 2 3 4 5 6 7 8 9 10
-1 -0.1 0.16 2.30 1.71 1.54 0.41 0.62 0.80 1.26
R Demonstration

##QQ plot for radiation


sample<-as.matrix(c(-1,-0.1, 0.16, 2.30, 1.71, 1.54,
0.41, 0.62, 0.80, 1.26))
ProbLevel<-(1:nrow(sample) - 1/2) / nrow(sample)
qStdNormal<-qnorm(ProbLevel)
Table<-matrix(c(1:nrow(sample),sort(sample),round(ProbLevel,2),
+ round(qStdNormal,3)),nrow(sample),4)
R Demonstration
par(mfrow=c(1,2))
plot(qStdNormal,sort(sample),xlab="Standard Normal Quantiles,
q(i)",ylab="Sample Quantiles, x(i)", main="QQ-Plot")
qqline(sample)
abline(h=0,v=0)

qqnorm(sample)
qqline(sample)

QQ−Plot Normal Q−Q Plot


Sample Quantiles, x(i)

2.0

2.0
Sample Quantiles
1.0

1.0
0.0

0.0
−1.0

−1.0

−1.5 −0.5 0.5 1.5 −1.5 −0.5 0.5 1.5


Standard Normal Quantiles,
q(i) Theoretical Quantiles
Example (Constructing a QQ-Plot)

 The pair of points lie very nearly along a straight line and
we will not reject the notion that these data are normally
distributed - particularly with sample size as small as
n = 10
 As a rule of thumb: do not trust this procedure in samples
of size n ≤ 20
 There can be quite a bit of variability in the straightness of
the QQ-plot for small samples even when the observations
are known to come from a normal population.
 Is the QQ-plot straight?
 If it is straight, the correlation between the pairs would be 1.
 We can test for correlation coefficient.
Test for Correlation Coefficient
 Correlation coefficient
Pn
j=1 (x(j) − x̄)(q(j) − q̄)
rQ = qP
n 2
Pn 2
j=1 (x(j) − x̄) j=1 (q(j) − q̄)

 We compare rQ to a table of critical points for the QQ-plot


correlation test for normality.
 If rQ > the critical value, we do not reject the hypothesis of
normality.
 From our example
n
X n
X
(x(j) − x̄)q(j) = 8.584, (x(j) − x̄)2 = 8.472
j=1 j=1
n
X
(q(j) )2 = 8.795 since always q̄ = 0
j=1
Table of Critical Points

Critical Points for the QQ-Plot


Correlation Coefficient Test for Normality
Sample Size Significance level α
n 0.01 0.05 0.10
5 0.8299 0.8788 0.9032
10 0.8801 0.9198 0.9351
15 0.9126 0.9389 0.9503
20 0.9269 0.9508 0.9604
25 0.9410 0.9591 0.9665
30 0.9479 0.9652 0.9795
Test for Correlation Coefficient

8.584
rQ = √ √ = 0.994
8.472 8.795
 The test for normality at the 10% level of significant is
provided by referring rQ = 0.994 to entries in the Table
corresponding to n = 10 and α = 0.10. This entry is
0.9351.
 Since rQ > 0.9351, we do not reject the hypothesis of
normality.
 As an alternative, the Shapiro-Wilk test statistics may be
computed.
 For large sample sizes, the two statistics are nearly the
same so either can be used to judge lack of normality.
Multivariate: Gamma plot or Chi-square plot

 A somewhat more formal method for judging the joint


normality of a dataset is based on the squared generalized
distance.

 Compute di2 = (x i − x̄)0 S −1 (x i − x̄) which are


◦ approximately χ2p if say n − p ≥ 25
◦ if the population is multivariate normal.

 Warnings:
◦ the χ2p is only an approximation, an F distribution can
be used instead.
◦ the quantities are not independent.
Construction of the Gamma Plot

 Order
2 2
d(1) ≤ · · · ≤ d(n)

 Graph ! !
i − 21
χ2p 2
, d(i)
n

 The plot should be a straight line.

 If a systematic pattern is detected, then a deviation from


normality is likely.
Example:
The ordered distances and the corresponding chi-square
percentiles of the previous example (n = 10, p = 2) are listed in
the table below.
Construction of the Gamma Plot

 
i− 12 2 i− 12
i n d(i) q2 n

1 0.05 0.59 0.10


2 0.15 0.81 0.33
3 0.25 0.83 0.58
4 0.35 0.97 0.86
5 0.45 1.01 1.20
6 0.55 1.02 1.60
7 0.65 1.20 2.10
8 0.75 1.88 2.77
9 0.85 4.34 3.79
10 0.95 5.33 5.99
R Demonstration

##Gamma Plot or Chi-square Plot


sales<-c(126974, 96933, 86656, 63438,55264,50976,39069,
36156,35209,32416)
profits<-c(4224,3835,3510,3758, 3939, 1809, 2946,359, 2480,2413)
x<-as.matrix(cbind(sales,profits))
cm <- colMeans(x)
S <- cov(x)
d <- apply(x, 1, function(x) t(x - cm) %*% solve(S) %*% (x - cm))
qc <- qchisq((1:nrow(x) - 1/2) / nrow(x), df = ncol(x))
sd <- sort(d)
R Demonstration
##Gamma Plot or Chi-square Plot
plot(qc, sd, main="Chi-square plot of the ordered distances",
xlab = expression(paste(chi[2]^2, "-quantile")),
ylab = "Ordered distances",
xlim = range(qc)*c(1, 1.1))
abline( a= 0, b = 1)

Chi−square plot of the ordered distances


5
Ordered distances

4
3
2
1

0 1 2 3 4 5 6

χ22−quantile
Construction of the Gamma Plot

 The points do not lie along a straight line with slope 1.

 This data do not appear to be bivariate normal.


Transformations
What if the assumption of normality is not tenable?
 Apply techniques for non-normal data
 Apply transformations: re-express data in other units which
are sometimes more natural than the original units.
Some well-known examples

Original scale Transformed scale



count y y
p
proportion p logit(p) = log 1−p
 
correlations r Fisher’s z(r ) = 21 log 1+r
1−r

 Another important transformation is the log-transform,


log(y ) which is often used in eg. growth data.
 Transformations that reduce the tail of the distribution are
the log-transform and the power-transform y = x 1/n
Box and Cox Transforms

 Allow data to suggest transformation.

 A family of transformation proposed by Box and Cox is


 x λ−1
if λ 6= 0
y= λ
log(x) if λ = 0

 Note that the form λ = 0 follows as the limit for λ → 0

 The parameter λ is to be estimated from the data.

 λ is estimated by maximum likelihood or by minimizing the


sample variance of the variables.
Plausible Values for a Normal Population Mean

 To determine whether a p × 1 vector µ0 is a plausible value


for the mean of a multivariate normal distribution, the
following test statistics (analogous to the univariate case)
is used.
1
T 2 = (X̄ − µ0 )0 ( S)−1 (X̄ − µ0 ) = n(X̄ − µ0 )0 S −1 (X̄ − µ0 ).
n

where X̄ = n1 ni=1 X i , S = n−1


1 Pn 0
P
  i=1 (X i − X̄ )(X i − X̄ ) and
µ10
 µ10 
µ0 =  . 
 
 .. 
µp0
 The test statistics T 2 is called Hotelling’s T 2 . Here ( n1 S) is
the estimated covariance matrix of X̄ .
Hotelling’s T 2
 If the statistical distance T 2 is too large ie if the X̄ is "too
far" from µ0 , the hypothesis H0 : µ = µ0 is rejected.
 T 2 is distributed as (n−1)p
(n−p) Fp,n−p where Fp,n−p denotes a
random variable with an F −distribution with p and n − p
degree of freedom.
 So we test the hypothesis H0 : µ = µ0 versus H1 : µ 6= µ0 .
At the α level of significance, we reject H0 in favor of H1 if
the observed
(n − 1)p
T2 > Fp,n−p .
(n − p)
Example
Let the data matrix for a random sample
 of size
 n = 3 from a
6 9
bivariate normal population be X =  10 6 . Evaluate the
8 3
2 0
T for µ0 = [9, 5]. What is the sampling distribution in this case.
R Demonstration

X1<-c(6, 10, 8)
X2<-c(9, 6, 3)
(X<-cbind(X1, X2))

## X1 X2
## [1,] 6 9
## [2,] 10 6
## [3,] 8 3

mu0<-t(c(9,5))
n<-nrow(X)
p<-ncol(X)
alpha<-0.1
R Demonstration

####Calculating T^2 ######


Xbar<-t(colMeans(X))
S<-cov(X)
SInv<-solve(S)
(T2<-nrow(X)%*%(Xbar-mu0)%*%SInv%*%t(Xbar-mu0))

## [,1]
## [1,] 0.7777778

F<-qf((1-alpha),p,n-p)
(CV<-(n-1)*p*F/(n-p))

## [1] 198

F>CV

## [1] FALSE
R Demonstration
##Create a function to perform Hotelling's T^2###
"Hotelling" <- function(dat,mu,alpha){
# The data matrix is "dat".
# The mean vector is mu. (a list of numbers).

if(!is.matrix(dat)) dat=as.matrix(dat)
Hotelling=NULL
n = nrow(dat)
p = ncol(dat)
cm=matrix(colMeans(dat),p,1)
S = cov(dat)
sinv=solve(S)
mu0=matrix(mu,p,1)
dev=cm-mu0
T2 = n*(t(dev)%*%sinv%*%dev)
d2=n-p
tt=T2*d2/(p*(n-1))
pvalue=1-pf(tt,p,d2)
F=qf((1-alpha),p,n-p)
CV=(n-1)*p*F/(n-p)
Hotelling=cbind(Hotelling,c(T2,CV, pvalue))
row.names(Hotelling)=c("Hotelling-T2","Critical Value","p.value")
Hotelling
}
Example

Consider the Sweat data of 20 observations. There are three


measurements, namely sweat rate (X1 ), Sodium (X2 ), and
Potassium (X3 ). Test H0 : µ = (4, 50, 10)0 versus
H0 : µ 6= (4, 50, 10)0 .
Using the above function "Hotelling"
##Example with SWEAT Data Set
sweat<-read.csv("./Datasets/Sweat.csv")
X<-as.matrix(sweat[,c("X1","X2","X3")])
mu0<-t(c(4, 50, 10))
alpha<-0.05
Hotelling(X,mu0,alpha)

## [,1]
## Hotelling-T2 9.73877286
## Critical Value 10.71860470
## p.value 0.06492834

We cannot reject the null hypothesis at α = 0.05


Confidence region

 To make inference from a sample, we can extend the


concept of univariate confidence interval to a multivariate
confidence region. Given data X = (X 1 , X 2 , . . . , X n )0 ,
indicate a confidence region R(X ) ⊂ <p .
 The region R(X ) is called a 100(1 − α)% percent
confidence region if the average probability θ is 1 − α. We
call α the level of significant.
 Formally
P(Θ ∈ R(X )) = 1 − α
 The confidence region for the mean µ of a p-dimensional
normal population is given by
 
0 −1 (n − 1)p
P n(X̄ − µ) S (X̄ − µ) ≤ Fp,n−p (α) = 1 − α
(n − p)
Confidence region

 For a given sample, we can compute x̄ and S, to obtain the


confidence region.

(n − 1)p
n(x̄ − µ)0 S −1 (x̄ − µ) ≤ Fp,n−p (α)
(n − p)

 To determine whether a given µ0 lies within the confidence


region, compute the Mahalanobis distance between x̄ and
µ0 .
 The vector lies within R(x̄) if the distance is less than

(n − 1)p
Fp,n−p (α)
(n − p)

 n(x̄ − µ0 )0 S −1 (x̄ − µ0 ) = mahalanobis distance between x̄


and µ0
Simultaneous Confidence Statement

 Let X has a Np (µ, Σ), then a linear combination Z = `0 X


has a normal distribution with mean µz = `0 µ and
covariance σz2 = `0 Σ`. The sample mean z̄ and sz2 are
computed accordingly.
Result: [T 2 -interval, Scheffe’s interval]
Let X 1 , X 2 , . . . , X n be random sample from a Np (µ, Σ)
population with Σ positive definite. Then the probability that the
interval,
" s s #
(n − 1)p (n − 1)p
`0 x̄ − Fp,n−p (α)`0 S`, `0 x̄ + Fp,n−p (α)`0 S`
(n − p) (n − p)

contain `0 µ simultaneous for all `, is equal to (1 − α)


Bonferroni Intervals

 Consider the set of p confidence intervals for the mean


components. Consider partitioning of the overall significant
level
α = α1 + α2 + · · · + αp
 An obvious choice is αj = α/p.
 The confidence intervals become
 r  r
α sjj α sjj
x̄j − tn−1 ≤ µj ≤ x̄j + tn−1
2p n 2p n

 Bonferroni and T 2 both have the structure


r r
`0 S` `0 S`
`0 x̄−(critical value) ≤ `0 µ ≤ `0 x̄+(critical value)
n n
Bonferroni Intervals

α
 The ratio of their length for αj = m is
α

tn−1 2m
q
p(n−1)
(n−p) Fp,n−p (α)

which does not depend on the random quantities x̄ and S.


 For small number of comparisons, the Bonferroni intervals
are precise (smaller), i.e when m = p. Because of this,
often intervals are based on Bonferroni method.
Large sample inference
 All large-sample inferences about µ are based on a χ2 -
distribution.
 We know that

(X̄ − µ)0 (n−1 S)(X̄ − µ) = n(X̄ − µ)S −1 (X̄ − µ)

is approximately χ2 with p df and thus

P[n(X̄ − µ)S −1 (X̄ − µ) ≤ χ2p (α)] = 1 − α

where χ2p (α) is the upper (100α)th of the χ2 distribution.


 Hypothesis Testing: X 1 , X 2 , . . . , X n be a random sample
from a population with mean µ and positive definite
covariance matrix Σ.
 When (n − p) is large, the hypothesis H0 : µ = µ0 is
rejected in favor H1 : µ 6= µ0 , at a level of significance
approximately α,
Large sample inference
 if the observed
n(x̄ − µ0 )S −1 (x̄ − µ0 ) > χ2p (α)
where χ2p (α) is the upper (100α)th percentile of a
chi-square distribution with p df .
 Note: The test statistics yields essentially the same results
as T 2 in situation where the χ2 - test is appropriate. This
follows from the fact (n−1)p 2
(n−p) Fp,n−p (α) and χp (α) are
approximately equal for n large relative to p.
 Confidence Statement: Let X 1 , X 2 , . . . , X n be a random
sample from a population with µ and positive definite
covariance Σ. If (n − p) is large
r
`0 S`
q
0 2
` X̄ ± χp (α)
n
0
will contain ` µ for every `, with probability approximately
(1 − α).
Large sample inference

 Consequence, we can make 100(1 − α)% simultaneous


confidence statements
r
q
2
sjj
x̄j ± χp (α)
n
contain µj
 for all pairs (µj , µk ), j, k = 1, 2, . . . , p
  
sjj sjk x̄j − µj
n[x̄j − µj , x̄k − µk ] ≤ χ2p (α)
sjk skk x̄k − µk

contain (µj , µk ).
Example[College Data]

The scores obtained by College Level Examination Program


(CLEP) subtest X1 and the College Qualification Test (CQT)
subtests X2 and X3 are given in the ’college’ data set for
X1 = social science and history, X2 = verbal, and
X3 = science.
(a) compute the 95 percent simultaneous confident intervals
for µ1 , µ2 and µ3 using scheffe, Bonferroni, and asymptotic
procedures.
R Demonstration

We will make use of the R script ’cregion’ written by Ruey S.


Tsay.
source("./Scripts/cregion.R")
college<-read.table("./Datasets/college.DAT",header=T)
head(college)

## Social Verbal Science


## 1 468 41 26
## 2 428 39 26
## 3 514 53 21
## 4 547 67 33
## 5 614 61 27
## 6 501 67 29
R Demonstration
cregion(college)

## [1] "C.I. based on T^2"


## [,1] [,2]
## [1,] 502.99940 550.17302
## [2,] 51.21484 58.16447
## [3,] 23.63855 26.61432
## [1] "CI based on individual t"
## [,1] [,2]
## [1,] 510.34352 542.82889
## [2,] 52.29678 57.08253
## [3,] 24.10183 26.15105
## [1] "CI based on Bonferroni"
## [,1] [,2]
## [1,] 506.63589 546.53652
## [2,] 51.75057 57.62874
## [3,] 23.86794 26.38493
## [1] "Asymp. simu. CI"
## [,1] [,2]
## [1,] 503.74533 549.42709
## [2,] 51.32473 58.05458
## [3,] 23.68560 26.56727
Paired comparisons
 Let Xijk be the outcome for unit i on the j th variable, for
treatment (experimental condition)
t = 1, 2; j = 1, . . . , p; i = 1, 2, . . . , n.
 Then define Dij = Xij1 − Xij2 . Let D i be the difference
 
Di1
 Di2 
vector for the i th unit/pair: D i =  . .
 
.
 . 
Dip
 Denote E(D i ) = δ and covariance, cov(D i ) = Σd . Again,
T 2 statistics play a critical role:
T 2 = n(D̄ − δ)0 S −1
d (D̄ − δ).
 Assume D i ∼ Np (δ, Σd ), then (for a test of hypothesis
H0 : δ = 0 against H1 : δ 6= 0)
(n − 1)p
T2 ∼ Fp,n−p
(n − p)
Paired comparisons

 Furthermore, if n − p is large, then T 2 ∼ χ2p regardless of


the distribution of D i
0 (n−1)p
 Confidence regions: T 2 = nd̄ S −1
d d̄ > (n−p) Fp,n−p (α)

 T 2 interval (scheffe interval):


s s
2 sd2j
s s
(n − 1)p s dj (n − 1)p
d̄j − Fp,n−p ≤ δj ≤ d̄j + Fp,n−p
(n − p) n (n − p) n

 Bonferroni intervals:
s s
  s2   s2
α dj α dj
d̄j − tn−1 ≤ δj ≤ d̄j + tn−1
2g n 2g n
Example: Measurement of biochemical oxygen demand (BoD)
and a suspended solids(SS) were obtained for n = 11 sample
splits from two laboratories. The data are displayed in the table
below.

Commercial Lab State Lab of Hygiene


Sample i Xi11 (B0D) Xi21 (SS) Xi12 (BOD) Xi22 (SS)
1 6 27 25 15
2 6 23 28 13
3 18 64 36 22
4 8 44 35 29
5 11 30 15 31
6 34 75 44 64
7 28 26 42 30
8 71 124 54 64
9 43 54 34 56
10 33 30 29 20
11 20 14 39 21
R Demonstration

 Do the two laboratories chemical analysis agree? if


differences exist , what is their nature.
 The T 2 statistics for testing H0 : δ 0 = [δ1 , δ2 ] = [0, 0] is
constructed from the differences of paired observations:
effluent<-read.table("./Datasets/effluent.dat",header=T)
attach(effluent)
d1<-x11-x21
d2<-x12-x22
d<-cbind(d1,d2)
alpha<-0.05
Hotelling(d,rep(0,2),alpha)
## [,1]
## Hotelling-T2 13.63931214
## Critical Value 9.45887718
## p.value 0.02082779
R Demonstration
Confidence statement are obtained from the differences.
cregion(d)

## [1] "C.I. based on T^2"


## [,1] [,2]
## [1,] -22.453272 3.72600
## [2,] -5.700119 32.24557
## [1] "CI based on individual t"
## [,1] [,2]
## [1,] -18.8467298 0.119457
## [2,] -0.4725958 27.018050
## [1] "CI based on Bonferroni"
## [,1] [,2]
## [1,] -20.573107 1.845835
## [2,] -2.974903 29.520358
## [1] "Asymp. simu. CI"
## [,1] [,2]
## [1,] -19.781395 1.054122
## [2,] -1.827351 28.372806
Compare mean vectors from two populations
 Now, we will have 2 (independent) sets of responses n1
subjects under experimental condition 1, n2 subjects under
experimental condition 2.
 The data setting is as follows. Let
◦ X `1 , X `2 , . . . , X `n be a random sample from population
`(` = 1, 2) which has mean µ` and covariance Σ`
◦ Both samples are independent.
◦ The observed values be x `1 , . . . , x `n` with mean µ`
and covariance S ` .
◦ (small samples only): both populations are
multivariate normal.
◦ (small samples only): Σ1 = Σ2 .
 Question of interest
◦ µ1 = µ2 ?
◦ statements about the components of the mean
vectors.
Common Variance

 When covariances are equal (Σ1 = Σ2 = Σ), it is common


practice to construct a "pooled" estimate of variance.

(n1 − 1)S 1 + (n2 − 1)S 2


S pooled =
n1 + n2 − 2
n1
Σi=1 (x 1i − x̄ 1 )(x 1i − x̄ 1 )0 + Σni=1
2
(x 2i − x̄ 2 )(x 2i − x̄
=
n1 + n2 − 2

 Why do we have the factor n1 + n2 − 2 in the denominator?


◦ we know that adding Wishart variables, the degree of
freedom add also
◦ there is likelihood support for this procedure
 An important hypothesis is H0 : µ1 − µ2 = δ 0 . Obviously,
E(X̄ 1 − X̄ 2 ) = µ1 − µ2
Common Variance
 The samples are independent
⇒ X̄ 1 and X̄ 2 are independent.
⇒ cov(X̄ 1 , X̄ 2 ) = 0

cov(X̄ 1 − X̄ 2 ) = cov(X̄ 1 ) + cov(X̄ 2 )
1 1
= Σ1 + Σ2
n n2
1 
1 1
= + Σ
n1 n2

which is estimated by ( n11 + 1


n2 )S pooled .

 Now the confidence region follows from


  −1
2 0 1 1
T = (x̄ 1 −x̄ 2 −δ0 ) + Spooled (x̄ 1 −x̄ 2 −δ0 ) > c 2
n1 n2
Common Variance
where c 2 follows from the distribution of T 2 which is
(n1 + n2 − 2)p
c2 = Fp,n1 +n2 −p−1
(n1 + n2 − p − 1)
 Simultaneous confidence interval for all `0 (µ1 − µ2 ) are
s  
0 0 1 1
` (X̄ 1 − X̄ 2 ) ± c ` + S pooled `
n1 n2
 In particular, a simultaneous confidence interval for
µ1i − µ2j is
s 
1 1
(X̄1j − X̄2j ) ± c + sjj,pooled
n1 n2
 Bonferroni confidence interval for µ1j − µ2j is when we
consider the population means
 s
α 1 1
(X̄1j − X̄2j ) ± tn1 +n2 −2 + s
2p n1 n2 jj,pooled
Two-sample situation when Σ1 6= Σ2

 Problem if samples are small.


 Large sample results available.
 Let n1 − p and n2 − p be large. An approximate 100(1 − α)
confidence ellipsoid for µ1 − µ2 is
 −1
0 S1 S2
[(X̄ 1 −X̄ 2 )−(µ1 −µ2 )] + [(X̄ 1 −X̄ 2 )−(µ1 −µ2 )] ≤ χ2p (α
n1 n2

 Simultaneous confidence interval for `0 (µ1 − µ2 ) is


s  
0 S1 S2
q
0 2
` (X̄ 1 − X̄ 2 ) ± χp (α) ` + `
n1 n2
The Behrens-Fisher Problem

 When the two populations are normally distributed but


covariance matrices are different
 Sample size are also small i.e. n1 − p and n2 − p are small
 A recommended method in literature is to approximate the
distribution of T 2 as
υp
T2 = Fp,υ−p+1
υ−p+1

where min(n1 , n2 ) ≤ υ ≤ n1 + n2 and


υ=
p + p2
( " #
P2   2
−1
   −
1 1 1 1
i=1 ni tr n1 i n2 S 1 +
S ni S 2 + tr n11 S i n12 S 1 + 1
ni S 2
R Demonstration

We use the Effluent data set ignoring the pairs. The R script
’Behrens.R’ is used to perform the Behrens-Fisher Test.
source("./Scripts/Behrens.R")
x1<-effluent[,1:2]
x2<-effluent[,3:4]
Behrens(x1,x2)

## [1] "Estimate of v: "


## [1] 18.7012
## [1] "Test result:"
## [,1]
## Test-T2 12.6648
## p.value 0.0103
Comparison of g populations

 Let sample `(` = 1, 2, . . . , g) be X `1 , X `2 , . . . , X `n`


 Our interest are
◦ µ1 = µ2 = · · · = µg ?
◦ statements about components
 The data setting is as follows: Let
◦ X `1 , X `2 , . . . , X `n` be a random sample from
population `, with mean µ` and common covariance Σ
◦ all samples are independent
◦ observed values x `1 , x `2 , . . . , x `n` with mean x¯` and
covariance S `
◦ (small samples) each population is multivariate normal
Multivariate Analysis of Variance

 MANOVA model for comparing g population mean vectors

X `i = µ + τ ` + e `i

with i = 1, 2, . . . , n` and ` = 1, 2, . . . , g.
 We take e `i ∼ N(0, Σ) and E(X `i ) = µ + τ ` , with µ =
overall mean, τ ` = group effect.
 The model is overspecified Pgand so we impose the
identifiability constraints `=1 n` τ ` = 0. That is, all group
effects add up to zero and there are n` individuals in each
group.
 The following terminology is sometimes useful
◦ systematic part: µ + τ `
◦ random part: e `i
MANOVA

 It can easily be shown that


   
Total(corrected) Sum Treatment (Between)
 of Squares and Cross  =  Sum of Squares and 
Products Cross Products
 
Residual (Within) Sums
+  of Squares and 
Cross Products
T = B+W
How to construct Test Statistics
 where
g n
X X̀
W = (x `i − x̄ ` )(x `i − x̄ ` )0
`=1 i=1
g
X
B = n` (x̄ ` − x̄)(x̄ ` − x̄)0
`=1
g n
X X̀
B+W = (x `i − x̄)(x `i − x̄)0
`=1 i=1

 Remark:
◦ The "between" matrix B is often denoted as H, the
"hypothesis" matrix.
◦ The "within" matrix W is often denoted as E, the
"error" matrix
Test Statistics
There are four statistics in common use.
 Wilk’s Lambda:
|W |
Λ = = det(W (B + W )−1 )
|B + W |
p p p
Y Y Y 1
= λj = θj =
1 + φj
j=1 j=1 j=1
Pp
 Pillar’s Trace: tr (B(B + W )−1 ) = j=1 θj

 Hotelling-Lawley
  P Trace:P
θj
tr BW −1
= pj=1 φj = pj=1 1−θj
Note that T 2 is defined slightly different here.
 Roy’s Greatest Root: The largest root of the equation
|B − φW | = 0:
θmax
φmax =
1 − θmax
MANOVA Table for Comparing Population Mean
Vectors

Table: MANOVA Table

Source Matrix of SS Degrees of Test


of Variation and CP Freedom Statistic
Treatment B g−1
Pg ∗ |W
Residual (Error) W `=1 n` − g Λ = |B +
Pg
Total (corrected for the mean) B+W `=1 n` − 1
Pg 0
` (x̄ ` − x̄)(x̄ ` − x̄) ,
B = P`=1 nP
g n`
W = `=1 i=1 (x `i − x̄ ` )(x `i − x̄ ` )0
 The quantity Λ∗ = |W | is convenient for testing
|B + W |

H0 : τ 1 = τ 2 = · · · = τ g = 0
and is called the Wilk’s Lambda.
Example

Exercise: Observations on two responses arecollected for


x1
three treatments. The observed vectors are
x
    2 
9 6 9
Treatment 1: , ,
3 2 7
   
0 2
Treatment 2: ,
4 0
     
3 1 2
Treatment 2: , ,
8 9 7

(i) Perform univariate ANOVA on each response.


(ii) Do a MANOVA
R Demonstration

We use here the R functions anova() and manova().


##One-Way ANOVA and MANOVA##
trt<-c(rep(1,3),rep(2,2),rep(3,3))
x1<-c(9,6,9,0,2,3,1,2)
x2<-c(3,2,7,4,0,8,9,7)
x<-cbind(trt,x1,x2)
R Demonstration
We use here the R functions anova() and manova().
##ANOVA
anova1<-aov(x1~as.factor(trt))
summary(anova1)

## Df Sum Sq Mean Sq F value Pr(>F)


## as.factor(trt) 2 78 39 19.5 0.00435 **
## Residuals 5 10 2
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova2<-aov(x2~as.factor(trt))
summary(anova2)

## Df Sum Sq Mean Sq F value Pr(>F)


## as.factor(trt) 2 48 24.0 5 0.0642 .
## Residuals 5 24 4.8
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R Demonstration

## MANOVA
m<-manova(x[,2:3]~as.factor(trt))
summary(m) #Pillai's trace default

## Df Pillai approx F num Df den Df Pr(>F)


## as.factor(trt) 2 1.5408 8.3882 4 10 0.003096 **
## Residuals 5
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R Demonstration

## MANOVA
summary(m,test='Wilks')

## Df Wilks approx F num Df den Df Pr(>F)


## as.factor(trt) 2 0.038455 8.1989 4 8 0.006234 **
## Residuals 5
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(m,test='Roy')#options test="Hotelling-Lawley" is available

## Df Roy approx F num Df den Df Pr(>F)


## as.factor(trt) 2 8.0764 20.191 2 5 0.004029 **
## Residuals 5
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Example: Two-way MANOVA

 For two factor variables.


 Plastic Film Data: The optimum conditions for extruding
plastic firm has been examined using a technique called
Evolutionary Operation. In the course of the study, three
responses - X1 = tear resistance, X1 = gloss, and X1 =
opacity - were measured at two levels of the factors, rate
extrusion and amount of an additive. The data is store in
the data set "plastic_film.dat".
Codes:
plastic<-read.table("./Datasets/plastic_film.dat", header=T)
resp<-as.matrix(plastic[,3:5])
Fact1<-as.factor(plastic$Fac1)
Fact2<-as.factor(plastic$Fac2)
m1<-manova(resp~Fact1+Fact2+Fact1*Fact2)
Example: Two-way MANOVA
m1

## Call:
## manova(resp ~ Fact1 + Fact2 + Fact1 * Fact2)
##
## Terms:
## Fact1 Fact2 Fact1:Fact2 Residuals
## resp 1 1.7405 0.7605 0.0005 1.7640
## resp 2 1.3005 0.6125 0.5445 2.6280
## resp 3 0.4205 4.9005 3.9605 64.9240
## Deg. of Freedom 1 1 1 16
##
## Residual standard errors: 0.3320392 0.4052777 2.014386
## Estimated effects may be unbalanced

summary(m1)

## Df Pillai approx F num Df den Df Pr(>F)


## Fact1 1 0.61814 7.5543 3 14 0.003034 **
## Fact2 1 0.47697 4.2556 3 14 0.024745 *
## Fact1:Fact2 1 0.22289 1.3385 3 14 0.301782
## Residuals 16
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Box’s M Test
 Setup: g populations and p variables.

 We want to test the hypothesis

H0 : Σ1 = Σ2 = · · · = Σg = Σ
H1 : Σi 6= Σj for some 1 ≤ i 6= j ≤ g

 Commonly use test statistics is the Box’s M test

 Test is based on the likelihood ratio statistics for testing


equality of covariance matrix
g  (n` −1)/2
Y |S `|
Λ=
|S pooled |
`=1

where ` is the sample size for the `th population, S ` is the


sample covariance matrix for the `th population
Box’s M Test

1
S pooled = Pg [(n1 −1)S 2 +(n2 −1)S 2 +· · ·+(ng −1)S g ]
`=1 (n` − 1)

 The test statistics is given by


g g
!
X X
M = −2 ln(Λ) = (n` − 1) ln(|S pooled |)− [(n` −1) ln(|S ` |)]
`=1 `=1

 (1 − u)M ∼ χ2v where

1
v = p(p + 1)(g − 1)
2
g
!
2p2 + 3p − 1

X 1 1
u = − Pg
(n` − 1) `=1 (n` − 1)
6(p + 1)(g − 1)
`−1

PCA
 Consider a data set with p response variables. The
summary information of this data will consist of the
following pieces:
◦ p sample means
◦ p sample standard deviation;equivalently p sample
variance (ie, the diagonal of the covariance matrix)
◦ p(p−1)
2 sample covariance (ie the off-diagonal elements
of the covariance matrix);equivalently the correlation
could be computed from this covariance matrix.
 Principal Component are used for several tasks:
◦ Data Reduction: When the data can be reduced from
p to a fewer variables, the manipulation of the data
becomes easier. This is true when the data have to be
used as input to other procedures such as regression
model.
◦ Display of data: Of course, p > 3 variables cannot be
plotted at once. Usually pairs of variables are plotted
against each other. At best, fancy graphical systems
allow rotating graphs of three variables.
◦ Interpretation: It is often hoped for that the original p
variables can be replaced by, for example, smaller k
new variables, which have an intuitive interpretation
when the interpretation is of greater scientific interest
than the original variables, then insight might gained.
◦ Inference: Above tasks refer to manipulating a sample
of n objects.Often,one wants to make statements
about the population. Then, inferential tools have to be
invoked. Of course, data reduction will always lead to
information reduction or information loss.
 In summary, one wants to reduce p original variables to k
new variables, where k is a compromise between:
◦ minimum number of variables
◦ maximum amount of information retained
PCA

 Do not
◦ over estimate importance of the principal components
analysis.
◦ over- interpret principal components analysis
 They can be useful:
◦ in exploratory phase of a data analysis (to learn about
the structure of the data)
◦ as input to other statistical procedures.
Population principal component
 Concepts can be developed for both the population level (ie
based on the theoretical distribution) or for the sample level
(ie based on the simple random sample to be studied).
 In virtually all cases, the two theories are dual to each
other, although the assumptions involved and the
generality of the conclusion can be different.
 Consider a random vector X = (X1 , . . . , Xp ) with
covariance matrix Σ and a correlation ρ.
 We want to apply a linear transformation of X in order to
obtain Y = (Y1 , ......Yp );

Y1 = `11 X1 + `12 X2 + · · · + `1p Xp


Y2 = `21 X1 + `22 X2 + · · · + `2p Xp
..
.
Yp = `p1 X1 + `p2 X2 + · · · + `pp Xp
 In matrix notation,
   
Y1   X1
 Y2  `11 . . . `1p  X2 
.. .. ..  
 ..  = 
   
. . .  .
 ..

 .  
`p1 . . . `pp
Yp Xp

 Two short hand notations:

Y = LX or Yj = `0j X (j = 1, 2, ......p)
Population PCA
 We requirement.
◦ the Yj are uncorrelated.
◦ the Yj have maximal variance.
◦ the coefficient vector `j have a unit length
 The last requirement is formalized as follows:
p
X
||`j ||2 = `0j `j = `2jk = 1
k =1
 These requirement can be translated in an iterative
procedure;
Y1 = `01 X with maximal variance, with condition `01 `1 = 1
Y2 = `02 X with maximal variance, with conditions
`02 `2 = 1, cov(Y1 , Y2 ) = 0
...
Yj = `0j X with maximal variance and conditions
`0j `j = 1, cov(Y1 , Yj ) = · · · = cov(Yj−1 , Yj ) = 0
Mathematical argument

 Theorem: Let Σ be the covariance matrix of X with


ordered eigenvalue-eigenvector pairs
(λj , ej )(j = 1, 2, . . . , p)
◦ the principal components are
Yj = e 0j X = ej1 X1 + · · · + ejp Xp
◦ var(Yj ) = λj
◦ cov(Yj , Yk ) = e 0j Σe k = 0, j 6= k
 For distinct eigenvalues, the choice of the principal
component is unique.
Procedure

The procedure, deduced from this theorem is as follows:


(1) Determine the covariance matrix of your population.
(2) Calculate the eigenvectors e j , ordered such that they
correspond to decreasing eigenvalues (e 1 corresponding
to λ1 , the largest eigenvalue)
(3) Calculate the transformed variables:

Y1 = e11 X1 + e12 X2 + · · · + e1p Xp


Y2 = e21 X1 + e22 X2 + · · · + e2p Xp
..
.
Yp = ep1 X1 + ep2 X2 + · · · + epp Xp
Properties
The following properties ensure that the new variables are very
desirable:
(1) The variance of the new variables is determined by the
eigenvalues λj :
var(Yj ) = λj
(2) The covariance between pairs of distinct new variables is
zero, and hence also the correlation:
Corr(Yj , Yk ) = 0, j 6= k
(3) The total variability of the original variables is recorded by
the new variables. More precisely, the sum of the diagonal
elements of the original covariance matrix equals the sum
of the eigenvalues.
(4) The total population variance is λ1 + · · · + λp . The j th
λj
principal component "explains" λ1 +···+λ p
of the variance.
(5) It is common practice to choose the first few components in
order to get approximately 80 − 90% of variation, such that
Sample PCs
 By replacing the population covariance matrix by the
sample covariance matrix, the results carry over.
 From the sample version, inferences can be drawn about
the population level. We may be lead to believe that the
principal component is made up of variables except those
with considerably smaller coefficient.
 However, this is merely jumping to conclusions and one
should be guided by the correlations between principal
components and each of its constituents.
 A formula is given in the following property;
Property: Let X have covariance Σ, with Yj = ej0 X the
principal components, then
p
ejk λj
ρYj ,Xk = √
σkk
PCA based on correlation

 Covariance or Correlation?
◦ The original variables could have different units. Then
we are comparing apples and oranges.
◦ Original variables could have widely varying standard
deviation.
 In both cases, the analysis is driven by the variable with
large standard deviation.
 kilometer or millimeter (×106 ) ⇒ principal components will
be pulled towards the variable in millimeters.
 A solution is provided by using the correlation matrix
instead. Thus means that all variables are replaced by
their standardized versions.
 The variance changes to 1 and total variability is equal to p.
Example

 Let us consider the popular Iris Data Set. It consist of 5


variables
◦ Species (with three levels: Setosa, Versicolor,
Virginica)
◦ Petal Length (in mm)
◦ Petal Width (in mm)
◦ Sepal Length (in mm)
◦ Sepal Width (in mm)
 There are 50 observations in each of the three species,
yielding a total sample size of 150.
 First, lets focus on the four outcomes (petal and sepal
length and width).
 We will require the R packages, psych and pcaMethods.
R Demonstration
Summary information are obtained from the following R codes.
library(psych)

##
## Attaching package: ’psych’
## The following object is masked from ’package:DAAG’:
##
## cities
## The following objects are masked from ’package:ggplot2’:
##
## %+%, alpha
## The following object is masked from ’package:car’:
##
## logit

attach(iris)
sapply(iris[,-5],mean)*10

## Sepal.Length Sepal.Width Petal.Length Petal.Width


## 58.43333 30.57333 37.58000 11.99333

sapply(iris[,-5],sd)*10

## Sepal.Length Sepal.Width Petal.Length Petal.Width


PCA

 The summary information consist of 4 sample means, 4


sample standard deviations; equivalently 4 variances and 6
sample covariances; equivalently 6 correlations.
 PCA based on covariances or correlation.
 Scree plot.
 Total variances.
 Normalized coefficients.
 We will use the princomp function in R to obtain the PCs.
 Covariance matrix or original data as input.
 To do PCA using correlation matrix, we use the option
cor=TRUE in princomp.
R Demonstration
cov_mat<-cov(iris[,-5]*10)
pcacov<-princomp(covmat=cov_mat)
summary(pcacov, loadings = TRUE)

## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4
## Standard deviation 20.5626888 4.92616228 2.79659615 1.543861813
## Proportion of Variance 0.9246187 0.05306648 0.01710261 0.005212184
## Cumulative Proportion 0.9246187 0.97768521 0.99478782 1.000000000
##
## Loadings:
## Comp.1 Comp.2 Comp.3 Comp.4
## Sepal.Length 0.361 0.657 0.582 0.315
## Sepal.Width 0.730 -0.598 -0.320
## Petal.Length 0.857 -0.173 -0.480
## Petal.Width 0.358 -0.546 0.754

(eigval<-(pcacov$sdev)^2)

## Comp.1 Comp.2 Comp.3 Comp.4


## 422.824171 24.267075 7.820950 2.383509
R Demonstration
Ways to do a Scree Plot
par(mfrow=c(1,2))
screeplot(pcacov,type="lines")
screeplot(pcacov,npcs=4, type="lines")

pcacov pcacov
400

400
300

300
Variances

Variances
200

200
100

100
0

Comp.1 Comp.3 Comp.1 Comp.3


R Demonstration
par(mfrow=c(1,2))
plot(pcacov$sdev^2, xlab = "Component number",
ylab = "Component variance", type = "l",
main = "Scree diagram")
plot(pcacov,type="l")

Scree diagram pcacov


400

400
Component variance

300

300
Variances
200

200
100

100
0

1.0 2.0 3.0 4.0 Comp.1 Comp.3

Component number
R Demonstration
loadings(pcacov)

##
## Loadings:
## Comp.1 Comp.2 Comp.3 Comp.4
## Sepal.Length 0.361 0.657 0.582 0.315
## Sepal.Width 0.730 -0.598 -0.320
## Petal.Length 0.857 -0.173 -0.480
## Petal.Width 0.358 -0.546 0.754
##
## Comp.1 Comp.2 Comp.3 Comp.4
## SS loadings 1.00 1.00 1.00 1.00
## Proportion Var 0.25 0.25 0.25 0.25
## Cumulative Var 0.25 0.50 0.75 1.00

sum((pcacov$sdev)^2) #Total Variance

## [1] 457.2957

sum((pcacov$loadings[,1])^2)

## [1] 1
Observations

 It is easy to check that the eigenvalues add up to 457.2957


as required by the total variance of the original covariance
matrix.
 A more desirable situation is the one where the first
eigenvalue is “large" with the remaining ones very “small".
 This is hard to judge, based on the absolute numbers.
Hence, the proportion can be used.
 The cumulative proportion indicates how much of the
variability is captured by the first few principal components.
 We see the first principal component captures 92% of the
total variability. Adding the second leads to almost 98%
variability explained.
 Both the 80-90% criteria and the scree plot suggests using
the first principal component.
R Demonstration

 To obtain scores, PCA is performed on the original data.


 Plot of principal components.
 The correlation between the PCA and original variables is
very important.

pca<-princomp(iris[,-5]*10,scores=T,cor=F)
head(pca$scores)

## Comp.1 Comp.2 Comp.3 Comp.4


## [1,] -2.264703 0.4800266 0.12770602 0.02416820
## [2,] -2.080961 -0.6741336 0.23460885 0.10300677
## [3,] -2.364229 -0.3419080 -0.04420148 0.02837705
## [4,] -2.299384 -0.5973945 -0.09129011 -0.06595556
## [5,] -2.389842 0.6468354 -0.01573820 -0.03592281
## [6,] -2.075631 1.4891775 -0.02696829 0.00660818
R Demonstration
par(mfrow=c(2,2))
biplot(pca,choices=1:2,main="Plot of First and
Second Principal Components")
biplot(pca,choices=2:3)
biplot(pca,choices=3:4)
biplot(pca,choices=c(1,4))

Plot of First and


−10 −5 0 5 10
Second
−10 Principal
−5 0 Components
5 10

10
0.2
16 132
118 123
0.2

Sepal.Width 63

10
34 88 77131
119 13051
3315 110 59108 106
53
136

0.1
66

5
6
17 69 7576 2137 15 132
19 109 55 87 12619
0.1

45
20
47 73
80 Sepal.Length

5
11
49 137145
125
149126 Sepal.Length 26
68 7498 29 32
22 42 120
82 93 72
10 78 11

Comp.3
865751 121
101
144136
106 132 36 28
103
13450401 49
37 111142
116
140 8170 135
8335
Comp.2

385
28 52 53
87
66 141
130
103 52 34118
23 18
144
41 316492818

0.0
2932
21 71138
78 108
113
146
105 123 61 54 147 46
124 140
40 148 131 5 4717

0
827 76117 112 38
113 6 33 16
0.0

725
12 24 92128
59 Petal.Width
Petal.Length 37924
30 117
2741
Petal.Length
50 139 119 94 9099195100
4127 12
0
62
96 150
75
98
6764 104
77133
129 84
569796
4810425 142
572220 Sepal.Width
36
30
3 89
85 7955
134
127 99 14
39 65 62 148
138 121
48
4335
10
31 97
65 74
72 112
115
124 58 129
89 146
Petal.Width111 45
125
426
2
46 56
100122
84
102
143 109 43133 44
23
7105
128 144

−0.2 −0.1
13
39 83
68
95 135
147 139 110
149 73 14186

−5
−5

60
8093
91
70 114
90 88 102 67
60143
114 150
9982
107120
81 85 71 116 145
58 6369
54
−10

94 122
−0.2

−10
42 107 149
137
115 101
61

−0.2 0.0 0.1 0.2 −0.2 −0.1 0.0 0.1 0.2

Comp.1 Comp.2

−3 −1 0 1 2 3 −10 −5 0 5 10
R Demonstration

#pairs(pca$scores) #scatter matrix


# correlation between original variables and principal components
corr <- cor(iris[,-5], pca$scores)
corr

## Comp.1 Comp.2 Comp.3 Comp.4


## Sepal.Length 0.8901688 0.36082989 0.27565767 0.03760602
## Sepal.Width -0.4601427 0.88271627 -0.09361987 -0.01777631
## Petal.Length 0.9915552 0.02341519 -0.05444699 -0.11534978
## Petal.Width 0.9649790 0.06399985 -0.24298265 0.07535950
Observations

 From the results of the PCA, we were lead to believe that


PC1 is made up of all variables but Sepal.Width, since that
coefficient was considerably smaller (R does not return
value).

 This is jumping into conclusions, and one must be guided


by the correlation between PCs and the original variables.

 We observe that, PC1 is highly correlated with PETAL


variables and Sepal.Length, but much less with
Sepal.Width.

 Sepal.Length, Petal.Length and Petal.Width convey


essentially the same information, while there is some
additional information put forward by Sepal.Width.
R Demonstration: PCA based on correlation
#summary(pcacor, loadings = TRUE)
##To output Scores, use original data
pca_cor<-princomp(iris[,-5]*10,scores=T,cor=T)
summary(pca_cor)

## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4
## Standard deviation 1.7083611 0.9560494 0.38308860 0.143926497
## Proportion of Variance 0.7296245 0.2285076 0.03668922 0.005178709
## Cumulative Proportion 0.7296245 0.9581321 0.99482129 1.000000000

#pairs(pca_cor$scores)
#correlation between original variables and principal components
cor(iris[,-5], pca_cor$scores)

## Comp.1 Comp.2 Comp.3 Comp.4


## Sepal.Length 0.8901688 0.36082989 0.27565767 0.03760602
## Sepal.Width -0.4601427 0.88271627 -0.09361987 -0.01777631
## Petal.Length 0.9915552 0.02341519 -0.05444699 -0.11534978
## Petal.Width 0.9649790 0.06399985 -0.24298265 0.07535950

sum((pca_cor$sdev)^2) #Total Variance

## [1] 4
R Demonstration: PCA based on correlation

par(mfrow=c(1,2))
screeplot(pca_cor,type="lines",main="Scree Plot")
biplot(pca_cor,choices=1:2,main="Plot of PC1 and PC2")

Scree Plot
Plot of PC1 and PC2
−10 −5 0 5 10
3.0

16 132
118

0.0 0.1 0.2


Sepal.Width

10
34 110
3315
2.0

17 6
19
Variances

45
20
47 137

5
11
49
22 8651 145
125
149 Sepal.Length
126
121
101
144136
57
111142106

Comp.2
2337
5
38
28
18
41144
32 525116
87
66
138
140
3
141
130
103
108
113123
29
40
27
25
12
7 821
24 62
71
76 146
78
105
148
117
128
9259 131Petal.Width
Petal.Length
50 139
150 133119
104

0
36 96
67
89
8575
98
6477
55
79
134129
4330
3
48
35
10 97
6574
72127
112
115
1.0

31
426
46
13 2 56
100 124
122
84
102
143109
39
14 83
68
93135
95 147
73

−10 −5
9 60
80
91
9990114
70
107
81
82 88
58 54
63 69
120
94
−0.2
42
0.0

61

Comp.1 Comp.3 −0.2 0.0 0.1 0.2

Comp.1
Exploration of the Principal Components
 One major observation of the graphs is that there is a “gap"
between the observations. They give a bimodal
expression.
 This is explained by the fact that the principal components
were calculated on all 150 observations, ignoring the
SPECIES variable.
 When no formal inference is needed, but the principal
components are only applied to summarize and simplify
the data, the procedure is valid.
 When formal inference is required, such a MARGINAL
PCA procedure would be very much unacceptable.
 SOLUTIONS: Calculate PCs for each species separately
or calculate PCs on transformed data (PCA by groups),
that are corrected for the effect of SPECIES (“partialled
out" PCA).
Exercise
(1) In the analysis above (based on correlation matrix)
(a) plot the third and fourth principal components. Take a
look at the scales on the axes of this and the plot
based on covariance matrix. What do you observe?
(b) Plot Sepal.Length versus the first principal
component. Is the orientation expected?
(2) Perform principal components analysis for each of the
species separately using
(a) covariance matrix
(b) correlation matrix

You might also like