0% found this document useful (0 votes)

195 views238 pages

STAT630Slide Adv Data Analysis

This document provides lecture notes for an advanced data analysis course. It outlines the course objectives, which include developing skills in linear regression, assumptions of regression models, model fitting and diagnostics, and applying techniques in R. It also lists the textbook, reference materials, instructor contact information, and grading policy. Key concepts covered include simple and multiple linear regression, estimation methods, hypothesis testing, confidence intervals, prediction, and analysis of variance.

Uploaded by

Tennyson

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

195 views238 pages

STAT630Slide Adv Data Analysis

Uploaded by

Tennyson

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 238

STAT630: Advanced Data Analysis

Lecture Notes/ Part II

Lecturer
Samuel Iddi, PhD

Department of Statistics
University of Ghana
siddi@ug.edu.gh

February 21, 2019

Learning Objectives:
Develop deeper understanding in the theory of linear
regression model.

List and explain assumptions underlying linear regression

models

Fit regression model to real-life data

Understand estimation strategies of linear model

Perform uncertainty assessments

Learning Objectives:
Derive confidence and prediction intervals

Assess suitability of regression model by performing model

diagnosis

Apply remedial measures to correct departures of the

model

Familiarize with techniques for binary and count data

Implement and apply the above procedures in R.

Interpret results from computer outputs.

Learning Objectives:
Understand problems associated with multi-dimensional
data.
Study basic multivariate distribution theory and methods.
Discuss some fundamental and important multivariate
statistical techniques.
Understand inference and how to apply them to real-life
problem arising from various scientific fields.
Learn some methods for data reduction.
Implement and apply the techniques in R.
Course Syllabus:
Introduction: general context, linear algebra.

Linear regression models: simple, multiple regression.

Estimation methods: Least squares and likelihood

estimators.

Testing: Hypothesis testing and ANOVA test.

Diagnosis: Formal and informal test for linearity,

homoscedasticity, Gaussianity, independence (residual
analysis).
Course Syllabus:
Remedial measures: Transformations, Box-Cox

Model selection: sequential (backward/forward/stepwise),

information creteria (AIC/BIC/Cp), best subset procedure.

Identifying outliers

Multicolinearity: VIF

Multivariate analysis

Comparison of mean vectors

Multivariate analysis of variance

Principal component analysis

Textbook:
1 Kutner, M. H., Nachtsheim, C. J., Neter, J. and Li, W.
(2005). Applied Linear Statistical Models. 5th Ed. New
York: McGraw-Hill Irwin.
2 Rencher, A. C. and Schaalje, G. B. (2008). Linear Models
in Statistics. 2nd Ed. New Jersey:John Wiley & Sons, Inc.
3 Johnson, R. A. and Wichern, D. W. (2007). Applied
Multivariate Statistical Analysis. 3rd Ed. Prentice-Hall.
Reference:
1 Maindonald, J. and Braun, J. (2003). Data Analysis and
Graphics Using R. New York: Cambridge University Press.
2 Chambers (2008). Software for Data Analysis. Springer.
3 Venables W.N. and Ripley B.D (1997). Modern Applied
Statistics with S-PLUS. 2nd Ed. New York: Springer.
Office hour:
Wednesday: 9:30am to 1:00pm or by appointment.
Phone Number: 0267603243, Office: 208
E-mail: siddi@ug.edu.gh; isamuel.gh@gmail.com
(Sending an email is the easiest way to contact me)
Personal Website: www.samueliddi.com

Teaching Assistants
TA:
Tutorial:
Phone number:
E-mail:
Grading
Homework Assignments (20%), Interim Assessment (30%) and
Final Exams (50%).
Guidelines
Homeworks should be submitted one week from the day
assigned.
R program codes will be submitted via email.
Late submission will not be accepted.
Duplicate solutions will not be graded.
TA will discussion solutions during tutorial hours.
Interim assessment could take the form of project,
presentation and defense.
Computing
The main software package for this course is regular R version
2.3.1 (or higher). Install R by visiting the website:
www.r-project.org. You may also use RStudio but this only
works after you have installed R first.
Regression
Establish a relationship (often linear) between dependent
(or response) variable Y and independent (or predictor)
variable X .
We observe pairs (X , Y ).
A statistical relationship between the two variables is given
by
Y = f (X ) + ε
where ε is the error committed by using f (X ) to
approximate Y .
If scatterplot of (X , Y ) is approximately linear,
f (X ) = β0 + β1 X is chosen.
Y = β0 + β1 X + ε is called a linear regression model used
to describe the relationship between Y and X .
Parameters β0 and β1 are the intercept and slope of the
straight line respectively.
Regression
If f (X ) is nonlinear, then

Y = f (X ) + ε

is the nonlinear regression.

Example, the growth model

Y = a(1 − exp(bX ))c + ε

Formally, suppose we observe the pairs Yi , Xi for

i = 1, 2, . . . , n, the simple linear regression model is given
by
Yi = β0 + β1 Xi + εi .
Model assumptions:
◦ The values Xi are precisely known.
◦ Yi is a continuous variable and is random.
Regression
Model assumptions:
◦ β0 , β1 and σ are called parameters which are
† unknown
† constant and not random
† not dependent on trial number
† εi is random error term and not observable.
† εi ∼ N(0, σ 2 ) for all i. That is, E(εi ) = 0 and
Var(εi ) = σ 2 .
† For any two different trials, i and j, εi and εj are
independent.
Since β0 , β1 and σ 2 are unknown, we estimate them from
the data.
This gives Ŷi = βˆ0 + βˆ1 Xi where β̂0 estimates β0 and β̂1
estimates β1 .
Also, σ̂ 2 estimates σ 2 .
Estimation
Least squares estimators:
P P P P
SSXY (Xi − X̄ )(Yi − Ȳ ) Xi Yi − Xi Yi /n
β̂1 = = = P 2
(Xi − X̄ )2 Xi − ( Xi )2 /n
P
SSXX
P

β̂0 = Ȳ − β̂1 X̄
where βˆ0 and βˆ1 are unbiased estimators of β0 and β1
respectively.
SSE
Unbiased estimator of σ 2 , s2 = MSE = n−2 where

(SSXY )2
SSE = SSYY −
SSXX
β0 and β1 are random and are normally distributed.

X̄ 2
Thus, E(βˆ0 ) = β0 , Var(βˆ0 ) = σ 2 n1 + SS XX
, E(βˆ1 ) = β1
σ2
and Var(βˆ1 ) = SSXX .
Hypothesis Testing and Confidence Intervals

If β1 = 0 ⇔ no relationship between X and Y .

To test the hypothesis H0 : β1 = c, we use the test statistics

β̂1 − c
q 2
∼ tn−1 .
P s
(Xi −X̄ )2

The (1 − α)100% CI for β1 is given by

s
s2
β̂1 ± t1−α/2,n−2
(Xi − X̄ )2

where t1−α/2,n−2 is the upper α/2 quantile from the

t−distribution with n − 2 degrees of freedom.
Hypothesis Testing and Confidence Intervals
Similarly, we can construct CI and Test hypothesis for β0 .
The values of the fitted line at X = xh is given by
Ŷh = βˆ0 + β̂1 xh .
The mean response is estimated with

E(Ŷh ) = β0 + β1 xh

and
1 (xh − X̄ )2

2
Var(Ŷh ) = σ +
n SSXX
.
The CI of the mean response
s
1 (xh − X̄ )2

Ŷh ± t1−α/2,n−2 s2 +
n SSXX
Prediction of mean new observations

To predict new observations at X = xh

E(Ŷh ) = β̂0 + β̂1 xh

and
1 (xh − X̄ )2

2
Var(Ŷh ) = σ 1+ +
n SSXX
.
CI can also be constructed based on the t−distribution
with n − 2 df .
Such an interval is called prediction interval.
Analysis of Variance
Another tool for interpretation results from a linear
regression is the ANOVA.

The total variation of Y ,

X
SSYY = (Yi − Ȳ )2

is sometimes called "sum of squares total" (SSTO) or "sum

of squares total (corrected)" ["corrected" means Y values
are corrected to have mean zero.]

SSTO is partitioned to different sources. i.e.

SSTO = SSE + SSR

X X X
(Yi − Ȳ )2 = (Yi − Ŷi )2 + (Ŷi − Ȳ )2

where SSE is the "sum of squares residuals" (unexplained

variability) and SSR is the "sum of squares regression"
(explained variability accounted for by the regression).
Summary of Partition

Table: ANOVA Table

SOV df SS MS E(MS) F
Regression 1 SSR MSR = SSR
1 σ 2 + β12 SSXX
Error n−2 SSE MSE = SSE
n−2 σ2 MSR
MSE
Total n−1 SSTO

SOV = Source of Variation

df = degrees of freedom
SS = Sum of Squares
MS = Mean Sum of Squares
F ∼ F1,n−2 assuming that HO : regression line is
"significant" is true.
The test for HO : β1 = 0 versus HA : β1 6= 0 is the same
hypothesis as that for the t− test, [F = t 2 ]
Analysis of Variance
σ 2 +β 2 SS
Observe that E(F ) = E(MSR)
MSE =
1
σ2
XX
. Thus,
◦ if β1 is close to 0, then the numerator is close to 1
◦ conversely, if β1 6= 0, then the numerator is larger than
1.

Note, the F −test automatically does a two-sided test and

cannot be used to test HA : β1 < 0 or HA : β1 > 0.

General linear model test: We can also test HO : β1 = 0

using a test based on reduction in sum of squares.

How much is a model improved by adding another

explanatory variable?

Full model: Yi = β0 + β1 Xi + εi gives

X
SSE(F ) = (Yi − Ŷi )2 = SSE.
Reduce model: Yi = β0 + εi yields
X
SSE(R) = (Yi − Ȳ )2 = SSTO.
General linear model test
To test if the full model does better than the reduced
model, we can use a different from of F −test given by
SSE(R)−SSE(F )
dfR −dfF
F∗ = SSE(F )
∼ FdfR −dfF ,dfF .
dfF

If F ∗ is too big, we conclude that the full model fits

significantly better than the reduced model. Thus, β1 6= 0.

R 2 −Coefficient of determination: measures the proportion

of total variability in Y that is explained by the linear
regression model.

It is given by
SSR SSTO − SSE SSE
r2 = = =1−
SSTO SSTO SSTO
SSXY
r=√ is the correlation coefficient.
SSXX SSYY
Coefficient of determination

Note that
0 ≤ r 2 ≤ 1. Why?

If all points fall exactly on a line (with β1 6= 0) then r 2 = 1.

If β1 = 0 then r 2 = 0.
Multiple regression

For more than one predictor variables.

Let X1 , X2 , . . . , Xp−1 be predictor variables

Model: Yi = β0 + β1 Xi1 + β2 Xi2 + · · · + βp−1 Xip−1 + εi

◦ parameters β0 , β1 , . . . , βp−1
◦ Xi1 , Xi2 , . . . , Xip−1 are unknown
◦ εi ∼ N(0, σ 2 ) with i = 1, 2, . . . , n

Matrix Notation: Y = X β + ε
◦ Y is n × 1
◦ X is n × p
◦ β is p × 1
◦ ε ∼ N(0, σ 2 I) is n × 1
Categorical and Continuous Covariates
Consider the following example:
◦ Y length of stay in hospital ⇒ continuous response
◦ Age ⇒ continuous variable
◦ Gender ⇒ two-level categorical variable
◦ Disability status ⇒ three-level categorical variable
Create DUMMY variables for categorical variables.
Let X1 = Age

1 if male
X2 =
0 if female

1 if not disabled
X3 =
0 otherwise

1 if partially disabled
X4 =
0 otherwise

Model: Yi = β0 + β1 Xi1 + β2 Xi2 + β3 Xi3 + β4 Xi4 + εi

Estimation of regression coefficients
Least square and maximum likelihood lead to the estimator

β̂ = (X 0 X )−1 XY

Fitted values and residuals:

◦ Ŷ = X β̂
◦ Residuals, ei = Yi − Ŷi or e = Y − Ŷ
◦ Ŷ = HY = X (X 0 X )−1 XY
◦ e = (I − H)Y

Table: ANOVA Table

SOV df SS MS F
Regression p−1 SSR MSR = SSR
p−1
Error n−p SSE MSE = SSE
n−p
MSR
MSE
Total n−1 SSTO
ANOVA

Test for regression coefficients

HO : β1 = β2 = · · · = βp−1 = 0
HA : β1 = β2 = · · · = βp−1 6= 0 for some k

Test statistics
MSR
F∗ = ∼ Fp−1,n−p
MSE
Decision rule

if F ∗ ≤ F (1 − α; p − 1, n − p), conclude HO
if F ∗ > F (1 − α; p − 1, n − p), conclude HA
Inference about the regression parameters
E(β̂) = β and Cov(β̂) = MSE(X 0 X )−1
Test for βk
◦ Hypothesis, HO : βk = 0 vs HA : βk 6= 0
◦ Test statistics, t ∗ = β̂k
se(β̂k )
◦ Decision rule, if |t ∗ | ≤ t1−α/2,n−p conclude HO and if
|t ∗ | > t1−α/2,n−p conclude HA
Interval estimation of βˆk

βk ± t(1 − α/2, n − p)se(β̂k )

Joint inferences
◦ Bonferroni joint confidence interval: β̂k ± Bse(β̂k )
where B = t1−α/2g,n−p
◦ Scheffe joint confidence interval: β̂k ± Sse(β̂k ) where
S 2 = gF1−α;g,n−p (preferred for large g)
Inference about mean response
E(Yh ) = Xh0 β and E(Ŷh ) = Xh0 β̂
σ 2 {Ŷh } = Xh0 σ 2 {β}Xh and s2 {Ŷh } = Xh0 s2 {β}Xh
(1 − α)100% confidence limits for Ŷh
Ŷh ± t1−α/2,n−p se(Ŷh )
Confidence region for regression surface
(Working-Hotelling band)
Ŷh ± Wse(Ŷh )
where W 2 = pF (1 − α; p, n − p)
Simultaneous CI for several mean response
◦ use working-hotelling band
◦ use Bonferroni (g intervals),
Ŷh ± Bse(Ŷh )
where B = t1−α/2g,n−p
Coefficient of multiple determination

Multiple determination

SSR SSE
R2 = =1−
SSTO SSTO
R 2 usually get larger by inclusion of more predictors.

A modified measure is the adjusted coefficient of multiple

determination
SSE
2 n−p n−1 SSE
R =1− SSTO
=1−
n−1
n−p SSTO

Multiple correlation √
R= R2
Departures from the model
Models must be check for assumptions.

Linearity: Regression function not linear.

Homoscedasticity: When the error terms do not have

constant variance.

Independence: Error terms are not independent.

Outliers: The fit is right except for some outliers.

Normality: The error terms are not normally distributed.

Model extension: important variables are not in the model.

How do we check for assumptions.

◦ residual analysis
◦ formal test
◦ lack of fit
Residual analysis
Residuals, ei = Yi − Ŷi . Why?

standard residuals
ei − ē
e∗ = √ ,
MSE
where ē = 0. Makes it easier to detect outliers. Values
greater than 3 or less than -3 are regarded "unusual"
observations.

Diagnostic plots are important to tell whether regression is

even appropriate.

Univariate plots of X and Y

◦ look for outliers
◦ examine shape of distribution

Boxplot, stem and leaf plots, histogram, dot plots, and time
plots can be used.
Residual analysis

Bivariate plots X versus Y

◦ is the relationship linear? nonlinear?
◦ are there two-dimensional outliers?
◦ does the assumption of constant of variance
reasonable?

Bivariate plot of residuals versus X

◦ useful for detecting nonlinearity
◦ a nonlinear pattern is indication of nonlinear
relationship between X and Y .
◦ a "random" pattern may suggest linear relationship
and no outliers.
◦ a megaphone pattern may be an indication of
nonconstant variance.

Bivariate plot of residuals versus Y?

Residual analysis

Plot of residuals versus time collection or spatial

arrangement → any pattern indicates a lack of
independence.

Plot of histogram or normal QQ-plot or normal PP-plot

◦ residuals normally distributed?
◦ if they are, points lie approximately on a straight line in
both QQ-plot and PP-plot.
◦ non-normal residuals up when observations in the
tails of the distribution are far from straight line.
Formal Test

Test for randomness (independence): Durbin-Watson

Test

Consider the simple linear regression Yi = β0 + β1 xi + εi

with εi = ρεi−1 + µi , |ρ| < 1, µi ∼ N(0, σ 2 ) and ρ is
autocorrelation parameter.

Test the hypothesis: HO : ρ = 0 versus HA : ρ > 0.

Test statistics:
n
X (ei − ei−1 )2
D= Pn 2
i=2 i=1 ei

Decision: if D > du do not reject HO , D < dl reject HO and

dl ≤ D ≤ du inconclusive.
Formal test

Test for constancy of variance: (1) Levene Test

Split sample into (two-) groups of size n1 and n2 .

Compute di1 = |ei1 − ẽ1 | and di1 = |ei2 − ẽ2 | with ẽ1 and
ẽ2 the medians.

Test statistics: Two sample t-test

d¯1 − d¯2
t∗ = q ∼ tn−2
sp n11 + n12

(di1 −d¯1 )2 + (di2 −d¯2 )2

P P
where sp = n−2

Decision: if t ∗ > tn−2 , reject constancy of variance at

significant level α. Otherwise, we do not reject.
Formal test
Test for constancy of variance: (2) Breusch-Pagan Test

εi independent and normally distributed

large sample size

Procedure:
◦ use the relationship Inσi = γ0 + γ1 xi
◦ HO : γ1 = 0 versus HA : γi > 0
◦ Regress ei2 against Xi and obtain SSR ∗

Statistics:
SSR ∗
χ2BP = 2 2
2 ∼ χ 1 .
SSE
n

SSR ∗
For q predictors, χ2BP = q
SSE 2
∼ χ2q
( n )
Test for Normality
Procedure
Fit a regression line and compute the residuals

Calculate the expected residuals under normality

√
i − 0.375

MSEΦ
n + 0.25

Compute the correlation coefficient between the model

residuals and expected residuals under normality

test with table of critical points available in most statistical

textbooks.
Other test
Shapiro-Wilk
Kolmogorov-Smirnov (mostly used)
Lillifors
Lack of fit

Testing method
Test to check if the linear regression model is appropriate

That is, HO : E(Y ) = β0 + β1 X versus

HA : E(Y ) 6= β0 + β1 X

This test procedure is only possible if repeated responses

are available at some values of X .
Responses Y for a given X are
1. Independent
2. Normally distributed
3. Same variance, σ 2
Remedial measures

If linear model is not appropriate then

◦ use a more appropriate model
◦ employ some transformation and analyze the
transform data

Linearity
◦ modify the regression model
◦ use transformation on X and Y

Homoscedasticity
◦ methods of weighted least squares
◦ variance stabilization transformation

Independence
◦ use time series regression model
◦ use generalized least squares
Remedial measures

Outliers
◦ discard outliers (be careful!)
◦ add interaction with independent variables
◦ use robust estimation method (median or absolute
deviation)

Normality
◦ use generalized linear model
◦ use transformation on Y
◦ rank regression
◦ non-parametric regression

Lack of fit
◦ include more independent variables
Transformation
Linearizing transforms
plot the data

if non-linear relationship exist between X and Y , pick a

transformation to try to get a straight line
√
◦ X 0 = log(X ) X0 = X
◦ X0 = X2 X 0 = exp(X )
0 1
◦ X =X X 0 = exp(−X )
Non-normality or Unequal error variance
Often non constant variance occur along with nonlinearity

Transform response using

√
◦ Y0 = Y
◦ Y 0 = log(Y )
◦ Y 0 = Y1
Special cases
Count data
√
Y = β0 + β1 X is often a good starting point
A slightly better version of this is the Freedmann-Tukey
transform for stabilizing variance
√ √
Y + Y + 1 = β0 + β1 X

You must be able to interpret your scientific question in

terms of parameters for the transformed data
Proportion data
√
use the arcsin-square-root transform: arcsin( Y )
A more modern approach is the logit transformation

Y
logit(Y ) = log
1−Y
Box-Cox Family of Transformations
Corrects for skeweness, non-normality, unequal variance,
nonlinearity.
Procedure:
Yλ

0 if λ 6= 0
Y =
log(Y ) if λ = 0 by definition

Estimates βˆ0 , βˆ1 , σ̂ 2 and λ̂ are obtained by maximizing the

likelihood function
n n
!
Y 1 1X λ
fY λ (Yiλ , xi , β0 , β1 , σ 2 ) = exp − (Yi − β0 − β1 Xi )2
i=1
i (2πσ 2 )n/2 σ
i=1

One can show that the ML estimate λ̂ equals

λˆML = arg min SSE(W /X )
K1 (Y λ − 1) if λ 6= 0

where W = for specific constants
K2 log(Y ) if λ = 0
1 Qn
K1 = and K2 = ( i=1 Yi )1/n
λK2λ−1
Demonstration in R
Example-MgCo3 content of sand
The skeletal chemical composition is believed to be related to
the mean water temperature. A simple linear regression is fitted
to the following data.

(sand<-read.table("./Datasets/sand.txt", header=T))

## mgco3 temp
## 1 9.2 17.5
## 2 9.2 21.0
## 3 9.4 20.0
## 4 9.0 15.3
## 5 8.5 14.0
## 6 8.5 13.1
## 7 8.8 13.3
## 8 8.5 13.0
## 9 9.3 19.0
## 10 9.0 18.7
Demonstration in R

reg<-lm(mgco3 ~ temp, data = sand)

summary(reg)

##
## Call:
## lm(formula = mgco3 ~ temp, data = sand)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.192410 -0.145541 0.008722 0.147426 0.178326
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.30034 0.30597 23.859 1.01e-08 ***
## temp 0.09943 0.01827 5.443 0.000614 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1697 on 8 degrees of freedom
## Multiple R-squared: 0.7874,Adjusted R-squared: 0.7608
## F-statistic: 29.63 on 1 and 8 DF, p-value: 0.0006136
Diagnostics in R
par(mfrow=c(1,2))
plot(reg$fitted.values, reg$residuals, xlab="Fitted Values",
ylab="Residuals")
library(car)

## Loading required package: carData

qqPlot(reg$residuals,xlab="Normal Quantiles", ylab="Residuals")

0.1

0.1
Residuals

Residuals
0.0

0.0
−0.2

−0.2

5 2

8.6 8.8 9.0 9.2 9.4 −1.5 −0.5 0.5 1.5

Diagnostics plots in R
We can also use the plot() function on the output of the
lm().

par(mfrow=c(1,2));plot(reg,1:2)

Residuals vs Fitted Normal Q−Q

0.2

4 7

Standardized residuals

1.0
0.1
Residuals

0.0

0.0
−1.0
−0.2

5 2 5
2

8.6 8.8 9.0 9.2 9.4 −1.5 −0.5 0.5 1.5

Fitted values Theoretical Quantiles

Diagnostics plots in R
par(mfrow=c(1,1));plot(reg, 3)

Scale−Location
1.2

2
5
Standardized residuals

7
0.8
0.4
0.0

8.6 8.8 9.0 9.2 9.4

Fitted values
lm(mgco3 ~ temp)
Understanding diagnosis plot

Residual vs Fitted
◦ Show if there is non-linear patterns that is not
captured by the model
◦ If there is no distinctive pattern in the plot,
non-linearity is not a problem
Normal Q-Q Plot
◦ Shows if residuals are normally distributed
◦ If residuals follow a straight line (not perfect straight
line) then there is no deviation from normality
◦ Severe deviation from straight dashed-line must be of
concern
Understanding diagnosis plot

Scale-Location/Spread-Location Plot
◦ Shows if residuals are spread equally along the
ranges of predictor.
◦ Used to check for homoscedasticity assumption
◦ A horizontal line with equally (randomly) spread points
is indicative of equal variance
Formal test in R
Formal Test for Normality and Constancy of Variance
shapiro.test(reg$residuals) ##Test for Normality

##
## Shapiro-Wilk normality test
##
## data: reg$residuals
## W = 0.82427, p-value = 0.02854

ncvTest(reg) #Breusch-Pagan test Constant Variance

## Non-constant Variance Score Test

## Variance formula: ~ fitted.values
## Chisquare = 0.004297782, Df = 1, p = 0.94773

Normality is suspect

Homoscedasticity seems plausible

Some remedial measures in R

Let us attempt some remedial measures by transformation on

the response.
reg2<-lm(log(mgco3) ~ temp, data = sand)
#qPlot(reg2$residuals,xlab="Normal Quantiles", ylab="Residuals")
shapiro.test(reg2$residuals) ##Test for Normality

##
## Shapiro-Wilk normality test
##
## data: reg2$residuals
## W = 0.8351, p-value = 0.03854

ncvTest(reg2) #Breusch-Pagan test Constant Variance

## Non-constant Variance Score Test

## Variance formula: ~ fitted.values
## Chisquare = 0.0009718099, Df = 1, p = 0.97513
Some remedial measures in R

reg3<-lm(sqrt(mgco3) ~ temp, data = sand)

#qqPlot(reg3$residuals,xlab="Normal Quantiles", ylab="Residuals")
shapiro.test(reg3$residuals) ##Test for Normality

##
## Shapiro-Wilk normality test
##
## data: reg3$residuals
## W = 0.82978, p-value = 0.03326

ncvTest(reg3) #Breusch-Pagan test Constant Variance

## Non-constant Variance Score Test

## Variance formula: ~ fitted.values
## Chisquare = 0.0002827769, Df = 1, p = 0.98658
Remedial measures in R

reg4<-lm(1/mgco3 ~ temp, data = sand)

#qqPlot(reg4$residuals,xlab="Normal Quantiles", ylab="Residuals")
shapiro.test(reg4$residuals) ##Test for Normality

##
## Shapiro-Wilk normality test
##
## data: reg4$residuals
## W = 0.84463, p-value = 0.05012

ncvTest(reg4) #Breusch-Pagan test Constant Variance

## Non-constant Variance Score Test

## Variance formula: ~ fitted.values
## Chisquare = 0.01555509, Df = 1, p = 0.90075
Scatterplot of regression fit in R

Create a scatterplot with regression fit. We use the ggplot()

function in the package ggplot2.
library(ggplot2)
p <- ggplot(sand, aes(x = temp, y = 1/mgco3))
p <- p + geom_point()
p <- p + geom_smooth(method = lm, se = FALSE)
Scatterplot of regression fit in R

print(p)

0.118

0.116

0.114
1/mgco3

0.112

0.110

0.108

0.106
15.0 17.5 20.0
temp
Scatterplot of regression fit in R
Alternatively, we can use the base graphics.
plot(sand$temp,1/sand$mgco3)
abline(reg4)
0.118
1/sand$mgco3

0.114
0.110
0.106

14 16 18 20

sand$temp
CI of mean and PI for new observation in R

To obtain CI of mean prediction and prediction interval for new

observation at temp = 15, we use the predict() function.

predict(reg4, data.frame(temp=15), interval = "confidence",

level = 0.95)

## fit lwr upr

## 1 0.113878 0.1121042 0.1156518

predict(reg4, data.frame(temp=15), interval = "prediction",

level = 0.95)

## fit lwr upr

## 1 0.113878 0.10857 0.1191859
CI of mean and PI for new observation in R

Observe that the prediction interval is wider than the CI for

the mean response. This is reasonable because you are
less confident in predicting an individual response than the
mean response for all individuals.
The CI for the mean response and the prediction interval
for an individual response become wider as xh moves
away from x̄. That is, you get a more sensitive CI and
prediction interval for xh ’s near the center of the data.
Confidence and prediction bands in R
Now we construct confidence and prediction bands along with
the fitted least square line.
##Confidence Band
p <- ggplot(sand, aes(x = temp, y = 1/mgco3))
p <- p + geom_point()
p <- p + geom_smooth(method = lm, se = T)
print(p)

0.116
1/mgco3

0.112

0.108

0.104

15.0 17.5 20.0

temp
Confidence and prediction bands in R
#using Base graphics
plot(sand$temp,1/sand$mgco3)
abline(reg4)
x.pred <- data.frame(temp = seq(min(sand$temp), max(sand$temp),
length = 20))
lines(x.pred$temp, predict(reg4, x.pred, interval = "confidence")[
, "upr"], col = "blue")
lines(x.pred$temp, predict(reg4, x.pred, interval = "confidence")[
, "lwr"], col = "blue")
0.118
1/sand$mgco3

0.112
0.106

14 16 18 20

sand$temp
An example with categorical independent variable

An example with categorical covariate.

A company operates computers at three different locations.
Each location has slightly different working conditions.
The variable of interest is the length of time between
computer failures.
Is there a difference in failure times between locations?
An example with categorical independent variable

ex2 <- read.table("./Datasets/CH18TA05.txt",

col.names = c("failure", "loc", "observation"))
ex2 <- within(ex2, locf <- factor(loc, levels = c(1:3),
labels = c("Loc 1","Loc 2", "Loc 3")))
(ex2reg <- lm(failure ~ locf, data = ex2))

##
## Call:
## lm(formula = failure ~ locf, data = ex2)
##
## Coefficients:
## (Intercept) locfLoc 2 locfLoc 3
## 50.37 -28.24 70.84
ANOVA results from R
anova(ex2reg)

## Analysis of Variance Table

##
## Response: failure
## Df Sum Sq Mean Sq F value Pr(>F)
## locf 2 26053 13026.6 2.0504 0.1714
## Residuals 12 76239 6353.2

boxplot(failure ~ locf, data = ex2)

300
150
0

Loc 1 Loc 2 Loc 3

Diagnostics plots in R
par(mfrow=c(1,2))
plot(ex2reg)[1:2]

Residuals vs Fitted Normal Q−Q

14 14
200

Standardized residuals

3
2
Residuals

100

1
0

0
−100

−1
12 15
12

20 40 60 80 100 120 −1 0 1

Fitted values Theoretical Quantiles

## NULL
Diagnostics plots in R
par(mfrow=c(1,2))
plot(ex2reg)[1:2]

Constant Leverage:
Scale−Location Residuals vs Factor Levels
14
Standardized residuals

Standardized residuals

3
1.5

2
12
1.0

1
0.5

0
−1
15
12
0.0

locf :
20 40 60 80 100 120 Loc 1 Loc 2 Loc 3

Fitted values Factor Level Combinations

## NULL
Formal diagnostics in R
shapiro.test(ex2reg$residuals)

##
## Shapiro-Wilk normality test
##
## data: ex2reg$residuals
## W = 0.8065, p-value = 0.004461

library(nortest)
lillie.test(ex2reg$residuals);

##
## Lilliefors (Kolmogorov-Smirnov) normality test
##
## data: ex2reg$residuals
## D = 0.25084, p-value = 0.01185

leveneTest(ex2reg, center = mean) # Levene Test

## Levene's Test for Homogeneity of Variance (center = mean)

## Df F value Pr(>F)
## group 2 2.5796 0.117
## 12
Remedial measures in R

Remedial measures
tapply(ex2$failure, ex2$locf, var)/tapply(ex2$failure, ex2$locf, mean) # Square root

## Loc 1 Loc 2 Loc 3

## 35.51206 49.86237 133.38597

tapply(ex2$failure, ex2$locf, sd)/tapply(ex2$failure, ex2$locf, mean) # Natural Log

## Loc 1 Loc 2 Loc 3

## 0.8396571 1.5010521 1.0490337

tapply(ex2$failure, ex2$locf, sd)/tapply(ex2$failure, ex2$locf, mean)^2 # 1/Y

## Loc 1 Loc 2 Loc 3

## 0.016669785 0.067828835 0.008654822

The most consistent set of values is standard deviation over the

means, so the natural log transformation is suggested.
Remedial measures in R
boxplot(log(failure) ~ locf, data = ex2)
6
5
4
3
2
1

Loc 1 Loc 2 Loc 3

Remedial measures in R
ex2reg2 <- lm(log(failure) ~ locf, data = ex2)
qqPlot(ex2reg2)

7
2
Studentized Residuals(ex2reg2)

1
0
−1

1
−2

−2 −1 0 1 2

t Quantiles
Remedial measures in R

shapiro.test(ex2reg2$residuals)

##
## Shapiro-Wilk normality test
##
## data: ex2reg2$residuals
## W = 0.978, p-value = 0.954

ncvTest(ex2reg2)

## Non-constant Variance Score Test

## Variance formula: ~ fitted.values
## Chisquare = 0.7178845, Df = 1, p = 0.39684
Box-Cox analysis in R
A more complete analysis of transformations can be done with
a Box-Cox analysis.
boxCox(ex2reg)

95%
−80
−90
log−likelihood

−100
−110
−120

−2 −1 0 1 2
Diagnostics after transformation
The graph seems to suggest the natural log transformation
since 0 is included in the 95% confidence interval. However, a
transformation of Y 0.1 appears to be better.
ex2reg3 <- lm(failure^0.1 ~ locf, data = ex2)
qqPlot(ex2reg3)
Studentized Residuals(ex2reg3)

7
2
1
0
−1

1
−2

−2 −1 0 1 2

t Quantiles
Diagnostics after transformation

shapiro.test(ex2reg3$residuals)

##
## Shapiro-Wilk normality test
##
## data: ex2reg3$residuals
## W = 0.97697, p-value = 0.9446

ncvTest(ex2reg3)

## Non-constant Variance Score Test

## Variance formula: ~ fitted.values
## Chisquare = 0.2171432, Df = 1, p = 0.64123

Both normality and homoscedasticity seem to be fulfilled.

Outlying cases

Observations separated from the remainder of the data

Should we retain or eliminate outliers?

Outlying with respect to Y or X or both Y and X

Influence?
Identifying outlying Y observations
Residuals
ei = Yi − Ŷi
Semistudentized Residuals
ei
ei∗ = √
MSE
Studentized Residuals
ei
ri =
s(ei )
MSE(1 − hii ). H = X (X 0 X )−1 X 0 , the hat
p
where s(ei ) =
matrix.
Deleted Residuals
ei
di = Yi − Ŷi(i) and di =
1 − hii
Studentized Deleted Residuals
di
ti =
s(di )
Identifying outlying X observations
Let H be the hat matrix
n
X
0 ≤ hii ≤ 1 hii = p
i=1

where p is the number of parameters in the model.

2p
If i is an outlier, hii > n

Criteria

0.2 ≤ hii ≤ 0.5 ⇒ moderate

0.5 < hii ⇒ high

Test for Outliers

We can perform a formal test for outliers using Bonferroni test

|ti | ≤ t(1 − α/2n; n − p − 1)

[observation is not an outlier].

Influential cases

Influence on a single fitted value

Ŷi − Ŷi(i)
DFFITSi =
MSE(i) hii

It measures how much the estimated value changes Ŷi if

we do not consider the ith case.
Rule of Thumb
◦ |DFFITS|i > 1q
for medium data sets.
p
◦ |DFFITS|i > 2 n for large data sets.
Influential cases

Influence on a all fitted value

Cook’s Distance: It considers the influence of the ith case
on the n fitted values
Pn
j=1 Ŷj − Ŷj(i)
Di =
pMSE(i)

Rule of Thumb
◦ F (Di , p, n − p) < 10% or 20% not influential case.
◦ F (Di , p, n − p) > 50% influential case.
Influential cases

Influence on the regression coefficients

DFBETAS: It considers the influence of the ith case on the

regression coefficients.

β̂i − β̂k (i)

DFBETASk (i) =
MSE(i) Ckk

where Ckk is the diagonal element of (X 0 X )−1 .

Rule of Thumb
◦ |DFBETAS|k (i) > 1 influential case for small data
◦ |DFBETAS|k (i) > √2n for large data sets.
Example: Body Fats Data

Variables
n = 20

X1 skinfold thickness

X2 thigh circumference

X3 midarm circumference

Y amount of fat
Some Implementation in R

body<-read.table("./Datasets/body.txt", header=T)
head(body)

## x1 x2 x3 y
## 1 19.5 43.1 29.1 11.9
## 2 24.7 49.8 28.2 22.8
## 3 30.7 51.9 37.0 18.7
## 4 29.8 54.3 31.1 20.1
## 5 19.1 42.2 30.9 12.9
## 6 25.6 53.9 23.7 21.7

(model<-lm(y~x1+x2, data=body))

##
## Call:
## lm(formula = y ~ x1 + x2, data = body)
##
## Coefficients:
## (Intercept) x1 x2
## -19.1742 0.2224 0.6594
Some Implementation in R

par(mfrow=c(1,2));plot(model, 4:5)

Cook's distance Residuals vs Leverage

2
3

Standardized residuals
14 0.5
0.4
Cook's distance

1
0
13
0.2

−1
3 0.5
Cook's
13 distance
0.0

−2
5 10 15 20 0.0 0.1 0.2 0.3

Obs. number Leverage

Some Implementation in R
par(mfrow=c(1,1));plot(model, 6)

Cook's dist vs Leverage hii (1 − hii)

0.0 0.1 0.2 0.3 0.4 0.5

2 3
1.5
Cook's distance

13 1
14

0.5
0

0.05 0.15 0.2 0.25 0.3 0.35

Leverage hii
lm(y ~ x1 + x2)
Some Implementation in R

model$residuals

## 1 2 3 4 5
## -1.6827093112 3.6429311788 -3.1759701405 -3.1584651200 -0.0002886579
## 6 7 8 9 10
## -0.3608155187 0.7161991891 4.0147327554 2.6551057360 -2.4748115410
## 11 12 13 14 15
## 0.3358063798 2.2255110139 -3.9468613463 3.4474561945 0.5705871038
## 16 17 18 19 20
## 0.6422984777 -0.8509464751 -0.7829198812 -2.8572887647 1.0404487275
Understanding diagnosis plot

Cook’s distance plot

◦ Used to find influential outliers
◦ Plots observations against their respective Cook’s
distance.
◦ Largest cook distances are labelled.
◦ We expect cook’s distance less than 1, with no other
cook’s distance ’significantly’ larger than the rest.
Understanding diagnosis plot
Cook’s distance vs Leverage
◦ Used to detect highly influential cases
◦ Cook’s distance above 1 indicates highly influential
points
◦ Leverage values greater than 2p/n indicate high
leverage observation.
◦ The coutours are standadized residuals labelled with
their magnitude.
Residuals vs Leverage
◦ Helps to find influential subjects
◦ Outlying values are found at the upper right or lower
right corner of the plot
◦ Cases outside the dash line (cook’s distance) are
influential to the results of the model
Some Implementation in R

rstandard(model)

## 1 2 3 4 5
## -0.7402261049 1.4765806027 -1.5757913418 -1.3171518783 -0.0001308889
## 6 7 8 9 10
## -0.1519867578 0.3064529233 1.6606069555 1.1095479783 -1.0316491806
## 11 12 13 14 15
## 0.1407848595 0.9272163634 -1.7121512617 1.4686083237 0.2747599004
## 16 17 18 19 20
## 0.2655244112 -0.3538020409 -0.3435016352 -1.1631291289 0.4197625143

rstudent(model)

## 1 2 3 4 5
## -0.7299854027 1.5342541325 -1.6543295725 -1.3484842072 -0.0001269809
## 6 7 8 9 10
## -0.1475490938 0.2981276214 1.7600924916 1.1176487404 -1.0337284208
## 11 12 13 14 15
## 0.1366610657 0.9231785040 -1.8259027246 1.5247630510 0.2671500921
## 16 17 18 19 20
## 0.2581323416 -0.3445090997 -0.3344080836 -1.1761712768 0.4093564171
Some Implementation in R

influence(model)$hat

## 1 2 3 4 5 6
## 0.20101253 0.05889478 0.37193301 0.11094009 0.24801034 0.12861620
## 7 8 9 10 11 12
## 0.15551745 0.09628780 0.11463564 0.11024435 0.12033655 0.10926629
## 13 14 15 16 17 18
## 0.17838181 0.14800684 0.33321201 0.09527739 0.10559466 0.19679280
## 19 20
## 0.06695419 0.05008526

dffits(model)

## 1 2 3 4 5
## -3.661472e-01 3.838103e-01 -1.273067e+00 -4.763483e-01 -7.292347e-05
## 6 7 8 9 10
## -5.668650e-02 1.279371e-01 5.745212e-01 4.021649e-01 -3.638725e-01
## 11 12 13 14 15
## 5.054583e-02 3.233366e-01 -8.507812e-01 6.355141e-01 1.888521e-01
## 16 17 18 19 20
## 8.376829e-02 -1.183735e-01 -1.655265e-01 -3.150707e-01 9.399706e-02
Some Implementation in R

(cd<-cooks.distance(model))

## 1 2 3 4 5
## 4.595055e-02 4.548118e-02 4.901567e-01 7.216190e-02 1.883399e-09
## 6 7 8 9 10
## 1.136518e-03 5.764939e-03 9.793853e-02 5.313352e-02 4.395704e-02
## 11 12 13 14 15
## 9.037986e-04 3.515436e-02 2.121502e-01 1.248925e-01 1.257530e-02
## 16 17 18 19 20
## 2.474925e-03 4.926142e-03 9.636470e-03 3.236006e-02 3.096787e-03
Some Implementation in R

dfbeta(model)

## (Intercept) x1 x2
## 1 -2.5873120410 -4.045756e-02 6.851256e-02
## 2 1.3885856142 3.359106e-02 -3.996603e-02
## 3 -6.7460811783 -3.417891e-01 2.959198e-01
## 4 -0.8298018102 -8.699579e-02 5.576707e-02
## 5 -0.0005491464 -9.548311e-06 1.507863e-05
## 6 0.3417033826 1.252797e-02 -1.327784e-02
## 7 -0.6662800194 -4.869870e-03 1.625790e-02
## 8 2.0621886739 1.119746e-01 -9.133441e-02
## 9 -1.2562286444 -8.876192e-02 7.137567e-02
## 10 1.9837372912 7.407218e-02 -7.811621e-02
## 11 -0.0776962164 5.331751e-03 -7.452907e-04
## 12 -1.0957798277 6.844439e-03 2.047112e-02
## 13 0.9361849968 1.685643e-01 -1.063493e-01
## 14 3.6377705378 3.307612e-02 -8.349511e-02
## 15 -0.0258331550 -3.893438e-02 2.059515e-02
## 16 0.0800533575 1.345694e-02 -7.525586e-03
## 17 0.6827037689 1.715290e-02 -2.275412e-02
## 18 1.1340627287 2.347924e-02 -3.472625e-02
## 19 -1.0715507527 -1.221911e-03 1.855296e-02
## 20 0.0873645353 7.127872e-04 -9.895689e-04
Some Implementation in R

influence.measures(model)

## Influence measures of
## lm(formula = y ~ x1 + x2, data = body) :
##
## dfb.1_ dfb.x1 dfb.x2 dffit cov.r cook.d hat inf
## 1 -3.05e-01 -1.31e-01 2.32e-01 -3.66e-01 1.361 4.60e-02 0.2010
## 2 1.73e-01 1.15e-01 -1.43e-01 3.84e-01 0.844 4.55e-02 0.0589
## 3 -8.47e-01 -1.18e+00 1.07e+00 -1.27e+00 1.189 4.90e-01 0.3719 *
## 4 -1.02e-01 -2.94e-01 1.96e-01 -4.76e-01 0.977 7.22e-02 0.1109
## 5 -6.37e-05 -3.05e-05 5.02e-05 -7.29e-05 1.595 1.88e-09 0.2480 *
## 6 3.97e-02 4.01e-02 -4.43e-02 -5.67e-02 1.371 1.14e-03 0.1286
## 7 -7.75e-02 -1.56e-02 5.43e-02 1.28e-01 1.397 5.76e-03 0.1555
## 8 2.61e-01 3.91e-01 -3.32e-01 5.75e-01 0.780 9.79e-02 0.0963
## 9 -1.51e-01 -2.95e-01 2.47e-01 4.02e-01 1.081 5.31e-02 0.1146
## 10 2.38e-01 2.45e-01 -2.69e-01 -3.64e-01 1.110 4.40e-02 0.1102
## 11 -9.02e-03 1.71e-02 -2.48e-03 5.05e-02 1.359 9.04e-04 0.1203
## 12 -1.30e-01 2.25e-02 7.00e-02 3.23e-01 1.152 3.52e-02 0.1093
## 13 1.19e-01 5.92e-01 -3.89e-01 -8.51e-01 0.827 2.12e-01 0.1784
## 14 4.52e-01 1.13e-01 -2.98e-01 6.36e-01 0.937 1.25e-01 0.1480
## 15 -3.00e-03 -1.25e-01 6.88e-02 1.89e-01 1.775 1.26e-02 0.3332 *
## 16 9.31e-03 4.31e-02 -2.51e-02 8.38e-02 1.309 2.47e-03 0.0953
## 17 7.95e-02 5.50e-02 -7.61e-02 -1.18e-01 1.312 4.93e-03 0.1056
## 18 1.32e-01 7.53e-02 -1.16e-01 -1.66e-01 1.462 9.64e-03 0.1968
## 19 -1.30e-01 -4.07e-03 6.44e-02 -3.15e-01 1.002 3.24e-02 0.0670
## 20 1.02e-02 2.29e-03 -3.31e-03 9.40e-02 1.224 3.10e-03 0.0501
Some Implementation in R

library(car)
outlierTest(model) # Bonferonni p-value for most extreme obs

## No Studentized residuals with Bonferonni p < 0.05

## Largest |rstudent|:
## rstudent unadjusted p-value Bonferonni p
## 13 -1.825903 0.086586 NA

Repeat by setting Y = 50.4 for observation 28. What changes

do you observe?
Model Building Process

Data collection and preparation

Reduction of explanatory or predictor variables

Model selection

Model validation
Procedures for variable reduction

Let X be a set of predictor variables.

How can we identify a good subset of predictor variables?

Need some criterion

If we have four variables, let’s say X1 , X2 , X3 , X4

How many models can we fit? 24 − 1 = 15

General case X1 , X2 , . . . , Xn

2n − 1 different models
Criteria for comparing the regression models
Different criteria can be used
Rp2

MSEp

PRESSp

AIC and SBC(BIC)

Notation and Remarks
p − 1 number of potential variables in the pool

All the models considered with intercept β0 hence p

parameters

Assume that n > p

Rp2 or SSEp criterion
Use the coefficient of multiple determination R 2

Identify subsets with high Rp2

Rp2 is equivalent to SSEp criterion since

SSEp
Rp2 = 1 −
SSTO
Given that SSTO is constant we will look for the smallest
SSEp
Ra2 or MSEp criterion
If we want to account for the number of parameters

n − 1 SSE MSE
Ra2 = 1 − =1−
n − p SSTO SSTO/(n − 1)

Ra2 increase iff MSE decreases

Mallows Cp -criterion
This criterion tries to identify the models with smaller bias in the
regression
SSEp
Cp = − (n − 2p)
MSE(X1 , X2 , . . . , Xp )
MSE(X1 , X2 , . . . , Xp ) unbiased estimator of σ 2
SSEp is the error sum of the squares for the fitted
regression model.
No bias implies E(Cp ) ∼
=p
We search for models with 1) Cp small 2) Cp mean p
PRESSp criterion The idea is to see how well the use of the
fitted values for a specific subset model predict Yi observed.
Yi observed values
Ŷi (i) fitted value for i made with a model that does not
contain i
Predict sum of squares PRESSp = ni=1 (Yi − Ŷi (i))2
P
Surgical Dataset

Data
X1 blood clotting score
X2 prognostic index
X3 enzyme function test score
X4 liver function test
Y survival

Objective
Predict survival of patients with a particular type of liver
operation.
Implementation in R

surg<-read.table("./Datasets/surgical.txt", header=T)

cor(surg)

## x1 x2 x3 x4 y logy
## x1 1.00000000 0.09011973 -0.14963411 0.5024157 0.3725187 0.3464042
## x2 0.09011973 1.00000000 -0.02360544 0.3690256 0.5539760 0.5928888
## x3 -0.14963411 -0.02360544 1.00000000 0.4164245 0.5802438 0.6651216
## x4 0.50241567 0.36902563 0.41642451 1.0000000 0.7223266 0.7262058
## y 0.37251865 0.55397598 0.58024382 0.7223266 1.0000000 0.9130965
## logy 0.34640419 0.59288884 0.66512160 0.7262058 0.9130965 1.0000000
Implementation in R
pairs(surg[,-6])

20 40 60 80 1 2 3 4 5 6

8
x1

4
60

x2
20

60 100
x3

20
5

x4
3
1

600
y

200
4 6 8 10 20 60 100 200 600
Implementation in R: Diagnostics
model<-lm(y~x1+x2+x3+x4, data=surg)
library(car)
par(mfrow=c(1,2))
qqnorm(model$residuals)
qqline(model$residuals)
plot(model$fitted,model$residuals,xlab="Fitted",ylab="Residuals",
main="Time")
abline(h=0)

Normal Q−Q Plot Time

300

300
Sample Quantiles

Residuals
100

100
0

−2 −1 0 1 2 −100 100 300 500

Theoretical Quantiles Fitted

Implementation in R: Diagnostics
model2<-lm(logy~x1+x2+x3+x4, data=surg)
par(mfrow=c(1,2))
#qqPlot(model2$residuals,main="LogTime",ylab="Residuals")
qqnorm(model2$residuals,main="LogTime")
qqline(model2$residuals)
plot(model2$fitted,model2$residuals,xlab="Fitted",ylab="Residuals",
main="LogTime")
abline(h=0)

LogTime LogTime
Sample Quantiles

0.10

0.10
Residuals
0.00

0.00
−0.10

−0.10

−2 −1 0 1 2 1.6 2.0 2.4 2.8

Theoretical Quantiles Fitted

Implementation in R: Variable Selection

#install.packages('leaps')
library(leaps)
modelsel<-regsubsets(logy~x1+x2+x3+x4,data=surg,nbest=2)

summary(modelsel)

## Subset selection object

## Call: regsubsets.formula(logy ~ x1 + x2 + x3 + x4, data = surg, nbest = 2)
## 4 Variables (and intercept)
## Forced in Forced out
## x1 FALSE FALSE
## x2 FALSE FALSE
## x3 FALSE FALSE
## x4 FALSE FALSE
## 2 subsets of each size up to 4
## Selection Algorithm: exhaustive
## x1 x2 x3 x4
## 1 ( 1 ) " " " " " " "*"
## 1 ( 2 ) " " " " "*" " "
## 2 ( 1 ) " " "*" "*" " "
## 2 ( 2 ) " " " " "*" "*"
## 3 ( 1 ) "*" "*" "*" " "
## 3 ( 2 ) " " "*" "*" "*"
## 4 ( 1 ) "*" "*" "*" "*"
Implementation in R: Variable Selection
plot(modelsel, scale="r2")

0.97

0.88
r2

0.81

0.69

0.53

0.44
(Intercept)

x4
Implementation in R: Variable Selection
car::subsets(modelsel, statistic='rsq',legend=F)

x1−x2−x3 x1−x2−x3−x4
0.9

x2−x3−x4

x2−x3
0.8
Statistic: rsq

0.7

x3−x4
0.6

x4
0.5

1.0 1.5 2.0 2.5 3.0 3.5 4.0

Subset Size
Implementation in R: Variable Selection
plot(modelsel, scale="adjr2")

0.97

0.88
adjr2

0.81

0.67

0.52

0.43
(Intercept)

x4
Implementation in R: Variable Selection
car::subsets(modelsel, statistic='adjr2', legend=F)

x1−x2−x3 x1−x2−x3−x4
0.9

x2−x3−x4
0.8

x2−x3
Statistic: adjr2

0.7

x3−x4
0.6

x4
0.5

1.0 1.5 2.0 2.5 3.0 3.5 4.0

Subset Size
Implementation in R: Variable Selection
plot(modelsel, scale="Cp")

160
Cp

280

510

790

940
(Intercept)

x4
Implementation in R: Variable Selection
car::subsets(modelsel, statistic='cp',legend=F)

x3
800

x4
600
Statistic: cp

x3−x4
400

x2−x3
200

x2−x3−x4

x1−x2−x3 x1−x2−x3−x4
0

1.0 1.5 2.0 2.5 3.0 3.5 4.0

Subset Size
Implementation in R: Variable Selection
plot(modelsel) ##scale=bic

−180

−170

−100
bic

−79

−51

−32

−24
(Intercept)

x4
Implementation in R: Variable Selection
car::subsets(modelsel, legend=F) ##statistics=bic

x3
x4
−50

x3−x4

x2−x3
Statistic: bic

−100

x2−x3−x4
−150

x1−x2−x3 x1−x2−x3−x4

1.0 1.5 2.0 2.5 3.0 3.5 4.0

Subset Size
Implementation in R: Variable Selection
car::subsets(modelsel, statistic='adjr2',
legend=c(3.5, -37)) ##statistics=bic

x1−x2−x3 x1−x2−x3−x4
0.9

x2−x3−x4
0.8

x2−x3
Statistic: adjr2

0.7

x3−x4
0.6

x4
0.5

1.0 1.5 2.0 2.5 3.0 3.5 4.0

Subset Size
Implementation in R: Variable Selection

Function press in package DAAG will return the PRESS

statistic for a model object produced by lm

mod<-lm(logy~x1+x2+x3+x4,data=surg)
library(DAAG)

## Loading required package: lattice

##
## Attaching package: ’DAAG’
## The following object is masked from ’package:car’:
##
## vif

press(mod)

## [1] 0.1455741
Implementation in R: Comparing Two Models

fit1 <- lm(logy ~ x1 + x2 + x3 + x4, data=surg)

fit2 <- lm(logy ~ x1 + x2+x3,data=surg)
anova(fit1, fit2)

## Analysis of Variance Table

##
## Model 1: logy ~ x1 + x2 + x3 + x4
## Model 2: logy ~ x1 + x2 + x3
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 49 0.10977
## 2 50 0.10986 -1 -8.8092e-05 0.0393 0.8436

fit1 <- lm(logy ~ x1 + x2 + x3 + x4, data=surg)

Implementation in R: Comparing Two Models
library(MASS)

##
## Attaching package: ’MASS’
## The following object is masked from ’package:DAAG’:
##
## hills

dropterm(fit1,test="F")

## Single term deletions

##
## Model:
## logy ~ x1 + x2 + x3 + x4
## Df Sum of Sq RSS AIC F Value Pr(F)
## <none> 0.10977 -324.71
## x1 1 0.35544 0.46521 -248.73 158.66 <2e-16 ***
## x2 1 1.00583 1.11560 -201.50 448.99 <2e-16 ***
## x3 1 1.28075 1.39052 -189.60 571.71 <2e-16 ***
## x4 1 0.00009 0.10986 -326.67 0.04 0.8436
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Implementation in R: Comparing Two Models

fit2<-update(fit1, .~. -x3)

dropterm(fit2,test="F")

## Single term deletions

##
## Model:
## logy ~ x1 + x2 + x4
## Df Sum of Sq RSS AIC F Value Pr(F)
## <none> 1.3905 -189.60
## x1 1 0.00163 1.3921 -191.54 0.0586 0.8097443
## x2 1 0.48530 1.8758 -175.44 17.4502 0.0001181 ***
## x4 1 0.84196 2.2325 -166.04 30.2750 1.285e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Implementation in R: Automatic Search Procedure

stepAIC(fit1)

## Start: AIC=-324.71
## logy ~ x1 + x2 + x3 + x4
##
## Df Sum of Sq RSS AIC
## - x4 1 0.00009 0.10986 -326.67
## <none> 0.10977 -324.71
## - x1 1 0.35544 0.46521 -248.73
## - x2 1 1.00583 1.11560 -201.50
## - x3 1 1.28075 1.39052 -189.60
##
## Step: AIC=-326.67
## logy ~ x1 + x2 + x3
##
## Df Sum of Sq RSS AIC
## <none> 0.10986 -326.67
## - x1 1 0.63315 0.74301 -225.45
## - x2 1 1.29732 1.40718 -190.96
## - x3 1 2.12263 2.23249 -166.04
##
## Call:
## lm(formula = logy ~ x1 + x2 + x3, data = surg)
##
## Coefficients:
## (Intercept) x1 x2 x3
## 0.483621 0.069225 0.009295 0.009524
Implementation in R: Automatic Search Procedure

stepAIC(fit1,direction="backward")

## Start: AIC=-324.71
## logy ~ x1 + x2 + x3 + x4
##
## Df Sum of Sq RSS AIC
## - x4 1 0.00009 0.10986 -326.67
## <none> 0.10977 -324.71
## - x1 1 0.35544 0.46521 -248.73
## - x2 1 1.00583 1.11560 -201.50
## - x3 1 1.28075 1.39052 -189.60
##
## Step: AIC=-326.67
## logy ~ x1 + x2 + x3
##
## Df Sum of Sq RSS AIC
## <none> 0.10986 -326.67
## + x4 1 0.00009 0.10977 -324.71
## - x1 1 0.63315 0.74301 -225.45
## - x2 1 1.29732 1.40718 -190.96
## - x3 1 2.12263 2.23249 -166.04
##
## Call:
## lm(formula = logy ~ x1 + x2 + x3, data = surg)
##
## Coefficients:
## (Intercept) x1 x2 x3
## 0.483621 0.069225 0.009295 0.009524
Multicolinearity

A problem when the predictor variables are correlated

among themselves
In practice
◦ perfect correlation is difficult to find
◦ does not inhibit our ability to obtain good fit
◦ interpretation of regression coefficients is not fully
applicable
Multicolinearity

Multicolinearity affects
◦ regression coefficients
◦ s{β̂}
◦ fitted values and predictors
◦ tests and confidence intervals

These information can be used to detect multicolinearity

◦ correction exceeding 0.8 (rule of thumb)
◦ drastic changes in estimated parameters
◦ large values of s{β̂}
◦ etc.
Multicolinearity Diagnostics

Adding or deleting variables changes the regression

coefficients

Standard deviation of the regression coefficient become

large

The estimated regression coefficients are not statistically

significant but the relation exists.

Large values of correlation between pairs of predictors.

Multicolinearity Diagnostics

Variance Inflation Factor

(VIF )k = (1 − Rk2 )−1 , k = 1, 2, . . . , p − 1

Rk2 is the coefficient of multiple determination when Xk is

regressed on the p − 2 variables

Rk2 = 1 ⇒ (VIF )k = ∞, Rk2 = 0 ⇒ (VIF )k = 1

Maximum value > 10 implies multicolinearity

P
VIFk
VIF k = p−1 1 multicolinearity
Some Implementation in R

crew<-read.table("./Datasets/crew.txt", header=T)
cor(crew[,-3])

## x1 x2
## x1 1 0
## x2 0 1
Some Implementation in R

summary(lm(y~x1,data=crew))

##
## Call:
## lm(formula = y ~ x1, data = crew)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.750 -3.750 0.125 4.500 6.000
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.500 10.111 2.324 0.0591 .
## x1 5.375 1.983 2.711 0.0351 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.609 on 6 degrees of freedom
## Multiple R-squared: 0.5505,Adjusted R-squared: 0.4755
## F-statistic: 7.347 on 1 and 6 DF, p-value: 0.03508
Implementation in R

summary(lm(y~x2,data=crew))

##
## Call:
## lm(formula = y ~ x2, data = crew)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.000 -4.688 -0.250 5.250 7.250
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27.250 11.608 2.348 0.0572 .
## x2 9.250 4.553 2.032 0.0885 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.439 on 6 degrees of freedom
## Multiple R-squared: 0.4076,Adjusted R-squared: 0.3088
## F-statistic: 4.128 on 1 and 6 DF, p-value: 0.08846
Implementation in R

summary(lm(y~x1+x2,data=crew))

##
## Call:
## lm(formula = y ~ x1 + x2, data = crew)
##
## Residuals:
## 1 2 3 4 5 6 7 8
## 1.625 -1.375 -1.625 1.375 -2.125 1.875 0.625 -0.375
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.3750 4.7405 0.079 0.940016
## x1 5.3750 0.6638 8.097 0.000466 ***
## x2 9.2500 1.3276 6.968 0.000937 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.877 on 5 degrees of freedom
## Multiple R-squared: 0.958,Adjusted R-squared: 0.9412
## F-statistic: 57.06 on 2 and 5 DF, p-value: 0.000361
Implementation in R

vif(lm(y~x1+x2,data=crew))

## x1 x2
## 1 1

sqrt(vif(lm(y~x1+x2,data=crew)))>2 #problem

## x1 x2
## FALSE FALSE
Body Fat Example: Implementation in R

body<-read.table("./Datasets/body.txt", header=T)
summary(lm(y~x1,data=body))

##
## Call:
## lm(formula = y ~ x1, data = body)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.1195 -2.1904 0.6735 1.9383 3.8523
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.4961 3.3192 -0.451 0.658
## x1 0.8572 0.1288 6.656 3.02e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.82 on 18 degrees of freedom
## Multiple R-squared: 0.7111,Adjusted R-squared: 0.695
## F-statistic: 44.3 on 1 and 18 DF, p-value: 3.024e-06
Implementation in R

summary(lm(y~x2,data=body))

##
## Call:
## lm(formula = y ~ x2, data = body)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.4949 -1.5671 0.1241 1.3362 4.4084
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -23.6345 5.6574 -4.178 0.000566 ***
## x2 0.8565 0.1100 7.786 3.6e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.51 on 18 degrees of freedom
## Multiple R-squared: 0.771,Adjusted R-squared: 0.7583
## F-statistic: 60.62 on 1 and 18 DF, p-value: 3.6e-07
Implementation in R

summary(lm(y~x1+x2,data=body))

##
## Call:
## lm(formula = y ~ x1 + x2, data = body)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9469 -1.8807 0.1678 1.3367 4.0147
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -19.1742 8.3606 -2.293 0.0348 *
## x1 0.2224 0.3034 0.733 0.4737
## x2 0.6594 0.2912 2.265 0.0369 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.543 on 17 degrees of freedom
## Multiple R-squared: 0.7781,Adjusted R-squared: 0.7519
## F-statistic: 29.8 on 2 and 17 DF, p-value: 2.774e-06
Implementation in R

summary(lm(y~x1+x2+x3,data=body))

##
## Call:
## lm(formula = y ~ x1 + x2 + x3, data = body)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.7263 -1.6111 0.3923 1.4656 4.1277
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 117.085 99.782 1.173 0.258
## x1 4.334 3.016 1.437 0.170
## x2 -2.857 2.582 -1.106 0.285
## x3 -2.186 1.595 -1.370 0.190
##
## Residual standard error: 2.48 on 16 degrees of freedom
## Multiple R-squared: 0.8014,Adjusted R-squared: 0.7641
## F-statistic: 21.52 on 3 and 16 DF, p-value: 7.343e-06
Implementation in R

cor(body[,-4])

## x1 x2 x3
## x1 1.0000000 0.9238425 0.4577772
## x2 0.9238425 1.0000000 0.0846675
## x3 0.4577772 0.0846675 1.0000000

vif(lm(y~x1+x2+x3,data=body))

## x1 x2 x3
## 708.84 564.34 104.61

sqrt(vif(lm(y~x1+x2+x3,data=body)))>2 #problem

## x1 x2 x3
## TRUE TRUE TRUE
Preliminary
Let i = 1, . . . , n denote subjects, individuals, items,
experimental units etc., j = 1, 2, . . . , p denotes variables,
characteristics measured for each individual. The jth
measurement for the ith subject will be denoted by xij . For n
measurements on p variables, the data is setup as follows:

Variable 1 Variable2 ... Variable j ... Variable p

Item 1 x11 x12 ... x1j ... x1p
Item 2 x21 x22 ... x2j ... x2p
.. .. .. .. ..
. . . . .
Item j xi1 xi2 ... xij ... xip
.. .. .. .. ..
. . . . .
Item n xn1 xn2 ... xnj ... xnp
Table: Data Setup
The data can be grouped into a matrix form as:
 
x11 x12 . . . x1p
x21 x22 . . . x2p 
X = (xij )i,j =  .
 
. .. .. 
 . . . 
xn1 xn2 . . . xnp

Example
A selection of four receipts from the university bookstore was
obtained in order to investigate the nature of book sales. Each
receipt provided, among other things, the number of books sold
and the total amount of each sale. Let the first variable be the
total Cedi sales and the second variable be the number of
books sold. Then we can regard the corresponding numbers on
the receipts as four measurements on two variables.
Example (cont.)
Suppose the data, in tabular form, are

Variable 1 (Cedi sales): 42 52 48 58

Variable 2 (number of books): 4 5 4 3

Solution
Using the notation just introduced, we have

x11 = 42, x21 = 52, x31 = 48, x41 = 58

x12 = 4, x22 = 5, x32 = 4, x42 = 3
and the data array X is
 
42 4
52 5
X =
46

4
58 3
Descriptive statistics
Sample mean
n
1X
x̄j = xij , j = 1, . . . , p.
n
i=1
 
x̄1
x̄2 
Mean vector x̄ = (x̄1 , x̄2 , . . . , x̄p )0 =  . 
 
 .. 
x̄p
Sample variance
n
1X
sj2 = (xij − x̄j )2 .
n
i=1

A very common definition is

n
1 X
sj2 = (xij − x̄j )2 .
n−1
i=1
When sample variance are put along the main diagonal of
a matrix, they will be denoted by sjj = sj2 . Standard
q
deviation: sj2
Linear association
◦ Sample covariance:
n
1X
sjk = (xij − x̄j )(xik − x̄k )
n
i=1

We denote the p × p symmetric covariance matrix

S n = (sjk )j,k and the matrix of squares and cross
products as W = nS n = nS n−1
◦ Sample correlation: (Pearson’s product moment
correlation)
Pn
sjk (xij − x̄j )(xik − x̄k )
rjk = √ = qP i=1
sjj skk n 2
Pn 2
i=1 (xij − x̄j ) i=1 (xik − x̄k )
Remarks on correlation:
unitless
covariance of standardized variables
range −1 ≤ rij ≤ 1
rjk = 0 implies no linear association
rjk > 0 implies both pair of variables tend to deviate from
their respective means in the same direction.
The correlation is invariant under the following
transformation

yij = axij + b, i = 1, 2, . . . , n
yik = cxik + d, i = 1, 2, . . . , n

provided a and c have the same sign (ac > 0). What
happens if ac < 0?
Do not overestimate the importance of linear association
measures.
◦ They do not capture other kinds of association (eg.
quadratic relationship between variables)
◦ Correlation coefficient is very sensitive to outliers.
Matrix representation:
Sample mean: x̄ = n1 ni=1 x j = (x̄1 , x̄2 , . . . , x̄p )0
P
Sample covariance P matrix:
S n = (sjk ) = n1 ni=1 (x j − x̄)(x k − x̄)0 . i.e.
 
s11 s12 . . . s1p
s21 s11 . . . s2p 
Sn =  .
 
. .. .. 
 . . . 
sp1 sp2 . . . spp
Sample correlation matrix: R = (rjk ) with rjj = 1. i.e.
 
1 r12 . . . r1p
r21 1 . . . r2p 
R= .
 
. .. .. 
 . . . 
Sample Mean, Covariance, and Correlation as Matrix
Operations

1 0
x̄ = X 1n
n
1 0 1
S = X In − Jn X
n−1 n

Let D = diag(sjj ) then

R = D −1/2 SD −1/2
S = D 1/2 RD 1/2
Examples

Consider the data array

 
42 4
52 5
X =
48

4
58 3

Find x̄, S, R.
TRY: Using the IRIS data set, find x̄, S, R for the entire
data and for each species.
R Demonstration
data<-c(42, 52, 48, 58, 4, 5, 4, 3)
(x<-matrix(data,4,2))

## [,1] [,2]
## [1,] 42 4
## [2,] 52 5
## [3,] 48 4
## [4,] 58 3

colnames(x)<-c("x1","x2")
colMeans(x)

## x1 x2
## 50 4

##Descriptive Statistics
var(x) # or cov(x)

## x1 x2
## x1 45.33333 -2.0000000
## x2 -2.00000 0.6666667

apply(x,2,var)
R Demonstraation
cor(x)

## x1 x2
## x1 1.0000000 -0.3638034
## x2 -0.3638034 1.0000000

#####Sample Mean, Covariance and Correlation

## as Matrix Operations####
n<-nrow(x)
I<-diag(n)
J<-matrix(1,n,n)
n;I;J

## [1] 4
## [,1] [,2] [,3] [,4]
## [1,] 1 0 0 0
## [2,] 0 1 0 0
## [3,] 0 0 1 0
## [4,] 0 0 0 1
## [,1] [,2] [,3] [,4]
## [1,] 1 1 1 1
## [2,] 1 1 1 1
## [3,] 1 1 1 1
## [4,] 1 1 1 1
R Demonstration
#Mean#
(MeanX=1/n*t(x)%*%J[,1])

## [,1]
## x1 50
## x2 4

#Covariance#
(CovX=(1/n)*t(x)%*%(I-(1/n)*J)%*%x)

## x1 x2
## x1 34.0 -1.5
## x2 -1.5 0.5

#Correlation#
Sjj<-diag(CovX)
(Dhalf<-diag(sqrt(Sjj)))

## [,1] [,2]
## [1,] 5.830952 0.0000000
## [2,] 0.000000 0.7071068
R Demonstration

(DInvHalf<-solve(Dhalf))

## [,1] [,2]
## [1,] 0.1714986 0.000000
## [2,] 0.0000000 1.414214

(R<-DInvHalf %% CovX %% DInvHalf)

## [,1] [,2]
## [1,] 1.0000000 -0.3638034
## [2,] -0.3638034 1.0000000
Introduction
A generalization of the familiar bell-shaped density to
several dimension plays a fundamental role in multivariate
analysis.
While real data are never exactly multivariate normal, the
normal density is often a useful approximation to the ‘true’
population distribution.
One advantage of the multivariate distribution is that it has
got mathematically elegant properties.
The normal distributions are useful in practice for two
reasons:
◦ it is a bona fide model for practice
◦ the sampling distribution of a lot of statistics is
approximately normal (central limit effect)
The univariate normal distribution
1 1 x−µ 2
f (x) = √ e− 2 ( σ )
2πσ 2
Introduction

The corresponding multivariate normal density function is

1 1
f (x) = f (x; µ, Σ) = √ p exp (x − µ)0 Σ−1 (x − µ)
( 2π)p |Σ| 2

where −∞ < x < ∞.

Often the notation φp is used to denote the multivariate
normal density, while Φp denotes the multivariate
cumulative distribution function.
A multivariate normal random variable will be indicated by

X ∼ Np (µ, Σ).
Assessing normality
Many of the multivariate techniques depend on the
assumption of (multivariate) normality (except some large
sample theory results)

Even in large samples, the approximation will be better if

the distribution is closer to normality.

The detection of departures from normality is crucial.

Three properties of the normal distribution will be

important eg.
◦ marginal distributions are normal
◦ linear combinations are normal
◦ density contours are ellipsoids

The detection of OUTLIERS (wild observations) will play

an important role as well.

Checks in univariate (bivariate) margins are usually not

difficult.
Quantile-Quantile Plot (QQ-Plot)

Special plots called QQ Plots can be used to assess the

assumption of normality.
They are plots of the sample quantile versus the quantile
one would expect to observe if the observations actually
were normally distributed.
When the points lie very nearly along a straight line, the
normality assumption remain tenable. Normality is suspect
if the points deviate from a straight line.
Moreover, the pattern of the deviations can provide clues
about the nature of the nonnormality.
Once the reasons for nonnormality are identified,
corrective action is often possible.
QQ-Plot
Let x1 , x2 , . . . , xn be n observation and denote their ordered
values by x(1) , x(2) , . . . , x(n) values by
x(1) ≤ x(2) ≤ · · · ≤ x(n)
where the x(j) ’s are the sample quantiles.
When the x(j) ’s are distinct, exactly j observations are less
than or equal to x(j) . The proportion j/n of the sample at or
to the left of x(j) is often approximated by j−1/2
n for
analytical convenience (continuity correction).
On the other hand, we can compute quantile qj of the
normal distribution
Z q(j)
1 2 j − 1/2
P(Z ≤ q(j) ) = φ(z)dz = √ e−z /2 dz =
−∞ 2π n
Solve this equation for q(j) and plot the pairs (q(j) , x(j) ). If
normality is true, the expected sample quantile should be
x(j) ≈ σq(j) + µ
Example (Constructing a QQ-Plot)

Radiation Data: A sample of n = 10 observations gives the

values in the following table.

1 2 3 4 5 6 7 8 9 10
-1 -0.1 0.16 2.30 1.71 1.54 0.41 0.62 0.80 1.26
R Demonstration

##QQ plot for radiation

sample<-as.matrix(c(-1,-0.1, 0.16, 2.30, 1.71, 1.54,
0.41, 0.62, 0.80, 1.26))
ProbLevel<-(1:nrow(sample) - 1/2) / nrow(sample)
qStdNormal<-qnorm(ProbLevel)
Table<-matrix(c(1:nrow(sample),sort(sample),round(ProbLevel,2),
+ round(qStdNormal,3)),nrow(sample),4)
R Demonstration
par(mfrow=c(1,2))
plot(qStdNormal,sort(sample),xlab="Standard Normal Quantiles,
q(i)",ylab="Sample Quantiles, x(i)", main="QQ-Plot")
qqline(sample)
abline(h=0,v=0)

qqnorm(sample)
qqline(sample)

QQ−Plot Normal Q−Q Plot

Sample Quantiles, x(i)

2.0

2.0
Sample Quantiles
1.0

1.0
0.0

0.0
−1.0

−1.0

−1.5 −0.5 0.5 1.5 −1.5 −0.5 0.5 1.5

Standard Normal Quantiles,
q(i) Theoretical Quantiles
Example (Constructing a QQ-Plot)

The pair of points lie very nearly along a straight line and
we will not reject the notion that these data are normally
distributed - particularly with sample size as small as
n = 10
As a rule of thumb: do not trust this procedure in samples
of size n ≤ 20
There can be quite a bit of variability in the straightness of
the QQ-plot for small samples even when the observations
are known to come from a normal population.
Is the QQ-plot straight?
If it is straight, the correlation between the pairs would be 1.
We can test for correlation coefficient.
Test for Correlation Coefficient
Correlation coefficient
Pn
j=1 (x(j) − x̄)(q(j) − q̄)
rQ = qP
n 2
Pn 2
j=1 (x(j) − x̄) j=1 (q(j) − q̄)

We compare rQ to a table of critical points for the QQ-plot

correlation test for normality.
If rQ > the critical value, we do not reject the hypothesis of
normality.
From our example
n
X n
X
(x(j) − x̄)q(j) = 8.584, (x(j) − x̄)2 = 8.472
j=1 j=1
n
X
(q(j) )2 = 8.795 since always q̄ = 0
j=1
Table of Critical Points

Critical Points for the QQ-Plot

Correlation Coefficient Test for Normality
Sample Size Significance level α
n 0.01 0.05 0.10
5 0.8299 0.8788 0.9032
10 0.8801 0.9198 0.9351
15 0.9126 0.9389 0.9503
20 0.9269 0.9508 0.9604
25 0.9410 0.9591 0.9665
30 0.9479 0.9652 0.9795
Test for Correlation Coefficient

8.584
rQ = √ √ = 0.994
8.472 8.795
The test for normality at the 10% level of significant is
provided by referring rQ = 0.994 to entries in the Table
corresponding to n = 10 and α = 0.10. This entry is
0.9351.
Since rQ > 0.9351, we do not reject the hypothesis of
normality.
As an alternative, the Shapiro-Wilk test statistics may be
computed.
For large sample sizes, the two statistics are nearly the
same so either can be used to judge lack of normality.
Multivariate: Gamma plot or Chi-square plot

A somewhat more formal method for judging the joint

normality of a dataset is based on the squared generalized
distance.

Compute di2 = (x i − x̄)0 S −1 (x i − x̄) which are

◦ approximately χ2p if say n − p ≥ 25
◦ if the population is multivariate normal.

Warnings:
◦ the χ2p is only an approximation, an F distribution can
be used instead.
◦ the quantities are not independent.
Construction of the Gamma Plot

Order
2 2
d(1) ≤ · · · ≤ d(n)

Graph ! !
i − 21
χ2p 2
, d(i)
n

The plot should be a straight line.

If a systematic pattern is detected, then a deviation from

normality is likely.
Example:
The ordered distances and the corresponding chi-square
percentiles of the previous example (n = 10, p = 2) are listed in
the table below.
Construction of the Gamma Plot

i− 12 2 i− 12
i n d(i) q2 n

1 0.05 0.59 0.10

2 0.15 0.81 0.33
3 0.25 0.83 0.58
4 0.35 0.97 0.86
5 0.45 1.01 1.20
6 0.55 1.02 1.60
7 0.65 1.20 2.10
8 0.75 1.88 2.77
9 0.85 4.34 3.79
10 0.95 5.33 5.99
R Demonstration

##Gamma Plot or Chi-square Plot

sales<-c(126974, 96933, 86656, 63438,55264,50976,39069,
36156,35209,32416)
profits<-c(4224,3835,3510,3758, 3939, 1809, 2946,359, 2480,2413)
x<-as.matrix(cbind(sales,profits))
cm <- colMeans(x)
S <- cov(x)
d <- apply(x, 1, function(x) t(x - cm) %*% solve(S) %*% (x - cm))
qc <- qchisq((1:nrow(x) - 1/2) / nrow(x), df = ncol(x))
sd <- sort(d)
R Demonstration
##Gamma Plot or Chi-square Plot
plot(qc, sd, main="Chi-square plot of the ordered distances",
xlab = expression(paste(chi[2]^2, "-quantile")),
ylab = "Ordered distances",
xlim = range(qc)*c(1, 1.1))
abline( a= 0, b = 1)

Chi−square plot of the ordered distances

5
Ordered distances

4
3
2
1

0 1 2 3 4 5 6

χ22−quantile
Construction of the Gamma Plot

The points do not lie along a straight line with slope 1.

This data do not appear to be bivariate normal.

Transformations
What if the assumption of normality is not tenable?
Apply techniques for non-normal data
Apply transformations: re-express data in other units which
are sometimes more natural than the original units.
Some well-known examples

Original scale Transformed scale

√
count y y
p
proportion p logit(p) = log 1−p

correlations r Fisher’s z(r ) = 21 log 1+r
1−r

Another important transformation is the log-transform,

log(y ) which is often used in eg. growth data.
Transformations that reduce the tail of the distribution are
the log-transform and the power-transform y = x 1/n
Box and Cox Transforms

Allow data to suggest transformation.

A family of transformation proposed by Box and Cox is

x λ−1
if λ 6= 0
y= λ
log(x) if λ = 0

Note that the form λ = 0 follows as the limit for λ → 0

The parameter λ is to be estimated from the data.

λ is estimated by maximum likelihood or by minimizing the

sample variance of the variables.
Plausible Values for a Normal Population Mean

To determine whether a p × 1 vector µ0 is a plausible value

for the mean of a multivariate normal distribution, the
following test statistics (analogous to the univariate case)
is used.
1
T 2 = (X̄ − µ0 )0 ( S)−1 (X̄ − µ0 ) = n(X̄ − µ0 )0 S −1 (X̄ − µ0 ).
n

where X̄ = n1 ni=1 X i , S = n−1

1 Pn 0
P
  i=1 (X i − X̄ )(X i − X̄ ) and
µ10
 µ10 
µ0 =  . 
 
 .. 
µp0
The test statistics T 2 is called Hotelling’s T 2 . Here ( n1 S) is
the estimated covariance matrix of X̄ .
Hotelling’s T 2
If the statistical distance T 2 is too large ie if the X̄ is "too
far" from µ0 , the hypothesis H0 : µ = µ0 is rejected.
T 2 is distributed as (n−1)p
(n−p) Fp,n−p where Fp,n−p denotes a
random variable with an F −distribution with p and n − p
degree of freedom.
So we test the hypothesis H0 : µ = µ0 versus H1 : µ 6= µ0 .
At the α level of significance, we reject H0 in favor of H1 if
the observed
(n − 1)p
T2 > Fp,n−p .
(n − p)
Example
Let the data matrix for a random sample
 of size
 n = 3 from a
6 9
bivariate normal population be X =  10 6 . Evaluate the
8 3
2 0
T for µ0 = [9, 5]. What is the sampling distribution in this case.
R Demonstration

X1<-c(6, 10, 8)
X2<-c(9, 6, 3)
(X<-cbind(X1, X2))

## X1 X2
## [1,] 6 9
## [2,] 10 6
## [3,] 8 3

mu0<-t(c(9,5))
n<-nrow(X)
p<-ncol(X)
alpha<-0.1
R Demonstration

####Calculating T^2 ######

Xbar<-t(colMeans(X))
S<-cov(X)
SInv<-solve(S)
(T2<-nrow(X)%*%(Xbar-mu0)%*%SInv%*%t(Xbar-mu0))

## [,1]
## [1,] 0.7777778

F<-qf((1-alpha),p,n-p)
(CV<-(n-1)*p*F/(n-p))

## [1] 198

F>CV

## [1] FALSE
R Demonstration
##Create a function to perform Hotelling's T^2###
"Hotelling" <- function(dat,mu,alpha){
# The data matrix is "dat".
# The mean vector is mu. (a list of numbers).

if(!is.matrix(dat)) dat=as.matrix(dat)
Hotelling=NULL
n = nrow(dat)
p = ncol(dat)
cm=matrix(colMeans(dat),p,1)
S = cov(dat)
sinv=solve(S)
mu0=matrix(mu,p,1)
dev=cm-mu0
T2 = n*(t(dev)%*%sinv%*%dev)
d2=n-p
tt=T2*d2/(p*(n-1))
pvalue=1-pf(tt,p,d2)
F=qf((1-alpha),p,n-p)
CV=(n-1)*p*F/(n-p)
Hotelling=cbind(Hotelling,c(T2,CV, pvalue))
row.names(Hotelling)=c("Hotelling-T2","Critical Value","p.value")
Hotelling
}
Example

Consider the Sweat data of 20 observations. There are three

measurements, namely sweat rate (X1 ), Sodium (X2 ), and
Potassium (X3 ). Test H0 : µ = (4, 50, 10)0 versus
H0 : µ 6= (4, 50, 10)0 .
Using the above function "Hotelling"
##Example with SWEAT Data Set
sweat<-read.csv("./Datasets/Sweat.csv")
X<-as.matrix(sweat[,c("X1","X2","X3")])
mu0<-t(c(4, 50, 10))
alpha<-0.05
Hotelling(X,mu0,alpha)

## [,1]
## Hotelling-T2 9.73877286
## Critical Value 10.71860470
## p.value 0.06492834

We cannot reject the null hypothesis at α = 0.05

Confidence region

To make inference from a sample, we can extend the

concept of univariate confidence interval to a multivariate
confidence region. Given data X = (X 1 , X 2 , . . . , X n )0 ,
indicate a confidence region R(X ) ⊂ <p .
The region R(X ) is called a 100(1 − α)% percent
confidence region if the average probability θ is 1 − α. We
call α the level of significant.
Formally
P(Θ ∈ R(X )) = 1 − α
The confidence region for the mean µ of a p-dimensional
normal population is given by

0 −1 (n − 1)p
P n(X̄ − µ) S (X̄ − µ) ≤ Fp,n−p (α) = 1 − α
(n − p)
Confidence region

For a given sample, we can compute x̄ and S, to obtain the

confidence region.

(n − 1)p
n(x̄ − µ)0 S −1 (x̄ − µ) ≤ Fp,n−p (α)
(n − p)

To determine whether a given µ0 lies within the confidence

region, compute the Mahalanobis distance between x̄ and
µ0 .
The vector lies within R(x̄) if the distance is less than

(n − 1)p
Fp,n−p (α)
(n − p)

n(x̄ − µ0 )0 S −1 (x̄ − µ0 ) = mahalanobis distance between x̄

and µ0
Simultaneous Confidence Statement

Let X has a Np (µ, Σ), then a linear combination Z = `0 X

has a normal distribution with mean µz = `0 µ and
covariance σz2 = `0 Σ`. The sample mean z̄ and sz2 are
computed accordingly.
Result: [T 2 -interval, Scheffe’s interval]
Let X 1 , X 2 , . . . , X n be random sample from a Np (µ, Σ)
population with Σ positive definite. Then the probability that the
interval,
" s s #
(n − 1)p (n − 1)p
`0 x̄ − Fp,n−p (α)`0 S`, `0 x̄ + Fp,n−p (α)`0 S`
(n − p) (n − p)

contain `0 µ simultaneous for all `, is equal to (1 − α)

Bonferroni Intervals

Consider the set of p confidence intervals for the mean

components. Consider partitioning of the overall significant
level
α = α1 + α2 + · · · + αp
An obvious choice is αj = α/p.
The confidence intervals become
r r
α sjj α sjj
x̄j − tn−1 ≤ µj ≤ x̄j + tn−1
2p n 2p n

Bonferroni and T 2 both have the structure

r r
`0 S` `0 S`
`0 x̄−(critical value) ≤ `0 µ ≤ `0 x̄+(critical value)
n n
Bonferroni Intervals

α
The ratio of their length for αj = m is
α

tn−1 2m
q
p(n−1)
(n−p) Fp,n−p (α)

which does not depend on the random quantities x̄ and S.

For small number of comparisons, the Bonferroni intervals
are precise (smaller), i.e when m = p. Because of this,
often intervals are based on Bonferroni method.
Large sample inference
All large-sample inferences about µ are based on a χ2 -
distribution.
We know that

(X̄ − µ)0 (n−1 S)(X̄ − µ) = n(X̄ − µ)S −1 (X̄ − µ)

is approximately χ2 with p df and thus

P[n(X̄ − µ)S −1 (X̄ − µ) ≤ χ2p (α)] = 1 − α

where χ2p (α) is the upper (100α)th of the χ2 distribution.

Hypothesis Testing: X 1 , X 2 , . . . , X n be a random sample
from a population with mean µ and positive definite
covariance matrix Σ.
When (n − p) is large, the hypothesis H0 : µ = µ0 is
rejected in favor H1 : µ 6= µ0 , at a level of significance
approximately α,
Large sample inference
if the observed
n(x̄ − µ0 )S −1 (x̄ − µ0 ) > χ2p (α)
where χ2p (α) is the upper (100α)th percentile of a
chi-square distribution with p df .
Note: The test statistics yields essentially the same results
as T 2 in situation where the χ2 - test is appropriate. This
follows from the fact (n−1)p 2
(n−p) Fp,n−p (α) and χp (α) are
approximately equal for n large relative to p.
Confidence Statement: Let X 1 , X 2 , . . . , X n be a random
sample from a population with µ and positive definite
covariance Σ. If (n − p) is large
r
`0 S`
q
0 2
` X̄ ± χp (α)
n
0
will contain ` µ for every `, with probability approximately
(1 − α).
Large sample inference

Consequence, we can make 100(1 − α)% simultaneous

confidence statements
r
q
2
sjj
x̄j ± χp (α)
n
contain µj
for all pairs (µj , µk ), j, k = 1, 2, . . . , p

sjj sjk x̄j − µj
n[x̄j − µj , x̄k − µk ] ≤ χ2p (α)
sjk skk x̄k − µk

contain (µj , µk ).
Example[College Data]

The scores obtained by College Level Examination Program

(CLEP) subtest X1 and the College Qualification Test (CQT)
subtests X2 and X3 are given in the ’college’ data set for
X1 = social science and history, X2 = verbal, and
X3 = science.
(a) compute the 95 percent simultaneous confident intervals
for µ1 , µ2 and µ3 using scheffe, Bonferroni, and asymptotic
procedures.
R Demonstration

We will make use of the R script ’cregion’ written by Ruey S.

Tsay.
source("./Scripts/cregion.R")
college<-read.table("./Datasets/college.DAT",header=T)
head(college)

## Social Verbal Science

## 1 468 41 26
## 2 428 39 26
## 3 514 53 21
## 4 547 67 33
## 5 614 61 27
## 6 501 67 29
R Demonstration
cregion(college)

## [1] "C.I. based on T^2"

## [,1] [,2]
## [1,] 502.99940 550.17302
## [2,] 51.21484 58.16447
## [3,] 23.63855 26.61432
## [1] "CI based on individual t"
## [,1] [,2]
## [1,] 510.34352 542.82889
## [2,] 52.29678 57.08253
## [3,] 24.10183 26.15105
## [1] "CI based on Bonferroni"
## [,1] [,2]
## [1,] 506.63589 546.53652
## [2,] 51.75057 57.62874
## [3,] 23.86794 26.38493
## [1] "Asymp. simu. CI"
## [,1] [,2]
## [1,] 503.74533 549.42709
## [2,] 51.32473 58.05458
## [3,] 23.68560 26.56727
Paired comparisons
Let Xijk be the outcome for unit i on the j th variable, for
treatment (experimental condition)
t = 1, 2; j = 1, . . . , p; i = 1, 2, . . . , n.
Then define Dij = Xij1 − Xij2 . Let D i be the difference
 
Di1
 Di2 
vector for the i th unit/pair: D i =  . .
 
.
 . 
Dip
Denote E(D i ) = δ and covariance, cov(D i ) = Σd . Again,
T 2 statistics play a critical role:
T 2 = n(D̄ − δ)0 S −1
d (D̄ − δ).
Assume D i ∼ Np (δ, Σd ), then (for a test of hypothesis
H0 : δ = 0 against H1 : δ 6= 0)
(n − 1)p
T2 ∼ Fp,n−p
(n − p)
Paired comparisons

Furthermore, if n − p is large, then T 2 ∼ χ2p regardless of

the distribution of D i
0 (n−1)p
Confidence regions: T 2 = nd̄ S −1
d d̄ > (n−p) Fp,n−p (α)

T 2 interval (scheffe interval):

s s
2 sd2j
s s
(n − 1)p s dj (n − 1)p
d̄j − Fp,n−p ≤ δj ≤ d̄j + Fp,n−p
(n − p) n (n − p) n

Bonferroni intervals:
s s
s2 s2
α dj α dj
d̄j − tn−1 ≤ δj ≤ d̄j + tn−1
2g n 2g n
Example: Measurement of biochemical oxygen demand (BoD)
and a suspended solids(SS) were obtained for n = 11 sample
splits from two laboratories. The data are displayed in the table
below.

Commercial Lab State Lab of Hygiene

Sample i Xi11 (B0D) Xi21 (SS) Xi12 (BOD) Xi22 (SS)
1 6 27 25 15
2 6 23 28 13
3 18 64 36 22
4 8 44 35 29
5 11 30 15 31
6 34 75 44 64
7 28 26 42 30
8 71 124 54 64
9 43 54 34 56
10 33 30 29 20
11 20 14 39 21
R Demonstration

Do the two laboratories chemical analysis agree? if

differences exist , what is their nature.
The T 2 statistics for testing H0 : δ 0 = [δ1 , δ2 ] = [0, 0] is
constructed from the differences of paired observations:
effluent<-read.table("./Datasets/effluent.dat",header=T)
attach(effluent)
d1<-x11-x21
d2<-x12-x22
d<-cbind(d1,d2)
alpha<-0.05
Hotelling(d,rep(0,2),alpha)
## [,1]
## Hotelling-T2 13.63931214
## Critical Value 9.45887718
## p.value 0.02082779
R Demonstration
Confidence statement are obtained from the differences.
cregion(d)

## [1] "C.I. based on T^2"

## [,1] [,2]
## [1,] -22.453272 3.72600
## [2,] -5.700119 32.24557
## [1] "CI based on individual t"
## [,1] [,2]
## [1,] -18.8467298 0.119457
## [2,] -0.4725958 27.018050
## [1] "CI based on Bonferroni"
## [,1] [,2]
## [1,] -20.573107 1.845835
## [2,] -2.974903 29.520358
## [1] "Asymp. simu. CI"
## [,1] [,2]
## [1,] -19.781395 1.054122
## [2,] -1.827351 28.372806
Compare mean vectors from two populations
Now, we will have 2 (independent) sets of responses n1
subjects under experimental condition 1, n2 subjects under
experimental condition 2.
The data setting is as follows. Let
◦ X `1 , X `2 , . . . , X `n be a random sample from population
`(` = 1, 2) which has mean µ` and covariance Σ`
◦ Both samples are independent.
◦ The observed values be x `1 , . . . , x `n` with mean µ`
and covariance S ` .
◦ (small samples only): both populations are
multivariate normal.
◦ (small samples only): Σ1 = Σ2 .
Question of interest
◦ µ1 = µ2 ?
◦ statements about the components of the mean
vectors.
Common Variance

When covariances are equal (Σ1 = Σ2 = Σ), it is common

practice to construct a "pooled" estimate of variance.

(n1 − 1)S 1 + (n2 − 1)S 2

S pooled =
n1 + n2 − 2
n1
Σi=1 (x 1i − x̄ 1 )(x 1i − x̄ 1 )0 + Σni=1
2
(x 2i − x̄ 2 )(x 2i − x̄
=
n1 + n2 − 2

Why do we have the factor n1 + n2 − 2 in the denominator?

◦ we know that adding Wishart variables, the degree of
freedom add also
◦ there is likelihood support for this procedure
An important hypothesis is H0 : µ1 − µ2 = δ 0 . Obviously,
E(X̄ 1 − X̄ 2 ) = µ1 − µ2
Common Variance
The samples are independent
⇒ X̄ 1 and X̄ 2 are independent.
⇒ cov(X̄ 1 , X̄ 2 ) = 0
⇒
cov(X̄ 1 − X̄ 2 ) = cov(X̄ 1 ) + cov(X̄ 2 )
1 1
= Σ1 + Σ2
n n2
1
1 1
= + Σ
n1 n2

which is estimated by ( n11 + 1

n2 )S pooled .

Now the confidence region follows from

−1
2 0 1 1
T = (x̄ 1 −x̄ 2 −δ0 ) + Spooled (x̄ 1 −x̄ 2 −δ0 ) > c 2
n1 n2
Common Variance
where c 2 follows from the distribution of T 2 which is
(n1 + n2 − 2)p
c2 = Fp,n1 +n2 −p−1
(n1 + n2 − p − 1)
Simultaneous confidence interval for all `0 (µ1 − µ2 ) are
s
0 0 1 1
` (X̄ 1 − X̄ 2 ) ± c ` + S pooled `
n1 n2
In particular, a simultaneous confidence interval for
µ1i − µ2j is
s
1 1
(X̄1j − X̄2j ) ± c + sjj,pooled
n1 n2
Bonferroni confidence interval for µ1j − µ2j is when we
consider the population means
s
α 1 1
(X̄1j − X̄2j ) ± tn1 +n2 −2 + s
2p n1 n2 jj,pooled
Two-sample situation when Σ1 6= Σ2

Problem if samples are small.

Large sample results available.
Let n1 − p and n2 − p be large. An approximate 100(1 − α)
confidence ellipsoid for µ1 − µ2 is
−1
0 S1 S2
[(X̄ 1 −X̄ 2 )−(µ1 −µ2 )] + [(X̄ 1 −X̄ 2 )−(µ1 −µ2 )] ≤ χ2p (α
n1 n2

Simultaneous confidence interval for `0 (µ1 − µ2 ) is

s
0 S1 S2
q
0 2
` (X̄ 1 − X̄ 2 ) ± χp (α) ` + `
n1 n2
The Behrens-Fisher Problem

When the two populations are normally distributed but

covariance matrices are different
Sample size are also small i.e. n1 − p and n2 − p are small
A recommended method in literature is to approximate the
distribution of T 2 as
υp
T2 = Fp,υ−p+1
υ−p+1

where min(n1 , n2 ) ≤ υ ≤ n1 + n2 and

υ=
p + p2
( " #
P2 2
−1
−
1 1 1 1
i=1 ni tr n1 i n2 S 1 +
S ni S 2 + tr n11 S i n12 S 1 + 1
ni S 2
R Demonstration

We use the Effluent data set ignoring the pairs. The R script
’Behrens.R’ is used to perform the Behrens-Fisher Test.
source("./Scripts/Behrens.R")
x1<-effluent[,1:2]
x2<-effluent[,3:4]
Behrens(x1,x2)

## [1] "Estimate of v: "

## [1] 18.7012
## [1] "Test result:"
## [,1]
## Test-T2 12.6648
## p.value 0.0103
Comparison of g populations

Let sample `(` = 1, 2, . . . , g) be X `1 , X `2 , . . . , X `n`

Our interest are
◦ µ1 = µ2 = · · · = µg ?
◦ statements about components
The data setting is as follows: Let
◦ X `1 , X `2 , . . . , X `n` be a random sample from
population `, with mean µ` and common covariance Σ
◦ all samples are independent
◦ observed values x `1 , x `2 , . . . , x `n` with mean x¯` and
covariance S `
◦ (small samples) each population is multivariate normal
Multivariate Analysis of Variance

MANOVA model for comparing g population mean vectors

X `i = µ + τ ` + e `i

with i = 1, 2, . . . , n` and ` = 1, 2, . . . , g.
We take e ì ∼ N(0, Σ) and E(X ì ) = µ + τ ` , with µ =
overall mean, τ ` = group effect.
The model is overspecified Pgand so we impose the
identifiability constraints `=1 n` τ ` = 0. That is, all group
effects add up to zero and there are n` individuals in each
group.
The following terminology is sometimes useful
◦ systematic part: µ + τ `
◦ random part: e ì
MANOVA

It can easily be shown that

   
Total(corrected) Sum Treatment (Between)
 of Squares and Cross  =  Sum of Squares and 
Products Cross Products
 
Residual (Within) Sums
+  of Squares and 
Cross Products
T = B+W
How to construct Test Statistics
where
g n
X X̀
W = (x ì − x̄ ` )(x ì − x̄ ` )0
`=1 i=1
g
X
B = n` (x̄ ` − x̄)(x̄ ` − x̄)0
`=1
g n
X X̀
B+W = (x ì − x̄)(x ì − x̄)0
`=1 i=1

Remark:
◦ The "between" matrix B is often denoted as H, the
"hypothesis" matrix.
◦ The "within" matrix W is often denoted as E, the
"error" matrix
Test Statistics
There are four statistics in common use.
Wilk’s Lambda:
|W |
Λ = = det(W (B + W )−1 )
|B + W |
p p p
Y Y Y 1
= λj = θj =
1 + φj
j=1 j=1 j=1
Pp
Pillar’s Trace: tr (B(B + W )−1 ) = j=1 θj

Hotelling-Lawley
P Trace:P
θj
tr BW −1
= pj=1 φj = pj=1 1−θj
Note that T 2 is defined slightly different here.
Roy’s Greatest Root: The largest root of the equation
|B − φW | = 0:
θmax
φmax =
1 − θmax
MANOVA Table for Comparing Population Mean
Vectors

Table: MANOVA Table

Source Matrix of SS Degrees of Test

of Variation and CP Freedom Statistic
Treatment B g−1
Pg ∗ |W
Residual (Error) W `=1 n` − g Λ = |B +
Pg
Total (corrected for the mean) B+W `=1 n` − 1
Pg 0
` (x̄ ` − x̄)(x̄ ` − x̄) ,
B = P`=1 nP
g n`
W = `=1 i=1 (x `i − x̄ ` )(x `i − x̄ ` )0
The quantity Λ∗ = |W | is convenient for testing
|B + W |

H0 : τ 1 = τ 2 = · · · = τ g = 0
and is called the Wilk’s Lambda.
Example

Exercise: Observations on two responses arecollected for

x1
three treatments. The observed vectors are
x
2
9 6 9
Treatment 1: , ,
3 2 7

0 2
Treatment 2: ,
4 0

3 1 2
Treatment 2: , ,
8 9 7

(i) Perform univariate ANOVA on each response.

(ii) Do a MANOVA
R Demonstration

We use here the R functions anova() and manova().

##One-Way ANOVA and MANOVA##
trt<-c(rep(1,3),rep(2,2),rep(3,3))
x1<-c(9,6,9,0,2,3,1,2)
x2<-c(3,2,7,4,0,8,9,7)
x<-cbind(trt,x1,x2)
R Demonstration
We use here the R functions anova() and manova().
##ANOVA
anova1<-aov(x1~as.factor(trt))
summary(anova1)

## Df Sum Sq Mean Sq F value Pr(>F)

## as.factor(trt) 2 78 39 19.5 0.00435 **
## Residuals 5 10 2
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova2<-aov(x2~as.factor(trt))
summary(anova2)

## Df Sum Sq Mean Sq F value Pr(>F)

## as.factor(trt) 2 48 24.0 5 0.0642 .
## Residuals 5 24 4.8
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R Demonstration

## MANOVA
m<-manova(x[,2:3]~as.factor(trt))
summary(m) #Pillai's trace default

## Df Pillai approx F num Df den Df Pr(>F)

## as.factor(trt) 2 1.5408 8.3882 4 10 0.003096 **
## Residuals 5
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R Demonstration

## MANOVA
summary(m,test='Wilks')

## Df Wilks approx F num Df den Df Pr(>F)

## as.factor(trt) 2 0.038455 8.1989 4 8 0.006234 **
## Residuals 5
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(m,test='Roy')#options test="Hotelling-Lawley" is available

## Df Roy approx F num Df den Df Pr(>F)

## as.factor(trt) 2 8.0764 20.191 2 5 0.004029 **
## Residuals 5
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Example: Two-way MANOVA

For two factor variables.

Plastic Film Data: The optimum conditions for extruding
plastic firm has been examined using a technique called
Evolutionary Operation. In the course of the study, three
responses - X1 = tear resistance, X1 = gloss, and X1 =
opacity - were measured at two levels of the factors, rate
extrusion and amount of an additive. The data is store in
the data set "plastic_film.dat".
Codes:
plastic<-read.table("./Datasets/plastic_film.dat", header=T)
resp<-as.matrix(plastic[,3:5])
Fact1<-as.factor(plastic$Fac1)
Fact2<-as.factor(plastic$Fac2)
m1<-manova(resp~Fact1+Fact2+Fact1*Fact2)
Example: Two-way MANOVA
m1

## Call:
## manova(resp ~ Fact1 + Fact2 + Fact1 * Fact2)
##
## Terms:
## Fact1 Fact2 Fact1:Fact2 Residuals
## resp 1 1.7405 0.7605 0.0005 1.7640
## resp 2 1.3005 0.6125 0.5445 2.6280
## resp 3 0.4205 4.9005 3.9605 64.9240
## Deg. of Freedom 1 1 1 16
##
## Residual standard errors: 0.3320392 0.4052777 2.014386
## Estimated effects may be unbalanced

summary(m1)

## Df Pillai approx F num Df den Df Pr(>F)

## Fact1 1 0.61814 7.5543 3 14 0.003034 **
## Fact2 1 0.47697 4.2556 3 14 0.024745 *
## Fact1:Fact2 1 0.22289 1.3385 3 14 0.301782
## Residuals 16
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Box’s M Test
Setup: g populations and p variables.

We want to test the hypothesis

H0 : Σ1 = Σ2 = · · · = Σg = Σ
H1 : Σi 6= Σj for some 1 ≤ i 6= j ≤ g

Commonly use test statistics is the Box’s M test

Test is based on the likelihood ratio statistics for testing

equality of covariance matrix
g (n` −1)/2
Y |S `|
Λ=
|S pooled |
`=1

where ` is the sample size for the `th population, S ` is the

sample covariance matrix for the `th population
Box’s M Test

1
S pooled = Pg [(n1 −1)S 2 +(n2 −1)S 2 +· · ·+(ng −1)S g ]
`=1 (n` − 1)

The test statistics is given by

g g
!
X X
M = −2 ln(Λ) = (n` − 1) ln(|S pooled |)− [(n` −1) ln(|S ` |)]
`=1 `=1

(1 − u)M ∼ χ2v where

1
v = p(p + 1)(g − 1)
2
g
!
2p2 + 3p − 1

X 1 1
u = − Pg
(n` − 1) `=1 (n` − 1)
6(p + 1)(g − 1)
`−1

PCA
Consider a data set with p response variables. The
summary information of this data will consist of the
following pieces:
◦ p sample means
◦ p sample standard deviation;equivalently p sample
variance (ie, the diagonal of the covariance matrix)
◦ p(p−1)
2 sample covariance (ie the off-diagonal elements
of the covariance matrix);equivalently the correlation
could be computed from this covariance matrix.
Principal Component are used for several tasks:
◦ Data Reduction: When the data can be reduced from
p to a fewer variables, the manipulation of the data
becomes easier. This is true when the data have to be
used as input to other procedures such as regression
model.
◦ Display of data: Of course, p > 3 variables cannot be
plotted at once. Usually pairs of variables are plotted
against each other. At best, fancy graphical systems
allow rotating graphs of three variables.
◦ Interpretation: It is often hoped for that the original p
variables can be replaced by, for example, smaller k
new variables, which have an intuitive interpretation
when the interpretation is of greater scientific interest
than the original variables, then insight might gained.
◦ Inference: Above tasks refer to manipulating a sample
of n objects.Often,one wants to make statements
about the population. Then, inferential tools have to be
invoked. Of course, data reduction will always lead to
information reduction or information loss.
In summary, one wants to reduce p original variables to k
new variables, where k is a compromise between:
◦ minimum number of variables
◦ maximum amount of information retained
PCA

Do not
◦ over estimate importance of the principal components
analysis.
◦ over- interpret principal components analysis
They can be useful:
◦ in exploratory phase of a data analysis (to learn about
the structure of the data)
◦ as input to other statistical procedures.
Population principal component
Concepts can be developed for both the population level (ie
based on the theoretical distribution) or for the sample level
(ie based on the simple random sample to be studied).
In virtually all cases, the two theories are dual to each
other, although the assumptions involved and the
generality of the conclusion can be different.
Consider a random vector X = (X1 , . . . , Xp ) with
covariance matrix Σ and a correlation ρ.
We want to apply a linear transformation of X in order to
obtain Y = (Y1 , ......Yp );

Y1 = `11 X1 + `12 X2 + · · · + `1p Xp

Y2 = `21 X1 + `22 X2 + · · · + `2p Xp
..
.
Yp = `p1 X1 + `p2 X2 + · · · + `pp Xp
In matrix notation,
   
Y1   X1
 Y2  `11 . . . `1p  X2 
.. .. ..  
 ..  = 
   
. . .  .
 ..

 .  
`p1 . . . `pp
Yp Xp

Two short hand notations:

Y = LX or Yj = `0j X (j = 1, 2, ......p)
Population PCA
We requirement.
◦ the Yj are uncorrelated.
◦ the Yj have maximal variance.
◦ the coefficient vector `j have a unit length
The last requirement is formalized as follows:
p
X
||`j ||2 = `0j `j = `2jk = 1
k =1
These requirement can be translated in an iterative
procedure;
Y1 = `01 X with maximal variance, with condition `01 `1 = 1
Y2 = `02 X with maximal variance, with conditions
`02 `2 = 1, cov(Y1 , Y2 ) = 0
...
Yj = `0j X with maximal variance and conditions
`0j `j = 1, cov(Y1 , Yj ) = · · · = cov(Yj−1 , Yj ) = 0
Mathematical argument

Theorem: Let Σ be the covariance matrix of X with

ordered eigenvalue-eigenvector pairs
(λj , ej )(j = 1, 2, . . . , p)
◦ the principal components are
Yj = e 0j X = ej1 X1 + · · · + ejp Xp
◦ var(Yj ) = λj
◦ cov(Yj , Yk ) = e 0j Σe k = 0, j 6= k
For distinct eigenvalues, the choice of the principal
component is unique.
Procedure

The procedure, deduced from this theorem is as follows:

(1) Determine the covariance matrix of your population.
(2) Calculate the eigenvectors e j , ordered such that they
correspond to decreasing eigenvalues (e 1 corresponding
to λ1 , the largest eigenvalue)
(3) Calculate the transformed variables:

Y1 = e11 X1 + e12 X2 + · · · + e1p Xp

Y2 = e21 X1 + e22 X2 + · · · + e2p Xp
..
.
Yp = ep1 X1 + ep2 X2 + · · · + epp Xp
Properties
The following properties ensure that the new variables are very
desirable:
(1) The variance of the new variables is determined by the
eigenvalues λj :
var(Yj ) = λj
(2) The covariance between pairs of distinct new variables is
zero, and hence also the correlation:
Corr(Yj , Yk ) = 0, j 6= k
(3) The total variability of the original variables is recorded by
the new variables. More precisely, the sum of the diagonal
elements of the original covariance matrix equals the sum
of the eigenvalues.
(4) The total population variance is λ1 + · · · + λp . The j th
λj
principal component "explains" λ1 +···+λ p
of the variance.
(5) It is common practice to choose the first few components in
order to get approximately 80 − 90% of variation, such that
Sample PCs
By replacing the population covariance matrix by the
sample covariance matrix, the results carry over.
From the sample version, inferences can be drawn about
the population level. We may be lead to believe that the
principal component is made up of variables except those
with considerably smaller coefficient.
However, this is merely jumping to conclusions and one
should be guided by the correlations between principal
components and each of its constituents.
A formula is given in the following property;
Property: Let X have covariance Σ, with Yj = ej0 X the
principal components, then
p
ejk λj
ρYj ,Xk = √
σkk
PCA based on correlation

Covariance or Correlation?
◦ The original variables could have different units. Then
we are comparing apples and oranges.
◦ Original variables could have widely varying standard
deviation.
In both cases, the analysis is driven by the variable with
large standard deviation.
kilometer or millimeter (×106 ) ⇒ principal components will
be pulled towards the variable in millimeters.
A solution is provided by using the correlation matrix
instead. Thus means that all variables are replaced by
their standardized versions.
The variance changes to 1 and total variability is equal to p.
Example

Let us consider the popular Iris Data Set. It consist of 5

variables
◦ Species (with three levels: Setosa, Versicolor,
Virginica)
◦ Petal Length (in mm)
◦ Petal Width (in mm)
◦ Sepal Length (in mm)
◦ Sepal Width (in mm)
There are 50 observations in each of the three species,
yielding a total sample size of 150.
First, lets focus on the four outcomes (petal and sepal
length and width).
We will require the R packages, psych and pcaMethods.
R Demonstration
Summary information are obtained from the following R codes.
library(psych)

##
## Attaching package: ’psych’
## The following object is masked from ’package:DAAG’:
##
## cities
## The following objects are masked from ’package:ggplot2’:
##
## %+%, alpha
## The following object is masked from ’package:car’:
##
## logit

attach(iris)
sapply(iris[,-5],mean)*10

## Sepal.Length Sepal.Width Petal.Length Petal.Width

## 58.43333 30.57333 37.58000 11.99333

sapply(iris[,-5],sd)*10

## Sepal.Length Sepal.Width Petal.Length Petal.Width

PCA

The summary information consist of 4 sample means, 4

sample standard deviations; equivalently 4 variances and 6
sample covariances; equivalently 6 correlations.
PCA based on covariances or correlation.
Scree plot.
Total variances.
Normalized coefficients.
We will use the princomp function in R to obtain the PCs.
Covariance matrix or original data as input.
To do PCA using correlation matrix, we use the option
cor=TRUE in princomp.
R Demonstration
cov_mat<-cov(iris[,-5]*10)
pcacov<-princomp(covmat=cov_mat)
summary(pcacov, loadings = TRUE)

## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4
## Standard deviation 20.5626888 4.92616228 2.79659615 1.543861813
## Proportion of Variance 0.9246187 0.05306648 0.01710261 0.005212184
## Cumulative Proportion 0.9246187 0.97768521 0.99478782 1.000000000
##
## Loadings:
## Comp.1 Comp.2 Comp.3 Comp.4
## Sepal.Length 0.361 0.657 0.582 0.315
## Sepal.Width 0.730 -0.598 -0.320
## Petal.Length 0.857 -0.173 -0.480
## Petal.Width 0.358 -0.546 0.754

(eigval<-(pcacov$sdev)^2)

## Comp.1 Comp.2 Comp.3 Comp.4

## 422.824171 24.267075 7.820950 2.383509
R Demonstration
Ways to do a Scree Plot
par(mfrow=c(1,2))
screeplot(pcacov,type="lines")
screeplot(pcacov,npcs=4, type="lines")

pcacov pcacov
400

400
300

300
Variances

Variances
200

200
100

100
0

Comp.1 Comp.3 Comp.1 Comp.3

R Demonstration
par(mfrow=c(1,2))
plot(pcacov$sdev^2, xlab = "Component number",
ylab = "Component variance", type = "l",
main = "Scree diagram")
plot(pcacov,type="l")

Scree diagram pcacov

400

400
Component variance

300

300
Variances
200

200
100

100
0

1.0 2.0 3.0 4.0 Comp.1 Comp.3

Component number
R Demonstration
loadings(pcacov)

##
## Loadings:
## Comp.1 Comp.2 Comp.3 Comp.4
## Sepal.Length 0.361 0.657 0.582 0.315
## Sepal.Width 0.730 -0.598 -0.320
## Petal.Length 0.857 -0.173 -0.480
## Petal.Width 0.358 -0.546 0.754
##
## Comp.1 Comp.2 Comp.3 Comp.4
## SS loadings 1.00 1.00 1.00 1.00
## Proportion Var 0.25 0.25 0.25 0.25
## Cumulative Var 0.25 0.50 0.75 1.00

sum((pcacov$sdev)^2) #Total Variance

## [1] 457.2957

sum((pcacov$loadings[,1])^2)

## [1] 1
Observations

It is easy to check that the eigenvalues add up to 457.2957

as required by the total variance of the original covariance
matrix.
A more desirable situation is the one where the first
eigenvalue is “large" with the remaining ones very “small".
This is hard to judge, based on the absolute numbers.
Hence, the proportion can be used.
The cumulative proportion indicates how much of the
variability is captured by the first few principal components.
We see the first principal component captures 92% of the
total variability. Adding the second leads to almost 98%
variability explained.
Both the 80-90% criteria and the scree plot suggests using
the first principal component.
R Demonstration

To obtain scores, PCA is performed on the original data.

Plot of principal components.
The correlation between the PCA and original variables is
very important.

pca<-princomp(iris[,-5]*10,scores=T,cor=F)
head(pca$scores)

## Comp.1 Comp.2 Comp.3 Comp.4

## [1,] -2.264703 0.4800266 0.12770602 0.02416820
## [2,] -2.080961 -0.6741336 0.23460885 0.10300677
## [3,] -2.364229 -0.3419080 -0.04420148 0.02837705
## [4,] -2.299384 -0.5973945 -0.09129011 -0.06595556
## [5,] -2.389842 0.6468354 -0.01573820 -0.03592281
## [6,] -2.075631 1.4891775 -0.02696829 0.00660818
R Demonstration
par(mfrow=c(2,2))
biplot(pca,choices=1:2,main="Plot of First and
Second Principal Components")
biplot(pca,choices=2:3)
biplot(pca,choices=3:4)
biplot(pca,choices=c(1,4))

Plot of First and

−10 −5 0 5 10
Second
−10 Principal
−5 0 Components
5 10

10
0.2
16 132
118 123
0.2

Sepal.Width 63

10
34 88 77131
119 13051
3315 110 59108 106
53
136

0.1
66

5
6
17 69 7576 2137 15 132
19 109 55 87 12619
0.1

45
20
47 73
80 Sepal.Length

5
11
49 137145
125
149126 Sepal.Length 26
68 7498 29 32
22 42 120
82 93 72
10 78 11

Comp.3
865751 121
101
144136
106 132 36 28
103
13450401 49
37 111142
116
140 8170 135
8335
Comp.2

385
28 52 53
87
66 141
130
103 52 34118
23 18
144
41 316492818

0.0
2932
21 71138
78 108
113
146
105 123 61 54 147 46
124 140
40 148 131 5 4717

0
827 76117 112 38
113 6 33 16
0.0

725
12 24 92128
59 Petal.Width
Petal.Length 37924
30 117
2741
Petal.Length
50 139 119 94 9099195100
4127 12
0
62
96 150
75
98
6764 104
77133
129 84
569796
4810425 142
572220 Sepal.Width
36
30
3 89
85 7955
134
127 99 14
39 65 62 148
138 121
48
4335
10
31 97
65 74
72 112
115
124 58 129
89 146
Petal.Width111 45
125
426
2
46 56
100122
84
102
143 109 43133 44
23
7105
128 144

−0.2 −0.1
13
39 83
68
95 135
147 139 110
149 73 14186

−5
−5

60
8093
91
70 114
90 88 102 67
60143
114 150
9982
107120
81 85 71 116 145
58 6369
54
−10

94 122
−0.2

−10
42 107 149
137
115 101
61

−0.2 0.0 0.1 0.2 −0.2 −0.1 0.0 0.1 0.2

Comp.1 Comp.2

−3 −1 0 1 2 3 −10 −5 0 5 10
R Demonstration

#pairs(pca$scores) #scatter matrix

# correlation between original variables and principal components
corr <- cor(iris[,-5], pca$scores)
corr

## Comp.1 Comp.2 Comp.3 Comp.4

## Sepal.Length 0.8901688 0.36082989 0.27565767 0.03760602
## Sepal.Width -0.4601427 0.88271627 -0.09361987 -0.01777631
## Petal.Length 0.9915552 0.02341519 -0.05444699 -0.11534978
## Petal.Width 0.9649790 0.06399985 -0.24298265 0.07535950
Observations

From the results of the PCA, we were lead to believe that

PC1 is made up of all variables but Sepal.Width, since that
coefficient was considerably smaller (R does not return
value).

This is jumping into conclusions, and one must be guided

by the correlation between PCs and the original variables.

We observe that, PC1 is highly correlated with PETAL

variables and Sepal.Length, but much less with
Sepal.Width.

Sepal.Length, Petal.Length and Petal.Width convey

essentially the same information, while there is some
additional information put forward by Sepal.Width.
R Demonstration: PCA based on correlation
#summary(pcacor, loadings = TRUE)
##To output Scores, use original data
pca_cor<-princomp(iris[,-5]*10,scores=T,cor=T)
summary(pca_cor)

## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4
## Standard deviation 1.7083611 0.9560494 0.38308860 0.143926497
## Proportion of Variance 0.7296245 0.2285076 0.03668922 0.005178709
## Cumulative Proportion 0.7296245 0.9581321 0.99482129 1.000000000

#pairs(pca_cor$scores)
#correlation between original variables and principal components
cor(iris[,-5], pca_cor$scores)

## Comp.1 Comp.2 Comp.3 Comp.4

sum((pca_cor$sdev)^2) #Total Variance

## [1] 4
R Demonstration: PCA based on correlation

par(mfrow=c(1,2))
screeplot(pca_cor,type="lines",main="Scree Plot")
biplot(pca_cor,choices=1:2,main="Plot of PC1 and PC2")

Scree Plot
Plot of PC1 and PC2
−10 −5 0 5 10
3.0

16 132
118

0.0 0.1 0.2

Sepal.Width

10
34 110
3315
2.0

17 6
19
Variances

45
20
47 137

5
11
49
22 8651 145
125
149 Sepal.Length
126
121
101
144136
57
111142106

Comp.2
2337
5
38
28
18
41144
32 525116
87
66
138
140
3
141
130
103
108
113123
29
40
27
25
12
7 821
24 62
71
76 146
78
105
148
117
128
9259 131Petal.Width
Petal.Length
50 139
150 133119
104

0
36 96
67
89
8575
98
6477
55
79
134129
4330
3
48
35
10 97
6574
72127
112
115
1.0

31
426
46
13 2 56
100 124
122
84
102
143109
39
14 83
68
93135
95 147
73

−10 −5
9 60
80
91
9990114
70
107
81
82 88
58 54
63 69
120
94
−0.2
42
0.0

Comp.1 Comp.3 −0.2 0.0 0.1 0.2

Comp.1
Exploration of the Principal Components
One major observation of the graphs is that there is a “gap"
between the observations. They give a bimodal
expression.
This is explained by the fact that the principal components
were calculated on all 150 observations, ignoring the
SPECIES variable.
When no formal inference is needed, but the principal
components are only applied to summarize and simplify
the data, the procedure is valid.
When formal inference is required, such a MARGINAL
PCA procedure would be very much unacceptable.
SOLUTIONS: Calculate PCs for each species separately
or calculate PCs on transformed data (PCA by groups),
that are corrected for the effect of SPECIES (“partialled
out" PCA).
Exercise
(1) In the analysis above (based on correlation matrix)
(a) plot the third and fourth principal components. Take a
look at the scales on the axes of this and the plot
based on covariance matrix. What do you observe?
(b) Plot Sepal.Length versus the first principal
component. Is the orientation expected?
(2) Perform principal components analysis for each of the
species separately using
(a) covariance matrix
(b) correlation matrix

Data Analyst Question-Answers
No ratings yet
Data Analyst Question-Answers
17 pages
Critical Thinking Book
No ratings yet
Critical Thinking Book
327 pages
Mda-Session-7 Simple Linear Regression
No ratings yet
Mda-Session-7 Simple Linear Regression
75 pages
Carol Ash - The Probability Tutoring Book - An Intuitive Course For Engineers and Scientists (And Everyone Else!) - Wiley-IEEE Press (1996)
100% (2)
Carol Ash - The Probability Tutoring Book - An Intuitive Course For Engineers and Scientists (And Everyone Else!) - Wiley-IEEE Press (1996)
481 pages
Stats101A - Chapter 2
No ratings yet
Stats101A - Chapter 2
59 pages
CUHK STAT5102 Ch3
No ratings yet
CUHK STAT5102 Ch3
73 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
36 pages
Regression Equation: Independent Variable Predictor Variable Explanatory Variable Dependent Variable Response Variable
No ratings yet
Regression Equation: Independent Variable Predictor Variable Explanatory Variable Dependent Variable Response Variable
60 pages
Note 13 - Linear Regression
No ratings yet
Note 13 - Linear Regression
25 pages
Module01 LinearRegression
No ratings yet
Module01 LinearRegression
41 pages
STAT301 Notes
No ratings yet
STAT301 Notes
168 pages
EViews 9 Users Guide I
No ratings yet
EViews 9 Users Guide I
903 pages
Ra Web
No ratings yet
Ra Web
70 pages
Regression Kann Ur 14
No ratings yet
Regression Kann Ur 14
43 pages
Chap01-3 (Autosaved)
No ratings yet
Chap01-3 (Autosaved)
51 pages
Statics Thinking-Regression
No ratings yet
Statics Thinking-Regression
51 pages
Chapter 8 Regression Model - 2023
No ratings yet
Chapter 8 Regression Model - 2023
21 pages
13 Predictive Analysis - Tests of Association- Regression
No ratings yet
13 Predictive Analysis - Tests of Association- Regression
70 pages
Multivariate Final PDF
No ratings yet
Multivariate Final PDF
261 pages
Multivariate Final PDF
No ratings yet
Multivariate Final PDF
261 pages
Applied Statistics II-SLR
100% (1)
Applied Statistics II-SLR
23 pages
STAT630 RSlide
No ratings yet
STAT630 RSlide
130 pages
Linear Models
No ratings yet
Linear Models
92 pages
Module01.1 LinearRegression
No ratings yet
Module01.1 LinearRegression
32 pages
MATH6183 Introduction+Regression
No ratings yet
MATH6183 Introduction+Regression
70 pages
Chapter 9 Simple Linear Regression and Correlation (1) (1)
No ratings yet
Chapter 9 Simple Linear Regression and Correlation (1) (1)
56 pages
Pradytha Galuh Putranti_2304220013_SSD_B ING-STAT (2)
No ratings yet
Pradytha Galuh Putranti_2304220013_SSD_B ING-STAT (2)
26 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
27 pages
Simple Linear Regression1.Pptx
No ratings yet
Simple Linear Regression1.Pptx
36 pages
Applied Linear Regression Models 4th Ed Note
No ratings yet
Applied Linear Regression Models 4th Ed Note
46 pages
Stat 353 Study Guide
No ratings yet
Stat 353 Study Guide
44 pages
Simple Linear Regression Analysis
No ratings yet
Simple Linear Regression Analysis
55 pages
BN2102 7-10
No ratings yet
BN2102 7-10
24 pages
Lecture 16 Regression
No ratings yet
Lecture 16 Regression
30 pages
Linear Regression
100% (2)
Linear Regression
228 pages
Statistic SimpleLinearRegression
No ratings yet
Statistic SimpleLinearRegression
7 pages
Chapter 3 - Classical Simple Linear Regression
No ratings yet
Chapter 3 - Classical Simple Linear Regression
52 pages
Data Science 03 - Regression PDF
No ratings yet
Data Science 03 - Regression PDF
32 pages
Chapter 3
No ratings yet
Chapter 3
36 pages
PE Civil: Transportation e-book Practice Exam
No ratings yet
PE Civil: Transportation e-book Practice Exam
41 pages
01 SLR Final
No ratings yet
01 SLR Final
37 pages
DMJAP-LinearRegression-3
No ratings yet
DMJAP-LinearRegression-3
28 pages
L4&5 Multiple Regression 2010B
No ratings yet
L4&5 Multiple Regression 2010B
77 pages
Regression
No ratings yet
Regression
24 pages
Simple Linear
No ratings yet
Simple Linear
10 pages
What Is Multiple Linear Regression
No ratings yet
What Is Multiple Linear Regression
23 pages
Regression Analysis
100% (1)
Regression Analysis
280 pages
BST 32202 LINEAR REGRESSION 6 SLR ASSUMPTIONS LSE
No ratings yet
BST 32202 LINEAR REGRESSION 6 SLR ASSUMPTIONS LSE
20 pages
Inference For Regression
No ratings yet
Inference For Regression
24 pages
Regression Notes- Part-1
No ratings yet
Regression Notes- Part-1
17 pages
03 Revisions L Regression
No ratings yet
03 Revisions L Regression
25 pages
EViews 3rd Week Assignment With Solution
100% (2)
EViews 3rd Week Assignment With Solution
26 pages
Chapter 9
No ratings yet
Chapter 9
28 pages
Lecture9 Regression1 PDF
No ratings yet
Lecture9 Regression1 PDF
22 pages
Topic:-Regression: Name: - Teotia Nidhi Class: - M.SC Biotechnology
No ratings yet
Topic:-Regression: Name: - Teotia Nidhi Class: - M.SC Biotechnology
11 pages
Introduction To Linear Regression
No ratings yet
Introduction To Linear Regression
6 pages
Handling Several Batches, Scatter Plot and Resistant Lines and Transformations
No ratings yet
Handling Several Batches, Scatter Plot and Resistant Lines and Transformations
39 pages
MATH 223: Calculus II: Dr. Joseph K. Ansong
No ratings yet
MATH 223: Calculus II: Dr. Joseph K. Ansong
34 pages
MATH 223: Calculus II: Dr. Joseph K. Ansong
No ratings yet
MATH 223: Calculus II: Dr. Joseph K. Ansong
38 pages
Untitled 472
No ratings yet
Untitled 472
13 pages
Regression 101
No ratings yet
Regression 101
18 pages
Summary of Topics For Midterm Exam #2: STA 371G, Fall 2017
No ratings yet
Summary of Topics For Midterm Exam #2: STA 371G, Fall 2017
6 pages
Linear Regression
100% (2)
Linear Regression
28 pages
Module05 Notes
No ratings yet
Module05 Notes
19 pages
Fikirte Mersha - Cleaned
No ratings yet
Fikirte Mersha - Cleaned
82 pages
MA Fall2022 W3 Slides
No ratings yet
MA Fall2022 W3 Slides
23 pages
USA Schools
No ratings yet
USA Schools
30 pages
Mauren Publish (Terbit)
No ratings yet
Mauren Publish (Terbit)
21 pages
STAT 606 - Some Empirical Methods Slides - 2019 - 2020
No ratings yet
STAT 606 - Some Empirical Methods Slides - 2019 - 2020
39 pages
El-Banbi, Sayyouh, Nassar.-Modified Black Oil PVT Properties Correlations For Volatile Oil and Gas Condensate Reservoirs
No ratings yet
El-Banbi, Sayyouh, Nassar.-Modified Black Oil PVT Properties Correlations For Volatile Oil and Gas Condensate Reservoirs
19 pages
03 - Costacctng 03 - Costacctng
No ratings yet
03 - Costacctng 03 - Costacctng
19 pages
Stevie Gibernau - Review Jurnal Internasional
100% (1)
Stevie Gibernau - Review Jurnal Internasional
16 pages
Variable Definitions
No ratings yet
Variable Definitions
13 pages
The Influence of Artificial Intelligence On E-Gove
No ratings yet
The Influence of Artificial Intelligence On E-Gove
16 pages
1 Poster Kappa
No ratings yet
1 Poster Kappa
17 pages
Simple Linear Regression Analysis
No ratings yet
Simple Linear Regression Analysis
7 pages
Contreras Et Al (2021) Scale Effect and Correlation Between Uniaxial Compressive Strength and Point Load Index For Limestone and Travertine
No ratings yet
Contreras Et Al (2021) Scale Effect and Correlation Between Uniaxial Compressive Strength and Point Load Index For Limestone and Travertine
23 pages
Elder Research Ebook The Ten Levels of Analytics
No ratings yet
Elder Research Ebook The Ten Levels of Analytics
19 pages
Hamidah Karil-02
No ratings yet
Hamidah Karil-02
19 pages
MATH 223: Calculus II: Dr. Joseph K. Ansong
No ratings yet
MATH 223: Calculus II: Dr. Joseph K. Ansong
18 pages
MATH 223: Calculus II: Dr. Joseph K. Ansong
No ratings yet
MATH 223: Calculus II: Dr. Joseph K. Ansong
28 pages
MATH 223: Calculus II: Dr. Joseph K. Ansong
No ratings yet
MATH 223: Calculus II: Dr. Joseph K. Ansong
28 pages
College of Natural and Computational Science Department of Statistics Linear Regression Biostatistics Master Program
No ratings yet
College of Natural and Computational Science Department of Statistics Linear Regression Biostatistics Master Program
3 pages
MATH 223: Calculus II: Dr. Joseph K. Ansong
No ratings yet
MATH 223: Calculus II: Dr. Joseph K. Ansong
18 pages
Artikel Iga Fix
No ratings yet
Artikel Iga Fix
12 pages
MATH 223: Calculus II: Dr. Joseph K. Ansong
No ratings yet
MATH 223: Calculus II: Dr. Joseph K. Ansong
14 pages
The Role of Supplier Base Rationalisation in Operational Performance in The Retail Sector in Zimbabwe
No ratings yet
The Role of Supplier Base Rationalisation in Operational Performance in The Retail Sector in Zimbabwe
10 pages
The "Mincer Equation" Thirty Years After Schooling, Experience, and Earning
No ratings yet
The "Mincer Equation" Thirty Years After Schooling, Experience, and Earning
32 pages
Chapter Three
No ratings yet
Chapter Three
5 pages
Regression and Multiple Regression Analysis
100% (1)
Regression and Multiple Regression Analysis
21 pages
Yaliserves: Yaliserves Business Business Analysis Analysis Workbook Workbook
No ratings yet
Yaliserves: Yaliserves Business Business Analysis Analysis Workbook Workbook
13 pages
Multi-Performance Optimization of Wire Cut EDM Process Parameters On Surface Roughness of AA7075 / B4Cp Metal Matrix Composites3
No ratings yet
Multi-Performance Optimization of Wire Cut EDM Process Parameters On Surface Roughness of AA7075 / B4Cp Metal Matrix Composites3
6 pages
Latest IASSC ICBB Exam Questions Free Update
No ratings yet
Latest IASSC ICBB Exam Questions Free Update
7 pages
334 Exx3
No ratings yet
334 Exx3
7 pages
Stat 441 & Stat 601 I.A 2 Solns 17-18
No ratings yet
Stat 441 & Stat 601 I.A 2 Solns 17-18
8 pages
Impact of Motivation On Students' Academic Performance A Case Study of University Sultan Zainal Abidin Stuents
No ratings yet
Impact of Motivation On Students' Academic Performance A Case Study of University Sultan Zainal Abidin Stuents
7 pages
Harare Institute of Technology
No ratings yet
Harare Institute of Technology
5 pages
Cusum Lec
No ratings yet
Cusum Lec
10 pages
Aids Cis Final
No ratings yet
Aids Cis Final
6 pages
Mandilectomy Pat
No ratings yet
Mandilectomy Pat
7 pages
08 Introductory Econometrics Fourth Sem
No ratings yet
08 Introductory Econometrics Fourth Sem
4 pages
STAT630 - Exam Project
No ratings yet
STAT630 - Exam Project
3 pages
STAT 630: Advanced Data Analysis: Procedure For Project May 29, 2020
No ratings yet
STAT 630: Advanced Data Analysis: Procedure For Project May 29, 2020
3 pages
MSREQ
No ratings yet
MSREQ
2 pages
Contest Whiteboard Flyer
No ratings yet
Contest Whiteboard Flyer
1 page
A Paradox in The Interpretation of Group Comparisons: Frederic M. Lord
No ratings yet
A Paradox in The Interpretation of Group Comparisons: Frederic M. Lord
2 pages
Mathematics 1St First Order Linear Differential Equations 2Nd Second Order Linear Differential Equations Laplace Fourier Bessel Mathematics
From Everand
Mathematics 1St First Order Linear Differential Equations 2Nd Second Order Linear Differential Equations Laplace Fourier Bessel Mathematics
Andrew Igla
No ratings yet
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
Calculus I Essentials
From Everand
Calculus I Essentials
Editors of REA
1/5 (1)
Algebraic Equations
From Everand
Algebraic Equations
Demetrios P. Kanoussis
No ratings yet
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Transformation of Axes (Geometry) Mathematics Question Bank
From Everand
Transformation of Axes (Geometry) Mathematics Question Bank
Mohmmad Khaja Shareef
3/5 (1)