Math 1060 - Lecture 7
Math 1060 - Lecture 7
Math 1060 - Lecture 7
The χ2 (“chi square”) distribution is the distribution of the random variable
k
2 zi2
i 1
where zi’s are the z‐scores of normally distributed independent random variables. As it only deals with
standardized scores the distribution only has one parameter, the degree of freedom k (or df) , which
represents the number of z‐scores we are squaring and adding together.
The distribution has a number of properties tied to the degree of freedom
Median 2
𝑘 1
9𝑘
Variance 2𝑘
We will look at the first two
Page | 66
1) Confidence interval of a Variance/Standard Deviation
Just as with the mean value it is of use to be able to create a confidence interval for the variance or
s2
more typically a standard deviation. We can show that the quantity (n 1) follows a 2 distribution
2
with a degree of freedom df=n‐1.
s2
So to get a (1‐ ) confidence interval for we set (n 1) 2 and solve for the range
2
(n 1) s 2 (n 1) s 2
12 /2 2 /2
2 / 2 12 / 2
We can get these 2 values using R and the 𝑞𝑐ℎ𝑖𝑠𝑐 𝑐 , 1 , 𝑑𝑓
Example : The mercury concentration in tap water was measured in randomly sampled specimens from
32 different locations. It was found that the mean cocentration was 26.2 ng/L and the standard
deviation 5.15 ng/L. Construct a 90% confidence interval for the standard deviation for the population
sampled.
Page | 67
Goodness‐of‐Fit Test
A goodness‐of‐fit test compares frequency distributions. It uses expected frequencies and compares
them to observed frequencies. Then we do a hypothesis test based on the test statistic
k
(Oi Ei )2
test
2
i 1 Ei
where
• Oi: Observed frequency
• Ei: Expected frequency
• k: Number of different categories
Note that we require Ei 5 for all i for this test statistic to actually follow a 2 distribution.
Example: Say we are inspecting a casino and given the job of testing to see if the dice are fair.
Roll a single die a total of 600 times and record the results:
1 2 3 4 5 6 Total
Observed 92 108 84 97 118 101 600
Hypothesis test
H : Oi Ei for all i,
H A : Oi Ei for some i.
Degrees of freedom =k−1
Goodness‐of‐fit tests are always upper‐tailed tests. This is because χ2 does not discriminate
between Oi>Ei and Oi<Ei. Note that χ2=0 if Oi=Ei for all i.
Use chisq.test() or pchisq() to get your p‐values
Page | 68
1 2 3 4 5 6 Total
i2 0.64 2.56 0.09 0.01
plot(seq(0,20,0.1),dchisq(x,5),type="solid",xlab="chi Square",ylab="Prob Density")
> abline(v=7.18)
Page | 69
Chi‐Square Test of Independence
At times we are interested in researching if two categorical variables are related or associated (i.e.
dependent). This can be done by Chi‐square Test of Independence:
𝐻 : In the population, the two categorical variables are independent.
𝐻 : In the population, two categorical variables are dependent.
Once we have gathered our data we summarize the data in the two‐way contingency table. This table
represents the observed counts and is called the Observed Table. The contingency table given here
contains the observed counts of the party affiliation and opinion for those surveyed.
Observed Table (source: https://onlinecourses.science.psu.edu/stat500/node/56)
favor indifferent opposed total
democrat 138 83 64 285
republican 64 67 84 215
total 202 150 148 500
Finding Expected Counts from Observed Counts: 𝐸
Expected table:
Now calculate the test statistic: 𝜒 ∑
The degree of freedom is 𝑟 1 𝑐 1 . The test is right sided and the critical value can be found by:
qchisq(alpha,df,lower.tail = "FALSE")
Page | 70
Example: In the built‐in data set survey, the Smoke column records the students’ smoking habit,
while the Exer column records their exercise level. The allowed values in Smoke are "Heavy", "Regul"
(regularly), "Occas" (occasionally) and "Never". As for Exer, they are "Freq" (frequently), "Some" and
"None".
We can tally the students’ smoking habit against the exercise level with the table function in R. The
result is called the contingency table of the two variables.
library(MASS) # load the MASS package
tbl = table(survey$Smoke, survey$Exer)
tbl # the contingency table
Freq None Some
Heavy 7 1 3
Never 87 18 84
Occas 12 3 4
Regul 9 1 7
In order to test the hypothesis whether the students smoking habit is independent of their exercise
level at 0.05 significance level:
chisq.test(tbl)
Pearson's Chi‐squared test
data: tbl
X‐squared = 5.4885, df = 6, p‐value = 0.4828
Decision: As the p‐value 0.4828 is greater than the .05 significance level, we do not reject the null
hypothesis that the smoking habit is independent of the exercise level of the students.
Page | 71
Inferential Statistics continued: Two Sample Hypothesis Testing
Hypothesis Test for Two Sample means – Independent sample
Often studies are done with no population mean data to compare to and instead we have two sample
data sets to compare the means of.
Example: Sulfide materials from piles of waste rock are known to cause leaching of metals into nearby
aquatic environments. A mining site (#1) has tested out a new method of waste rock disposal and
compares the impact to a similar nearby mining site. Samples of size 50 were taken from both
environments, and the results were as follows (all in ppm):
Operation #1: mean of 0.2671 and standard deviation 0.0221
Operation #2: mean of 0.2818 and standard deviation 0.0278
Is the leaching from Operation #1 less than the one from #2?
Notation
Requirements
1. The two samples are independent.
2. Both samples are simple random samples.
3. Both samples are large, or they come from normally distributed populations.
Page | 72
Test statistics for hypothesis testing two INDEPENDENT samples
a) ttest
x1 x2 1 2
s12 s22
n1 n2
2
s12 s22
n1 n2
with degree of freedom df min( n1 1, n2 1) or more technically df
s 2 2 2
1 s22
n1
2
n
( n1 1) ( n2 1)
b) If we think 1 2 then we can use
x1 x2 1 2 with (n1 1) s12 ( n2 1) s22
ttest df n1 n2 2 and s 2p
s 2p s 2p n1 n2 2
n1 n2
Confidence intervals
12 22 s12 s22
( 1 2 ) ( x1 x2 ) E where E z /2 or E t /2 as appropriate
n1 n2 n1 n2
Page | 73
Inferences about 2 means: dependent samples (matched pairs)
Case study
You have just upgraded your company’s e‐commerce site. The following is a list of conversion rates
before and after the upgrade. Can we say that the conversion rates increased? Conversion rate is the
rate that a customer buys something on their visit.
User ID 1 2 3 4 5 6
Difference
(After‐Before)
Requirements
1) The sample data consist of matched pairs.
2) Both samples are simple random samples.
3) Both samples are large, or they come from normally distributed populations.
Really just a single sample test of the mean of the differences. If we define
xd d
ttest with degree of freedom df n 1
sd
n
Page | 74
Example : Test the claim in the above example at α=0.01.
User ID 1 2 3 4 5 6
Page | 75
Listed below are the cost in US dollars of flights from New York to San Francisco for different airlines.
Use a 0.01 significance level to test the claim that flights booked one day in advance cost more than
ones booked 30 days in advance.
t.test(x, y = NULL, alternative = c("two.sided", "less", "greater"),mu = 0, paired = FALSE, var.equal = FALSE,
conf.level = 0.95, ...)
Airlines A B C D E F G
Manual solution:
Mean(d)= 336.2857
Sd(d)= 254.3021
336.2875 0
𝑡 3.498706
254.3021
√7
𝑇 𝑞𝑡 0.99,6 3.142668
Decision: 𝑡 𝑇 thus we reject the null which means the evidence supports the claim that 1‐
day booking price mean is higher than 30‐day booking price mean.
R solution:
A$oneday=c(356,614,628,1088,943,367,536)
A$onemonth=c(344,230,264,464,278,318,280)
t.test(A$oneday,A$onemonth,conf.level = 0.95,alternative = "greater",paired = TRUE)
Paired t‐test
data: A$oneday and A$onemonth
t = 3.4987, df = 6, p‐value = 0.006423
alternative hypothesis: true difference in means is greater than 0
99 percent confidence interval:
34.22132 Inf
sample estimates:
mean of the differences
336.2857
Page | 76
ANalysis Of VAriance (ANOVA)
The method we use to compare the means of two or more groups is called one‐way Analysis of
Variance. “Analysis of Variance” actually refers to a general approach to analysis (many different tests
require that you do an ANOVA decomposition), but most often, when people refer to ANOVA, they
mean the one‐way ANOVA we will discuss here.
Let us consider two possible datasets that might result from the experiment described (applying 4
possible fertilizer treatments with 3 replicates of each).
Dataset 1:38
Dataset 2:
The two example datasets have the same group means, but the variability of the data is different for the
two datasets. Which of the two datasets do you think is more likely to allow us to conclude that at least
one fertilizer results in a significantly different dried plant weight after our specified grouping period?
Why?
Page | 77
The key element of one‐way ANOVA is to compare the size of the within groups variation to the size of
the between groups variation.
Within groups variation describes the variability in the results for each treatment level, within that
group.
Between groups variation describes the group‐to‐group variability. That is, how different the results
are for different treatment levels.
If there is a lot of variation between groups, compared to the amount of variation within groups, then
we conclude that at least one group has a significantly different mean, in other words the treatment
made a difference. If the amount of between group variation is similar to the within group variation,
then we conclude that there is no significant difference between the means of the groups. This is why it
is called analysis of variance, even though we are doing a test for the mean!
Notation
Data is in k Groups or k factor levels. In above example k=4
Yij = jth measured value of variable for factor level i. where i 1,..., k and j 1,..., ni e.g.:
𝑌 38 for dataset 1
ni represents the number of repeat measurements (or sample size) for factor level i. e.g.: 𝑛
𝑛 𝑛 𝑛 3
1 ni
Sample mean for each group, Yi
ni
Y
j 1
ij . E.g.: 𝑌 30 32 28 30 , 𝑌
33 34 38 35 , so on.
where i is the population mean for the ith group.
Page | 78
This is tested using a new test statistic called Ftest which follows a new distribution, F‐distribution (or
Fisher distribution named after statistician Ronald Fisher). To calculate the 𝐹 :
The requirements for ANOVA tests are very similar to those for the two‐sample t‐test (for independent
samples with equal variance).
1. Independence of values within groups and between groups
Observations within groups must be independent, and the groups must also be independent. So
ANOVA cannot be used with paired data.
2. Populations we are sampling from must be Normal (Normality)
As in the t‐test, it is not critical that this assumption be met exactly, but it is still important to test.
Long tails and skewness are only a problem if samples are of different sizes, or if sample sizes are
small. If different samples are skewed in opposite directions, or if there are severe outliers, then
there might be problems with this assumption being violated. To check for normality, we can
construct side‐by‐side boxplots (i.e. boxplots on the same axes), normal Q‐Q plots for each sample,
or (recommended) normal Q‐Q plots for residuals
3. Equal Standard Deviations / Equal Variances / Homoscedasticity
This is an important assumption for ANOVA. If it is violated, then it may be necessary to transform
the data or consider some other type of test. To assess whether standard deviations are equal, we
use a plot of residuals, ei , j against fitted values, and look to see if variability is different for
different fitted values. Remember that the fitted value for one‐way ANOVA is the group mean, so
the residuals for each group will be plotted together.
A “residual” is the difference between an observed data value and the best point estimate of a data
value in the same group. For ANOVA, the best point estimate is the group mean Yi . So the residual
Page | 79
4. Samples are simple random samples
5. Samples are categorized in only one way.
For our example, the samples are only categorized by fertilizer.
F‐distribution
The F test statistic is always larger when groups are more different so like the 2 distribution, we
always think of p‐values and critical values in the right tail of the distribution, even though we are
essentially doing a two‐tailed type of test. Note: The F‐statistic always has two associated degrees of
freedom (numerator and denominator). Any time you perform an F‐test, you should report both
degrees of freedom!
To find p‐values we will use R and the command pf(Ftest, num df, denom df)
Page | 80
e.g.
k ni
Y ij
1 k ni k
Y i 1 j 1
k
ij Y where n ni is the total number of data values in all k groups.
n
n i 1 j 1 i 1
i
i 1
This is basically computing a sample mean for all the recorded values as if they were from one sample
(which they would be if the null hypothesis of equal means were true).
For Data Set 1 let’s find all the means and the grand mean. Note that the grand mean is not necessarily
the same as the mean of group means.
Sums of Squares
If we consider our entire dataset as one sample the variance can be written as:
ni
Y Y
k
2
ij
i 1 j 1
n 1
Total Sum of Squares
Page | 81
The numerator of this expression is called the total sum of squares, as it is the sum of all the squared
differences of the values from the grand mean. It is a measure of the total variability in the entire
dataset. It is written as 𝑆𝑆
𝑆𝑆 𝑌 𝑌
Visually
For Dataset 1:
Page | 82
𝑆𝑆 can be separated into two different sums of squares: one that measure the between group
variation (how different are the group means from the grand mean) and one that measures natural
within group variation (how different are the individual measurements within each group from the
group mean).
Between Groups Variability
This is assessed based on the difference between each of the group means and the grand mean. It is
weighted by the number of replicates in a group, so that a group with more replicates has more
influence on the size of this sum of squares than a group with fewer replicates. It is described as the
“sum of squares treatment” or the “sum of squares between groups”, and so is denoted as 𝑆𝑆 .
𝑆𝑆 ∑ 𝑛 𝑌 𝑌
Visually:
For Dataset 1:
Page | 83
Within Group Variability
This is a measure of the variation we would expect to see for a fixed treatment (how variable are results
when the same treatment is applied to different experimental units). It is sometimes described as the
“unexplained” variation, because it is the variability in the data that is not explained by the changing
factor between groups (treatment). It has many names: “within‐groups sum of squares”, “sum of
squares error (SSE)”, and “sum of squares residual”. It is denoted as 𝑆𝑆 . It is computed as:
𝑆𝑆 𝑌 𝑌
Visually:
For Dataset 1:
Page | 84
Combined relationships
If we combine the between‐groups sum of squares and the within‐groups sum of squares, we obtain the
total sum of squares. The same applies to the corresponding degrees of freedom.
𝑆𝑆 𝑆𝑆 𝑆𝑆
𝑑𝑓 𝑑𝑓 𝑑𝑓
Note that:
∑ ∑ 𝑌 𝑌 ∑ 𝑌 𝑌 ∑ ∑ 𝑌 𝑌 and
𝑛 1 𝑘 1 𝑛 𝑘
We would like to compare the between group variation (SSgroups) to the within group variation (SSwithin).
However, these values must be “standardized” by dividing by the associated number of degrees of
freedom first to obtain the mean squares. You can think of this as just a way to make sure that the
sums are averaged out appropriately so that they’re on the same scale. The calculations are as follows:
𝑆𝑆 𝑆𝑆
𝑀𝑆
𝑑𝑓 𝑘 1
𝑆𝑆 𝑆𝑆
𝑀𝑆
𝑑𝑓 𝑛 𝑘
Finally we have
𝑀𝑆
𝐹 𝐹 ,
𝑀𝑆
Page | 85
Generally, the information about this test is presented in an ANOVA table:
The ANOVA test requires a constant variance for each group. 𝑀𝑆 𝜎 is our estimate of this
constant variance, and the value used to calculate the standard deviation and used to create confidence
intervals for the differences between groups. If we perform an ANOVA test when there are only two
groups, then 𝜎 𝑀𝑆 is the same as the pooled variance s 2p we discussed in the two‐sample t‐
test.
To perform an ANOVA in R, use code like:
modl=lm(Resp~Pred,data=MyData)
resaov=aov(modl)
where Resp is the response variable, and Pred is the categorical predictor variable.
Data Set 2:
Using R to do ANOVA on Dataset 2
> modl=lm(driedweight~fertilizer,data=dataset2)
> resaov=aov(modl)
> summary(resaov)
Df Sum Sq Mean Sq F value Pr(>F)
fertilizer 3 188.2 62.75 1.63 0.258
Residuals 8 308.0 38.50
Page | 86
Comparing Differences Between Groups
1. Tukey‐Kramer / Tukey / Tukey’s HSD (Honestly Significant Difference)
This is based on something called the studentized range distribution rather than the t‐distribution.
It is relatively straight‐forward to compute Tukey’s HSD confidence intervals once you have created
your ANOVA model. To obtain confidence intervals and p‐values for each pair of comparisons, run:
TukeyHSD(resaov) where resaov is the result of running aov() on the model.
The output will look something like this for the Dataset 1 data:
TukeyHSD(resaov)
Note: summary(result of AOV) for Data Set1
Tukey multiple comparisons of means
95% family‐wise confidence level was
summary(resaov)
Fit: aov(formula = modl) Df Sum Sq Mean Sq F value
fertilizer 3 188.2 62.75 10.46
$fertilizer
diff lwr upr p adj Residuals 8 48.0 6.00
B‐A 5 ‐1.404704 11.404704 0.1344163
C‐A 7 0.595296 13.404704 0.0329735
D‐A ‐3 ‐9.404704 3.404704 0.4800522
C‐B 2 ‐4.404704 8.404704 0.7538411
D‐B ‐8 ‐14.404704 ‐1.595296 0.0166535
D‐C ‐10 ‐16.404704 ‐3.595296 0.0046245
Why are we not doing Data Set 2?
Page | 87
2. Others
There are several other methods, including Bonferroni Correction, Scheffé, Least Significant
Difference, Newman Keuls, Duncan’s multiple range, and Dunnett’s intervals. We will not explore
these in this course.
Example: The pH values of waste water from factories A, B, and C are shown in the table. Are there any
differences among the mean pH values from the three factories?
A B C
7 6 8
8 6.1 7.5
6 5.5 6
6.5 4 7.5
The relationship we are interested in is ph~sites
To see a plot which has all the data on it type
stripchart(phData$ph~phData$site,vertical=TRUE, ylab="pH")
You should get a plot that looks like the one to the right. This should
give you an idea if the variance is the same for all data but we will
also plot the residues as well to see this.
Next we want to see if the data is normal. There is quite a bit of
theory behind this and there are even hypothesis tests for normality
but we will simply do a QQ Normal plot. This plot puts data points
on it in such a way that, if the data came from a normal distribution then it tends to fall in a straight
line. This works best with more data and it is not unusual for the tails to not fall on the line. To see
the effect try
qqnorm(rnorm(100,10,2))
Probably many things will produce a roughly straight line but one
thing is certain.
Page | 88
Try making plots for all site data together.
It should look like below. Notice it is not a great straight line suggesting that perhaps we should not
be doing ANOVA with this. qqnorm(phData$ph)
If we have fit data we can also put a theoretical line on there. Let’s do the steps and actually do
a plot of the residuals rather than just the data.
Create a fitted model of the relationship indicating the variables and then do ANOVA and check
F and p‐value for interest.
modl=lm(phData$ph~phData$site)
resaov=aov(modl)
summary(resaov)
Df Sum Sq Mean Sq F value Pr(>F)
phData$site 2 7.652 3.826 4.744 0.0392 *
Residuals 9 7.258 0.806
‐‐‐
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
A “residual” is the difference between an observed data value and the best point estimate of a
data value in the same group. For ANOVA, the best point estimate is the group mean Yi . So
Page | 89
resids=resid(modl)
qqnorm(resids)
qqline(resids)
summary(resids)
Min. 1st Qu. Median Mean 3rd Qu. Max.
‐1.4000 ‐0.5000 0.1875 0.0000 0.6250 1.1250
To assess whether standard deviations are equal, we use a plot of residuals against fitted
values, and look to see if variability is different for different fitted values. The fitted value for
one‐way ANOVA is the group mean, so the residuals for each group will be plotted together. To
create a plot of residuals against fitted values, extract the residuals (as above), extract the fitted
values, and then create a scatterplot. The scatterplot should hopefully show residuals (on the y‐
axis) approximately evenly scattered for each group, and each group should be approximately
centred on zero.
fits=fitted(modl)
plot(resids~fits)
Now since our data luckily has a significant p‐value
we can run the TukeyHSD to see which means are
statistically different
> TukeyHSD(resaov)
Tukey multiple comparisons of means
95% family‐wise confidence level
Fit: aov(formula = modl)
$`phData$site`
diff lwr upr p adj
#2‐#1 ‐1.475 ‐3.24785639 0.2978564 0.1033693
#3‐#1 0.375 ‐1.39785639 2.1478564 0.8285008
#3‐#2 1.850 0.07714361 3.6228564 0.0413676
Page | 90
Looking at this it appear the difference between the means of site #2 and #3 are statistically
different.
So while there are definitely a number of steps involved, the work to do an ANOVA is definitely
easy enough to do with R.
On the topic of Data Entry
Somewhat out of the flow of this lab but relevant. If you want to enter data and it is not too
much data you can use the following code:
dataname=edit(data.frame())
This opens up a window where you can set the name of each column and whether it is numeric
or character. You can also enter data at this time. Weirdly there are no “done” buttons. You just
click the window close “x” at the top corner.
Now more importantly say you realize you want to edit the data again to make some changes:
dataname=edit(data.frame(dataname))
This opens up the data and lets you make changes and then saves it back to the same name.
Lastly as a quick note, you can do boxplots using the response~predictor format.
boxplot(phData$ph~phData$site)
Page | 91