2 PDF
2 PDF
2 PDF
ST102
Elementary Statistical Theory
2. Point estimation
3. Interval estimation and sampling distributions
Dr James
Abdey
4. Hypothesis testing
5. Other statistical tests
Department
of Statistics
London School of Economics and Political Science
6. Linear regression
7. Analysis of variance (ANOVA) time permitting
Dr James Abdey
LT 2014
840
Dr James Abdey
LT 2014
841
example
Goal: To test the hypothesis that k populations means are the same.
7.1: Testing for 3 means an introductory example
Can we infer from those data that there is no significant difference in the
examination marks among all 3 classes?
Class 1
85
75
82
76
71
85
Dr James Abdey
LT 2014
842
Class 2
71
75
73
74
69
82
Dr James Abdey
Class 3
59
64
62
69
75
67
LT 2014
843
The data form a 6 3 array. Denote the data point at the (i, j)-th
position as Xij , we compute the column means first:
1 , X
2 , X
3 should be very close to
If H0 is true, the three sample means X
each other, i.e. all of them should be close to the overall mean
1 = 79, X
2 = 74, X
3 = 66. Transposing, we get
leading to X
Observation
Class 1
Class 2
Class 3
1
85
71
59
2
75
75
64
3
82
73
62
4
76
74
69
5
71
69
75
6
85
82
67
Mean
79
74
66
j=1
scale-invariant.
ST102 Elementary Statistical Theory
Dr James Abdey
LT 2014
3
P
j X
)2 . However, it is not
(X
844
Dr James Abdey
LT 2014
845
variance
A general setting: k
samples available from k Normal
2
distributions N(j , ), j = 1, . . . , k. Denote by X1j , X2j , . . . , Xnj j the
sample with the sample size nj from N(j , 2 ), j = 1, . . . , k.
j=1
T =
T .
(Note
j =
The j-th sample mean: X
1 = X
2 = X
3 .)
= 0 if X
1
nj
under H0 ?
nj
P
Xij .
i=1
nj
XX
1X
= 1
Xij =
nj Xj ,
X
n
n
j=1 i=1
where n =
k
P
j=1
j=1
ST102 Elementary Statistical Theory
Dr James Abdey
LT 2014
846
Dr James Abdey
LT 2014
847
ANOVA decomposition:
nj
k X
X
j=1 i=1
)2 =
(Xij X
(n 1) d.f.
nj
k P
P
Total variation:
j=1 i=1
nj
k X
X
j=1 i=1
j )2 +
(Xij X
(n k) d.f.
k
X
j=1
Remarks:
j X
)2
nj ( X
(k 1) d.f.
Between-treatments variation:
, with k 1
degrees of freedom.
Within-treatments variation:
nk =
k
P
j=1
, with
i=1
(ai b) =
m
X
i=1
m
X
Dr James Abdey
LT 2014
848
Dr James Abdey
LT 2014
849
Theorem:
iii. Formulae for computations: n =
k
P
nj , and
i. W =
j=1
j =
X
nj
P
=
Xij /nj , X
i=1
Total variation =
k
P
B=
k
P
j=1
j=1 i=1
each other.
j=1
nj
k P
P
j=1 i=1
j /n
nj X
nj
k P
P
2
Xij2 nX
k
j )2 and B = P nj (X
j X
)2 are independent of
(Xij X
ii. Also,
2
2 nX
nj X
j
j=1
nj
k
1 XX
j )2 2 .
(Xij X
nk
2
j=1 i=1
Residual (Error) SS = W =
nj
k P
P
j=1 i=1
Xij2
k
P
j=1
2 =
nj X
j
k
P
j=1
iii. When 1 = = k ,
(nj 1)Sj2
k
1 X
j X
)2 2 .
nj (X
k1
2
j=1
Dr James Abdey
LT 2014
850
Dr James Abdey
LT 2014
851
F =
k
P
j=1
j X
)2 /(k 1)
nj ( X
nj
k P
P
j=1 i=1
j )2 /(n k)
(Xij X
DF
k 1
nk
n1
SS
B
W
B +W
MS
B/(k 1)
W /(n k)
F
B/(k1)
W /(nk)
P
p-value
Dr James Abdey
LT 2014
852
j=1
j=1
B=
j=1
F =
W =
j=1 i=1
j )2 =
(Xij X
=
3 X
6
X
Xij2
j=1 i=1
3
X
3
X
LT 2014
853
516/2
B/(k 1)
=
= 9.
W /(n k)
430/15
!
j X
)2 = 6 (79 73)2 + (74 73)2 + (66 73)2 = 516,
6(X
3 X
6
X
Dr James Abdey
Hence
X
X
= 1
j = 1
j = 73.
X
nj X
X
n
3
3
X
j2
X
j=1
Source
Factor
Error
Total
j=1
DF
2
15
17
SS
516
430
946
MS
258
28.67
F
9
P
0.003
Dr James Abdey
LT 2014
854
Dr James Abdey
LT 2014
855
An old problem: Compare two Normal means with the same, but
unknown, variance.
j=1 i=1
= (n1 X
1 + n2 X
2 )/n. Hence
When k = 2, n = n1 + n2 , and X
1 X
= n2 ( X
1 X
2 )/n,
X
where Sj2 =
2 X
= n1 ( X
2 X
1 )/n.
X
Therefore
B=
2
X
j=1
nj
P
i=1
2
2
1 X
j X
) 2 = n1 n2 + n1 n2 ( X
1 X
2 )2 = n1 n2 (X
2 )2 .
nj ( X
2
n
n
Dr James Abdey
LT 2014
1 X
2 )2
(X
n1 + n2 2
= T 2,
2
2
1/n1 + 1/n2 (n1 1)S1 + (n2 1)S2
j )2 , and
(Xij X
T =
2 )2
1 X
(X
B/(2 1)
n1 n2 (n 2)
=
nj
2 P
P
W /(n 2)
n
j )2
(Xij X
2
1 X
n1 + n2 2
X
q
1/n1 + 1/n2 (n 1)S 2 + (n 1)S 2
1
2
1
2
856
Dr James Abdey
LT 2014
857
Example:
MTB > desc c1 c2 c3
N
100
100
100
(You may copy the file uhah.mtw from the ST102 Moodle site into your
document folder, and double-click the file to start a Minitab session.)
Variable
English
Mathematics
Political Scienc
Q1
4.000
4.000
4.000
Dr James Abdey
LT 2014
858
N*
0
0
0
Mean
5.810
5.300
5.330
Median
5.000
5.000
5.000
Dr James Abdey
SE Mean
0.249
0.201
0.197
Q3
8.000
7.000
7.000
LT 2014
StDev
2.493
2.013
1.975
Minimum
0.000000000
0.000000000
0.000000000
Maximum
11.000
9.000
9.000
859
DF
2
297
299
SS
16.38
1402.50
1418.88
S = 2.173
MS
8.19
4.72
F
1.73
Since the p-value for the F test is 0.178, we cannot reject the hypothesis
P
0.178
H0 : 1 = 2 = 3
i.e. the mean numbers of uhs or ahs said by professors in the 3
departments are the same.
R-Sq = 1.15%
Level
English
Mathematics
Political Scienc
N
100
100
100
R-Sq(adj) = 0.49%
Individual 95% CIs For Mean Based on
Pooled StDev
Mean StDev
-+---------+---------+---------+-------5.810 2.493
(-----------*-----------)
5.300 2.013
(-----------*------------)
5.330 1.975
(-----------*------------)
-+---------+---------+---------+-------4.90
5.25
5.60
5.95
Dr James Abdey
LT 2014
860
2
Radj
= 1
16.38
B
=
= 0.0115 = 1.15%,
Total SS
1,418.88
W /(n k)
1,402.50/297
= 1
= 0.0049 = 0.49%.
(Total SS)/(n 1)
1,418.88/299
j t0.025, nk S ,
X
nj
ST102 Elementary Statistical Theory
Dr James Abdey
LT 2014
j = 1, . . . , k.
Dr James Abdey
LT 2014
861
Example: In early 2001, the American economy was slowing down and
companies were laying off workers. A poll conducted during 911 February
2001, asked a random sample of workers how long (in months) it would be
before they had significant financial hardship if they lost their jobs. They
are classified into 4 groups according to their incomes. Below is a part of
Minitab output of the descriptive statistics of the classified data. Can we
infer that income has a significant impact on the length of time before
facing financial hardship?
Estimate for :
p
p
862
Mean
22.21
18.456
15.49
9.313
SE Mean
1.77
0.890
1.03
0.988
Dr James Abdey
StDev
11.03
9.507
9.23
8.087
LT 2014
863
k
X
nj = 39 + 114 + 81 + 67 = 301.
j=1
nj
k X
X
j=1 i=1
F =
Now
j=1
j X
)2 = 39(22.21 16.109)2 + 114(18.456 16.109)2
nj (X
j=1
(nj 1)Sj2
Consequently,
j=1
k
X
k
X
= 25,968.24.
X
= 1
j
X
nj X
n
B =
j )2 =
(Xij X
2 = 18.456, X
3 = 15.49, X
4 = 9.313, and
1 = 22.21, X
X
Dr James Abdey
LT 2014
864
5,205.097/3
B/(k 1)
=
= 19.84.
W /(n k)
25,968.24/(301 4)
Under H0 , F Fk1, nk = F3, 297 . Since F0.01, 3, 297 = 3.848 < 19.84, we
reject H0 at the 1% significance level, i.e. income has a significant impact
on the length of time before facing financial hardship.
ST102 Elementary Statistical Theory
Dr James Abdey
LT 2014
865
DF
3
297
300
S = 9.352
Level
Over $50K
$30 to 50K
$20 to 30K
Under $20K
SS
5202.1
25973.3
31175.4
MS
1734.0
87.5
R-Sq = 16.69%
N
39
114
81
67
Mean
22.205
18.456
15.494
9.313
F
19.83
P
0.000
R-Sq(adj) = 15.84%
StDev
11.029
9.507
9.233
8.087
Dr James Abdey
LT 2014
866
Dr James Abdey
LT 2014
867
H0 : 1 = = k .
i = 1, . . . , nj , j = 1, . . . , k,
Dr James Abdey
i = 1, . . . , r , j = 1, . . . , c,
where
LT 2014
represent the
1 , . . . , c represent c different
1 , . . . , r represent r different
ij N(0, 2 ) and are independent.
There are n = rc observations.
Conditions to make parameters , i , j identifiable:
1 + + r = 0,
868
1 + + c = 0.
Dr James Abdey
LT 2014
869
Hypotheses of interest:
Similar to the original model
Xij = + i + j + ij ,
we decompose the observations as follows:
c
P
Xij /c,
+ (X
i X
) + (X
j X
) + (Xij X
i X
j + X
)
Xij = X
i = 1, . . . , r
for i = 1, . . . , r and j = 1, . . . , c.
j=1
j =
Mean at the j-th treatment level: X
r
P
Xij /r ,
j = 1, . . . , c
,
Point estimators:
b=X
i=1
r
c
=X
= P P Xij /rc.
Overall mean: X
i X
j + X
.
Residuals: bij = Xij X
i=1 j=1
Dr James Abdey
LT 2014
i X
,
bi = X
870
Dr James Abdey
LT 2014
j X
.
bj = X
871
)2 = c
(Xij X
+
r
X
i X
)2 + r
(X
i=1
c
r
XX
i=1 j=1
Pr
c
X
j=1
j X
)2
(X
i X
j + X
) .
(Xij X
2
Pc
)2 , with rc 1 degrees of
Total variation: Total SS = i=1 j=1 (Xij X
freedom.
P
i X
)2 , with r 1
Between-blocks (rows) variation: Brow = c ri=1 (X
degrees of freedom.
P
j X
)2 ,
Between-treatments (columns) variation: Bcol = r cj=1 (X
with c 1 degrees of freedom.
Residual
variation: Residual SS
Pr P(Error)
c
i X
j + X
)2 , with (r 1)(c 1) degrees of
= i=1 j=1 (Xij X
freedom.
ST102 Elementary Statistical Theory
Dr James Abdey
LT 2014
872
Dr James Abdey
LT 2014
873
Computational formulae:
i = Pc Xij /c, i = 1, . . . , r
Row means: X
j=1
j =
Column means: X
Source
DF
SS
MS
Row
r 1
Brow
Brow /(r 1)
(c1)Brow
Residual SS
p-value
Column
c 1
Bcol
Bcol /(c 1)
Residual SS
Error
(r 1)(c 1)
Residual SS
Total
rc 1
Total SS
(r 1)Bcol
=
Overall mean: X
p-value
Total SS =
Residual SS
(r 1)(c1)
Pr
i=1
Pr
i=1 Xij /r ,
Pr
i=1
Pc
Pc
2
j=1 Xij
Dr James Abdey
LT 2014
874
2
rc X
Pr
2
i=1 Xi
j = 1, . . . , c
Pc
Pr
i=1 Xi /r
2
j=1 Xj
Dr James Abdey
Pc
j=1 Xj /c
2
rc X
2
rc X
Residual
Brow Bcol =
Pr PcSS =2 (Total
PSS)
r
2 r Pc X
2
2
X
X
c
i=1
j=1 ij
i=1 i
j=1 j + rc X .
LT 2014
875
Example:
The table below lists the percentage annual returns (calculated four times
per annum) of the Common Stock Index at the New York Stock Exchange
during 198185.
1981
1982
1983
1984
1985
1st quarter
5.7
7.2
4.9
4.5
4.4
2nd quarter
6.0
7.0
4.1
4.9
4.2
3rd quarter
7.1
6.1
4.2
4.5
4.2
4th quarter
6.7
5.2
4.4
4.5
3.6
r = 5, c = 4.
P
i = c Xij /c which are, respectively, 6.375, 6.375, 4.4,
Row means: X
j=1
4.6, 4.1 for i = 1, . . . , 5.
j =
Column means: X
4.88 for j = 1, . . . , 4.
Pr
Bcol = r
LT 2014
2
j=1 Xj
2
j=1 Xij
Brow = c
876
Pc
Pc
i=1 Xij /r
Pr
i=1 Xi /r
= 5.17.
= 559.06.
P P
2 = 559.06 20 (5.17)2 =
Hence Total SS = ri=1 cj=1 Xij2 rc X
559.06 534.578 = 24.482.
Dr James Abdey
Pr
=
The overall mean X
i=1
Pr
2
i=1 Xi
Dr James Abdey
LT 2014
877
Dr James Abdey
LT 2014
878
DF
SS
MS
Year
Quarter
Error
Total
4
3
12
19
19.867
0.602
4.013
24.482
4.967
0.201
0.334
14.852
0.600
< 0.01
> 0.10
Dr James Abdey
LT 2014
879
Two-way ANOVA in Minitab is almost as easy as oneway ANOVA, except the data set to be analysed needs
to be in a special format 3 columns each of length
r c:
Data column (c1): stack the original r c data
points column over column.
Dr James Abdey
LT 2014
c1
5.7
7.2
4.9
4.5
4.4
6.0
7.0
4.1
4.9
4.2
7.1
6.1
4.2
4.5
4.2
6.7
5.2
4.4
4.5
3.6
c2
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
c3
1
1
1
1
1
2
2
2
2
2
3
3
3
3
3
4
4
4
4
4
DF
4
3
12
19
S = 0.5783
880
Year
1
2
3
4
5
Xij = + i + j + ij .
Hence we may also look at the residuals:
i = 1, . . . r , j = 1, . . . , c.
Quarter
1
2
3
4
It may also give interval estimates for each block and treatment level.
MTB > twoway c1-c3;
SUBC> means c2 c3;
SUBC> gFourpack.
LT 2014
F
14.85
0.60
P
0.000
0.627
R-Sq(adj) = 74.05%
882
Dr James Abdey
LT 2014
881
Mean
6.375
6.375
4.400
4.600
4.100
Dr James Abdey
R-Sq = 83.61%
MS
4.96675
0.20067
0.33442
SS 1/2 = (Residual MS)1/2
The pooled estimator for : S = Residual
(r 1)(c1)
Residual MS .
2 =1
R 2 = (Brow + Bcol )/ (Total SS) /(rc 1), Radj
(Total SS)/(rc1)
b
bi bj ,
bij = Xij
SS
19.867
0.602
4.013
24.482
Mean
5.34
5.24
5.22
4.88
Dr James Abdey
LT 2014
883
99
0.5
Residual
Percent
90
50
0.0
-0.5
10
-1.0
1
-1.0
-0.5
0.0
Residual
0.5
1.0
5
Fitted Value
Residual
Frequency
0.5
3
2
1
0.0
-0.5
-1.0
0
-0.8 -0.6 -0.4 -0.2 0.0
Residual
0.2
0.4
0.6
Dr James Abdey
LT 2014
8 10 12 14
Observation Order
16
18
20
884