Statistical Analysis Dr. Shamsuddin
Statistical Analysis Dr. Shamsuddin
Statistical Analysis Dr. Shamsuddin
Learning points/objectives:
1. Definition and meaning of Hypothesis and the construction of hypothesis
testing.
2. How and why tests are used to make decisions about a population if mean,
variance or probability of success is specified.
3. Physical meanings of Type I and Type II errors. How to calculate the
probability of Type II or error when a value of alternate hypothesis is given
and the values of , , and n is known.
4. For tests of a single mean, proportion, or variance, know how to compute the
value of the decision criterion (or criteria) for different alternative hypotheses
and know how to use those values to determine whether the null hypothesis
should be rejected.
5. Understand how the values of n, c, , and are related to one another and OC
curve.
6. Meaning, scope, procedure, analysis on design of experiments.
7. Regression analysis for developing statistical relationships between predictor
variables and response variable.
Defining Hypotheses
A statistical hypothesis is a statement or a claim about the parameters of a
probability distribution or the parameters of a model.
Ex. Mean tension bond strengths of two formulations are unequal. So
H 0 : 1 2
H 1 : 1 2
H1
H0 Decision
is less than equal to 0.5
H0: p 0.5 is not rejected
No error is made
is greater than 0.5
H0: p 0.5 is not rejected
error is made
H1 Decision
is less than equal to 0.5
H0: p 0.5 is rejected
error is made
is greater than 0.5
H0: p 0.5 is rejected
No error is made
3
Ex. Take the previous example and explain the physical meanings of hypotheses.
H0: p 0.5
H1: p > 0.5 (majority of the mix of raw materials are improper)
Meaning: if Type I error is made, H0 is rejected although H0 is true (current mixture is
ok). So one is making a mistake and concludes to have new mixtures, although this
not necessary. But if H0 cannot be rejected, although H0 not is true, committing Type
II error. That is, need a new mixture but cannot conclude in its favor.
Therefore, making a Type I or Type II error is always be a possibility. No way to
avoid this dilemma. How to get rid of it?
The question is the amount of probability associated with each of them. Design a
method for deciding whether or not to reject H0, so that a probability of making either
of them error is minimum.
That is,
Say we agree to reject H0 in favor of H1, if the observed value of the test statistic is 14
or larger. Then, split the possible values into two sets. One A = [1, 2,.,12, 13] and
other A = [14, 15, 16, ., 20]. If any particular value falls in set A, reject H0,
otherwise accept it. If the test is so designed, the probability of committing Type I
error is thus the value obtained for the expression, approximately 0.05 as desired.
This is applicable for other values of p that are < 0.5.
Now about Type II error, : It is possible that the observed value of the test statistic
does not fall into the rejection region, even though H0 is not true and should be
rejected. Type II error will occur at this stage. depends on alternative hypothesis a
value of alternative need to be specified by experimenter.
Ex. The critical region is A = [14, 15, 16, ., 20]. If the true proportion is 0.7, what
is the probability that the test is unable to detect this situation? That is, find for
which H0 will not be rejected.
P[Type II error] P[ fail to reject H 0 p 0.7]
Ex. Emission of a toxic gas of 100 mg/s by an engine. A new design (engine) is
proposed that would reduce the emission level. A sample of 50 values from a
modified engine is taken and tested. Sample mean (x-bar) is 92 mg/s and STD is 21
mg/s. Can we produce the new engine based on these sample statistics? Population
mean may be more or less than 92 mg/s.
Smaller P-value indicates that the analyst is more certain not to accept H0
(rejecting null hypothesis).
Larger the P-value, more plausible is H0 true, but not certain that is true.
A thumb rule is that, reject H0 whenever P 0.05 (not scientific). If
P 0.05,
the result is statistically significant at the 5% level. If P 0.01, the result is
statistically significant at the 1% level, and so on.
A test is statistically significant at a significance level if P-value is . Test result
is significant at the 100% level (null hypothesis is rejected). So, reporting on test is
on the P-value, rather than comparing it to 5% or 1%.
P-value is not the probability that H0 is true. In other words, do not accept or reject
H0 directly. H0 is in between (P ). Leave it for a subjective judgment/probability.
General criteria:
Strong evidence against H0: P < 0.01
Moderately strong evidence against H0: 0.01< P < 0.05
Relatively weak evidence against H0: P < 0.10
P value for lower-tailed test is defined as
P-value = (sample mean would be as small as the observed value if H0 is true)
Software solutions find the observed P -value and the judgment is up to the user.
/ n
H 0 : 100
H1 : 100
n = 50,
X 0 92 100 8
X
0
/ n
8
21 / 50
xc
x c
x c.
Performance of the test (how well the test controls errors) can be analyzed by:
Computing Type-I error and Type-II error probabilities, and
Finding the power function of the test.
x 0 c 0
c 0
P[ x c 0 ] P
1
/ n / n
/ n
x
c
c
P[ x c ] P
/ n / n
/ n
The probabilities of four possible correct and incorrect decisions can be listed as
H0 is true: ( = 0)
H1 is true: ( > 0)
xc
x c
c 0
/ n
c 0
1
/ n
/ n
c
1
/ n
P[ x cH 0 .is.true] P[ x c 0 ] .
Choose the value of c so that type I error is less than (typically 10%, 5% or 1%).
c 0
c 0 z ( )
and (0 )
and 1
/
n
n
n
Probability of type I error is P[ x 0 z ( )( / n )0 ]
Now accept H0 if x 0 z ( )( / n ) , or
Reject H0 if x 0 z ( )( / n )
n
c
( ) P[ x c 0 ]
z
/ n
2 z ( ) z ( ) 2
0 2
X 0
z ( ).
Z 0
/ n
X 0
z ( ) , or
/ n
X 0
z ( )
Reject H0 if
/ n
Accept H0 if
Ex. Suppose the current mean is unknown but STD is 6. Standard mean is 50. From
the sample data (n = 9), mean is calculated and found to be 52.5.
Construct the hypothesis. Locate the acceptance and rejection regions for 5% .
Make a decision on H0 hypothesis. Find the type II error function and draw it.
Determine the power function and draw its curve. Do the equivalent test.
9
/ n
1 P[ X c 0 ] 1
/ n
Power function ( )
c 0 z
0.1731
. Draw power vs value curve.
0.04 / 6
Now, ( )
X 0
.
/ n
X 0
X 0
z ( ) , or reject H0 if
z ( )
/ n
/ n
P[ X 0 cH 0 .is.true] P[ X 0 c 0 ] P
/ n
H0
H1
c n
Rejection region
x 0
z ( )
/ n
10
Accept H0 if X 0 z / 2
x 0
z ( )
/ n
x 0
z ( / 2)
/ n
and reject it if X 0 z / 2
X 0
z / 2 and reject it if z0
n
c
c
Power function, ( ) 1 0
0
/ n
/ n
X 0
z / 2
So,
n
[ z ( ) z ( )] (for lower or upper tailed tests)
0
[ z ( / 2) z ( )]
For two-sided tests n
0
OC Curve
f (n, , ) . For a fixed value of , there can be a set of curves, are usually called
operating characteristic (OC) curves. The vertical and horizontal axes are respectively
and d /
mean 0
OC curves are useful in determining the sample size (n) required to detect a specified
difference with a particular probability.
11
Ex. You wish determine the sample size that will be necessary to have a 0.90
probability of rejecting H 0 : 16.0 , if the true mean is 16.05 at a known/specified
level. The population STD 0.1 .
X
has
/ n
X 0
s/ n
Alternate hypothesis
0
0
0
Rejection region
x 0
z ( )
s/ n
x 0
z ( )
s/ n
x 0
z ( / 2)
s/ n
12
Ex. n = 30, x-bar = 0.09, s = 0.03. If 0.10 , it cannot be used without additional
processing.
Now, H0: 0.10 and H1: 0.10
z observed
X 0
0.09 0.1
1.83
0.03 / 30
Appro P-value = P(Z zobserved ) P(Z 1.83) 0.03 . Since P = 0.03 < 0.05, the
s/ n
X
s/ n
X c
c
P[ X c ] P
P tn 1
s / n
s/ n s/ n
tn 1
X 0
s/ n
X 0
t , n 1
s/ n
X 0
t Pt n 1, t
Power function is ( ) P
s/ n
Rejection region is
0
/ n
In case upper one sided, that is, H0: = 0 against H1: > 0. Reject H0 if
X 0
t , n 1
s/ n
Ex. say n = 6, x-bar = 0.17 and sigma is unknown but s = 0.04. Test hypothesis for a
value of 0 0.2 at 5% significance level
13
12
and 2 .
H0: 1 2 0
and various possibilities of alternate hypothesis
H1:
X1 and X2 are ND as
ND RV as
0 That is 1 2 0 or H1: 0
N (1 , 1 ) and N ( 2 , 2 ) , 1 2 is also a
say H1:
N ( 1 2 , 12 22 )
. The distribution of
( x1 x2 ) is
12 22
N 1 2 ,
n
n2
1
Test statistic z
( x1 x 2 )
12
n1
22
N (0,1)
n2
14
1 2 0
Rejection region
( x1 x2 ) c 1 2 z ( )
( x1 x2 ) c 1 2 z ( )
( x1 x2 ) c1 1 2 z ( )
12
n1
12
n1
12
Or ( x1 x2 ) c2 1 2 z ( )
n1
22
n2
22
22
12
n1
n2
n2
22
n2
Ex. A new ball bearing is claimed to have lower frictional resistance under very
heavy loading conditions. 36 of the old pcs and 25 of new are placed for testing.
2
2
64 and new
xold 52 and; xnew 44 ; old
144 . At 10% significance level,
12
n1
2
1
22
n2
12 22
12 22
n 2
1 22
n1
. This implies
n2
22 z ( ) z ( )
0 new / alt 2
Ex.
15
1 2 0
Now, take the samples (n1 and n2) from both populations. Sample means are
x1 and
( x1 x 2 ) ( 1 2 )
Since
1
1
n1 n2
1
1
variance. The estimated standard error is s( X 1 X 2 ) s p
n1
Pooled estimator of
n2
2
2 is denoted by s p
2 s 2 p
(n1 1) s 21 (n2 1) s 2 2
.
(n1 n2 2)
( x1 x 2 ) ( 1 2 )
sp
1
1
n1 n2
reference distribution.
Alternate hypothesis
Rejection region
1 2 0 (2-tailed test)
1 2 0 (+ve) (Right
t0 t / 2, n1 n2 2 or t 0 t / 2,n1 n2 2
tailed test)
t 0 t / 2,n1 n2 2
t0 t / 2,n1n2 2
1 2 0
Ex. A study of the tensile strength of ductile iron annealed or strengthened at two
different temperatures is conducted. It is thought that the lower temperature will yield
the higher mean tensile strength. The data are
1450F
1650F
n1 10
x1 18,900
n2 16
x2 17,500
s 21 1600
s 2 2 2500
( x1 x 2 ) ( 1 2 )
12
n1
22
n2
Each variance is estimated separately and these are not combined. So, the unequal
variance test statistic
t
( x1 x 2 ) ( 1 2 )
s12 s 22
n1 n2
This time, the number of df must be estimated from data. One method for doing so is
the Smith-Satterthwaite procedure.
df
2
1
s12 s 22
n1 n2
/ n1
s2 / n
2 2
n1 1
n2 1
17
Ex. Two types of materials are used in producing a product. It is intended to compare
the strength of one to the other. Is material 1, on the average, better able to withstand
a heavy load than material 2? Data are:
1450F
1650F
n1 25
x1 380lb
s 21 100
n2 16
x2 370lb
s 2 2 400
Write down the hypotheses and conduct the test to answer the question.
Tests comparing two means when data are paired
The random variables (X and Y) are not independent. Each observation in one sample
is naturally or by design is paired with an observation in the other.
Population I: problem
solved using old method
Sample of size n1
x1, x2, , xn
x=?
y1, y2, , yn
y=?
x - y = ?
D 0
sd / n
, where df = n -
1.
Ex. Data
program
10
Old (x)
8.05
24.74
28.33
8.45
9.19
25.20
14.05
20.33
4.82
8.54
New (y)
0.71
0.74
0.74
0.77
0.80
0.83
0.82
0.77
0.71
0.72
difference, d
18
(n 1) s 2
(reference distribution)
Let s12 and s22 are sample variances corresponding to random samples sizes n1 and n2
taken from two independent NDs
2
[(n1 1) s1 ] / 12
2
n2 1 /( n1 1)
(n1 1)
s1 / 12
F0 2
n 1 /( n2 1) [(n2 1) s 2 2 ] / 22 s 2 2 / 22
(n2 1)
1
s1
s2
Indeed, F / 2, n1 1, n2 1
Accept H0 if F / 2, n 1, n
1
2 1
1
F(1 / 2), n1 1, n2 1
F0
1
F / 2), n1 1, n2 1
12 2 2
F0
F0
s2
s1
s1
s2
F ,n2 1,n11 1
F , n1 1, n2 1
x
(x is number of success events), p is not near zero(0) or 1, and p-hat is appro.
n
x
n
p p0 z ( )
p0 (1 p0 )
n
z0
p p0
p0 (1 p0 )
n
c p
c p
and ( p)
p(1 p) / n
p
(
1
p
)
/
n
( p) 1
Alternate hypothesis
Rejection region
p p0
p p0
z ( )
p0 1 p0
p p0
p p0
z ( )
p0 1 p0
p p0
p p0
z ( / 2)
p0 1 p0
Ex. 90% of the torsion springs will survive beyond the accepted max. standard of
performance. Investigation revealed that 168 springs in a sample of 200 exceed that
std. Is the claim valid at 5% sig level?
20
p1 (1 p1 ) p2 (1 p2 )
n1
n2
p1 (1 p1 ) p2 (1 p2 )
n1
n2
p1 p 2
s p1 p 2
p1 (1 p1 ) p 2 (1 p 2 )
n1
n2
You can use confidence interval approach to test hypothesis (L, U) at 100(1 )% ,
where
L p1 p 2 z ( / 2)
p1 (1 p1 ) p 2 (1 p 2 )
and
n1
n2
U p1 p 2 z ( / 2)
p1 (1 p1 ) p 2 (1 p 2 )
.
n1
n2
( p1 p 2 ) ( p1 p2 )
where
p1 (1 p1 ) p2 (1 p2 )
n1
n2
p1
x1
n1
and
p 2
x2
n2
( p1 p 2 )
1 1
p(1 p)
n1 n2
N (0.1)
Alternate hypothesis
x1 x2
n1 n2
Rejection region
p1 p2
p1 p2
p1 p 2
z ( )
s p1 p 2
p1 p 2
z ( )
s p1 p 2
p1 p2
p1 p 2
z ( / 2)
s p1 p 2
21
are
called
controllable
variables,
x1 , x2,.... ; observable
22
y = output
Process
Controllable factors
x1, x2, , xn
Figure: General model of a process or system
Now, Signal factors (usually denoted by y ) are set by designer or operator
to obtain the intended value of the response variable. Noise factors are not
controlled or are very expensive or difficult to control. Both of them are
controlled by a single measure known as signal-to-noise (S/N) ratio. Its
mathematical expression depends on the type of orientation.
Goals and Objectives of experimental analysis
Goals
Determine the input variables and their magnitude that influences the
response, y.
Determine the levels for these variables
Determine how to manipulate these variables to control the response
Objectives may be noted as:
Determine which factors are most influential on the response, y.
Locate where to set the influential xs so that y is near the nominal
requirement
Determine where to set the influential xs so that variability in y is small or
minimum
Locate where to set the influential xs so that effects of
uncontrollable/noise variables are minimum.
In manufacturing, experimental design techniques applied early in process
design and development can result in
23
1.
2.
3.
4.
improved yield
reduced variability and closer conformance to nominal or desired output
reduced development time
reduced overall costs
Experimental Approaches
1. One can combine factors arbitrarily, then test them and see the results.
This strategy is called best-guess approach.
2. One-factor-at-a-time approach. Select a starting point (baseline set of
levels) for each factor, and then change the level of one factor keeping
all others at baseline levels. Do the test and report the response. But,
this method cannot locate interaction (of factors) effect.
24
Response
Response
Low
High
Factor x1
Low
High
Factor x2
H
1
1
1
1
1
1
y0 (benchmark or
ref. value)
2
2
1
1
1
1
yA
3
1
2
1
1
1
yB
4
1
1
2
1
1
yC
5
1
1
1
2
1
yD
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
h
1
1
1
1
1
2
yh
TC = treatment conditions; 1 = base level, 2 = changed level
Effects:
Level 1
28 psi
55 mph
30 weight
Regular
Level 2
35 psi
65 mph
40 weight
premium
Response variable
Gas mileage
25
Response
Factor x1
Low
High
Factor x1 and x2
Low
High
Factor x2
B
-
C
-
AB
+
AC
+
BC
+
ABC
-
Response
Y
y11, y12, y1n
26
2
3
4
5
6
7
8
+
+
+
+
+
+
+
+
+
+
+
+
1
2
.
.
.
a
Observations
Totals
Averages
yij
yi .
yi .
y1.
y 2.
y1.
y 2.
ya.
ya.
y..
y..
y11
y12
y21
y22
y1n
y2n
y a1
ya2
y an
yij i ij
Where yij = jth observation at the ith factor level (i = 1, 2, a, and
j = 1, 2, n). The overall mean is common to all treatments. The treatment
effect particular to a treatment (ith treatment) is i, and
component. The model can also be written as
ij
i 1,2,..., a
yij i ij
j 1,2,..., n
27
Random errors are independently and normally distributed with mean zero
and variance 2. So, each treatment generates a normal population with
mean i and variance 2. Thus there are a population means. This is a
completely randomized experimental design and the respective analysis is a
fixed effects model ANOVA. ANOVA is applied to test for equality of
treatment effects. The total treatment effects (deviations from the common
mean )
a
i 1
yi. yij
j 1
yi. yi. / n
y.. yij
and
i 1 j 1
observations) .
Hypotheses are
...
2
a and H1: i
H0: 1
for at least one i.
If the null hypothesis is true, changing the level of factor has no effect on the
mean response.
The total variability in data, i.e., total sum of squares, SST is
i 1 j 1
ij
i 1 j 1
y 2 ..
i 1 j 1
i 1
i 1 j 1
28
i 1
y 2 i. y 2 ..
N
i 1 n
a
i 1 j 1
or
SSE = SST SStretaments
Now, we need to examine the expected values of SStreatments and SSE. This
will lead us to appropriate statistic for testing the null hypothesis.
The expected value of SStreatments
a
E{SS treatments} (a 1) n i
i 1
E{SStreatments} (a 1) 2
If alternate hypothesis is ture,
a
SS
E treatments 2
a 1
n i
i 1
a 1
SStreatments
The ratio
a 1 is called mean square for treatments, MStreatments. If H0
is true, MStreatments is an unbiased estimator of 2. If H0 is not accepted,
MStreatments estimates 2 plus a +ve quantity that comes from variation due to
systematic difference in treatment means.
Similarly, expected value of SSE is
29
its
MSE
SSE
SSE
is a pooled
N a a(n 1)
Z
l 1
2
l
SStreatments
MStreatments
F0 a 1
SSE
MSE
a(n 1)
If null hypothesis is true, both numerator and denominator are equal and F0 =
1. If alternate hypothesis is true, certainly numerator is larger than
denominator. In other words, we should reject H0 for larger value of F0. This
also implies that test is appropriate at upper tail. So reject H0 is
F0 F , a 1, a ( n 1)
Now, you can prepare a summarization table on ANOVA fro a single-factor
experiment
Source of
variation
Treatments
Error
Total
SS
df
MS
SStreatments
(a 1)
MStreatments
SSE
SST
a(n - 1)
MSE
F0
F0
MS treatments
MSE
(an - 1)
Totals
Averages
yi .
yi .
575
565
600
725
542
593
651
700
530
590
610
715
539
579
637
685
570
610
629
710
2756
2937
3127
3535
y.. =
y.. =
Conduct a test if the mean values in all 4 levels are equal against any mean
is unequal. Show the results in a F-distribution diagram.
For Two Factors
Observation of an experiment can be written as:
ya 21, ya 22,..., ya 2 n
b
y Ai . yij
y Bi . yij
and
j 1
j 1
i 1
j 1
y.. y Ai y Bj
y 2..
SST yijk
N
i 1 j 1 k 1
a
SS B SS B
SS A
and SS B
and SS AB
N
n
N
N
j 1 an
i 1 j 1
i 1 bn
Then SSE = SST - SSA - SSB - SSAB
df A a 1 , df B b 1 , df AB (a 1)(b 1) , df E N ab
2
MS A
MSE
SS AB
SS A
SS B
; MS B
; MS AB
df AB
a 1
b 1
SSE
N ab
ANOVA table for two-factor experiment
Source
of
variation
A
SS
MS
SS B
AB
SS AB
Error
SSE
SST
F0
F critical Significant
(from
(yes/no)
table)
MS A
MSE
MS B
F0
MSE
MS AB
F0
MSE
SS A
Total
df
F0
N-ab
MSE
(abn - 1)
1
1
1
1
1
2
2
2
2
2
1
1
2
2
1
1
2
2
3
1
1
2
2
2
2
1
1
4
1
2
1
2
1
2
1
2
5
1
2
1
2
2
1
2
1
6
1
2
2
1
1
2
2
1
7
1
2
2
1
2
1
1
2
B
AC
ABD
C
AD
ACD
D
BC
BCD
BD
CD
df
4
6
4
1
Sum
1
16
Step 3: Selecting OA
No. of TCs = no. of rows in an OA that must be df. OAs that are now
available, up to OA36 (see table).
OA
OA2
OA8
OA9
OA12
OA16
OA161
OA18
OA25
No.
of
rows
Max.
no. of
factors
can be
used
4
8
9
12
16
16
18
25
3
7
4
11
15
5
8
6
Max. no.
of
columns
available
For 2level
3
7
11
15
1
-
Max. no.
of
columns
available
For 3level
4
7
-
Max. no.
of
columns
available
For 4level
5
-
Max. no.
of
columns
available
For 5level
6
34
OA27
OA32
OA321
OA36
OA361
.
.
.
27
32
32
36
36
.
.
.
13
31
10
23
16
.
.
.
31
1
11
3
.
.
.
13
12
13
.
.
.
9
.
.
.
.
.
.
So, if df = 13, next available OA is OA16. See a geometric progression for the
2-level arrays of OA4, OA8, OA16, OA32, , which is 22, 23, 24, 25, and for
3-level OA, OA27, OA81, . Which is 32, 33, 34, OAs can be modified.
Interaction table
Try to avoid confounding (inability to distinguish among the effects of one
factor from another factor and/or interaction). Try to find which column is to
use for the factors via an interaction table.
Table: OA8 (reprint from the back)
TC
1
2
3
4
5
6
7
8
Column
1
2
3
4
5
6
7
1
1
1
1
1
2
2
2
2
2
1
1
2
2
1
1
2
2
1
(1)
3
1
1
2
2
2
2
1
1
4
1
2
1
2
1
2
1
2
5
1
2
1
2
2
1
2
1
6
1
2
2
1
1
2
2
1
7
1
2
2
1
2
1
1
2
6
7
4
5
2
3
(6)
7
6
5
4
3
2
1
(7)
35
A
1
1
1
1
2
2
2
3
3
3
K
1
1
2
2
B
2
1
2
3
1
2
3
1
2
3
L
1
2
1
2
C
3
1
2
3
2
3
2
3
1
2
M
1
2
2
1
D
4
1
2
3
3
1
2
2
3
1
(y)
y1
y2
y3
y4
M
1
2
2
1
(y)
y33
y34
y35
y36
.
.
.
1
2
3
4
K
1
1
2
2
L
1
2
1
2
36
Nine treatments conditions are used for controllable factors (OA9) and an
OA4 is used for noise factors. Four runs are made under TC 1 (all level 1s)
one each of the noise TCs. The results are y1, y2, y3 and y4. Repeat for all 9
TCs.
Ex. A design team identified high nonconformity and average scrap cost
exceeded much beyond the expectation in case of an air filter injection. They
sorted out the factors involved related to materials, machines, mould
(method), people, conditions, and environment and prepared a cause-andeffect diagram and correlation table (below) between defect cause and
phenomenon.
Cause/phenomenon Felt
Mat.
Filter
Broken Glue
Nonflange shortage paper
paper penetration fabric
explosion
fiber
explosion
Folding paper
X
Different altitude
for cutting
X
Incomplete cutoff in cut line
X
Different altitude
for folding paper
X
Different degree
of
color
of
background
Mould
X
X
X
X
Precision-big
gear crevice
X
X
X
X
Uncontrolled
mould
temperature
Precision
of injection m/c
X
X
X
X
Inaccurate
amount of glue
Injection condition
X
X
Pressure, speed,
temperature
37
1
2
3
4
5
6
7
8
1
1
1
1
2
2
2
2
1
1
2
2
1
1
2
2
1
1
2
2
2
2
1
1
1
2
1
2
1
2
1
2
1
2
1
2
2
1
2
1
1
2
2
1
1
2
2
1
1
2
2
1
2
1
1
2
1
2
3
4
5
6
7
8
9
1
1
1
2
2
2
3
3
3
1
2
3
1
2
3
1
2
3
1
2
3
2
3
1
3
1
2
1
2
3
3
1
2
2
3
1
76
81
78
72
69
56
72
87
92
67
52
68
74
81
62
71
90
82
87
69
72
51
66
93
71
56
49
78
53
61
77
75
83
90
75
61
62
57
54
67
71
70
86
48
81
65
76
81
89
82
69
71
77
84
73
68
69
88
81
63
67
76
82
Rowwise yaverage
71.875
64.75
69.75
75.375
77.125
74.125
76.625
72.75
74
67
62
75
85
92
97
85
73
61
S/NL
variable y
1
2
3
1 1 1
1 2 2
1 3 3
1
2
3
36.9907
35.92
36.66
39
4
5
6
7
8
9
2
2
2
3
3
3
1
2
3
1
2
3
2
3
1
3
1
2
3
1
2
2
3
1
37.14
37.61
36.96
37.55
36.69
36.82
1 n 1
n i 1 y 2 i
1 n
In case of smaller-is-better: S / N S 10 log y 2 i
n i1
Low
High
Low
70.17
C
72.917
D
74.333
A
Med High
71.58 74.67
Low
High
Similarly, we can draw such diagrams for other variables. Finally we draw:
40
Average
response
y for all
control
variables
separately
(connect
between
the dots by
straight
lines)
Low
Med
Levels
High
Low
Med
Levels
High
And
Average
S/NL for
all control
variables
separately
(connect
between
the dots by
straight
lines)
And, see both the last charts and for maximizing the response variable y, the
optimal values of the control variables are:
Controllable
variable
A
B
C
D
Level
You can do the confirmation test afterward and validate your experimental
results.
41
REGRESSION ANALYSIS
A methodology to develop and utilize relation between two or more quantitative
variables to predict one variable from other or others. It has a wide range of use.
Difference between correlation and regression
Correlation shows direction and strength of association between two random
variables (they are perfectly normally distributed). Here, no need to consider which
variable is independent or independent. So, it does not say how change of a unit of X
or Y change another.
Linear regression is about dependency of one variable (Y) on one or more other
variables (Xs). It expresses the change in response due to So regression analysis
serves model building purpose.
Model representation of the world (but not really perfectly). Model is not really
duplication.
Why? Error!
Data = model + error/residual. If data = model (perfect/duplication)
Statistical relation
No perfect relation. Say, year end evaluation of employees performance is a
dependent or response variable, Y. Mid-year evaluations as independent, explanatory,
regressor or predictor variable, X. X is a conditional RV. So
yx . Its estimated
Yi 0 1 xi i
where i represents ith trial (i = 1,2 3, , n), betas are regression parameters/
i .
2
2
variance { i } . So, response Yi is also normally distributed and has the
2
2
same constant variance: {Yi } . That is
That is, probability distributions of Y have the same variance , regardless of the
level of X.
5. Error terms are uncorrelated. Outcome in any one trial has no effect on the
error term for any other trial. So are the response terms, say Yi and Yj.
6. Summarily, responses Yi come from probability distributions whose mean
2
Ex. Suppose that the following regression model is appropriate in certain case
Yi 9.5 2.1xi i
0 1 ( xi x ) i
Data for Regression Analysis
Observational data non-experimental/preselected or uncontrolled data. Need
adequate care to draw cause-and-effect relationship.
Experimental data controlled experiment. Productivity vs length of training.
Take several workers; train them, and record performance for several weeks.
Length of training is called a treatment. Completely Randomized data every
experimental unit has an equal chance to receive any one of the treatments.
Overview of steps in Regression Analysis
44
Start
Exploratory data
analysis
Identify most
suitable model
Stop
Yes
Make inferences
on the basis of
regression model
Is one or more
of the regression
models suitable
for data at hand?
No
Revise and develop
new models
Step 2. Does the predicted straight line pass through the origin Yi 1 xi ? Or, it
fits equation like Yi 0 1 xi . Then, what are the good estimated values of 0
and 1 to have a best-fit line. To solve this problem, the method used is named
as least square method.
Line fitting using the Least Square Method
Suppose, the line is the one that best fits the data Yi 0 1 xi . The ith
fitted/predicted value is defined by Yi 0 1 xi .
The ith random error or residual is the difference between ith observed value and
the ith predicted value, which is
i Yi (0 1xi )
45
Y
values
X values
The least square criterion,
2
2
SSE Q ei (Yi Y ) 2 Yi ( 0 1 xi ) . It should be as small as
1i n
1i n
2
possible. Distribution of e is ei N (0, )
For regression model, estimators should have values that minimize Q for any
given set of sample data. Use the rules of summation to obtain the following
normal equations
n
i 1
i 1
n0 1 xi Yi
and
i 1
i 1
i 1
0 xi 1 xi xiYi
1 i 1
2
and
n
n
2
n xi xi
i 1
i 1
n
In simpler form 1
(x
i 1
x )( yi y )
(x
i 1
x)2
and
0 y 1x
0 y 1x
46
n
Q SSE
2 yi ( 0 1 xi ) 0 and
0
0
i 1
n
Q SSE
2 yi ( 0 1 xi ) xi 0
1
1
i 1
n
i 1
i 1
Or, n0 1 xi yi (1)
i 1
i 1
0 xi 1 x 2 i xi yi . (2)
i 1
n xi yi xi yi
i 1
i 1 i 1
2
Then, 1
and
n
n
2
n xi xi
i 1
i 1
n
0 y 1x
These estimated values of 0 and 1 are logical and thus good estimators of 0 and
1.
(x
x) 0
(x
x )( yi y ) ( xi x ) yi
i 1
n
2.
i 1
i 1
3.
(x
i 1
n
4.
(x
i 1
x )( yi y )
i 1
i 1
n xi y i xi y i
i 1
x ) 2 ( xi x )xi
i 1
47
n
n
x
i xi
n
i 1
5. ( xi x ) 2 i 1
n
i 1
n
Distribution of 1
To develop confidence intervals or test hypothesis or slope of regression line, we
need to know its distribution. Least square estimator for 1 is an unbiased estimator
n
n n
n xi yi xi yi
i 1
i 1 i 1
2
for this parameter. Now 1
. Apply summation properties
n
n
2
n xi xi
i 1
i 1
n
2, 3 and 5, then 1
(x
i 1
n
x ) yi
( xi x ) 2
. Let c j
i 1
( xi x )
n
( xi x ) 2
, where j = 1, 2, 3, , n, then
i 1
variance ( 1 ) n
. So, 1 N 1 , n
2
( xi x ) 2
( xi x )
i 1
i 1
( xi x )(Yi Y )
i 1
(x
i 1
x)
(x
i 1
n
(x
i 1
x )Yi
x)
c j Y j and
0 y 1x
Where constant c j
( xi x )
n
(x
i 1
x)
fixed quantities.
Properties of cj:
48
Distribution of
c x
j
1
( xi x ) 2
0
n
xi
i 1
n
n ( xi x )
2.
2
i 1
2
xi
2
i
0 N 0 , n
n ( xi x ) 2
i 1
2 denotes variability of each of random variables Yi about the true regression line.
We use information concerning the variability of data points about the fitted
regression line. Since residual measures unexplained the random deviation of a data
point from the estimated regression line, residuals are used to estimate
Estimator for
is
2 . So,
s 2 2 SSE /( n 2)
1
20
5
2
55
12
3
30
10
Apply least square SSE (also called Q) criterion for fit of regression line.
Ex. A company manufactures refrigerator equipment and many replacement parts. In
case of one part (produced periodically in lots of varying sizes), company wishes to
determine the optimum lot size. Production involves unavoidable setting up the
process, machining and assembly operations. One key issue is to determine the
optimum lot size is the relationship between lot size and labor hours required to
49
produce the lot. Data are collected to develop that kind of relationship under a stable
production condition.
Production
run i:
1
2
3
..
23
24
25
Total
Mean
Lot size
xi
80
30
50
..
40
80
70
1,750
70.0
Work
hours Yi
399
121
221
244
242
323
7,807
312.28
xi x
Yi Y
( xi x )x( Yi Y ) ( xi x )2 ( Yi Y )2
70,690
19,800 307,203
If you use software, you will get the table like below:
Predictor coefficient
stdev
t-ratio P
Constant 62.37
26.18
2.38
0.026
x
3.5702
0.3470
10.29 0.000
S = 48.82 R-sq = 82.2%
R-sq (adj) = 81.4%
Write the regression equation:
Point Estimation of Mean Response
Suppose, sample estimators of parameters are computed. The mean (expected)
response of the regression function is
Plot it alongside the scatter plot and comment if it represents a good relationship
between lot size and work hours.
Production
run i
1
2
3
..
23
24
25
Total
Mean
Lot
size xi
80
30
50
..
40
80
70
1,750
70.0
Residuals
The ith residual is
Work
hours
yi
Estimated mean
response
y-hat
Residual
yi y ei
Squared
Residual
399
121
221
244
342
323
7,807
312.28
347.98
169.47
240.88
51.02
2,603.0
7,807
54,825
( yi y )2 e2i
ei Yi Yi Yi 0 1 xi
i Yi E{Yi } . Corresponding
i 1
n
i 1
e
i 1
estimated.
51
3. Sum of the observed values of Yi equals the sum of the fitted values y i .
n
i 1
i 1
That is Yi y i
n
xe
i 1
i i
5. The sum of the weighted residuals is zero when the residual in a trial is
weighted by the fitted value of the response variable for that trial. That is
n
Y e
i 1
i i
2
s
n 1
The numerator is called sum of squares. The variance of each observation yi for the
above regression model is 2, the same as that of error term i . yi comes from
different probability distributions with different means that depend upon the level of
xi. The deviation or residuals of an observation Yi from the estimated mean is
ei yi yi
The sum of squares of residuals, called error (or residual) sum of squares (SSE) is
defined as,
2
2
SSE ei ( yi y i ) 2 yi ( 0 1 xi )
1i n
1i n
SSE is a measure of deviation of data points from the predicted line. The method of
least square is to find those values of 0 and 1 that minimize the SSE. SSE is a
function of 0 and 1 and is given by
2
2
SSE i yi ( 0 1 xi ) f ( 0 , 1 )
1i n
52
SSE has n-2 degrees of freedom. Two degrees of freedom are lost each for each beta.
Hence error mean square or residual mean square is defined
2
SSE ei
MSE
n2 n2
MSE is an unbiased estimator of 2. That is E{MSE} = 2
Ex. Find the variance 2 on previous example.
S xx ( xi x )2 x 2i nx 2
S yy ( yi y )2 y 2i ny 2
S xy ( xi x )( yi y ) xi yi nx. y
The sum of squares into a sum of three squares is given by
f (0 , 1 ) S yy (1 r 2 ) (1 S xx r S yy )2 n y (0 1x )
Here the second and third summands depend on unknown parameters 0 and 1.
Lets attempt to minimize f ( 0 , 1 ) by choosing these parameters so that 2nd and
3rd terms equal zero. That is,
( 1 S xx r S yy ) 2 0 ... 1
y ( 0 1x ) 0 .......
Solve these two equations and find the values of 0 (intercept) and 1 (slope). The
estimated values of these are
r S yy
S xx
S xy
S xx , which is an unbiased estimator of
1 . This estimator is
2
2
normally distributed with variance of 1 / S xx .
and
2
2
2
and this is normally distributed with a variance ( x i ) / nS xx ,
0
i 1
53
2
and S SSE /( n 2) is the unbiased estimator of 2 .
y 0 1x
You can reach to the same findings by differentiating SSE with respect to 0 and
2
2
1. SSE i yi (0 1xi )
1 i n
( SSE )
( SSE )
2 yi ( 0 1 xi ) 0 and
2 yi ( 0 1 xi ) xi 0
0
0
Even though the regression equation actually estimates the mean of y for a given
values x, it is used extensively to estimate the value of Y itself. It is an estimated
average value, or
y Y 0 1 x
Test statistic: If H0 is true, then SSR = 0 (regression model explains none of the
variation). So reject H0 if t 0
t / 2,n2 . tn2 t0
S / S xx
S / S xx
value from t-table and make conclusion.
. Find critical
1 t / 2,n2 S / S xx
Test statistic on 1 , slope of the regression line is
t n2 t 0
1 01
S / S xx
0 t / 2,n2
nS xx
54
t n2 t 0
0
S x2
nS
xx
0 1 x is Yx N Yx , 1 / n
(x x)2 2
.
S xx
(x x)2
(x x)2
Yx t / 2,n2 S 1 / n
or 0 1 x t / 2,n2 S 1 / n
S xx
S xx
Prediction interval for the response variable for a given value of x is
(x x)
0 1 x t / 2,n2 S 1 1 / n
S xx
( yi y ) .
The total variation or total sum of squares SST is defined as SST
Magnitude of SST indicates the level of uncertainty. Without taking the predictor
variable X into account, SST can be decomposed as
2
The regression sums of squares (or sum of squared deviation) is termed as SSR. SSR
2
= SST SSE = SSR ( y i y )
Breakdown of Degrees of freedom
Associated with SST, df = n 1. Associated with SSE = n 2 . Associated with SSR df
= 1.
55
SST = SSE + SSR. Their degrees of freedom are also additive, i.e. n 1= n 2 + 1
Mean Squares (MS)
MS is the sum of squires divided by the respective df. The regression mean of square
SSR SSR
SSR
(MSR) is MSR
df
1
SSE
The error mean of square, MSE, MSE
n2
Important note is that the above two means of squares MSR and MSE do not add to
SST divided by its df. That is, mean squares not additive.
Two important implications of MS are:
The mean of the sampling distribution of MSE is 2 whether or not X and Y
are linearly related, i.e., whether or not 1 0 .
1 0 , then?
SS
df
MS
E{MS}
SSR ( y i y ) 2
2 12 xi x 2
( y
y i ) 2
(n-2)
SST ( yi y ) 2
SSR
1
SSE
MSE
n2
(n-1)
SSE
1i n
Total
MSR
nY
56
Source of variation
Regression
Error
SS
SSR ( y i y )
SSE
( y
df
1
y i ) 2
(n-2)
SST ( yi y ) 2
(n-1)
1
n
1i n
Total
Correction for mean
Total, uncorrected
SS correction ny 2
SSTU yi
MS
SSR
1
SSE
MSE
n2
MSR
F Test of 1 0 versus 1 0
For regression analysis, ANOVA approach provides us with a test for
H0:
1 0
1 0
H1:
Test statistic: for ANOVA it is denoted by Fo. It compares MSR and MSE in
the following fashion:
Fo
MSR
MSE
Large values of Fo support H1 and values of Fo near 1 support H0. That is, the
appropriate test is an upper-tail test.
Coefficient of Determination
We have seen that SST is a measure of uncertainty in predicting Y, when no
account of predictor is taken. But SSE measures the variation in the Yi when a
regression model utilizing the predictor variable X is employed. In reducing the
uncertainty in predicting Y with effect of X, is to express the reduction in variation
(SST SSE = SSR), is a proportion of total variation.
R2
( yi y xi )
( yi y )
0 SSE SST
0 R2 1
We may interpret R2 as the proportionate reduction of total variation associated
with the use of predictor variable X. The larger R2 is, more the total variation of Y
57
When the fitted regression line as horizontal so that 0 and Yi Y , then SSE
= SST and R2 = 0, here, there is no linear association between X and Y in the
sample data. The predictor variable X is of no help in reducing the variation in
observations Yi with linear regression.
In computer solution, R2 is labeled as R-Sq in percent form and R-sq(adj) is
adjusted coefficient of determination when sum of squares is divided by respective
degrees of freedom.
R2and r carry different meanings for linear regression
A high r does not necessarily mean that a useful prediction can be made.
A high r does not necessarily mean that a regression line is a good fit
A zero value of r does not necessarily indicate that x and y are not related
Adjusted R2
Reduced value of R2 attempts to make an estimate of the value of R2 in the
population.
AdjR2 1 (1 R 2 )
N 1
N k 1 , where N is the population size, k = number of
Lack-of-fit Tests
For experimental data, regression model can be fitted, but the relationship was
previously unknown. So, check if the model is correct. Do lack-of-fit test.
To test divided SSE as such, SSE = SS due to pure error (SSpe) + SS due to lack
of fit (SSlof). You can calculate first one and
SSlof = SSE Sspe.
For SSpe, take repeated observations on y for at least one level/value of x.
Hypo: H0: model adequately fits the data
H1: model does not adequately fit the data
58
Ex.
x
y
2.0
2.4
2.0
2.9
3.0
3.2
4.4
4.9
4.4
4.7
5.1
5.7
5.1
5.7
5.1
6.0
5.8
6.3
6.0
6.5
Repeated observations for levels of x 2, 4.4 and 5.1 on y. Construct the regression
line.
For SSpe
x
2
4.4
5.1
y
2.4, 2.9
4.9, 4.7
5.7, 5.7, 6.0
y-bar
(yi-y-bar)2
df
2-1 = 1
2-1 =1
3-1 = 2
SSpe ( yi y ) 2
Source of variation
Regression
Error/residual
Pure residual, SSpe
Lack-of-fit
SS
df
1
8
4
4
MS
Fo
MSR/MSE
?
MSlof
F
6.39
MSpe From F-table 0.05, 4, 4
Do not reject H0 if Fo F0.05, 4, 4
lack of fit
E{ i } 0 ,
the
regression
E{Y } 0 1 X i1 2 X i 2
function
for
the
model
is
i Yi E{Yi }
First-order with more than two predictor variables: General Linear Model
Yi 0 1 X i1 2 X i 2 ... k 1 X i , k 1 i
59
that E{ i }
Assuming
for
the
model
is
Y1 0 1 x1 2 x22 ... p x p 1
Now, we can write down other equations for all xs data (1, 2, , n).
Then write the matrices as above and solve.
Matrix Approach to Least Squares method
In a complex model having several X variables, it eases the regression analysis. We
cana. Express the general linear model in matrix form.
b. Find a matrix expression for normal equations.
c. Find expressions for least square estimates by solving normal equations.
d. Apply the results obtained to the polynomial and multiple linear regression
models.
The general linear model: Yi 0 1 x1i 2 x2i ... k xki i , i 1,2,..., n
Its expanded forms are:
1
2
Random error vector: .
.
n
x
...
x
21
22
k
2
X nx(k 1) matrix .
1 x
x2 n ... xkn
1n
To change from one model to another, simply change this matrix. The product of
X is a nx1 matrix. Thus multiple regression model matrixes are combined as
Y X
To find the matrix formulation of the normal equations, consider the matrix X X
1
1 ... 1 1 x11 x21 ... xk1
1
x
.
.
.
.
.
.
.
.
.
.
n
n
x
1i
i 1
n
=
x2 i
i 1
.
n
xki
i 1
x1i
i 1
n
2
1i
i 1
x
i 1
1i
x2i
x2i
i 1
n
x
i 1
n
x
i 1
.
i 1
1i
x2 i
2
2i
1i
xki
x
i 1
2i
xki
i 1
n
... x1i xki
i 1
n
i 1
.
.
n
... xki2
i 1
...
ki
61
1
x
11
X y x21
.
xk1
yi
1
1 ... 1 y1 ni1
0
x12 x13 ... x1n y2 x1i yi
i 1
x2i yi , and 3
.
.
.
. i1
xk 2 xk 3 ... xkn yn n
k
xki yi
i1
Ex.
An equation is to be developed on the performance of workers based on their skill
levels and year of services. The data are as follows:
Worker no.
1
2
3
Skill level
1.35 1.90 1.70
Yr in service 17.9 16.5 16.4
4
5
6
7
8
9
10
1.80 1.30 2.05 1.60 1.80 1.85 1.40
16.8 18.8 15.5 17.5 16.4 15.9 18.3
62