Randomization Test For Multiple Regression
Randomization Test For Multiple Regression
Randomization Test For Multiple Regression
Multiple Regression
Justification of randomization approach
in multiple regression
1. From the assumption the observations are random sample from a
population of possible observations, where my be independent of
the variables.
2. From the random allocation of the values to experimental units
after which the values are observed as respondents that may not be
affected by the variables; or
3. From the general considerations which suggest that if is
independent of the values then the mechanism generating the data
makes any of the values equally likely to occur with any of the sets
of values.
Randomizing Observations
This method is proposed by Manly(1997), though there are some controversy in
this method. This can be done by
1. The variable Y is regressed on full set of predictors to derive the OLS
estimates for the parameter of the model (s). T-statistic may also be
calculated (
/(
Randomizing the Residuals under the full
model
This is proposed by Ter Braak(1992).
1. The variable Y is regressed on full set of predictors to derive the OLS
estimates for the parameter of the model (s). T-statistic may also be
calculated (
/(
(
0
+
)
3. Randomize(permutes) residuals and regressed this randomize Rs to full set
of Xs to derive the parameters (
Randomizing the residuals under the
reduced model a.
Randomizing the residuals under the
reduced model b.
Colony Altitude 1/alt
annual
precipitation max.temp min.temp HK_1.00_freq
PD+SS 0.5 2 58 97 16 98
SB 0.8 1.25 20 92 32 36
WSB 0.57 1.75 28 98 26 72
JRC+JRH 0.55 1.82 28 98 26 67
SJ 0.38 2.63 15 99 28 82
CR 0.93 1.08 21 99 28 72
MI 0.48 2.08 24 101 27 65
UO+LO 0.63 1.59 10 101 27 1
DP 1.5 0.67 19 99 23 40
PZ 1.75 0.57 22 101 27 39
MC 2 0.5 58 100 18 9
HH 4.2 0.24 36 95 13 19
IF 2.5 0.4 34 102 16 42
AF 2 0.5 21 105 20 37
SL 6.5 0.15 40 83 0 16
GH 7.85 0.13 42 84 5 4
EP 8.95 0.11 57 79 -7 1
GL 10.5 0.1 50 81 -12 4
Results from B p.val
randomizing observations
t1 26.1237 0.014
t2 0.47199 0.357
t3 0.86677 0.435
t4 0.2503 0.793
f 5.957 0.007
randomizing residuals
t1 26.1237 0.01
t2 0.47199 0.349
t3 0.86677 0.482
t4 0.2503 0.814
f 5.957 0.01
Using t- and F-tables
t1 26.1237 0.00982
t2 0.47199 0.35823
t3 0.86677 0.4729
t4 0.2503 0.80986
f 5.957 0.005984
Results from F p.val
randomizing observations x1 21.847 0.003
x2 0.1998 0.652
x3 1.72 0.206
x4 0.0603 0.793
randomizing residuals x1 21.847 0.002
x2 0.1998 0.671
x3 1.72 0.21
x4 0.0603 0.814
Using t- and F-tables x1 21.847 0.000435
x2 0.1998 0.662218
x3 1.72 0.212387
x4 0.0603 0.809863
case y x1 x2
1 99 33 2.09
2 5.94 1.97 0.87
3 103.45 2.65 2.83
4 8.33 2.72 0.42
5 3.83 0.7 2.8
6 2.82 0.94 1.93
7 4.19 0.76 0.82
8 2.86 0.95 0.31
9 4.18 1.39 1.2
10 3.47 0.69 1.17
11 38.09 1.28 1.79
12 25.62 1.5 1.53
13 7.54 2.33 2.35
14 5.51 0.82 0.19
15 10.13 2.8 0.72
16 4.66 1.55 2.74
17 4.15 0.76 2.29
18 19.97 0.46 2.93
19 4.23 0.7 2.02
20 8.83 1.98 1.94
19 values of X1 is (0,3)
20
th
value of X1 is 33 which
serves as outlier
X2 is from (0,3)
Y=3X1+e
3
~exp ()
Simulation A
Results from B p.val Results from F p.val
randomizing
observations t1 2.68 0.0304
randomizing
observations x1 15.77 0.0162
t2 9.34 0.1242 x2 2.7379 0.1242
f 9.254 0.0104
randomizing
residuals x1 15.77 0.0454
randomizing residuals t1 2.68 0.0454 x2 2.7379 0.1014
t2 9.34 0.1014
Using t- and F-
tables x1 15.77 0.000987
f 9.254 0.0454 x2 2.7379 0.116337
Using t- and F-tables t1 2.68 0.00151
t2 9.34 0.11634
f 9.254 0.00191
Simulation B
A simulation experiment was used to investigate to know what is the
best method when there are some outliers and a highly nonnormal
errors. What was done was to take the 20 errors in random error and
using the same predictors, values of Y is calculated using =
1
1 +
2
2 +
Results from
randomizing observations randomizing residuals Using t- and F-tables
B1 B2 t1 t2 f t1 t2 f t1 t2 f
0 0 5.1 5.3 5 7.1 4.1 5.6 5.2 2.4 5.4
2 0 25.3 3.8 18.5 46.9 4.2 36.5 94.8 2.1 42.5
4 0 91.8 3.8 91.1 49.6 4.3 49.8 100 2.5 100
0 5 4.9 18.9 8.5 6.2 18 8 4.7 14.3 8.3
0 10 5.3 43.4 16.5 6.7 44.9 17.9 5.1 39.3 24.2
2 5 21.8 15.2 24 47.5 16.7 40.5 94.8 13 54.5
4 10 93.2 40.3 94.9 48 44.6 50.2 100 38.6 100
Results of a simulation experiment to examine the properties of three methods for analyzing
multiple regression data. The tabulated values are the percentage of results from 5000 sets
of data where the test statistics t1(for
1
),t2(for
2
) and F (for the whole equation) where
significant at the 5% level. Percentages that are shown in italic(outside the range 4.4% to
5.6%) are significantly different at 5% level from what is desired(also 5%). Bold type
indicates that the method concerned gave the lowest observed power when the null
hypothesis was not true
Findings on simulation experiment
Randomizing observations observation gave the best performance in terms of
controlling the percentage of significant results close to 5% when the null
hypothesis was true. There were 35000 tests altogether under these conditions,
for which randomizing residuals gave 5.46% significant results, and using the t-
and F-distributions gave 3.9% significant results
In terms of power when the null hypothesis was not true, using t- and F-
distributions gave the best results overall, although sometimes this method gave
the lowest power. There were 70000 tests carried out under carried out under
these conditions. Of these, 43.1% were significant by randomizing
observations,37.08% were significant randomizing the residuals and, 58.88%
were significant using t- and F-distributions.
Simulation C
To examine the effect of correlation
between predictors, another
simulation was carried out. Using the
data in simulation A,
2
= 0.51 +
2. The result is that the correlation
of two predictors is 0.972
case y x1 x2
1 99 33 18.59
2 5.94 1.97 1.855
3 103.45 2.65 4.155
4 8.33 2.72 1.78
5 3.83 0.7 3.15
6 2.82 0.94 2.4
7 4.19 0.76 1.2
8 2.86 0.95 0.785
9 4.18 1.39 1.895
10 3.47 0.69 1.515
11 38.09 1.28 2.43
12 25.62 1.5 2.28
13 7.54 2.33 3.515
14 5.51 0.82 0.6
15 10.13 2.8 2.12
16 4.66 1.55 3.515
17 4.15 0.76 2.67
18 19.97 0.46 3.16
19 4.23 0.7 2.37
20 8.83 1.98 2.93
Results from
randomizing observations randomizing residuals Using t- and F-tables
B1 B2 T1 t2 f t1 t2 f t1 t2 f
0 0 5.3 5.2 5.6 4.9 4.4 5.9 4.4 3.3 5.8
2 0 11.6 4.3 24.5 12.5 4.7 35.2 11.5 3.5 68.4
4 0 29.2 4.9 91.2 29.7 4.7 64.3 28.4 3.5 100.0
0 5 3.8 15.6 68.4 4.5 16.7 47.2 3.6 14.7 100.0
0 10 4.3 45.9 96.5 4.6 46.3 95.9 4.1 42.4 100.0
2 5 11.9 17.1 93.6 13.0 16.9 80.3 11.9 14.5 100.0
4 10 26.8 50.5 99.0 30.3 46.4 100.0 28.8 42.5 100.0
Results of a simulation experiment to examine the properties of three methods for
analyzing multiple regression data. The tabulated values are the percentage of
results from 5000 sets of data where the test statistics t1(for
1
),t2(for
2
) and F
(for the whole equation) where significant at the 5% level. Percentages that are
shown in italic(outside the range 4.4% to 5.6%) are significantly different at 5%
level from what is desired(also 5%). Bold type indicates that the method
concerned gave the lowest observed power when the null hypothesis w as true
Findings
Using t- and F-distributions for testing has worked best because this gave the
highest power on average, while tending to give a significant results even less
than 5% of the time when the null hypothesis was true .
Randomization of observations has balance performance better than
randomizing residuals, but the results are really quite mixed, with each method
of randomization appearing to be the best for some condition.
More extensive simulation study is needed