Robust Nonparametric Statistical Methods Second Edition
Robust Nonparametric Statistical Methods Second Edition
Robust Nonparametric Statistical Methods Second Edition
Nonparametric
Statistical Methods
Second Edition
General Editors
Robust
Nonparametric
Statistical Methods
Second Edition
Thomas P. Hettmansperger
Penn State University
University Park, Pennsylvania, USA
Joseph W. McKean
Western Michigan University
Kalamazoo, Michigan, USA
This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been made to
publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials
or the consequences of their use. The authors and publishers have attempted to trace the copyright holders of all material repro-
duced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained. If any
copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any
form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming,
and recording, or in any information storage or retrieval system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://www.copy-
right.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400.
CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For organizations that have been
granted a photocopy license by the CCC, a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identifica-
tion and explanation without intent to infringe.
QA278.8.H47 2010
519.5--dc22 2010044858
vii
i i
i i
i i
Contents
Preface xv
1 One-Sample Problems 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Location Model . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Geometry and Inference in the Location Model . . . . . . . . . 5
1.3.1 Computation . . . . . . . . . . . . . . . . . . . . . . . 13
1.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.5 Properties of Norm-Based Inference . . . . . . . . . . . . . . . 19
1.5.1 Basic Properties of the Power Function γS (θ) . . . . . 20
1.5.2 Asymptotic Linearity and Pitman Regularity . . . . . . 22
1.5.3 Asymptotic Theory and Efficiency Results for θb . . . . 26
1.5.4 Asymptotic Power and Efficiency Results for the Test
Based on S(θ) . . . . . . . . . . . . . . . . . . . . . . . 27
1.5.5 Efficiency Results for Confidence Intervals Based on S(θ) 29
1.6 Robustness Properties of Norm-Based Inference . . . . . . . . 32
1.6.1 Robustness Properties of θb . . . . . . . . . . . . . . . . 33
1.6.2 Breakdown Properties of Tests . . . . . . . . . . . . . . 35
1.7 Inference and the Wilcoxon Signed-Rank Norm . . . . . . . . 38
1.7.1 Null Distribution Theory of T (0) . . . . . . . . . . . . 39
1.7.2 Statistical Properties . . . . . . . . . . . . . . . . . . . 40
1.7.3 Robustness Properties . . . . . . . . . . . . . . . . . . 46
1.8 Inference Based on General Signed-Rank Norms . . . . . . . . 48
1.8.1 Null Properties of the Test . . . . . . . . . . . . . . . . 50
1.8.2 Efficiency and Robustness Properties . . . . . . . . . . 51
1.9 Ranked Set Sampling . . . . . . . . . . . . . . . . . . . . . . . 57
1.10 L1 Interpolated Confidence Intervals . . . . . . . . . . . . . . 61
1.11 Two-Sample Analysis . . . . . . . . . . . . . . . . . . . . . . . 65
1.12 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
ix
i i
i i
i i
x CONTENTS
2 Two-Sample Problems 77
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
2.2 Geometric Motivation . . . . . . . . . . . . . . . . . . . . . . 78
2.2.1 Least Squares (LS) Analysis . . . . . . . . . . . . . . . 81
2.2.2 Mann-Whitney-Wilcoxon (MWW) Analysis . . . . . . 82
2.2.3 Computation . . . . . . . . . . . . . . . . . . . . . . . 84
2.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
2.4 Inference Based on the Mann-Whitney-Wilcoxon . . . . . . . . 87
2.4.1 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
2.4.2 Confidence Intervals . . . . . . . . . . . . . . . . . . . 97
2.4.3 Statistical Properties of the Inference Based on the MWW 97
2.4.4 Estimation of ∆ . . . . . . . . . . . . . . . . . . . . . . 102
2.4.5 Efficiency Results Based on Confidence Intervals . . . . 103
2.5 General Rank Scores . . . . . . . . . . . . . . . . . . . . . . . 105
2.5.1 Statistical Methods . . . . . . . . . . . . . . . . . . . . 109
2.5.2 Efficiency Results . . . . . . . . . . . . . . . . . . . . . 110
2.5.3 Connection between One- and Two-Sample Scores . . . 113
2.6 L1 Analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
2.6.1 Analysis Based on the L1 Pseudo-Norm . . . . . . . . . 115
2.6.2 Analysis Based on the L1 Norm . . . . . . . . . . . . . 119
2.7 Robustness Properties . . . . . . . . . . . . . . . . . . . . . . 122
2.7.1 Breakdown Properties . . . . . . . . . . . . . . . . . . 122
2.7.2 Influence Functions . . . . . . . . . . . . . . . . . . . . 123
2.8 Proportional Hazards . . . . . . . . . . . . . . . . . . . . . . . 125
2.8.1 The Log Exponential and the Savage Statistic . . . . . 126
2.8.2 Efficiency Properties . . . . . . . . . . . . . . . . . . . 129
2.9 Two-Sample Rank Set Sampling (RSS) . . . . . . . . . . . . . 131
2.10 Two-Sample Scale Problem . . . . . . . . . . . . . . . . . . . 133
2.10.1 Appropriate Score Functions . . . . . . . . . . . . . . . 133
2.10.2 Efficacy of the Traditional F -Test . . . . . . . . . . . . 142
2.11 Behrens-Fisher Problem . . . . . . . . . . . . . . . . . . . . . 144
2.11.1 Behavior of the Usual MWW Test . . . . . . . . . . . . 144
2.11.2 General Rank Tests . . . . . . . . . . . . . . . . . . . . 146
2.11.3 Modified Mathisen’s Test . . . . . . . . . . . . . . . . . 147
2.11.4 Modified MWW Test . . . . . . . . . . . . . . . . . . . 149
2.11.5 Efficiencies and Discussion . . . . . . . . . . . . . . . . 150
2.12 Paired Designs . . . . . . . . . . . . . . . . . . . . . . . . . . 152
2.12.1 Behavior under Alternatives . . . . . . . . . . . . . . . 156
2.13 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
i i
i i
i i
CONTENTS xi
i i
i i
i i
xii CONTENTS
i i
i i
i i
CONTENTS xiii
6 Multivariate 377
6.1 Multivariate Location Model . . . . . . . . . . . . . . . . . . . 377
6.2 Componentwise Methods . . . . . . . . . . . . . . . . . . . . . 382
6.2.1 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 385
6.2.2 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . 386
6.2.3 Componentwise Rank Methods . . . . . . . . . . . . . 390
6.3 Spatial Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 392
6.3.1 Spatial Sign Methods . . . . . . . . . . . . . . . . . . . 392
6.3.2 Spatial Rank Methods . . . . . . . . . . . . . . . . . . 399
6.4 Affine Equivariant and Invariant Methods . . . . . . . . . . . 403
6.4.1 Blumen’s Bivariate Sign Test . . . . . . . . . . . . . . 403
6.4.2 Affine Invariant Sign Tests . . . . . . . . . . . . . . . . 405
6.4.3 The Oja Criterion Function . . . . . . . . . . . . . . . 413
6.4.4 Additional Remarks . . . . . . . . . . . . . . . . . . . 418
6.5 Robustness of Estimates of Location . . . . . . . . . . . . . . 419
6.5.1 Location and Scale Invariance: Componentwise Methods 419
6.5.2 Rotation Invariance: Spatial Methods . . . . . . . . . . 420
6.5.3 The Spatial Hodges-Lehmann Estimate . . . . . . . . . 421
6.5.4 Affine Equivariant Spatial Median . . . . . . . . . . . . 421
6.5.5 Affine Equivariant Oja Median . . . . . . . . . . . . . 422
6.6 Linear Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 422
6.6.1 Test for Regression Effect . . . . . . . . . . . . . . . . 425
6.6.2 The Estimate of the Regression Effect . . . . . . . . . 431
6.6.3 Tests of General Hypotheses . . . . . . . . . . . . . . . 432
6.7 Experimental Designs . . . . . . . . . . . . . . . . . . . . . . . 439
6.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443
i i
i i
i i
xiv CONTENTS
References 495
Index 527
i i
i i
i i
Preface
Basically, I’m not interested in doing research and I never have been. I’m
interested in understanding, which is quite a different thing. And often to
understand something you have to work it out yourself because no one else has
done it.
I don’t believe I can really do without teaching. The reason is, I have to have
something so that when I don’t have any ideas and I’m not getting anywhere
I can say to myself, “At least I’m living; at least I’m doing something; I’m
making some contribution”-it’s just psychological.
Richard Feynman
xv
i
i
i i
xvi PREFACE
additional models (including models with dependent error structure and non-
linear models) and methods and extends significantly the possible analyses
based on ranks.
In the second edition we have retained the material on one- and two-sample
problems (Chapters 1 and 2) along with the basic development of rank meth-
ods in the linear model (Chapter 3) and fixed effects experimental designs
(Chapter 4). Chapter 5, from the first edition, on high breakdown R esti-
mates has been condensed and moved to Chapter 3. In addition, Chapter
3 now contains a new section on rank procedures for nonlinear models. Se-
lected topics from the first four chapters provide a basic graduate course in
rank-based methods. The methods are fully illustrated and the theory fully
developed. The prerequisites are a basic course in mathematical statistics and
some background in applied statistics. For a one semester course, we suggest
the first seven sections of Chapter 1, the first four sections of Chapter 2, the
first seven sections plus section 9 in Chapter 3, and the first four sections of
Chapter 4, and then choice of topics depending on interest.
The new Chapter 5 deals with models with dependent error structure. New
material on rank methods for mixed models is included along with material
on general estimating equations, GEE. Finally, a section on time series has
been added. As in the first edition, this new material is illustrated on data
sets and R software is made available to the reader.
Chapter 6 in both editions deals with multivariate models. In the sec-
ond edition we have added new material on the development of affine in-
variant/equivariant sign methods based on transform-retransform techniques.
The new methods are computationally efficient as opposed to the earlier affine
invariant/equivariant methods.
The methods developed in the book can be computed using R li-
braries and functions. These libraries are discussed and illustrated in
the relevant sections. Information on several of these packages and func-
tions (including Robnp, ww, and Rfit) can be obtained at the web site
http://www.stat.wmich.edu/mckean/index.html. Hence, we have again ex-
panded significantly the available set of tools and inference methods based
on ranks.
We have included the data sets for many of our examples in the book. For
others, the reader can obtain the data at the Chapman and Hall web site. See
also the site http://www.stat.wmich.edu/mckean/index.html for information
on the data sets used in this book.
We are indebted to many of our students and colleagues for valuable dis-
cussions, stimulation, and motivation. In particular, the first author would like
to express his sincere thanks for many stimulating hours of discussion with
Steve Arnold, Bruce Brown, and Hannu Oja while the second author wants
to express his sincere thanks for discussions over the years with Ash Abebe,
i i
i i
i i
xvii
Kim Crimin, Brad Huitema, John Kapenga, John Kloke, Joshua Naranjo, M.
Rashid, Jerry Sievers, Jeff Terpstra, and Tom Vidmar. We both would like to
express our debt to Simon Sheather, our friend, colleague, and co-author on
many papers. We express our thanks to Rob Calver, Sarah Morris, and Michele
Dimont of Chapman & Hall/CRC for their assistance in the preparation of
this book.
Thomas P. Hettmansperger
Joseph W. McKean
i i
i i
i i
Chapter 1
One-Sample Problems
1.1 Introduction
Traditional statistical procedures are widely used because they offer the user a
unified methodology with which to attack a multitude of problems, from simple
location problems to highly complex experimental designs. These procedures
are based on least squares fitting. Once the problem has been cast into a model
then least squares offers the user:
2. diagnostic techniques that check the adequacy of the fit of the model,
explore the quality of fit, and detect outlying and/or influential cases;
4. computational feasibility.
1
i i
i i
i i
realized that these procedures are almost as efficient as the traditional meth-
ods when the errors follow a normal distribution and, furthermore, are often
much more efficient relative to the traditional methods when the error distri-
butions deviate from normality; see Hodges and Lehmann (1956). These pro-
cedures possess both robustness of validity and power. In recent years these
nonparametric methods have been extended to linear and nonlinear models.
In addition, from the perspective of modern robustness theory, contrary to
least squares estimates, these rank-based procedures have bounded influence
functions and positive breakdown points.
Often these nonparametric procedures are thought of as disjoint methods
that differ from one problem to another. In this text, we intend to show that
this is not the case. Instead, these procedures present a unified methodology
analogous to the traditional methods. The four items cited above for the tra-
ditional analysis hold for these procedures too. Indeed the only operational
difference is that the Euclidean norm is replaced by another norm.
There are computational procedures available for the rank-based pro-
cedures discussed in this book. We offer the reader a collection of com-
putational functions written in the software language R; see the site
http://www.stat.wmich.edu/mckean/. We refer to these computational algo-
rithms as robust nonparametric R algorithms or Robnp. For the chapters on
linear models we make use of the set of algorithms ww written by Terpstra
and McKean (2005) and the R package Rfit developed by Kloke and McKean
(2010). We discuss these functions throughout the text and use them in many
of the examples, simulation studies, and exercises. The programming language
R (see Ihaka and Gentleman, 1996) is freeware and can run on all (PC, Mac,
Linux) platforms. To download the R software and accompanying informa-
tion, visit the site http://www.r-project.org/. The language R has intrinsic
functions for computation of some of the procedures discussed in this and the
next chapter.
i i
i i
i i
3. T (H−X ) = −T (HX ).
Then, we call θ = T (H) a location parameter of H.
Note that if X has location parameter θ it follows from the second item in
the above definition that the random variable e = X −θ has location parameter
0. Suppose X1 , . . . , Xn is a random sample having the common distribution
function H(x) and θ = T (H) is a location parameter of interest. We express
this by saying that Xi follows the statistical location model,
Xi = θ + ei , i = 1, . . . , n , (1.2.1)
Example 1.2.1 (The Median Location Functional). First define the inverse
of the cdf H(x) by H −1 (u) = inf{x : H(x) ≥ u}. Generally we suppose that
H(x) is strictly increasing on its support and this eliminates ambiguities on
the selection of the parameter. Now define θ1 = T1 (H) = H −1 (1/2). This is the
median functional. Note that if G(x) ≤ F (x) for all x, then G−1 (u) ≥ F −1 (u)
for all u; and, in particular, G−1 (1/2) ≥ F −1 (1/2). Hence, T1 (H) satisfies the
first condition for a location functional. Next let H ∗ (x) = P (aX + b ≤ x) =
H[a−1 (x − b)]. Then it follows at once that H ∗−1 (u) = aH −1 (u) + b and the
second condition is satisfied. The third condition follows with an argument
similar to the one for the second condition.
Example 1.2.2 (The R Mean Location Functional). For the mean R functional
let
R θ−1 2 = T2 (H) = xdH(x), when the mean exists. Note that xdH(x) =
H (u)du. Now if G(x) ≤ F (x) for all x, then x ≤ G−1 (F (x)). Let x =
−1
RF −1(u) and weRhave F −1 (u) ≤ G−1 (F (F −1(u)) ≤ G−1 (u). Hence, T2 (G) =
G (u)du ≥ F −1 (u)du = T2 (F ) and the first condition is satisfied. The
other two conditions follow easily from the definition of the integral.
Example 1.2.3 (The Pseudo-Median Location Functional). Assume that X1
and X2 are independent and identically distributed, (iid), with distribution
i i
i i
i i
hence, as in Example 1.2.1, it follows that G∗−1 (u) ≥ F ∗−1 (u) and, hence,
that T3 (G) ≥ T3 (F ). For the second property, let W = aX + b where X
has distribution function H and a > 0. Then W has distribution function
FW (t) = H((t − b)/a). Then by the change of variable z = (x − b)/a, we have
Z Z
∗ 2y − x − b 1 x−b y−b
FW (y) = H h dx = H 2 − z h(z) dz .
a a a a
Thus the defining equation for T3 (FW ) is
Z
1 T3 (FW ) − b
= H 2 − z h(z) dz ,
2 a
which is satisfied for T3 (FW ) = aT3 (H) + b. For the third property, let V =
−X where X has distribution function H. Then V has distribution function
FV (t) = 1 − H(−t). Hence, by the change in variable z = −x,
Z Z
∗
FV (y) = (1 − H(−2y + x))h(−x) dx = 1 − H(−2y − z))h(z) dz .
The next theorem characterizes all the location functionals for a symmetric
distribution.
i i
i i
i i
Theorem 1.2.1. Suppose that the pdf h(x) is symmetric about some point a.
If T (H) is a location functional, then T (H) = a.
Proof: Let the random variable X have pdf h(x) symmetric about a. Let
Y = X − a, then Y has pdf g(y) = h(y + a) which is symmetric about 0.
Hence Y and −Y have the same distribution. By the third property of location
functionals, this means that T (GY ) = T (G−Y ) = −T (GY ); i.e, T (GY ) = 0.
By the second property, 0 = T (GY ) = T (H) − a; that is , a = T (H).
This theorem means that when we sample from a symmetric distribution
we can unambiguously define location as the center of symmetry. Then all
location functionals that we may wish to study specify the same location
parameter.
i i
i i
i i
The minimum distance between the vector of observations x and the space
b As shown in Exercise 1.12.3, D(θ) is a convex and continuous
ΩF is D(θ).
function of θ which is differentiable almost everywhere. Actually the norms
discussed in this book are differentiable at all but at most a finite number of
points. We define the gradient process by the function
d
S(θ) = − D(θ) . (1.3.4)
dθ
As Exercise 1.12.3 shows, S(θ) is a nonincreasing function. Its discontinuities
are the points where D(θ) is nondifferentiable. Furthermore the minimizing
value is a value where S(θ) is 0 or, due to a discontinuity, steps through 0. We
express this by saying that θb solves the equation
b =.
S(θ) 0. (1.3.5)
H0 : θ = θ0 versus HA : θ 6= θ0 , (1.3.6)
i i
i i
i i
θbL = inf{t : S(t) < c} and θbU = sup{t : S(t) > −c}. (1.3.10)
i i
i i
i i
P
Example 1.3.1 (L1 Norm). Recall that the L1 norm is defined as kxk1 = |
xi |, hence the associatedP
dispersion and negative gradient
P functions are given
respectively by D1 (θ) = | Xi − θ | and S1 (θ) = sgn(Xi − θ). Letting Hn
denote the empirical cdf, we can write the estimating equation as
X Z
−1
0=n sgn(xi − θ) = sgn(x − θ)dHn (x) .
hence, H(T (H)) = 1/2 and solving for T (H) we find T (H) = H −1 (1/2) as
expected.
As we show in Section 1.5,
i i
i i
i i
and c1 satisfies
P [bin(n, 1/2) ≤ c1 ] = α/2 , (1.3.15)
where bin(n, 1/2) denotes a binomial random variable based on n trials and
with probability of success 1/2. Note that the critical value of the test can be
determined without specifying the shape of F . In this sense, the test based
on S1 is distribution free or nonparametric. Using the asymptotic null
.
distribution of S1+ , c1 can be approximated as c1 = n/2 − n1/2 zα/2 /2 − .5
where Φ(−zα/2 ) = α/2; Φ(.) is the standard normal cdf, and .5 is the continuity
correction.
For the associated (1 − α)100% confidence interval, we follow the general
development above, (1.3.12). Hence, we must find θbL = inf{t : S1+ (t) < n−c1 },
where c1 is given by (1.3.15). Note that S1+ (t) < n−c1 if and only if the number
of Xi greater than t is less than n−c1 . But #{i : Xi > X(c1 +1) } = n−c1 −1 and
#{i : Xi > X(c1 +1) − ǫ} ≥ n − c1 for any ǫ > 0. Hence, θbL = X(c1 +1) . A similar
argument shows that θbU = X(n−c1 ) . We can summarize this by saying that the
(1 − α)100% L1 confidence interval is the half open, half closed interval
The critical value c1 can be determined from the binomial(n, 1/2) distribution
or from the normal approximation cited above. The interval developed here is
a distribution-free confidence interval since the confidence coefficient is deter-
mined from the binomial distribution without making any shape assumption
on the underlying model distribution.
Example 1.3.2
Pn (L22 Norm). Recall that the square of the L2 norm is given
by kxk22 = i=1 xi . As shown in Exercise 1.12.4, the estimate determined
by this norm is the sample mean X and the functional parameter is µ =
i i
i i
i i
where R(|xi |) denotes the rank of |xi | among |x1 |, . . . , |xn |. As the next theorem
shows this function is a norm on Rn . See Section 1.8 for a general weighted
L1 norm.
P P
Theorem 1.3.2. The function kxk3 = j|x|(j) = R(|xj |)|xj | is a norm,
where R(|xj |) is the rank of |xj | among |x1 |, . . . , |xn | and |x|(1) ≤ · · · ≤ |x|(n)
are the ordered absolute values.
Proof: The equality relating kxk3 to the ranks is clear. To show that we have
a norm, we first note that kxk3 ≥ 0 and that kxk3 = 0 if and only if x = 0.
Also clearly kaxk3 = |a|kxk3 for any real a. Hence, to finish the proof, we
must verify the triangle inequality. Now
X X
kx + yk3 = j|x + y|(j) = R(|xi + yj |)|xi + yj |
X X
≤ R(|xi + yj |)|xi | + R(|xi + yj |)|yj | . (1.3.18)
Consider the first term on the right side. By summing through another index
we can write it as,
X X
R(|xi + yj |)|xi | = bj |x|(j) ,
i i
i i
i i
We call this norm the weighted L1 Norm. In the next theorem, we offer
an interesting identity satisfied by this norm. First, though, we need another
representation of it. For a random sample X1 , . . . , Xn , define the anti-ranks
to be the random variables D1 , . . . , Dn such that
Z1 = |XD1 | ≤ . . . ≤ Zn = |XDn | . (1.3.19)
For example, if D1 = 2 then |X2 | is the smallest absolute value and Z1 has rank
1. Note that the anti-rank function is just the inverse of the rank function.
We can then write
n
X n
X
kxk3 = j|x|(j) = j|xDj | . (1.3.20)
i=j j=1
Proof: Letting the index run through the anti-ranks, the right-side of (1.3.21)
is
X n X X xDi + xDj xDj − xDi
|xi | + + . (1.3.22)
2 2
i=1 i<j
There are four cases to consider: where xDi and xDj are both positive; where
they are both negative; and the two cases where they have mixed signs. In all
these cases, though, it is easy to show that
xDi + xDj xDj − xDi
+ = |xD | .
2 2 j
Using this, we have that the right side of expression (1.3.22) is equal to:
n
X XX n
X n
X n
X
|xi | + |xDj | = |xDj | + (j − 1)|xDj | = j|xDj | = kxk3 ,
i=1 i<j j=1 j=1 j=1
(1.3.23)
i i
i i
i i
The middle term is due to the fact that the ranks only change values at the
finite number of points determined by |Xi −θ| = |Xj −θ|; otherwise R(|Xi −θ|)
is constant. The third term is obtained immediately from the identity (1.3.21).
The n(n + 1)/2 pairwise averages {(Xi + Xj )/2 : 1 ≤ i ≤ j ≤ n} are called
the Walsh averages. Hence, the estimate of θ is the median of the Walsh
averages, which we denote as,
b Xi + Xj
θ3 = medi≤j , (1.3.25)
2
first discussed by Hodges and Lehmann (1963). Often θb3 is called the Hodges-
Lehmann estimate of location. In order to obtain the corresponding loca-
tion functional, note that
R(|Xi − θ|) = #{|Xj − θ| ≤ |Xi − θ|}
= #{θ − |Xi − θ| ≤ Xj ≤ θ + |Xi − θ|}
= nHn (θ + |Xi − θ|) − nHn− (θ − |Xi − θ|) ,
where Hn− is the left limit of Hn . Hence (1.3.24) becomes
Z
{Hn (θ + |x − θ|) − Hn− (θ − |x − θ|)}sgn(x − θ) dHn (x) = 0 ,
that is,
Z θ Z ∞
− {H(2θ − x) − H(x)} dH(x) + {H(x) − H(2θ − x)} dH(x) = 0 .
−∞ θ
This simplifies to
Z ∞ Z ∞
1
H(2θ − x) dH(x) = H(2θ − x) dH(x) = , (1.3.26)
−∞ −∞ 2
Hence, the functional is the pseudo-median defined in Example 1.2.3. If the
density h(x) is symmetric then from (1.7.11)
i i
i i
i i
The corresponding gradient test statistic for the hypotheses (1.3.6) is T + (0).
In Section 1.7, provided that h(x) is symmetric, it is shown that T + (0) is
distribution free under H0 with null mean and variance n(n + 1)/4 and n(n +
1)(2n + 1)/24, respectively. This test is often referred to as the Wilcoxon
signed-rank test. Thus the test for the hypotheses (1.3.6) is
n(n+1)
Reject H0 in favor of HA , if T + (0) ≤ k or T + (0) ≥ 2
−k , (1.3.29)
where P (T + (0) ≤ k) = α/2. An approximation for k is given in the next
paragraph.
Because of the similarity between the sign and signed-rank processes, the
confidence interval based on T + (θ) follows immediately from the argument
given in Example 1.3.1 for the sign process. Instead of the order statistics which
were used in the confidence interval based on the sign process, in this case
we use the ordered Walsh averages, which we denote as W(1) , . . . , W(n(n+1)/2) .
Hence a (1 − α)100% confidence interval for θ is given by
[W(k+1) , W((n(n+1)/2)−k) ) where k is such that α/2 = P (T + (0) ≤ k) . (1.3.30)
As with the sign process, k can be approximated using the asymptotic normal
distribution of T + (0) by
r
. n(n + 1) n(n + 1)(2n + 1)
k= − zα/2 − .5 ,
4 24
where zα/2 is the (1 − α/2)-quantile of the standard normal distribution. Pro-
vided that h(x) is symmetric, this confidence interval is distribution free.
1.3.1 Computation
The three procedures discussed in this section are easily computed in R. The R
intrinsic functions t.test and wilcoxon.test compute the t- and Wilcoxon
signed-rank tests, respectively. Our collection of R functions, Robnp, contains
the functions onesampwil and onesampsgn which compute the asymptotic ver-
sions of the Wilcoxon signed-rank and sign tests, respectively. These functions
also compute the associated estimates, confidence intervals, and standard er-
rors. Their use is discussed in the examples. Minitab (Ryan, Joiner, and Cryer,
2005) also can be used to compute these tests. At command line the Minitab
commands stest, wtest, and ttest compute the sign, Wilcoxon signed-rank,
and t-tests, respectively.
i i
i i
i i
1.4 Examples
In applications by convention, when testing the null hypothesis H0 : θ = θ0
using the sign test, any data point equal to θ0 is set aside and the sample size is
reduced. On the other hand, these values are not set aside for point estimation
or confidence intervals. The output of the Robnp functions onesampwil and
onesampsgn includes the test statistics T and S, respectively, and a continuity
corrected standardized value z. The p-values are approximated by computing
normal probabilities on z. Especially for small sample sizes, for the test based
on the signs, S, the approximate and exact p-values can be somewhat different.
In calculating the signed-ranks for the test statistic T , we use average ranks.
For t-tests, we report the the p-values and confidence intervals using the t-
distribution with n − 1 degrees of freedom.
Example 1.4.1 (Cushney-Peebles Data). The data given in Table 1.4.1 gives
the average excess number of hours of sleep that each of 10 patients achieved
from the use of two drugs. The third column gives the difference (Laevo-
Dextro) in excesses across the two drugs. This is a famous data set. Gosset,
writing under the pseudonym Student, published his landmark paper on the
t-test in 1908 and used this data set for illustration. The differences, however,
suggests that the L2 methods may not be the methods of choice in this case.
The normal quantile plot, Panel A of Figure 1.4.1, shows that the tails may
be heavy and that there may be an outlier. A normal quantile plot has the
data (differences) on the vertical axis and the expected values of the standard
normal order statistics on the horizontal axis. When the data is consistent
with a normal assumption, the plot should be roughly linear. The boxplot,
with 95% L1 confidence interval, Panel B of Figure 1.4.1, further illustrates
the presence of an outlier. The box is defined by the quartiles and the shaded
notch represents the confidence interval.
For the sake of discussion and comparison of methods, we provide the
p-values for the sign test, the Wilcoxon signed-rank test, and the t-test. We
used the Robnp functions onesampwil, onesampsgn, and onesampt to compute
the results for the Wilcoxon signed-rank test, the sign test, and the t-test,
respectively. For each function, the following display shows the necessary R
code (these are preceded with the prompt >) to compute these functions,
which is then followed by the results. The standard errors (SE) for the sign
and signed-rank estimates are given by (1.5.29) and (1.7.12), respectively, in
general in Section 1.5.5. These functions also produce a boxplot of the data.
The boxplot produced by the function onesampsgn is shown in Figure 1.4.1.
> onesampwil(diffs)
i i
i i
i i
1.4. EXAMPLES 15
Table 1.4.1: Excess Hours of Sleep under the Influence of Two Drugs
Row Dextro Laevo Diff(L-D)
1 -0.1 -0.1 0.0
2 0.8 1.6 0.8
3 3.4 4.4 1.0
4 0.7 1.9 1.2
5 -0.2 1.1 1.3
6 -1.2 0.1 1.3
7 2.0 3.4 1.4
8 3.7 5.5 1.8
9 -1.6 0.8 2.4
10 0.0 4.6 4.6
> onesampsgn(diffs)
> temp=onesampt(diffs)
i i
i i
i i
4
Difference: Laevo − Dextro
3
*
2
2
*
*
* * *
*
1
1
*
*
0
0
−1.0 −0.5 0.0 0.5 1.0
Normal quantiles
Panel C Panel D
2.8
6
5
2.7
Standardized sign test
4
2.6
t−test
2.5
2
2.4
1
2.3
0
2.2
−10 −5 0 5 10 −10 −5 0 5 10
The confidence interval corresponding to the sign test is (0.8, 2.4) which is
shifted above 0. Hence, there is strong support for the alternative hypothesis
that the location of the difference distribution is not equal to zero. That is,
we reject H0 : θ = 0 in favor of HA : θ 6= 0 at α = .05. All three tests support
this conclusion. The estimates of location corresponding to the three tests are
the median (1.3), the median of the Walsh averages (1.3), and the mean of the
sample differences (1.58). Note that the outlier had an effect on the sample
mean.
i i
i i
i i
1.4. EXAMPLES 17
In order to see how sensitive the test statistics are to outliers, we change
the value of the outlier (difference in the 10th row of Table 1.4.1) and plot the
value of the test statistic against the value of the difference in the 10th row
of Table 1.4.1; see Panel C of Figure 1.4.1. Note that as the value of the 10th
difference changes, the t-test changes quite rapidly. In fact, the t-test can be
pulled out of the rejection region by making the difference sufficiently small
or large. However, the sign test, Panel D of Figure 1.4.1, stays constant until
the difference crosses zero and then only changes by 2. This illustrates the
high sensitivity of the t-test to outliers and the relative resistance of the sign
test. A similar plot can be prepared for the Wilcoxon signed-rank test; see
Exercise 1.12.8. In addition, the corresponding p-values can be plotted to see
how sensitive the decision to reject the null hypothesis is to outliers. Sensitivity
plots are similar to influence functions. We discuss influence functions for
estimates in Section 1.6.
> onesampsgn(x,theta0=.618,alpha=.10)
i i
i i
i i
0.9
0.9
*
0.8
0.8
Width to length ratios
0.7
**
**
**
*
*
***
***
0.6
0.6
**
*
−1.5 −0.5 0.5 1.5
Normal quantiles
i i
i i
i i
> onesampt(x,theta0=.618,alpha=.10)
i i
i i
i i
sup γS (θ) = α .
θ≤0
i i
i i
i i
i i
i i
i i
i i
i i
i i
Assume the setup found at the beginning of this section; i.e., we are con-
sidering the location model (1.3.1) and we have specified a norm with gradient
function S(θ). We first define a Pitman Regular process:
Definition 1.5.3. We say an estimating function S(θ) is Pitman Regular
if the following four conditions hold: first,
i i
i i
i i
and
√ S(0) D−b/√n
n → Z − cb , (1.5.17)
σ(0)
where Z ∼ N(0, 1) and, so, Z − cb ∼ N(−cb, 1).
The second part of this theorem says that the limiting distribution of S(0),
when standardized by σ(0), and computed along a sequence of alternatives
−b/n1/2 is still normal with the same variance of one but with a new mean,
namely −cb. This result is useful in approximating the power near the null
hypothesis.
We find asymptotic linearity to be useful in establishing statistical prop-
erties. Our next result provides sufficient conditions for linearity.
Theorem 1.5.6. Let S(θ) = (1/nγ )S(θ) for some γ > 0 such that the condi-
tions (1.5.7), (1.5.9), and (1.5.11) of Definition 1.5.3 hold. Suppose for any
b ∈ R,
nVar0 (S(n−1/2 b) − S(0)) → 0 , as n → ∞ . (1.5.18)
Then
√ b √ P
sup n S √ − n S(0) + µ (0)b → 0 ,
′
(1.5.19)
|b|≤B n
for any B > 0.
√
Proof: First consider Un (b) = [S(n−1/2 b) − S(0)]/(b/ n). By (1.5.9) we have
√ √
n −b n b ′
E0 (U0 (b)) = µ √ = − √ µ (ξn ) → −µ′ (0) . (1.5.20)
b n b n
Furthermore,
n b
Var0 Un (b) = 2 Var0 S √ − S(0) → 0 . (1.5.21)
b n
As Exercise 1.12.9 shows, (1.5.20) and (1.5.21) imply that Un (b) converges to
−µ′ (0) in probability, pointwise in b, i.e., Un (b) = −µ′ (0) + op (1).
√ √
For √the second part of the proof, let Wn (b) = n[S(b/ n) − S(0) +
µ′ (0)b/ n]. Further let ǫ > 0 and γ > 0 and partition [−B, B] into
−B = b0 < b1 < . . . < bm = B so that bi − bi−1 ≤ ǫ/(2|µ′ (0)|) for all i.
There exists N such that n ≥ N implies P [maxi |Wn (bi )| > ǫ/2] < γ.
i i
i i
i i
Now suppose that Wn (b) ≥ 0 ( a similar argument can be given for Wn (b) <
0). Then
√ b ′
√ b
|Wn (b)| = n S √ − S(0) + bµ (0) ≤ n S √ − S(0)
n n
+bi−1 µ′ (0) + (b − bi−1 )µ′(0)
≤ |Wn (bi−1 )| + (b − bi−1 )|µ′ (0)| ≤ max |Wn (bi )| + ǫ/2 .
i
Hence,
!
P0 sup |Wn (b)| > ǫ ≤ P0 (max |Wn (bi )| + ǫ/2) > ǫ) < γ ,
|b|≤B i
and
P
sup |Wn (b)| → 0 .
|b|≤B
In the next three subsections we use these tools to handle the issues of
power and efficiency for a general norm-based inference, but first we show
that the L1 gradient function is Pitman Regular.
Example 1.5.2 (Pitman Regularity of the L1 Process). Assume that the
model pdf satisfies f (0) > 0. Recall that the L1 gradient function is
n
X
S1 (θ) = sgn(Xi − θ) .
i=1
i i
i i
i i
For future√ reference, we state the asymptotic linearity result for the L1
process: if | nθn | ≤ B then
√ √ √
n S 1 (θn ) = n S 1 (0) − 2f (0) nθn + op (1) . (1.5.23)
Example 1.5.3 (Pitman Regularity of the L2 Process). In Exercise 1.12.6
it is shown that, provided Xi has finite variance, the L2 gradient function is
Pitman Regular and that the efficacy is simply cL2 = 1/σf .
We are now in a position to investigate the efficiency and power properties
of the statistical methods based on the L1 norm relative to the statistical
methods based on the L2 norm. As we see in the next three subsections, these
properties depend only on the efficacies.
i i
i i
i i
Solving we have √ √
nθb = c−1 n S(0)/σ(0) + op (1) ;
√
hence, the result follows because n S(0)/σ(0) is asymptotically N(0, 1).
Definition 1.5.4. If we have two Pitman Regular estimates with efficacies c1
and c2 , respectively, then the efficiency of θb1 with respect to θb2 is defined to
be the reciprocal ratio of their asymptotic variances, namely, e(θb1 , θb2 ) = c21 /c22 .
The next example compares the L1 estimate to the L2 estimate.
Example 1.5.4 (Relative Efficiency between the L1 and L2 Estimates). In
this example we compare the L1 and L2 estimates, namely, the sample median
and mean. We have seen that their respective efficacies are 2f (0) and σf−1 , and
their asymptotic variances are 1/4f 2(0)n and σf2 /n, respectively. Hence, the
relative efficiency of the median with respect to the mean is
√ √
e(Ẋ, X̄) = asyvar( nX̄)/asyvar( nẊ) = c2Ẋ /c2X̄ = 4f 2(0)σf2 (1.5.24)
where Ẋ is the sample median and X̄ is the sample mean. The efficiency com-
putation depends only on the Pitman efficacies. We illustrate the computation
of the efficiency using the contaminated normal distribution. The pdf of the
contaminated normal distribution consists of mixing the standard normal pdf
with a normal pdf having mean zero and variance δ 2 > 1. For ǫ between 0 and
1, the pdf can be written:
with σf2 = 1 + ǫ(δ 2 − 1). This distribution has tails heavier than the standard
normal distribution and can be used to model data contamination; see Tukey
(1960) for more discussion. We can think of ǫ as the fraction of the data
that is contaminated. In Table 1.5.1 we provide values of the efficiencies for
various values of contamination and with δ = 3. Note that when we have 10%
contamination that the efficiency is 1. This indicates that, for this distribution,
the median and mean are equally effective. Finally, this example exhibits a
distribution for which the median is superior to the mean as an estimate of
the center. See Exercise 1.12.12 for other examples.
i i
i i
i i
Table 1.5.1: Efficiencies of the Median Relative to the Mean for Contaminated
Normal Models
ǫ e(Ẋ, X̄)
.00 .637
.03 .758
.05 .833
.10 1.000
.15 1.134
this test is nondecreasing with upper limit one and that it is typically resolving.
Further, we showed that for a fixed alternative, the test is consistent. Thus the
power tends to one as the sample size increases. To offset this effect, we let the
alternative converge to the null value at a rate that stabilizes the power away
from one. This enables us to compare two tests along the same alternative
sequence.√Consider the null hypothesis H0 : θ = 0 versus HAn : θ = θn where
θn = θ∗ / n and ∗
√ θ > 0. Recall that the asymptotic size α test based on S(0)
rejects H0 if n S/σ(0) ≥ zα where 1 − Φ(zα ) = α.
The following theorem is called the asymptotic power lemma. Its proof
follows immediately from expression (1.5.14).
Theorem 1.5.8. Assume that S(0) is Pitman Regular √ with efficacy c, then
∗
the asymptotic local power along the sequence θn = θ / n is
√ √
γS (θn ) = Pθn n S(0)/σ(0) ≥ zα = P0 n S(−θn )/σ(0) ≥ zα
→ 1 − Φ(zα − θ∗ c), as n → ∞.
Note that larger values of the efficacy imply larger values of the asymptotic
local power.
Note that this is the same formula as the efficiency of one estimate relative
to another given in Definition 1.5.4. Therefore, the efficiency results discussed
in Example 1.5.4 between the L1 and L2 estimates apply for the sign and t-tests
also. Hence, we have an example in which the simple sign test is asymptotically
more powerful than the t-test.
We can also develop a sample size interpretation for the asymptotic power.
Suppose we specify a power γ < 1. Further, let zγ be defined by 1 −Φ(zγ ) = γ.
Then 1 − Φ(zα − cn1/2 θn ) = 1 − Φ(zγ ) and zα − cn1/2 θn = zγ . Solving for n
yields
.
n = (zα − zγ )2 /c2 θn2 . (1.5.26)
i i
i i
i i
Typically we take θn = kn σ with kn small. Now if S1 (0) and S2 (0) are two
Pitman Regular asymptotically size α tests then the ratio of sample sizes
required to achieve the same asymptotic power along the same sequence of
.
alternatives is given by the approximation: n2 /n1 = c21 /c22 . This provides ad-
ditional motivation for the above definition of Pitman efficiency of two tests.
The initial development of asymptotic efficiency was done by Pitman (1948)
in an unpublished manuscript and later published by Noether (1955).
When this is also done for θbU and the difference is taken, we have:
n1/2 (θbU − θbL ) = 2zα/2 /c + oP (1) ,
i i
i i
i i
This simple estimate of the asymptotic standard error is based on the length
of the 95% confidence interval for the median. Sheather (1987) shows that
i i
i i
i i
There are other approaches to the estimation of this standard error. For
example, we could estimate the density h(x) directly and then use hn (θ)b where
hn is the density estimate. Another possibility is to estimate the finite sample
standard error of the sample median directly. Sheather (1987) surveys these
approaches. We discuss one further possibility here, namely the bootstrap.
The bootstrap has gained wide attention recently because of its versatility in
estimation and testing in nonstandard situations. See Efron and Tibshirani
(1993) for a very readable account of the bootstrap.
If we know the underlying distribution H(x), then we could estimate the
standard error of the median by repeatedly drawing samples with a computer
from the distribution H. If we have B samples from H and have computed and
stored the B values of the sample median, then our estimate of the standard
error of the median is simply the sample standard deviation of these B values.
When H is unknown we replace it by Hn , the empirical distribution func-
tion, and proceed with the simulation. The bootstrap approach based on Hn
is called the nonparametric bootstrap since nothing is assumed about the
form of the underlying distribution H. In another version, called the paramet-
ric bootstrap, we suppose that we know the form of the underlying distribution
H but there are some unknown parameters such as the mean and variance. We
use the sample to estimate these unknown parameters, insert the values into
H, and use this distribution to draw the B samples. In this book we are con-
cerned mainly with the nonparametric bootstrap and we use the generic term
bootstrap to refer to this approach. In either case, ready access to high speed
computing makes this method appealing. The following example illustrates
the computations.
Example 1.5.6 (Generated Data). Using Minitab, the 30 data points in Ta-
ble 1.5.2 were generated from a normal distribution with mean 0 and vari-
ance 1. Thus, we know that the asymptotic standard error should be about
1/[301/2 2f (0)] = 0.23. We use this to check what happens if we try to estimate
the standard error from the data.
Using expression (1.3.16), the 95% confidence interval for the median is
(−0.789, 0.331). Hence, the length of confidence interval estimate, given in
expression (1.5.29), is (0.331 + 0.789)/4 = 0.28. A simple R function was
written to bootstrap the sample; see Exercise 1.12.7. Using this function, we
obtained 1000 bootstrap samples and the resulting standard deviation of the
i i
i i
i i
1000 bootstrap medians was 0.27. For this instance, the bootstrap procedure
essentially agrees with the length of confidence interval estimate.
Note that, from the data, the sample mean is −0.03575 and the sample
standard deviation is 1.04769. If we assume the underlying distribution H is
normal with unknown mean and variance, we would use the parametric boot-
strap. Hence, instead of sampling from the empirical distribution function, we
want to sample from a normal distribution with mean −0.03575 and standard
deviation 1.04769. Using R (see Exercise 1.12.7), we obtained 1000 parametric
bootstrapped samples. The sample standard deviation of the resulting medi-
ans was 0.23, just the value we would expect. You should not expect to get the
precise value every time you bootstrap, either parametrically or nonparametri-
cally. It is, however, a very versatile method to use to estimate such quantities
as standard errors of estimates and p-values of tests.
i i
i i
i i
single outlying value when it is introduced into the sample. Further, we would
not like procedures that can be changed by arbitrary amounts by corrupting
a small amount of the data. Response to outliers is measured by the influ-
ence curve and response to data corruption is measured by the breakdown
value. We introduce finite sample versions of these concepts. They are easy to
work with and, in the limit, they generally equal the more abstract versions
based on the study of statistical functionals. We consider, first, the robustness
properties of the estimates and, secondly, tests. As in the last section, the dis-
cussion is general but the L1 and L2 procedures are discussed as we proceed.
The robustness properties of the procedures based on the weighted L1 norm
are covered in Sections 1.7 and 1.8. See Section A.5 of the Appendix for a
development based on functionals.
i i
i i
i i
where Ω(·) is the function needed in the representation. When we combine the
above two statements we have
Xn
1/2 b −1/2
n θn = n Ω(xi ) + oP (1) . (1.6.4)
i=1
i i
i i
i i
.
(n + 1)θbn+1 − (n + 1)θbn = Ω(x∗ )
and
θbn+1 − θbn
≈ Ω(x∗ ) , (1.6.5)
1/(n + 1)
and this reveals the differential character of the influence function. Hampel
(1974) developed the influence function from the theory of von Mises differen-
tiable functions. In Sections A.5 and A.5.2 of the Appendix, we use his formu-
lation to derive several influence functions for later situations. Here, though,
we identify influence functions for the estimates through the approximations
described above. We now illustrate this approach.
sgn(x)
Ω(x) = .
2f (0)
Note that the influence function is bounded but not continuous. Hence,
outlying observations cannot have an arbitrarily large effect on the estimate. It
is this feature along with the 50% breakdown property that makes the sample
median the prototype of resistant estimates. The sample mean, on the other
hand, has an unbounded influence function. It is easy to see that Ω(x) = x,
linear and unbounded. Hence, a single large outlier is sufficient to carry the
sample mean beyond any bound. The unbounded influence is connected to the
0 breakdown property. Hence, the L2 estimate is the prototype of an estimate
highly efficient at a specified model, the normal model in this case, but not
resistant. This means that quite close to the model for which the estimate is
optimal, the estimate may perform very poorly; recall Table 1.5.1.
i i
i i
i i
where the sup is taken over all possible corruptions of m data points. Likewise
the acceptance breakdown is defined to be
Rejection breakdown is the smallest portion of the data that can be cor-
rupted to guarantee that the test rejects the null hypothesis. Acceptance
breakdown is interpreted as the smallest portion of the data that must be
corrupted to guarantee that the test statistic is not in the critical region; i.e.,
the test is guaranteed to fail to reject the null hypothesis. We turn immediately
to a comparison of the L1 and L2 tests.
Example 1.6.3 (Rejection Breakdown of the L1 ). We first consider the one-
sided sign test for testing H0 : θ = 0 versus HA : θ > 0. The asymptotically
size α test rejects the null hypothesis when n−1/2 S1 (0) ≥ zα , the upper α
quantile from a standard normal distribution.
P It is easier to see exactly what
+ 1/2
happens if we convert the test to S1 (0) = I(Xi > 0) ≥ n/2 + (n zα )/2.
Now each time we make an observation positive it makes S1+ (0) increase by
one. Hence, if we wish to guarantee that the test rejects the null hypothesis,
we make m observations positive where m∗ = [n/2 + (n1/2 zα )/2] + 1, [.] the
greatest integer function. Then the rejection breakdown is
. 1 zα
ǫ∗n (reject) = m∗ /n = + 1/2 .
2 2n
Likewise,
. 1 zα
ǫ∗n (accept) = − 1/2 .
2 2n
Note that the rejection breakdown converges down to the estimation break-
down and the acceptance breakdown converges up to it.
We next turn to the one-sided Student’s t-test. Acceptance breakdown
for the t-test is simple. By making a single observation approach −∞, the
t-statistic can be made negative, hence we can always guarantee acceptance
with control of one observation. The rejection breakdown is more interesting.
If we increase an observation both the sample mean and the sample standard
deviation increase. Hence, it is not at all clear what happens to the t-statistic.
In fact it is not sufficient to increase a single observation in order to force the
i i
i i
i i
t-statistic to move into the critical region. We now show that the rejection
breakdown for the t-statistic is:
t2α
ǫ∗n (reject) = → 0 , as n → ∞ ,
n − 1 + t2α
as M → ∞. We now equate the limit to tα and solve for m to get m = nt2α /(n−
1 + t2α ), (actually we would take the greatest integer and add one). Then the
rejection breakdown is m divided by n as stated. Table 1.6.1 compares rejection
breakdown values for the sign and t-tests. We assume α = .05 and the sample
sizes are chosen so that the size of the sign test is quite close to .05. For further
discussion, see Ylvisaker (1977).
These definitions of breakdown assume a worst case scenario. They assume
that the test statistic is as far away from the critical region (for rejection
breakdown) as possible. In practice, however, it may be the case that a test
statistic is quite near the edge of the critical region and only one observation is
needed to change the decision from fail to reject to that of reject. An alternative
form of breakdown considers the average number of observations that must
be corrupted, conditional on the test statistic being in the acceptance region,
to force a rejection.
Let MR be the number of observations that must be corrupted to force a
rejection; then, MR is a random variable. The expected rejection break-
i i
i i
i i
down is defined to be
Exp∗n (reject) = EH0 [MR |MR > 0]/n . (1.6.8)
Note that we condition on MR > 0 since MR = 0 is equivalent to a rejection.
It is left as Exercise 1.12.14 to show that the expected breakdown can be
computed with unconditional expectation as
Exp∗n (reject) = EH0 [MR ]/(1 − α) . (1.6.9)
In the following example we illustrate this computation on the sign test and
show how it compares to the worst case breakdown introduced earlier.
Example 1.6.4 (Expected Rejection Breakdown of the P Sign Test). Refer to
Example 1.6.3. The one-sided sign test rejects when I(Xi > 0) ≥ n/2 +
1/2
n zα/2 . Hence, given that we fail to P reject the null hypothesis, we need to
change (corrupt) n/2 + n1/2 zα/2 − I(Xi > 0) negative observations into
positive ones. This is precisely MR and E[MR ] = n1/2 zα/2 . It follows that
Exp∗n (reject) = zα/2 n1/2 (1 − α) → 0 as n → ∞ rather than .5 which happens
in the worst case breakdown. Table 1.6.2 compares the two types of rejection
breakdown. This simple calculation clearly shows that even highly resistant
tests such as the sign test may breakdown quite easily. This is contrary to what
the worst case breakdown analysis would suggest. For additional reading on
test breakdown see Coakley and Hettmansperger (1992). He, Simpson, and
Portnoy (1990) discuss asymptotic test breakdown.
i i
i i
i i
of Section 1.3. Recall that the norm and its associated gradient function are
given in expressions (1.3.17) and (1.3.24), respectively. Recall for a sample
X1 , . . . , Xn that the estimate of θ is the median of the Walsh averages given
by (1.3.25). As in Section 1.3, our hypotheses of interest are
H0 : θ = 0 versus H0 : θ 6= 0 . (1.7.1)
where c is such that P0 [|T (0)| ≥ c]. To complete the test we need to determine
the null distribution of T (0), which is given by Theorems 1.7.1 and 1.7.2.
In order to develop the statistical properties, in addition to (1.2.1), we
assume that
h(x) is symmetrically distributed about θ . (1.7.3)
We refer to this as the symmetric location model. Under symmetry, by
Theorem 1.2.1, V (H) = θ, for all location functionals V .
where Wj = sgn(XDj ).
Lemma 1.7.1. Under H0 , |X1 |, . . . , |Xn | and sgn(X1 ), . . . , sgn(Xn ) are inde-
pendent.
Proof: Since X1 , . . . , Xn is a random sample from H(x), it suffices to show
that P [|Xi| ≤ x, sgn(Xi ) = 1] = P [|Xi | ≤ x]P [sgn(Xi ) = 1]. But due to H0
and the symmetry of h(x) this follows from the following string of equalities:
1
P [|Xi| ≤ x, sgn(Xi ) = 1] = P [0 < Xi ≤ x] = H(x) −
2
1
= [2H(x) − 1] = P [|Xi | ≤ x]P [sgn(Xi ) = 1] .
2
Based on this lemma, the vector of ranks and, hence, the vector of an-
tiranks (D1 , . . . , Dn ), are independent of the vector (sgn(X1 ), . . . , sgn(Xn )).
Based on these facts, we can obtain the distribution of (W1 , . . . , Wn ), which
we summarize in the following lemma; see Exercise 1.12.15 for its proof.
i i
i i
i i
Lemma 1.7.2. Under H0 and the symmetry of h(x), W1 , . . . , Wn are iid ran-
dom variables with P [Wi = 1] = P [Wi = −1] = 1/2.
We can now easily derive the null distribution theory of T (0) which we
summarize in the following theorems. Details are given in Exercise 1.12.16.
Theorem 1.7.1. Under H0 and the symmetry of h(x),
Using this formula algorithms can be developed which obtain the null
distribution of the signed-rank test statistic. The moment generating function
can also be inverted to find the null distribution; see Hettmansperger (1984a,
Section 2.2). As discussed in Section 1.3.1, software is now available which
computes critical values and p-values of the null distribution.
Theorem 1.7.1 justifies the confidence interval for θ in (1.3.30); i.e, the
(1 − α)100% confidence interval given by [W(k+1) , W(((n(n+1))/2)−k) ) where W(i)
denotes the ith ordered Walsh average and P (T + (0) ≤ k) = α/2. Based on
(1.7.7), k can be approximated as k ≈ n(n + 1)/4 − .5 − zα/2 [n(n + 1)(2n +
1)/24]1/2 . As noted in Section 1.3.1, the computation of the estimate and
confidence interval can be obtain by the Robnp R function onesampwil or the
R intrinsic function wilcox.test.
i i
i i
i i
Pitman efficacy which determines the asymptotic variance of the estimate, the
asymptotic local power of the test, and the asymptotic length of the confidence
interval. In the following theorem we show that the weighted L1 gradient
function is Pitman Regular and determine the efficacy. Then we make some
preliminary efficiency comparisons with the L1 and L2 methods.
R
Theorem 1.7.3. Suppose that h is symmetric and that h2 (x)dx < ∞. Let
X
2 xi + xj
T (θ) = sgn −θ .
n(n + 1) i≤j 2
Then the conditions of Definition 1.5.3 are satisfied and, thus, T (θ) is Pitman
Regular. Moreover, the Pitman efficacy is given by
√ Z ∞ 2
c = 12 h (x)dx . (1.7.9)
−∞
Proof: Since we have the L1 norm applied to the Walsh averages, the estimating
function is a nonincreasing step function with steps at the Walsh averages.
Hence, (1.5.7) holds. Next note that h(x) = h(−x) and, hence,
2 n−1 X1 + X2
µ(θ) = Eθ T (0) = Eθ sgn(X1 ) + Eθ sgn .
n+1 n+1 2
Now Z
Eθ sgnX1 = sgn(x + θ)h(x)dx = 1 − 2H(θ) ,
and
Z Z
Eθ sgn(X1 + X2 )/2 = sgn[(x + y)/2 + θ]h(x)h(y)dxdy
Z
= [1 − 2H(−2θ − y)]h(y)dy .
The finiteness of the integral is sufficient to ensure that the derivative can
be passed through the integral; see Hodges and Lehmann (1961) or Olshen
(1967). Hence, (1.5.9) also holds. We next establish Condition (1.5.10). Since
n
X X
2 2 Xi + Xj
T (θ) = sgn(Xi − θ) + sgn −θ ,
n(n + 1) i=1 n(n + 1) ı<j 2
i i
i i
i i
the first term is of smaller order and we need only consider the second term.
Now, for b > 0, let
∗ 2 X
Xi + Xj
X i + Xj
−1/2
V = sgn −n b − sgn
n(n + 1) i<j 2 2
−4 X X Xi + Xj
= I 0< < n−1/2 b .
n(n + 1) i<j
2
Hence,
∗ 16n XX
nVar(V ) = E{ (Iij Ist − EIij EIst )} ,
n2 (n + 1)2 i<j s<t
where Iij = I(0 < (xi + xj )/2 < n−1/2 b). This becomes
i i
i i
i i
From our general discussion, a simple estimate of the standard error of the
median of the Walsh averages is proportional to the length of a distribution
free confidence interval. Consider the (1 − α)100% confidence interval given
by [W(k+1) , W(((n(n+1))/2)−k) ) where W(i) denotes the ith ordered Walsh average
and P (T + (0) ≤ k) = α/2. Then by expression (1.5.28), a consistent estimate
of the SE of the median of the Walsh averages (medWA) is
W(((n(n+1))/2)−k) − W(k+1)
SE(medWA) = . (1.7.12)
2zα/2
Our R function onesampwil computes this standard error for general α (de-
fault α is set at 0.05). We have more to say about this particular c in the next
chapter where we encounter it in the two-sample location model and later in
the linear model, where a better estimator of this SE is presented.
From Example 1.5.3 and Definition 1.5.4, we have the asymptotic relative
efficiency between the signed-rank Wilcoxon process and the L2 process is
given by Z 2
e(Wilcoxon, L2 ) = 12σh2 2
h (x) dx , (1.7.13)
i i
i i
i i
Table 1.7.1: Efficiencies of the Rank, L1 , and L2 Methods for the Contaminated
Normal Distribution
ǫ e(L1 , L2 ) e(R, L1 ) e(R, L2 )
.00 .637 1.500 .955
.01 .678 1.488 1.009
.03 .758 1.462 1.108
.05 .833 1.436 1.196
.10 1.000 1.373 1.373
.15 1.134 1.320 1.497
that replacing the values of the observations by their ranks (retaining only
the order information) does not affect the statistical properties of the test.
This was considered highly nonintuitive in the 1950’s since nonparametric
methods were thought of as quick and dirty. Now they must be considered
highly efficient competitors of the optimal methods and, in addition, they are
more robust than the optimal methods. This provides powerful motivation
for the continued study of rank methods in other statistical models such as
the two-sample location model and the linear model. The early work in the
area of efficiency of rank methods is due largely to Lehmann and his students.
See Lehmann and Hodges (1956, 1961) for two important early papers and
Lehmann (1975, Appendix) for more discussion.
We complete this example with a table of efficiencies of the rank methods
relative to the L1 and L2 methods for the contaminated normal model with
σ = 3. Table 1.7.1 shows these efficiencies and extends Table 1.5.1. As ǫ
increases the weight in the tails of the distribution also increases. Note that the
efficiencies of both the L1 and rank methods relative to L2 methods increase
with ǫ. On the other hand, the efficiency of the rank methods relative to
the L1 methods decreases slightly. The rank methods are still more efficient;
however, this illustrates the fact that the L1 methods are good for heavy
tailed distributions. The overall implication of this example is that the L2
methods, such as the sample mean, the t-test, and t-confidence interval, are not
particularly efficient once the underlying distribution departs from the normal
distribution. Further, the rank methods such as the Wilcoxon signed-rank test,
confidence interval, and the median of the Walsh averages are surprisingly
efficient, even at the normal distribution. Note that the rank methods are
more efficient than L2 methods even for 1% contamination.
Finally, the following theorem shows that the Wilcoxon signed-rank statis-
tic never loses much efficiency relative to the t-statistic. Let Fs denote the
family of distributions which have symmetric densities and finite Fisher infor-
mation; see Exercise 1.12.21.
i i
i i
i i
First complete the square on the first term on the right side of (1.7.17) to get
Z Z
2
2 2
h + b(x − a ) − b2 (x2 − a2 )2 . (1.7.18)
|x|≤a |x|≤a
Now (1.7.17) is equal to the two terms of (1.7.18) plus the second term on the
right side of (1.7.17). We can now write the density that minimizes (1.7.16).
If |x| > a take h(x) = 0, since x2 > a2 , and if |x| ≤ a take h(x) = b(a2 −x2 ),
since the integral in the first term of (1.7.18) is nonnegative. We R can now
determine the values of a and b from the side conditions. From h = 1. we
have Z a
b(a2 − x2 ) dx = 1 ,
−a
3 3
R
which implies that a b = 4
. Further, from x2 h = 1 we have
Z a
x2 b(a2 − x2 ) dx = 1 ,
−a
15
√
from which a5 b = . Hence solving for a and b yields a = 5 and b =
√ 4
3 5/100. Now
Z Z" √ √ #2 √
5
3 5 3 5
h2 = √ (5 − x2 ) dx = ,
− 5 100 25
i i
i i
i i
and
θbn ≈ T (0)/cσ(0) ,
R
where σ(0) = (4/3)1/2 and c = (12)1/2 h2 (t)dt. Making these substitutions,
X
b . 1 Xi + Xj
θn = R sgn .
n(n + 1)2 h2 (t)dt i≤j 2
Now introduce an outlier xn+1 = x∗ and take the difference between θbn+1 and
θbn . The result is
Z n+1
X
. 1 xi + x∗
2 h (t)dt[(n + 2)θbn+1 − nθbn ] =
2
sgn .
(n + 1) i=1 2
i i
i i
i i
.
It now follows that (n + 1)(θbn+1 − θbn ) = Ω(x∗ ), given in the statement of the
theorem; see the discussion of the influence function in Section 1.6.
Note that we have a bounded influence function since the cdf H is
a bounded function. Further, it is continuous, unlike the influence func-
tion of the median. Finally, as an additional check, note that EΩ2 (X) =
R
1/12[ h2 (t)dt]2 = 1/c2, the asymptotic variance of n1/2 θ. b
Let θbc = medi,j {(Xi − cXj )/(1 − c)} for −1 ≤ c < 1 . This extension of
the Hodges-Lehmann estimate, (1.3.25), has some very interesting robustness
properties for c > 0. The influence function of θbc is not only bounded but also
redescending, similar to the most robust M estimates. In addition, θbc has 50%
breakdown. For a complete discussion of this estimate see Maritz, Wu, and
Staudte (1977) and Brown and Hettmansperger (1994).
In the next theorem we develop the test breakdown for the Wilcoxon
signed-rank test.
Theorem 1.7.6. The rejection breakdown, Definition 1.6.2, for the Wilcoxon
signed rank test is
1/2
. 1 zα 1 .
ǫ∗n =1− − 1/2
→ 1 − 1/2 = .29 .
2 (3n) 2
PP
Proof: Consider the form T + (0) = I[(xi + xj )/2 > 0], where the double
sum is over all i ≤ j. The asymptotically size α test rejects H0 : θ = 0
.
in favor of HA : θ > 0 when T + (0) ≥ c = n(n + 1)/4 + zα [n(n + 1)(2n +
1)/24]1/2 . Now we must guarantee that T + (0) is in the critical region. This
requires at least c positive Walsh averages. Let x(1) ≤ . . . ≤ x(n) be the ordered
observations. Then contamination of x(n) results in n contaminated Walsh
averages, namely those Walsh averages that include x(n) . Contamination of
x(n−1) yields n − 1 additional contaminated Walsh averages. When we proceed
in this way, contamination of the b ordered values x(n) , . . . , x(n−b+1) yields
n+ (n−1) + ...+ (n−b+ 1) = [n(n+ 1)/2]−[(n−b)(n−b+ 1)/2] contaminated
.
Walsh averages. We now set [n(n + 1)/2] − [(n − b)(n − b + 1)/2] = c and solve
.
the resulting quadratic for b. We must solve b2 − (2n + 1)b + 2c = 0. The
appropriate root in this case is
i i
i i
i i
for the sign and t-tests, also. The rejection breakdown for the Wilcoxon test
converges from above to the estimation breakdown of .29. The Wilcoxon test
is more resistant than the t-test but not as resistant as the simple sign test. It
is interesting to note that from the discussion of efficiency, it is clear that we
can now achieve high efficiency and not pay the price in lack of robustness.
The rank-based methods seem to be a very attractive alternative to the highly
resistant but relatively inefficient (at the normal model) L1 methods and the
highly efficient (at the normal model) but nonrobust L2 methods.
where the scores a+ (i) are generated as a+ (i) = ϕ+ (i/(n + 1)) for a positive
valued, nondecreasing, square-integrable function ϕ+ (u) defined on the inter-
val (0, 1). The proof that k · kϕ+ is a norm on Rn follows in the same way as
in the weighted L1 case; see the proof of Theorem 1.3.2 and Exercise 1.12.23.
The gradient function associated with this norm is
n
X
Tϕ+ (θ) = a+ (R|Xi − θ|)sgn(Xi − θ) . (1.8.2)
i=1
i i
i i
i i
which converges to
Z ∞
δ(θ) = ϕ+ (H(|x − θ| + θ) − H(θ − |x − θ|))sgn(x − θ) dH(x) = 0 . (1.8.6)
−∞
i i
i i
i i
Tables can be constructed for the null distribution of Tϕ+ (0) from which critical
values, c, can be obtained to complete the test described in (1.8.9).
For the asymptotic null distribution of Tϕ+ (0), the following additional
assumption on the scores is sufficient:
max a+2 (j)
P j +2 →0. (1.8.13)
i=1 a (i)
i i
i i
i i
i.e., the left side is a Riemann sum of the integral. Under these assumptions
and the symmetric location model, Corollary A.1.1 of the Appendix can be
used to show that the null distribution of Tϕ+ (0) is asymptotically normal;
see, also, Exercise 1.12.16. Hence, an asymptotic level α test is
√Tϕ+ (0)
Reject H0 in favor of HA , if nσ + ≥ zα/2 . (1.8.15)
ϕ
see (1.5.27). These equations can be solved by the simple tracing algorithm
discussed immediately following expression (1.8.5).
where
u+1
h′ H −1
ϕ+
h (u) =− 2
u+1
. (1.8.18)
h H −1 2
i i
i i
i i
where the third equality in (1.8.19) follows from an integration by parts. Hence
the second Pitman Regularity condition holds.
For the third condition, (1.5.10), the asymptotic linearity for the process
Tϕ (0) is given in Theorem A.2.11 of the Appendix. We restate the result here
+
for reference:
" #
1 1
P0 √ sup √ Tϕ+ (θ) − √ Tϕ+ (0) + θγh ≥ ǫ → 0 , (1.8.20)
n|θ|≤B n n
for all ǫ > 0 and all B > 0. Finally the fourth condition, (1.5.11), concerns
the asymptotic
√ null distribution which was discussed above. The null variance
of Tϕ (0)/ n is given by expression (1.8.12). Therefore the process Tϕ+ (θ) is
+
As our first result, we obtain the asymptotic power lemma for the process
Tϕ+ (θ). This, of course, follows immediately from Theorem 1.5.8 so we state
it as a corollary.
Using the general result of Theorem 1.5.9, the length of the confidence
interval for θ, (1.8.16), can be used to obtain a consistent estimate of τϕ+ .
This in turn can be used to obtain a consistent estimate of the standard error
of θbϕ+ ; see Exercise 1.12.19.
i i
i i
i i
The asymptotic relative efficiency between two estimates or two tests based
on score functions ϕ+ +
1 (u) and ϕ2 (u) is the ratio
c2ϕ+ τϕ2+
e(ϕ+ +
1 , ϕ2 ) = 1
= 2
. (1.8.25)
c2ϕ+ τϕ2+
2 1
This can be used to compare different tests. For a specific distribution we can
determine the optimum scores. Such a score should make the scale parameter
τϕ+ as small as possible. This scale parameter can be written as,
s
R 1 ϕ+ (u)ϕ+ (u) du Z 1
cϕ+ = τϕ−1
+ =
0
qR h ϕ2h (u) du . (1.8.26)
1 2 0
σϕ +
0
ϕh (u) du
ϕ+ (u) = ϕ+
h (u) ,
qR
1
where ϕ+
h (u) is given by expression (1.8.18). The quantity 0
(ϕ+ 2
h (u)) du is
the square root of Fisher information; see Exercise 1.12.24. Therefore for this
choice of scores the estimate θbϕ+ is asymptotically efficient. This is the
h
reason for calling the score function ϕ+ h the optimal score function.
It is shown in Exercise 1.12.25 that the optimal scores are the normal
scores if h(x) is a normal density, the Wilcoxon weighted L1 scores if h(x) is a
logistic density, and the L1 scores if h(x) is a double exponential density. It is
further shown that the scores generated by (1.8.3) are optimal for symmetric
densities with a logistic center and exponential tails.
From Exercise 1.12.25, the efficiency of the normal scores methods relative
to the least squares methods is
Z ∞ 2
f 2 (x)
e(NS, LS) = dx , (1.8.27)
−∞ φ (Φ−1 (F (x)))
i i
i i
i i
i i
i i
i i
This restrains the estimating function from crossing the horizontal axis pro-
vided
[(1−ǫ)n] n
X X
− a+ (i) + a+ (i) > 0 .
i=1 i=[(1−ǫ)n]+1
i i
i i
i i
i i
i i
i i
> onesampr(x,theta0=.618,alpha=.10,score=phinscp,grad=spnsc)
While not as sensitive to the outliers as the traditional analysis, the outliers
still had some influence on the normal scores analysis. The normal scores test
rejects the null hypothesis at level 0.06 while the 90% confidence interval just
misses the value 0.618.
i i
i i
i i
k!
h(j) (t) = H j−1(t)[1 − H(t)]k−j h(t) .
(j − 1)!(k − j)!
We suppose the measurements are distributed as H(x) = F (x − θ) and we
wish to make a statistical inference concerning θ, such as an estimate, test, or
confidence interval. We illustrate the ideas on the L1 methods since they are
simple to work with. We also wish to compute the efficiency of the RSSL1
methods relative to the SRSL1 methods. We see that there is a substantial
increase in efficiency when using the RSS design. In particular, we compare
the RRS methods to SRS methods based on a sample of size nk. The RSS
method was first applied by McIntyre (1952) in measuring mean pasture yields.
See Hettmansperger (1995) for a development of the RSSL1 methods. The
most convenient form of the RSS sign statistic is the number of positive
measurements given by
k X
X n
+
SRSS = I(X(j)i > 0) . (1.9.1)
j=1 i=1
+ + P + +
Now note that SRSS can be written as SRSS = S(j) where S(j) =
P
i I(X(j)i > 0) has a binomial distribution with parameters n and 1−H(j) (0).
+
Further, S(j) , j = 1, . . . , k are stochastically independent. It follows at once
that
k
X
+
ESRSS = n (1 − H(j) (0)) (1.9.2)
j=1
k
X
+
VarSRSS = n (1 − H(j) (0))H(j) (0) .
j=1
+
With k fixed and n → ∞, it follows from the independence of S(j) , j = 1, . . . , k
i i
i i
i i
that
k
X D
(nk)−1/2 {SRSS
+
−n (1 − H(j) (0)} → Z ∼ n(0, ξ 2 ) , (1.9.3)
j=1
and
X
+ +
ESRSS = nk/2, and VarSRSS = 1/4 − k −1 (F(j) (0) − 1/2)2 .
P
Proof: Use the fact that k −1 F(j) (0) = F (0) = 1/2, and the expectation
formula follows at once. Note that
Z 0
k!
F(j) (0) = F (t)j−1(1 − F (t))k−j f (t)dt ,
(j − 1)!(k − j)! −∞
i i
i i
i i
P
Table 1.9.1: Values of F(j) (0), j = 1, . . . , k and δ 2 = 1−(4/k) (F(j) (0)−1/2)2
k: 2 3 4 5 6 7 8 9 10
1 .750 .875 .938 .969 .984 .992 .996 .998 .999
2 .250 .500 .688 .813 .891 .938 .965 .981 .989
3 .125 .313 .500 .656 .773 .856 .910 .945
4 .063 .188 .344 .500 .637 .746 .828
5 .031 .109 .227 .363 .500 .623
6 .016 .063 .145 .254 .377
7 .008 .035 .090 .172
8 .004 .020 .055
9 .002 .011
10 .001
δ 2 .750 .625 .547 .490 .451 .416 .393 .371 .352
+ +
P
ratio of variances is VarSRSS /VarSSRS = δ 2 = 1 − (4/k) (F(j) (0) − 1/2)2 .
The reduction in variance is given in the last row of Table 1.9.1 and can be
quite large.
We next show that the parameter δ is an integral part of the efficacy of
the RSS L1 methods. It is straightforward using the methods of Section 1.5
and Example 1.5.2 to show that the RSS L1 estimating function is Pitman
Regular. To compute the efficacy we first note that
k X
X n
S̄RSS = (nk)−1 sgn(X(j)i ) = (nk)−1 [2SRSS
+
− nk] .
j=1 i=1
and µ′ (0) = 2f (0); see Exercise 1.12.28. See Babu and Koti (1996) for a
development of the exact distribution. Hence, the efficacy of the RSS L1
methods is given by
2f (0) 2f (0)
cRSS = = Pk .
δ {1 − (4/k) j=1 (F(j) (0) − 1/2)2 }1/2
We now summarize the inference methods and their efficiency in the fol-
lowing:
i i
i i
i i
D
2. The estimate. (nk)1/2 {medX(j)i − θ} → Z ∼ n(0, δ 2 /4f 2(0)).
∗ ∗
3. The confidence interval. Let X(1) , . . . , X(nk) be the ordered values
∗ ∗
of X(j)i , j = 1, . . . , k and i = 1, . . . , n. Then [X(m+1) , X(nk−m) ] is a
+
(1 − α)100% confidence interval for θ where P (SSRS ≤ m) = α/2. Using
.
the normal approximation we have m = (nk/2) − zα/2 δ(nk/4)1/2 .
4. Efficiency. The efficiency of the RSS methods with respect to the SRS
methods is given by e(RSS, SRS) = c2RSS /c2SRS = δ −2 . Hence, the re-
ciprocal of the last line of Table 1.9.1 provides the efficiency values and
they can be quite substantial. Recall from the discussion following Defi-
nition 1.5.5 that efficiency can be interpreted as the ratio of sample sizes
needed to achieve the same approximate variances, the same approxi-
mate local power, and the same confidence interval length. Hence, we
.
write (nk)RSS = δ 2 (nk)SRS . This is really the point of the RSS design.
Returning to the example of estimating the volume of wood in a forest,
if we let k = 5, then from Table 1.9.1, we would need to destroy and
measure only about one half as many trees using the RSS method rather
than the SRS method.
As a final note, we mention the problem of assessing the effect of imperfect
ranking. Suppose that the expert makes a mistake when asked to identify the
jth ordered value in a set of k observations. As expected, there is less gain
from using the RSS method. The interesting point is that if the expert simply
identifies the supposed jth ordered value by random guess then δ 2 = 1 and the
two sign tests have the same information; see Hettmansperger (1995) for more
details. Also, see Presnell and Bohn (1999) for a careful analysis of imperfect
ranking.
i i
i i
i i
with confidence coefficient γk and interval (x(k+1) , x(n−k) ) with confidence co-
efficient γk+1 where γk+1 ≤ γ ≤ γk . Then the interpolated interval is [θbL , θbU ],
where
(n − k)I γk − γ
λ= and I = . (1.10.2)
k + (n − 2k)I γk − γk+1
We call I the interpolation factor and note that if we were using linear in-
terpolation then λ = I. Hence, we see that the interpolation is distinctly
nonlinear.
As a simple example we take n = 10 and ask for a 95% confidence interval.
For k = 2 we find γk = .9786 and γk+1 = .8907. Then I = .325 and λ = .685.
Hence, θbL = .342x(2) + .658x(3) and θbU = .342x(9) + .658x(8) . Note that linear
interpolation is almost the reverse of the recommended mixtures, namely λ =
I = .325 and this can make a substantial difference in small samples.
The method is based on the following theorem. This theorem highlights the
nonlinear relationship between the interpolation factor and λ. After proving
the theorem, we develop an approximate solution and then show that it works
in practice.
and
γk+1 = P0 (xk+1 ≤ 0 ≤ xn−k ) = P0 (k < S1+ (0) < n − k) .
Taking the difference, we have, using nk to denote the binomial coefficient,
n
γk − γk+1 = P0 (S1+ (0) = k) + P0 (S1+ (0) = n − k) = (1/2)n−1 . (1.10.3)
k
We now consider the lower tail probability associated with the confidence
interval. First consider
Z ∞
1 − γk+1 n!
P0 (Xk+1 > 0) = = F k (t)(1 − F (t))n−k−1dF (t)
2 0 k!(n − k − 1)!
= P0 (S1+ (0) ≥ n − k) = P0 (S1+ (0) ≤ k). (1.10.4)
i i
i i
i i
Let kn = n!/[(k − 1)!(n − k − 1)!]. Then the lower end of the interpolated
interval is
1−γ
= P0 ((1 − γ)Xk + λXk+1 > 0)
2 Z ∞Z y
= kn F k−1(x)(1 − F (y))n−k−1f (x)f (y)dxdy
−λ
0 1−λ
y
Z ∞
1 k k −λy
= kn F (y) − F (1 − F (y))n−k−1f (y)dy
0 k 1−λ
Z ∞
1−γk+1 kn k −λy
= − F (1−F (y))n−k−1f (y)dy .
2 0 k 1−λ
Use (1.10.4) in the last line above. Now with (1.10.3), substitute into the
formula for the interpolation factor and the result follows.
Clearly, not only is the relationship between I and λ nonlinear but it also
depends on the underlying distribution F . Hence, the interpolated interval is
not distribution-free. There is one interesting case in which we have a distri-
bution free interval given in the following corollary.
This shows that when we sample from a symmetric distribution, the in-
terval that lies half between the available intervals does not depend on the
underlying distribution. Other interpolated intervals are not distribution free.
Our next theorem shows how to approximate the solution and the solution
is essentially distribution free. We show by example that the approximate
solution works in many cases.
The integrand decreases rapidly for moderate powers; hence, we expand the
integrand around y = 0. First take logarithms then
−λ λ f (0)
k log F y = k log F (0) − k y + o(y)
1−λ 1 − λ F (0)
i i
i i
i i
and
f (0)
(n−k−1) log(1−F (y)) = (n−k−1) log(1−F (0))−(n−k−1) y+o(y) .
1 − F (0)
Substitute r = λk/(1−λ) and F (0) = 1−F (0) = 1/2 into the above equations,
and add the two equations together. Add and subtract r log(1/2), and group
terms so the right side of the second equation appears on the right side along
with k log(1/2) − r log(1/2). Hence, we have
−λ
k log F y + (n − k − 1) log(1 − F (y)) = k log(1/2) − r log(1/2)
1−λ
+(n − r − k − 1) log(1 − F (y)) + o(y) .
R∞ −λ
Thus, 0 F k 1−λ y (1 − F (y))n−k−1f (y)dy is approximately equal to
Z ∞
1
2−(k−r)(1 − F (y))n+r−k−1f (y)dy = n . (1.10.5)
0 2 (n + r − k)
Substitute this approximation into the formula for I(λ), use r = λk/(1 − λ)
and the result follows.
Note that the approximation agrees with Corollary 1.10.1. In addition Ex-
ercise 1.12.29 shows that the approximation formula is exact for the dou-
ble exponential (Laplace) distribution. In Table 1.10.1 we show how well the
approximation works for several other distributions. The exact results were
obtained by numerical integration of the integral in Theorem 1.10.1. Simi-
lar close results were found for asymmetric examples. For further reading see
Hettmansperger and Sheather (1986) and Nyblom (1992).
i i
i i
i i
> tm=interpci(.05,x)
Estimation of Median
Sample Median is 1.3
Confidence Interval ( 1 , 1.8 ) 89.0625 %
Confidence Interval ( 0.9315 , 2.0054 ) 95 % Interpolated
Confidence Interval ( 0.8 , 2.4 ) 97.8516 %
Note the p-value of the test is .0039 and we would easily reject the null
hypothesis at any reasonable level of significance. The interpolated 95% con-
fidence interval for θ shows the reasonable set of values of θ to be between
.9315 and 2.0054, given the level of confidence.
i i
i i
i i
. Sx (0) zx . Sy (0) zy
XL = − 1/2 and YU = + 1/2 .
m2f (0) m 2f (0) m2f (0) n 2f (0)
Since m/N → λ
D
N 1/2 XL → λ−1/2 Z1 , Z1 ∼ n(−zx /2f (0), 1/4f 2(0)) ,
and
D
N 1/2 YU → (1 − λ)−1/2 Z2 , Z2 ∼ n(−zy /2f (0), 1/4f 2(0)) .
Now αc /2 = P (XL > YU ) = P (N 1/2 (YU − XL ) < 0) and XL , YU are indepen-
dent, hence
D
N 1/2 (YU − XL ) → λ−1/2 Z1 − (1 − λ)−1/2 Z ,
and λ−1/2 Z1 − (1 − λ)−1/2 Z2 has the distribution
1 zx zy 1 1 1
N + , 2 + .
2f (0) (1 − λ)1/2 λ1/2 4f (0) λ 1 − λ
It then follows that
1/2 !
zx zy 1
P (N 1/2 (YU − XL ) < 0) → Φ − 1/2
+ 1/2 / .
(1 − λ) λ λ(1 − λ)
Which, when simplified, yields the result in the statement of the theorem.
i i
i i
i i
i i
i i
i i
1 − αc = P∆ (YL − XU ≤ ∆ ≤ YU − XL )
N 1/2 Λ 1
→ .
2zc [λ(1 − λ)]1/2 ]2f (0)
Proof: First note that Λ = Λx + Λy , the sum of the two lengths of the X and
Y intervals, respectively. Further,
i i
i i
i i
First 5.9 6.8 6.4 7.0 6.6 7.7 7.2 6.9 6.2
Fourth 5.3 5.6 5.5 5.1 6.2 5.8 5.8
choices for zx and zy are discussed; for example, we could choose zx and zy so
that the asymptotic standardized lengths are equal. The corresponding con-
fidence coefficients for this choice are more sensitive to unequal sample sizes
than the method proposed here.
Example 1.11.1 (Hendy and Charles Coin Data). Hendy and Charles (1970)
study the change in silver content in Byzantine coins. During the reign of
Manuel I (1143-1180) there were several mintings. We consider the research
hypothesis that the silver content changed from the first to the fourth coinage.
The data consists of nine coins identified from the first coinage and seven coins
from the fourth. We suppose that they are realizations of random samples of
coins from the two populations. The percentage of silver in each coin is given
in Table 1.11.1. Let ∆ = θ1 − θ4 where the 1 and 4 indicate the coinage.
To test the null hypothesis H0 : ∆ = 0 versus HA : ∆ 6= 0 at α = .05,
we construct two 84% L1 confidence intervals and reject the null hypothesis
if they are disjoint. The confidence intervals can be computed by using the
Robnp function onesampsgn with the value alph=.16. Results pertinent to the
confidence intervals are:
> onesampsgn(First,alpha=.16)
> onesampsgn(Fourth,alpha=.16)
Clearly, the 84% confidence intervals are disjoint, hence, we reject the
null hypothesis at a 5% significance level and claim that the emperor appar-
ently held back a little on the fourth coinage. A 95% confidence interval for
∆ = θ1 − θ4 is found by taking the differences in the ends of the confidence
intervals: (6.4 − 5.8, 7.0 − 5.3) = (0.6, 1.7). Hence, this analysis suggests that
i i
i i
i i
the difference in median percentages is someplace between .6% and 1.7%, with
a point estimate of 6.8 − 5.6 = 1.2%.
Figure 1.11.1 provides a comparison boxplot of the data for the first and
fourth coinages. If one marks these 84% confidence intervals on the plot, the
relatively large gap between the confidence intervals is apparent. Hence, there
was a sharp reduction in silver content from the first to fourth coinage. In
addition, the box for the fourth coinage is a bit more narrow than the box for
the first coinage indicating that there may be less variation (as measured by
the interquartile range) in the fourth coinage. There are no apparent outliers
as indicated by the whiskers on the boxplot. Larson and Stroup (1976) analyze
this example with a two-sample t-test.
Figure 1.11.1: Comparison boxplots of the Hendy and Charles coin data.
7.5
7.0
Percentage of silver
6.5
6.0
5.5
5.0
First Fourth
1.12 Exercises
1.12.1. Show that if k · k is a norm, then there always exists a value of θ which
minimizes kx − θ1k for any x1 , . . . , xn .
i i
i i
i i
1.12. EXERCISES 71
1.12.2. Figure 1.12.1 displays the graph of Z(θ) versus θ for n = 20 data
points (count the steps) where
20
1 X
Z(θ) = √ sign(Xi − θ),
n i=1
i.e., the standardized sign (median) process. Use this plot to answer the fol-
lowing:
(a) What are the minimum and maximum values of the sample?
(c) Determine a 95% confidence interval for θ, (approximate, but show on the
graph).
(d) Determine the value of the test statistic and the associated p-value for
testing H0 : θ = 0 versus HA : θ > 0.
0
−5
−1 0 1 2 3
theta
i i
i i
i i
1.12.4.
√ √ Consider the L2 norm. √ Show that θb = x̄ and that S2 (0) =
2
nt/ n − 1 + t where t = nx̄/s, and s is the sample standard deviation.
Further, show S2 (0) is an increasing function of t so the test based on t is
equivalent to S2 (0).
1.12.5. Discuss the consistency of the t-test. Is the t-test resolving?
1.12.6. Discuss the Pitman Regularity in the L2 case.
1.12.7. The following R function computes a bootstrap distribution of the
sample median.
bootmed = function(x,nb){
# Sample is in x and nb is the number of bootstraps
n = length(x)
bootmed = rep(0,nb)
for(i in 1:nb){
y = sample(x,size=n,replace=T)
bootmed[i] = median(y)
}
bootmed
}
(a) Use this code to obtain 1000 bootstrapped medians for the Shoshoni data
of Example 1.4.2. Determine the standard error of this bootstrap sample
of medians and compare it with the estimate based on the length of the
confidence interval for the Shoshoni data.
(b) Now find the mean and variance of the Shoshoni data. Use these estimates
to perform a parametric bootstrap of the sample median, as discussed
in Example 1.5.6. Determine the standard error of this parametric boot-
strap sample of medians and compare it with estimates in Part (a).
1.12.8. Using languages such as Minitab or R, obtain a plot of the test sen-
sitivity curves based on the signed-rank Wilcoxon statistic for the Cushney-
Peebles data, Example 1.4.1, similar to the sensitivity curves based on the
t-test and the sign test as shown in Figure 1.4.1.
1.12.9. In the proof of Theorem 1.5.6, show that (1.5.20) and (1.5.21) imply
that Un (b) converges to −µ′ (0) in probability, pointwise in b, i.e., Un (b) =
−µ′ (0) + op (1).
1.12.10. Suppose we are sampling from the distribution with pdf
3 1
f (x) = exp{−|x|3/2 }, −∞ < x < ∞
4 Γ(2/3)
and we are considering whether to use the Wilcoxon or sign test. Using the
efficacies of these tests, determine which test to use.
i i
i i
i i
1.12. EXERCISES 73
1.12.12. Show that (1.5.24) is scale invariant. Hence, the efficiency does not
change if X is multiplied by a positive constant. Let
1.12.13. Show that the finite sample breakdown of the Hodges-Lehmann esti-
mate (1.3.25) is ǫ∗n = m/n, where m is the solution to the quadratic inequality
2m2 − (4n + 2)m + n2 + n ≤ 0. Table ǫ∗n as a function of n and show that ǫ∗n
.
converges to 1 − √12 = .29.
1.12.16. Prove Theorem 1.7.1. In particular, check the conditions of the Lin-
deberg Central Limit Theorem to verify (1.7.7).
1.12.18. For the general signed-rank norm given by (1.8.1), show that the
function Tϕ+ (θ), (1.8.2) is a decreasing step function which steps down only
at the Walsh averages. Hint: First show that the ranks of |Xi − θ| and |Xj − θ|
switch for θ1 < θ2 if and only if
Xi + Xj
θ1 < < θ2 ,
2
(replace ranks by signs if i = j).
1.12.19. Let ϕ+ (u) be a general score function. Using the general result of
Theorem 1.5.9, show that the length of the confidence interval for θ, (1.8.16),
can be used to obtain a consistent estimate of τϕ+ . Use this to obtain a stan-
dard error for the estimate of θ based on the score function ϕ+ (u).
i i
i i
i i
1.12.20. Use the results of the last exercise to write in some detail the trac-
ing algorithm, described after expression (1.8.5), for obtaining the location
estimator θbϕ+ and its associated standard error.
1.12.26. Verify that the influence function of the normal score estimate is
unbounded when the underlying distribution is normal.
1.12.29. Show that approximation (1.10.5) is exact for the double exponential
(Laplace) distribution.
1.12.30. Extend the simulation study of Example 1.8.2 to the other contam-
inated normal situations found in Table 1.7.1. Comment on the results. Com-
pare the empirical results for the Wilcoxon withe asymptotic results found in
the table.
The following R code performs the contaminated normal simulation dis-
cussed in Example 1.8.2. (Semicolons are end of line indicators. As indicated
in the call to onesampr, the normal scores estimator is computed by using the
gradient R function spnsc and score function phinscp.)
i i
i i
i i
1.12. EXERCISES 75
H0 : θ = 0 versus HA : θ > 0.
Assume that T (θ) is standardized so that the decision rule of the (asymptotic)
level α test is given by
(a) For θ0 > 0, determine the asymptotic power γ(θ0 ), i.e., determine
i i
i i
i i
1.12.34. The signed-rank Wilcoxon scores are optimal for the logistic distri-
bution while the sign scores are optimal for the Laplace distribution. A family
of score functions which are optimal for distributions with logistic “middles”
and Laplace “tails” are the bent scores. These are continuous score functions
ϕ+ (u) with a linear (positive slope and intercept 0) piece for 0 < u < b and
a constant piece for b < u < 1, for a specified value of b; see Policello and
Hettmansperger (1976). These are called signed-rank Winsorized Wilcoxon
scores.
R
(a) Obtain the standardized scores such that [ϕ+ (u)]2 du = 1.
(b) For these scores with b = 0.75, obtain the corresponding estimate of
location and an estimate of its standard error for the following data set:
7.94 8.13 8.11 7.96 7.83 7.04 7.91 7.82
7.42 8.06 8.51 7.88 8.96 7.58 8.14 8.06
onesampr(x,score=phipb,grad=sphipb,param=c(.75)).
i i
i i
i i
Chapter 2
Two-Sample Problems
2.1 Introduction
Let X1 , . . . , Xn1 be a random sample with common distribution function F (x)
and density function f (x). Let Y1 , . . . , Yn2 be another random sample, inde-
pendent of the first, with common distribution function G(x) and density g(x).
We call this the general model throughout this chapter. A natural null hypoth-
esis is H0 : F (x) = G(x). In this chapter we consider rank and sign tests of
this hypothesis. A general alternative to H0 is HA : F (x) 6= G(x) for some x.
Except for Section 2.10 on the scale model we are generally concerned with
the alternative models where one distribution is stochastically larger than the
other; for example, the alternative that G is stochastically larger than F which
can be expressed as HA : G(x) ≤ F (x) with a strict inequality for some x.
This family of alternatives includes the location model, described next, and
the Lehmann alternative models discussed in Section 2.7, which are used in
survival analysis.
As in Chapter 1, the location models are of primary interest. For these
models G(x) = F (x − ∆) for some parameter ∆. Thus the parameter ∆
represents a shift in location between the two distributions. It can be expressed
as ∆ = θY −θX where θY and θX are the medians of the distributions of G and
F or equivalently as ∆ = µY − µX where, provided they exist, µY and µX are
the means of G and F . In the location problem the null hypothesis becomes
H0 : ∆ = 0. In addition to tests of this hypothesis we develop estimates
and confidence intervals for ∆. We call this the location model throughout
this chapter and we show that this is a generalization of the location problem
defined in Chapter 1.
As in Chapter 1 with the one-sample problems, for the two-sample prob-
lems, we offer the reader computational R functions which do the computation
for the rank-based analyses discussed in this chapter.
77
i i
i i
i i
Zi = ∆ci + ei , 1 ≤ i ≤ n , (2.2.2)
where e1 , . . . , en are iid with distribution function F (x). Let C = [ci ] denote
the n × 1 design matrix and let ΩF U LL denote the column space of C. We can
express the location model as
Z = C∆ + e , (2.2.3)
Z = 1θ + C∆ + e∗ . (2.2.4)
i i
i i
i i
i i
i i
i i
Note that a regular norm satisfies the first three properties but in lieu
of the fourth property, the norm of a vector is 0 if and only if the vector is
0. The following inequalities establish the invariance of pseudo-norms to the
parameter θ:
H0 : ∆ = 0 versus HA : ∆ 6= 0 . (2.2.13)
The closer S∗ (0) is to 0 the more plausible is the hypothesis H0 . More formally,
we define the gradient test of H0 versus HA by the rejection rule,
i i
i i
i i
where the critical values k and l depend on the null distribution of S∗ (0).
Typically, the null distribution of S∗ (0) is symmetric about 0 and k = −l.
The reduction in dispersion test is given by
b ≥m,
Reject H0 in favor of HA if D∗ (0) − D∗ (∆)
where the critical value m is determined by the null distribution of the test
statistic. In this chapter, as in Chapter 1, we are concerned with the gradient
test while in Chapter 3 we use the reduction in dispersion test. A confidence
interval for ∆ of confidence (1 − α)100% is the interval {∆ : k < S∗ (∆) < l}
and
1 − α = P∆ [k < S∗ (∆) < l] . (2.2.14)
Since D∗ (∆) is convex, S∗ (∆) is nonincreasing and, hence,
b L = inf{∆ : S∗ (∆) < l} and ∆
∆ b U = sup{∆ : S∗ (∆) > k} ; (2.2.15)
compare (1.3.10). Often we are able to invert k < S∗ (∆) < l to find an explicit
formula for the upper and lower end points.
We discuss a large class of general pseudo-norms in Section 2.5, but now we
present the pseudo-norms that yield the pooled t-test and the Mann-Whitney-
Wilcoxon test.
i i
i i
i i
Note that this pseudo-norm is the L1 norm based on the differences between
the components and that it is the second term of expression (1.3.20), which
defines the norm of the signed rank analysis of Chapter 1. Note further, that
this pseudo-norm differs from the least squares pseudo-norm in that the square
root is taken inside the double summation. In Exercise 2.13.2 the reader is
asked to show that this indeed is a pseudo-norm and that further it can be
written in terms of ranks as
X n
n+1
kukR = 4 R(ui ) − ui .
i=1
2
Our estimate of ∆ is a value which makes the gradient zero; that is, makes half
of the differences positive and the other half negative. Thus the rank-based
estimate of ∆ is
∆b R = med {Yj − Xi } . (2.2.18)
This pseudo-norm estimate is often called the Hodges-Lehmann estimate of
shift for the two-sample problem (Hodges and Lehmann, 1963). As we show in
Section 2.4.4, ∆b R has an approximate normal distribution with mean ∆ and
p
standard deviation τ (1/n1 ) + (1/n2 ), where the scale parameter τ is given
in display (2.4.22).
From the gradient we define
n1 X
X n2
SR (∆) = sgn (Yj − Xi − ∆) . (2.2.19)
i=1 j=1
Next define
SR+ (∆) = #(Yj − Xi > ∆) . (2.2.20)
Note that we have (with probability one) that SR (∆) = 2SR+ (∆) − n1 n2 . The
statistic SR+ = SR+ (0), originally proposed by Mann and Whitney (1947), is
more convenient to use. The gradient test for the hypotheses (2.2.13) is
Reject H0 in favor of HA if SR+ ≤ k or SR+ ≥ n1 n2 − k ,
i i
i i
i i
n2 (n2 + 1)
SR+ (∆) = W (∆) − . (2.2.24)
2
The test statistic W (0) was proposed by Wilcoxon (1945). Since it is a linear
function of the Mann-Whitney test statistic it has identical statistical prop-
erties. We refer to the statistic, SR+ , as the Mann-Whitney-Wilcoxon statistic
and label it as MWW.
As a final note on the geometry of the rank-based analysis, reconsider the
model with the location functional θ in it, i.e., (2.2.4). Suppose we obtain the R
estimate of ∆, (2.2.18). Let b
eR = Z−C∆ b R denote the residuals. Next suppose
we want to estimate the location parameter θ by using the weighted L1 norm
which was discussed for estimation of location in Section 1.7 of Chapter 1. Let
i i
i i
i i
Asymptotic theory for the joint distribution of the random vector (θbR , ∆
b R )′ is
discussed in Chapter 3.
2.2.3 Computation
The Mann-Whitney-Wilcoxon analysis which we described above is easily com-
puted using the Robnp function twosampwil. This function returns the value
of the Mann-Whitney-Wilcoxon test statistic SR+ = SR+ (0), (2.2.20), the esti-
b (2.2.18), the associated confidence interval (2.2.22), and comparison
mate ∆,
boxplots of the samples. Also, the R intrinsic function wilcoxon.test and
minitab command MANN compute this Mann-Whitney-Wilcoxon analysis.
2.3 Examples
In this section we present two examples which illustrate the methods discussed
in the last section. The calculations were performed by the Robnp functions
twosampwil and twosampt which compute the Mann-Whitney-Wilcoxon and
LS analyses, respectively. By convention, for each difference Yj − Xi = 0,
we add the value 1/2 to the test statistic SR+ . Further, the returned p-value
is calculated with the usual continuity correction. The estimate of τ and its
standard error (SE) displayed in the results are given by expression (2.4.27),
where a full discussion is given. The LS analysis, computed by twosampt, is
based on the traditional pooled two-sample t-test.
Example 2.3.1 (Quail Data). The data for this problem are drawn from a
high-volume drug screen designed to find compounds which reduce low den-
sity lipoproteins, LDL, cholesterol in quail; see McKean, Vidmar, and Sievers
(1989) for a discussion of this screen. For the purposes of the present exam-
ple, we have taken the plasma LDL levels of one group of quail who were fed
over a specified period of time a special diet mixed with a drug compound
i i
i i
i i
2.3. EXAMPLES 85
and the LDL levels of a second group of quail who were fed the same spe-
cial diet but without the drug compound over the same length of time. A
completely randomized design was employed. We refer to the first group as
the treatment group and the second group as the control group. The data
are displayed in Table 2.3.1. Let θC and θT denote the true median levels of
LDL for the control and treatment populations, respectively. The parameter
of interest is ∆ = θC − θT . We are interested in the alternative hypothesis that
the treatment has been effective; hence the hypotheses are:
H0 : ∆ = 0 versus HA : ∆ > 0 .
> twosampwil(y,x,alt=1,alpha=.10,namex="Treated",namey="Control",
nameresp="LDL cholesterol")
i i
i i
i i
Figure 2.3.1: Comparison boxplots of treatment and control quail LDL levels.
100
80
60
40
Control Treated
> twosampt(y,x,alt=1,alpha=.10,namex="Treated",namey="Control",
nameresp="LDL cholesterol")
The data discussed in the last example were drawn from a high-speed drug
screen to discover drug compounds which have the potential to reduce LDL
cholesterol. In this screen, if a compound was at least marginally significant
then the investigation of it would continue; otherwise, it would be eliminated
from further scrutiny. Hence, for this drug compound, the robust and LS
analyses would result in different practical outcomes.
Example 2.3.2 (Hendy-Charles Coin Data, continuation of Example 1.11.1).
Recall that the 84% L1 confidence intervals for the data are disjoint. Thus
we reject the null hypothesis that the silver content is the same for the two
i i
i i
i i
mintings at the 5% level. We now apply the MWW test and confidence interval
to this data and find the Hodges-Lehmann estimate of shift. If the tailweights
of the underlying distributions are moderate, the MWW methods are more
efficient.
The output from the Robnp function twosampwil is:
> twosampwil(Fourth,First)
2.4.1 Testing
Although the geometric motivation of the test statistic SR+ was derived under
the location model, the test can be used for more general models. Recall that
the general model is comprised of a random sample X1 , . . . , Xn1 with cdf F (x)
and a random sample Y1 , . . . , Yn2 with cdf G(x). For the discussion we select
the hypotheses,
H0 : F (x) = G(x), for all x, versus
HA : F (x) ≥ G(x), with strict inequality for some x. (2.4.1)
Under this stochastically ordered alternative Y tends to dominate X; i.e.,
P (Y > X) > 1/2. Our rank-based decision rule is to reject H0 in favor of
i i
i i
i i
HA if SR+ is too large, where SR+ = #(Yj − Xi > 0). Our immediate goal
is to make this precise. What we discuss, of course, holds for the other one-
sided alternative F (x) ≤ G(x) and the two-sided alternative F (x) ≤ G(x) or
F (x) ≥ G(x) as well. Furthermore, since the location model is a submodel of
the general model, what holds for the general model holds for it also. It will
always be clear which set of hypotheses is being considered.
Under H0 , we first show that SR+ is distribution free and then show it is
symmetrically distributed about (n1 n2 )/2.
Theorem 2.4.1. Under the general null hypothesis in (2.4.1), SR+ is distribu-
tion free.
Theorem 2.4.3. Under the general null hypothesis in (2.4.1), let Pn1 ,n2 (k) =
PH0 [SR+ = k]. Then
n2 n1
Pn1 ,n2 (k) = Pn1 ,n2 −1 (k − n1 ) + Pn −1,n2 (k) ,
n1 + n2 n1 + n2 1
where Pn1 ,n2 (k) satisfies the boundary conditions Pi,j (k) = 0 if k < 0, Pi,0 (k)
and P0,j (k) are 1 or 0 as k = 0 or k 6= 0.
i i
i i
i i
(a) the general model where X has distribution function F (x) and Y has
distribution function G(x);
Of course, from Theorem 2.4.2, the null mean of SR+ is (n1 n2 )/2. In our deriva-
tion we repeatedly make use of the fact that if H is the distribution function
of a random variable Z then the random variable H(Z) has a uniform distri-
bution over the interval (0, 1); see Exercise 2.13.5.
Theorem 2.4.4. Assuming that X1 , . . . , Xn1 are iid F (x) and Y1 , . . . , Yn2 are
iid G(x) and that these two samples are independent of one another, the means
of SR+ under the three models (a)-(c) are:
(a) E SR+ = n1 n2 [1 − E [G(X)]] = n1 n2 E [F (Y )]
(b) E SR+ = n1 n2 [1 − E [F (X − ∆)]] = n1 n2 E [F (X + ∆)]
n1 n2
(c) E SR+ = .
2
Proof: We prove only (a), since results (b) and (c) follow directly from it. We
can write SR+ in terms of indicator functions as
n1 X
X n2
SR+ = I(Yj − Xi > 0) , (2.4.2)
i=1 j=1
where the second equality follows from the independence of X and Y . The
results then follow.
i i
i i
i i
Theorem 2.4.5. The variances of SR+ under the models (a) - (c) are:
(a) V ar SR+ = n1 n2 E [G(X)] − E 2 [G(X)]
+ n1 n2 (n1 − 1)V ar [F (Y )] + n1 n2 (n2 − 1)V ar [G(X)]
+
(b) V ar SR = n1 n2 E [F (X − ∆)] − E 2 [F (X − ∆)]
+ n1 n2 (n1 − 1)V ar [F (Y )] + n1 n2 (n2 − 1)V ar [F (X − ∆)]
n1 n2 (n + 1)
(c) V ar SR+ = .
12
Proof: Again, only the result (a) is obtained. Using the indicator formulation
of SR+ , (2.4.2), we have
n1 X
X n2
Var SR+ = V ar [I(Yj − Xi > 0)]
i=1 j=1
Xn1 Xn2 X
n1 X
n2
+ Cov [I(Yj − Xi > 0), I(Yk − Xl > 0)] ,
i=1 j=1 l=1 k=1
where the sums for the covariance terms are over all possible combinations
except (i, j) = (l, k). For the first term, note that the variance of I(Y −X > 0)
is
This yields the first term in (a). For the covariance terms, note that a co-
variance is 0 unless either j = k or i = l. This leads to the following two
cases:
i i
i i
i i
Case (ii) The terms for the covariances where i = l and j 6= k follow similarly
to Case (i). This leads to the third and final term of (a).
n1 n2
S+−
The last two theorems suggest that the random variable Z = qR 2
n1 n2 (n+1)
has
12
an approximate N(0, 1) distribution under H0 . This follows from the next re-
sults which yield the asymptotic distribution of SR+ under general alternatives
as well as under the null hypothesis. We obtain these results by projecting our
statistic SR+ down onto a set of linear combinations of independent random
variables. Then we can use central limit theory on the projection. See Hájek
and Šidák (1967) for a discussion of this technique.
Let T = T (Z1 , . . . , Zn ) be a random variable based on a sample Z1 , . . . , Zn
such that E [T ] = 0. Let
p∗k (x) = E [T | Zk = x] , k = 1, . . . , n .
In the next theorem we show that Tp is the projection of T onto the space of
linear functions of Z1 , . . . , Zn . Note that unlike T , Tp is a linear combination
of independent random variables; hence, its asymptotic distribution is often
easier to obtain than that of T . As the following projection P theorem shows
it is in a sense the “closest” linear function of the form pi (Zi ) to T .
Pn 2
Theorem 2.4.6. If W = i=1 pi (Zi ) then E [(T − W ) ] is minimized by
∗ 2
taking pi (x) = pi (x). Furthermore E [(T − Tp ) ] = V ar [T ] − V ar [Tp ].
i i
i i
i i
Hence the cross-product term is zero, and, therefore the left side of expression
(2.4.4) is minimized with respect to W by taking W = Tp . Also since this
holds, in particular, for W = 0 we get
E T 2 = E (T − Tp )2 + E Tp2 .
Since both T and Tp have zero means the second result of the theorem also
follows.
From these results a strategy for obtaining the asymptotic distribution of
T is apparent. Namely, find the asymptotic distribution of its projection, Tp
and then show V ar [T ] − V ar [Tp ] → 0 as n → ∞. This implies that T and
Tp have the same asymptotic distribution; see Exercise 2.13.7. We apply this
strategy to get the asymptotic distribution of therank-based methods. As a
first step, we obtain the projection of SR+ − E SR+ under the general model.
Theorem 2.4.7.
+ Under the general model the projection of the random vari-
+
able SR − E SR is
n2
X n1
X
Tp = n1 (F (Yj ) − E [F (Yj )]) − n2 (G(Xi ) − E [G(Xi )]) . (2.4.5)
j=1 i=1
i i
i i
i i
or similarly
P (Y > X) = E [F (Y )] .
From (a) of Theorem 2.4.4,
E SR+ = n1 n2 (1 − E [G(X)]) = n1 n2 P (Y > X) .
Substituting these three results into (2.4.6) we get the desired result.
The next corollary follows immediately.
i i
i i
i i
Note that the asymptotic variance of Tp∗ is not zero under (E.3).
We are now in the position to find the asymptotic distribution of Tp∗ .
Theorem 2.4.8. Under the general model and the assumptions (D.1) and
(E.3), Tp∗ has an asymptotic N (0, λ1 Var (F (Y )) + λ2 Var (G(X))) distribu-
tion.
i i
i i
i i
Theorem 2.4.9. Under the general model and the conditions (E.3) and (D.1)
S + −E [SR
+
]
the random variable √R has a limiting N(0, 1) distribution.
Var (SR+ )
Proof: By the last theorem and Theorem 2.4.6, we need only show that the
√
difference in the variances of SR+ / nn1 n2 and Tp∗ goes to 0 as n → ∞. Note
that,
1 n1 n2
Var √ +
SR = E [G(X)] − E [G(X)]2
nn1 n2 nn1 n2
n1 n2 (n1 − 1) n1 n2 (n2 − 1)
+ Var (F (Y )) + Var (G(X)) .
nn1 n2 nn1 n2
√
Hence, Var (Tp∗ ) − Var (SR+ / nn1 n2 ) → 0 and the result follows from Exercise
(2.13.7).
The asymptotic distribution of the test statistic under the null hypothesis
follows immediately from this theorem. We record it in the next corollary.
S+ − n1 n2
z = qR 2
n1 n2 (n+1)
12
and
1 − Φ(zα/2 ) = α/2 .
Since we approximate a discrete random variable with a continuous one, we
think it is advisable in cases of small samples to use a continuity correction.
Fix and Hodges (1955) give an Edgeworth approximation to the distribution
of SR+ and Bickel (1974) discusses the error of this approximation.
Since the standard normal distribution function, Φ, is continuous on the
entire real line, we can strengthen the convergence in Theorem 2.4.9 to uniform
convergence; that is, the distribution function of the standardized MWW con-
verges uniformly to Φ. Using this, it is not hard to show that the standardized
critical values of the MWW converge to their counterparts at the standard
normal. Thus if cα,n is the MWW critical value defined by α = PH0 [SR+ ≥ cα,n ]
then
cα,n − n12n2
q → zα , (2.4.10)
n1 n2 (n+1)
12
i i
i i
i i
where 1 − α = Φ(zα ); see Exercise 2.13.8 for details. This result proves useful
in the next section.
We now consider when the test based on SR+ is consistent. Consider the
general set up; i.e., X1 , . . . , Xn1 is a random sample with distribution function
F (x) and Y1 , . . . , Yn2 is a random sample with distribution function G(x).
Consider the hypotheses
H0 : F = G versus HA1 : F (x) ≥ G(x) with F (x0 ) > G(x0 ) for some x0 ,
(2.4.11)
where x0 ∈ Int(S(F ) ∩ S(G)). Such an alternative is called a stochastically
ordered alternative. The next theorem shows that the MWW test statistic
is consistent for this alternative. Likewise it is consistent for the other one-
sided stochastically ordered alternative with F and G interchanged, HA2 , and,
also, for the two-sided alternative which consists of the union of HA1 and HA2 .
These results imply that the MWW test is consistent for location alternatives,
provided F and G have overlapping support. As Exercise 2.13.9 shows, it is
also consistent when one support is shifted to the right of the other support.
Theorem 2.4.10. Suppose that the assumptions (D.1), (2.4.7), and (E.3),
(2.4.8), hold. Under the stochastic ordering alternatives given above, SR+ is a
consistent test.
Proof: Assume the stochastic ordering alternative HA1 , (2.4.11). For an arbi-
trary level α, select the critical level cα such that the test that rejects H0 if
SR+ ≥ cα has asymptotic level α. We want to show that the power of the test
goes to 1 as n → ∞. Since F (x0 ) > G(x0 ) for some point x0 in the interior of
S(F ) ∩ S(G), there exists an interval N such that F (x) > G(x) on N. Hence
Z Z
EHA [G(X)] = G(y)f (y)dy + G(y)f (y)dy
c
ZN ZN
1
< F (y)f (y)dy + F (y)f (y)dy = . (2.4.12)
N Nc 2
i i
i i
i i
where κ is a real number (since the variances are of the same order). But by
(2.4.12)
(n1 n2 /2) − EHA (SR+ ) (n1 n2 /2) − n1 n2 [1 − EHA (G(X))]
p +
= p
VHA (SR ) VHA (SR+ )
1
n1 n2 − 2 + EHA (G(X))
= p → −∞ .
VHA (SR+ )
+ +
SR −EHA (SR )
By Theorem (2.4.9), under HA the random variable √ +
converges in
VHA (SR )
distribution to a standard normal variate. Since the convergence is uniform, it
follows from the above limits that the power converges to 1. Hence, the MWW
test is consistent.
Thus [D(cα/2 +1) , D(n1 n2 −cα/2 ) ) is a (1 − α)100% confidence interval for ∆; com-
pare (1.3.30). From the asymptotic null distribution theory for SR+ , Corollary
(2.4.2), we can approximate cα/2 as
r
. n1 n2 n1 n2 (n + 1)
cα/2 = − zα/2 − .5 . (2.4.13)
2 12
i i
i i
i i
Results similar to those given below can be obtained for the power function of
the other one-sided and the two-sided alternatives. Given a level α, let cα,n1 ,n2
denote the upper critical value for the MWW test of this hypothesis; hence,
the test rejects H0 if SR+ ≥ cα,n1 ,n2 . The power function of this test is given by
From Lemma 2.4.1 and Theorem 2.4.11 we have our first important result
on the power function of the MWW test; namely, that it is monotone.
Theorem 2.4.12. For the above hypotheses (2.4.14), the function γ(∆) is
monotonically increasing in ∆.
Proof: Let ∆1 < ∆2 . Then −∆2 < −∆1 and, hence, from Lemma 2.4.1, we
have SR+ (−∆2 ) ≥ SR+ (−∆1 ). By applying Theorem 2.4.11, the desired result,
γ(∆2 ) ≥ γ(∆1 ), follows from the following:
From this we immediately have that the MWW test is unbiased; that is,
its power function evaluated at an alternative is always at least as large as its
level of significance. We state it as a corollary.
Corollary 2.4.3. For the above hypotheses (2.4.14), γ(∆) ≥ α for all ∆ > 0.
i i
i i
i i
If T is any test for these hypotheses with critical region C then we say T is a
size α test provided
sup P∆ [T ∈ C] = α .
∆≤0
For selected α, it follows from the monotonicity of the MWW power function
that the MWW test has size α for this more general null hypothesis.
From the above theorems, we have that the MWW power function is mono-
tonically increasing in ∆. Since SR+ (∆) achieves its maximum for ∆ finite, we
have by Theorem 1.5.2 of Chapter 1 that the MWW test is resolving; hence,
its power function approaches one as ∆ → ∞. Even for the location model,
though, we cannot get the power function of the MWW test in closed form.
For local alternatives, however, we can obtain an asymptotic expression for the
power function. Applications of this result include sample size determination
for the MWW test and efficiency comparisons of the MWW with other tests,
both of which we consider.
We need the assumption that the density f (x) has finite Fisher Infor-
mation, i.e.,
R1
(E.1) f is absolutely continuous, 0 < I(f ) = 0 ϕ2f (u) du < ∞ , (2.4.16)
where
f ′ (F −1 (u))
ϕf (u) = − . (2.4.17)
f (F −1 (u))
As discussed in Section 3.4, assumption (E.1) implies that f is uniformly
bounded.
Once again we consider the one-sided alternative, (2.4.14) (similar results
hold for the other one-sided and two-sided alternatives). Consider a sequence
of local alternatives of the form
δ
HAn : ∆n = √ , (2.4.18)
n
i i
i i
i i
condition is true. Hence we need only show that the third condition, asymp-
+
totic linearity of S R (∆) is true. This follows provided we can show the variance
condition (1.5.18) of Theorem 1.5.6 is true. Note that
+ √ + √
S R (δ/ n) − S R (0) = (n1 n2 )−1 #(0 < Yj − Xi ≤ δ/ n) .
This is similar to the MWW statistic itself. Using essentially the same ar-
gument as that for the variance of the MWW statistic, Theorem 2.4.5 we
get
+ √ + n n(n1 − 1)
nVar0 [S R (δ/ n) − S R (0)] = (an − a2n ) + (bn − cn )
n1 n2 n1 n2
n(n2 − 1)
+ (dn − a2n ) ,
n1 n2
√ √
where an = E0 [F (X + δ/ n) √ − F (X)], bn = E0 [(F (Y ) − F√ (Y − δ/ n))2 ],
cn = E0 [(F (Y ) − F (Y − δ/ n))], and dn = E0 [(F (X + δ/ n) − F (X))2 ].
Using the Lebesgue Dominated Convergence Theorem, it is easy to see that
an , bn , cn , and dn all converge to 0. Therefore Condition (1.5.18) of Theorem
1.5.6 holds and we have thus established the asymptotic linearity result given
by:
Z
1/2 + √ + P
sup n S R (δ/ n) − n1/2 S R (0) + δ f 2 (x) dx → 0 , (2.4.20)
|δ|≤B
for any B > 0. Therefore, it follows that SR+ (∆) is Pitman Regular.
In order to get the efficacy of the MWW test, we need the quantity σ 2 (0)
defined by
In Exercise
√ 2.13.10 it is shown that the efficacy of the two-sample pooled
t-test is λ1 λ2 σ −1 where σ 2 is the common variance of X and Y . Hence the
i i
i i
i i
efficiency of the MWW test to the two-sample t-test is the ratio σ 2 /τ 2 . This
of course is the same efficiency as that of the signed rank Wilcoxon test to
the one-sample t-test; see (1.7.13). In particular if the distribution of X is
normal then the efficiency of the MWW test to the two-sample t-test is .955.
For heavier tailed distributions, this efficiency is usually larger than 1; see
Example 1.7.1.
As in Chapter 1 it is convenient to summarize the asymptotic linearity
result as follows:
( + √ ) ( + )
√ S R (δ/ n) − µ(0) √ S R (0) − µ(0)
n = n −cM W W δ+op (1) , (2.4.23)
σ(0) σ(0)
In Exercise 2.13.10, it is shown that if γLS (∆) denotes the power function
of the usual two-sample t-test then
p
δ
lim γLS (∆n ) = 1 − Φ zα − λ1 λ2 , (2.4.24)
n→∞ σ
i i
i i
i i
2.4.4 Estimation of ∆
Recall from the geometry earlier in this chapter that the estimate of ∆ based
on the rank-pseudo-norm is ∆ b R = med i,j {Yj − Xi }, (2.2.18). We now obtain
several properties of this estimate including its asymptotic distribution. This
leads again to the efficiency properties of the rank-based methods discussed
in the last section.
For convenience, we note some equivariances of ∆ b R = ∆(Y,
b X), which are
b
established in Exercise 2.13.11. First, ∆R is translation equivariant; i.e.,
b R (Y + ∆ + θ, X + θ) = ∆
∆ b R (Y, X) + ∆ ,
b R is an unbiased estimate of ∆
for any a. Based on these we next show that ∆
under certain conditions.
Theorem 2.4.14. If the errors, e∗i , in the location model (2.2.4) are symmet-
b R is symmetrically distributed about ∆.
rically distributed about 0, then ∆
i i
i i
i i
2
Z 2
e(∆ b LS ) = σ = 12σf2
b R, ∆ 2
f (x) dx .
τ2
This agrees with the asymptotic relative efficiency results for the MWW test
relative to the t-test and (1.7.13).
i i
i i
i i
b R ± L1−α .
∆ (2.4.29)
2
Besides the estimate given in (2.4.26), a consistent estimate of τ was pro-
posed by by Koul, Sievers, and McKean (1987) and is discussed in Section 3.7.
Using this estimate small sample studies indicate that zα/2 should be replaced
by the t critical value t(α/2,n−1) ; see McKean and Sheather (1991) for a review
of small sample studies on R estimates. In this case, the symmetric confidence
interval based on ∆ b R is directly analogous to the usual t interval based on
least squares in that the only difference is that σb is replaced by τb.
Example 2.4.1 (Hendy and Charles Coin Data, continued from Examples
1.11.1 and 2.3.2). Recall from Chapter 1 that this example concerned the
silver content in two coinages (the first and the fourth) minted during the
reign of Manuel I. The data are given in Chapter 1. The Hodges-Lehmann
estimate of the difference between the first and the fourth coinage is 1.10% of
silver and a 95% confidence interval for the difference is (.60, 1.70). The length
of this confidence interval is 1.10; hence, the estimate of τ given in expression
(2.4.27) is 0.595. The symmetrized confidence interval (2.4.28) based on the t-
upper .025 critical value is (0.46, 1.74). Both of these intervals are in agreement
i i
i i
i i
with the confidence interval obtained in Example 1.11.1 based on the two L1
confidence intervals.
The results for the other one-sided and two-sided alternatives are similar. We
are also concerned with estimation and confidence intervals for ∆. As in the
preceding sections, we first present the geometry.
Recall that the pseudo-norm which generated the MWW analysis could
be written as a linear combination of ranks times residuals. This is easily
generalized. Consider the function
n
X
kuk∗ = a(R(ui ))ui , (2.5.2)
i=1
P
where a(i) are scores such that a(1) ≤ · · · ≤ a(n) and a(i) = 0. For the
next theorem, we also assume that a(i) = −a(n + 1 − i), although this is only
used to show the scalar multiplicative property.
P
Theorem 2.5.1. Suppose that a(1) ≤ · · · ≤ a(n), a(i) = 0, and a(i) =
−a(n + 1 − i). Then the function k · k∗ is a pseudo-norm.
Proof: By the connection between ranks and order statistics we can write
n
X
kuk∗ = a(i)u(i) .
i=1
i i
i i
i i
Next suppose that u(j) is the last order statistic with a negative score. Since
the scores sum to 0, we can write
n
X
kuk∗ = a(i)(u(i) − u(j))
i=1
X X
= a(i)(u(i) − u(j)) + a(i)(u(i) − u(j) ) . (2.5.3)
i≤j i≥j
Both terms on the right side are nonnegative; hence, kuk∗ ≥ 0. Since all the
terms in (2.5.3) are nonnegative, kuk∗ = 0 implies that all the terms are zero.
But since the scores are not all 0, yet sum to zero, we must have a(1) < 0
and a(n) > 0. Hence we must have u(1) = u(j) = u(n) ; i.e., u(1) = · · · = u(n) .
Conversely if u(1) = · · · = u(n) then kuk∗ = 0. By the condition a(i) =
−a(n + 1 − i) it follows that kαuk∗ = |α|kuk∗; see Exercise 2.13.16.
In order to complete the proof we need to show the triangle inequality
holds. This is established by the following string of inequalities:
n
X
ku + vk∗ = a(R(ui + vi ))(ui + vi )
i=1
Xn n
X
= a(R(ui + vi ))ui + a(R(ui + vi ))vi
i=1 i=1
n
X n
X
≤ a(i)u(i) + a(i)v(i)
i=1 i=1
= kuk∗ + kvk∗ .
The proof of the above inequality is similar to that of Theorem 1.3.2 of Chapter
1.
Based on a set of scores satisfying the above assumptions, we can establish
a rank inference for the two-sample problem similar to the MWW analysis.
We do so for general rank scores of the form
see, also, Assumption (S.1), (3.4.10), of Chapter 3. The last assumptions con-
cerning standardization of the scores are for convenience. The
√ Wilcoxon scores
are generated in this way by the linear function ϕR (u) = 12(u − (1/2)) and
i i
i i
i i
the sign scores are generated by ϕS (u) = sgn(2u − 1). We denote the corre-
sponding pseudo-norm for scores generated by ϕ(u) as
n
X
kukϕ = aϕ (R(ui))ui . (2.5.6)
i=1
These two-sample sign and Wilcoxon scores are generalizations of the sign
and Wilcoxon scores discussed in Chapter 1 for the one-sample problem. In
Section 1.8 of Chapter 1 we presented one-sample analyses based on general
score functions. Similar to the sign and Wilcoxon cases, we can generate a
two-sample score function from any one-sample score function. For reference
we establish this in the following theorem:
Theorem 2.5.2. As discussed at the beginning of Section 1.8, let ϕ+ (u) be
a score function for the one-sample problem. For u ∈ (−1, 0), let ϕ+ (u) =
−ϕ+ (−u). Define,
and n
X
kxkϕ = ϕ(R(xi )/(n + 1))xi . (2.5.8)
i=1
and Z Z
1 1
2
ϕ (u) du = (ϕ+ (u))2 du . (2.5.10)
0 0
i i
i i
i i
Since the test statistic only depends on the ranks of the combined sample it
is distribution free under the null hypothesis. As shown in Exercise 2.13.18,
E0 [Sϕ ] = 0 (2.5.14)
n
X
n1 n2
σϕ2 = V0 [Sϕ ] = a2 (i) . (2.5.15)
n(n − 1) i=1
i i
i i
i i
i i
i i
i i
see Chernoff and Savage (1958) for a rigorous proof of the limit. Differentiating
µϕ (∆) and evaluating the derivative at 0 we obtain
Z ∞
′
µϕ (0) = λ1 λ2 ϕ′ [F (t)]f 2 (t) dt
Z−∞
∞ ′
f (t)
= λ1 λ2 ϕ[F (t)] − f (t) dt
−∞ f (t)
Z 1
= λ1 λ2 ϕ(u)ϕf (u) du = λ1 λ2 τϕ−1 > 0 . (2.5.25)
0
i i
i i
i i
linearity. This result follows from the results for general rank regression statis-
tics which are developed in Section A.2.2 of the Appendix. By Theorem A.2.8
of the Appendix, the asymptotic linearity result for Sϕ (∆) is given by
1 √ 1
√ Sϕ (δ/ n) = √ Sϕ (0) − τϕ−1 λ1 λ2 δ + op (1) , (2.5.26)
n n
τϕ−1 λ1 λ2 p
cϕ = √ = τϕ−1 λ1 λ2 . (2.5.27)
λ1 λ2
Based on this result, sample size determination for the test based on Sϕ can
be conducted similar to that based on the MWW test statistic; see (2.4.25).
Next consider the asymptotic distribution of the estimator ∆ b ϕ . Recall
b ϕ solves the equation Sϕ (∆ .
b ϕ ) = 0. Based on Pitman
that the estimate ∆
Regularity and Theorem 1.5.7 of Chapter 1 the asymptotic distribution ∆ b ϕ is
given by
√ D
n(∆b ϕ − ∆) → N(0, τϕ2 (λ1 λ2 )−1 ) ; (2.5.29)
i i
i i
i i
can write
Z
τϕ−1 = ϕ(u)ϕf (u)du
sZ R
ϕ(u)ϕf (u)du
= ϕ2f (u)du qR qR
2
ϕf (u)du ϕ2 (u)du
sZ
= ρ ϕ2f (u)du . (2.5.31)
Example 2.5.1 (Quail Data, continued from Example 2.3.1). In the larger
study, McKean et al. (1989), from which these data were drawn, the responses
i i
i i
i i
were positively skewed with long right tails, although outliers frequently oc-
curred in the left tail also. McKean et al. conducted an investigation of esti-
mates of the score functions for over 20 of these experiments. Classes of simple
scores which seemed appropriate for such data were piecewise linear with one
piece which is linear on the first part on the interval (0, b) and with a second
piece which is constant on the second part (b, 1); i.e., scores of the form
2
b(2−b)
u − 1 if 0 < u < b
ϕb (u) = b . (2.5.33)
2−b
if b ≤ u < 1
These scores are optimal for densities with left logistic and right exponential
tails; see Exercise 2.13.19. A value
P of b which seemed appropriate for this type
of data was 3/4. Let S3/4 = a3/4 (R(Yj )) denote the test statistic based on
these scores. The Robnp function phibentr with the argument param = 0.75
computes these scores. Using the Robnp function twosampr2 with the argument
score = phibentr, computes the rank-based analysis for the score function
(2.5.33). Assuming that the treated and control observations are in x and y,
respectively, the call and the resulting analysis for a one-sided test as computed
by R is:
Comparing p-values, the analysis based on the score function (2.5.33) is a little
more precise than the MWW analysis given in Example 2.3.1. Recall that the
data are right skewed, so this result is not surprising.
For another class of scores similar to (2.5.33), see the discussion around
expression (3.10.6) in Chapter 3.
i i
i i
i i
i i
i i
i i
where the efficacies are given by expressions (2.5.27) and (1.8.21). Hence the
efficiency and asymptotic properties of the one- and two-sample analyses are
the same. As a final remark, if we write the model as in expression (2.2.4),
then we can use the rank statistic based on the two-sample scores to estimate
∆. We next form the residuals Zi − ∆c b i . Then using the one-sample scores
statistic of Chapter 1, we can estimate θ based on these residuals, as discussed
in Chapter 1. In terms of a regression problem we are estimating the intercept
parameter θ based on the residuals after fitting the regression coefficient ∆.
This is discussed in some detail in Section 3.5.
2.6 L1 Analyses
In this section, we present analyses based on the L1 norm and pseudo-norm.
We discuss the pseudo-norm first, showing that the corresponding test is
the familiar Mood’s (1950) test. The test which corresponds to the norm is
Mathisen’s (1943) test.
i i
i i
i i
Recall that the pseudo-norm based on the Wilcoxon scores can be expressed
as the sum of all absolute differences between the components; see (2.2.17).
In contrast the pseudo-norm based on the sign scores only involves the n
symmetric absolute differences |u(i) − u(n−i+1) |.
In the two-sample location model the corresponding R estimate based on
the pseudo-norm (2.6.1) is a value of ∆ which solves the equation
n2
X
n+1 .
Sϕ (∆) = sgn R(Yj − ∆) − =0. (2.6.3)
j=1
2
Note that we are ranking the set {X1 , . . . , Xn1 , Y1 − ∆, . . . , Yn2 − ∆} which
is equivalent to ranking the set {X1 − med Xi , . . . , Xn1 − med Xi , Y1 − ∆ −
med Xi , . . . , Yn2 − ∆ − med Xi }. We must choose ∆ so that half of the ranks
of the Y part of this set are above (n + 1)/2 and half are below. Note that in
the X part of the second set, half of the X part is below 0 and half is above
0. Thus we need to choose ∆ so that half of the Y part of this set is below 0
and half is above 0. This is achieved by taking
b = med Yj − med Xi .
∆ (2.6.4)
This is the same estimate as produced by the L1 norm, see the discussion
following (2.2.5). We refer to the above pseudo-norm (2.6.1) as the L1 pseudo-
norm. Actually, as pointed out in Section 2.2, this equivalence between es-
timates based on the L1 norm and the L1 pseudo-norm is true for general
regression problems in which the model includes an intercept,
Pn2 as it does here.
The corresponding test statistic for H0 : ∆ = 0 is j=1 sgn(R(Yj ) − n+1 2
).
Note that the sgn function here is only counting the number of Yj ’s which
are above the combined sample median M c = med {X1 , . . . , Xn1 , Y1 , . . . , Yn2 }
minus the number below M c. Hence a more convenient but equivalent test
statistic is
M0+ = #(Yj > M c) , (2.6.5)
which is called Mood’s median test statistic; see Mood (1950).
Testing
Since this L1 analysis is based on a rank-based pseudo-norm we could use the
general theory discussed in Section 2.5 to handle the theory for estimation and
i i
i i
i i
testing. As we point out, though, there are some interesting results pertaining
to this analysis.
For the null distribution of M0+ , first assume that n is even. Without loss
of generality, assume that n = 2r and n1 ≥ n2 . Consider the combined sample
as a population of n items, where n2 of the items are Y ’s and n1 items are
X’s. Think of the n/2 items which exceed M c. Under H0 these items are as
+
likely to be an X as a Y . Hence, M0 , the number of Y ’s in the top half of the
sample follows the hypergeometric distribution, i.e.,
n2
n1
k
P (M0+ = k) = n
r−k k = 0, . . . , n2 ,
r
where r = n/2. If n is odd the same result holds except in this case r =
(n − 1)/2. Thus as a level α decision rule, we would reject H0 : ∆ = 0 in
favor of HA : ∆ > 0, if M0+ ≥ cα , where cα could be determined from the
hypergeometric distribution or approximated by the binomial distribution.
From the properties of the hypergeometric distribution, E0 [M0+ ] = r(n2 /n)
and V0 [M0+ ] = (rn1 n2 (n − r))/(n2 (n − 1)). Under the assumption D.1, (2.4.7),
it follows that the limiting distribution of M0+ is normal.
Confidence Intervals
Exercise 2.13.21 shows that, for n = 2r,
n2
X
M0+ (∆) c) =
= #(Yj − ∆ > M I(Y(i) − X(r−i+1) − ∆ > 0) , (2.6.6)
i=1
Y(1) − X(r) < Y(2) − X(r−1) < · · · < Y(n2 ) − X(r−n2 +1) ,
can be ordered only knowing the order statistics from the individual samples.
It is further shown that if k is such that P (M0+ ≤ k) = α/2 then a (1−α)100%
confidence interval for ∆ is given by
i i
i i
i i
Efficiency Results
We obtain the efficiency results from the asymptotic distribution of the esti-
mate, ∆b = med Yj − med Xi of ∆. Equivalently, we could obtain the results
by asymptotic linearity that was derived for arbitrary scores in (2.5.26); see
Exercise 2.13.22.
Theorem 2.6.1. Under the conditions cited in Example 1.5.2, (L1 Pitman
regularity conditions), and (2.4.7), we have
√ D
b − ∆) →
n(∆ N(0, (λ1 λ2 4f 2 (0))−1 ) . (2.6.7)
Proof: Without loss of generality assume that ∆ and θ are 0. We can write,
r r
√ n√ n√
b
n∆ = n2 med Yj − n1 med Xi .
n2 n1
From Example 1.5.2, we have
n2
√ 1 1 X
n2 med Yj = √ sgnYj + op (1)
2f (0) n2 j=1
√ D √ D
hence, n2 med Yj → Z2 where Z2 is N(0, (4f 2 (0))−1). Likewise n1 med Xi →
Z1 where Z1 is N(0, (4f 2(0))−1 ). Since Z1 and Z2 are independent, we have
√ b D
that n∆ → (λ2 )−1/2 Z2 − (λ1 )−1/2 Z1 which
√ yields the result.
The efficacy of Mood’s test is thus λ1 λ2 2f (0). The asymptotic relative
efficiency of Mood’s test to the two-sample t-test is 4σ 2 fR2 (0), while its asymp-
totic relative efficiency with the MWW test is f 2 (0)/(3( f 2 )2 ). These are the
same as the efficiency results of the sign test to the t-test and to the Wilcoxon
signed rank test, respectively, that were obtained in Chapter 1; see Section
1.7.
i i
i i
i i
Example 2.6.1 (Quail Data, continued from Example 2.3.1). For the quail
data the median of the combined samples is M c = 64. For the subsequent
test based on Mood’s test we eliminated the three data points which had this
value. Thus n = 27, n1 = 9 and n2 = 18. The value of Mood’s test statistic
is M0+ = #(Pj > 64) = 11. Since EH0 (M0+ ) = 8.67 and VH0 (M0+ ) = 1.55, the
standardized value (using the continuity correction) is 1.47 with a p-value of
.071. Using all the data, the point estimate corresponding to Mood’s test is 19
while a 90% confidence interval, using the normal approximation, is (−10, 31).
Testing
Mathisen’s test statistic, similar to Mood’s, has a hypergeometric distribution
under H0 .
Theorem 2.6.2. Suppose n1 is odd and is written as n1 = 2n∗1 + 1. Then
under H0 : ∆ = 0,
n∗ +t n2 −t+n∗
1 1
n∗1 n∗1
P (Ma+ = t) = n
, t = 0, 1, . . . , n2 .
n1
Proof: The proof is based on a conditional argument. Given X(n∗1 +1) = x, Ma+
is binomial with n2 trials and 1 − F (x) as the probability of success. The
density of X(n∗1 +1) is
n1 ! ∗ ∗
f ∗ (x) = ∗ 2
(1 − F (x))n1 F (x)n1 f (x) .
(n1 !)
i i
i i
i i
Using this and the fact that the samples are independent we get,
Z
+ n2
P (Ma = t) = (1 − F (x))t F (x)n2 −t f (x)dx
t
Z
n2 n1 ! ∗ ∗
= ∗ 2
(1 − F (x))t+n1 F (x)n1 +n2 −t f (x)dx
t (n !)
1 Z 1
n2 n1 ! ∗ ∗
= ∗ 2
(1 − u)t+n1 un1 +n2 −t du .
t (n1 !) 0
By properties of the β function this reduces to the result.
Once again using the conditional argument, we obtain the moments of Ma+
as
n2
E0 [Ma+ ] = (2.6.9)
2
n2 (n + 1)
V0 [Ma+ ] = ; (2.6.10)
4(n1 + 2)
see Exercise 2.13.23.
The result when n1 is even is found in Exercise 2.13.23. For the asymptotic
null distribution of Ma+ we make use of the linearity result for the sign process
derived in Chapter 1; see Example 1.5.2.
Theorem 2.6.3. Under H0 and D.1, (2.4.7), Ma+ has an approximate
N( n22 , n4(n
2 (n+1)
1 +2)
) distribution.
Proof: Assume without loss of generality that the true median of X and Y is
0. Let θb = med Xi . Note that
Xn2
Ma+ =( b + n2 )/2 .
sgn(Yj − θ) (2.6.11)
j=1
√
Clearly under (D.1), n2 θb is bounded in probability. Hence by the asymptotic
linearity result for the L1 analysis, obtained in Example 1.5.2, we have
n2
X n2
X
b =n √
sgn(Yj ) − 2f (0) n2 θb + op (1) .
−1/2 −1/2
n2 sgn(Yj − θ) 2
j=1 j=1
Therefore
n2
X n2
X n1
X
p
−1/2
n2 b = n−1/2
sgn(Yj − θ) sgn(Y j ) − n2 /n n
−1/2
1 1 sgn(Xi ) + op (1) .
2
j=1 j=1 i=1
i i
i i
i i
Note that n2
−1/2
X D
n2 sgn(Yj ) → N(0, λ−1
1 )
j=1
and n1
p −1/2
X D
n2 /n1 n1 sgn(Xi ) → N(0, λ2 /λ1 ) .
i=1
The result follows from these asymptotic distributions, the independence of
the samples, expression (2.6.11), and the fact that asymptotically the variance
of Ma+ satisfies
n2 (n + 1) .
= n2 (4λ1 )−1 .
4(n1 + 2)
Confidence Intervals
Note that Ma+ (∆) = #(Yj − ∆ > θ) b = #(Yj − θb > ∆); hence, if k is such that
+
P0 (Ma ≤ k) = α/2 then (Y(k+1) − θ, b Y(n −k) − θ)b is a (1 − α)100% confidence
2
interval for ∆. For testing the two-sided hypothesis H0 : ∆ = 0 versus
HA : ∆ 6= 0 we would reject H0 if 0 is not in the confidence interval. This is
equivalent, however, to rejecting if θb is not in the interval (Y(k+1) , Y(n2 −k) ).
Suppose we determine k by the normal approximation. Then
s r
. n2 n2 (n + 1) . n2 n2
k= − zα/2 − .5 = − zα/2 − .5 .
2 4(n1 + 2) 2 4λ1
Remarks on Efficiency
Since the estimator of ∆ based on the Mathisen procedure is the same as that
of Mood’s procedure, the asymptotic relative efficiency results for Mathisen’s
procedure are the same as that of Mood’s. Using another type of efficiency
due to Bahadur (1967), Killeen, Hettmansperger, and Sievers (1972) show it
is generally better to compute the median of the smaller sample.
Curtailed sampling on the Y ’s is one situation where Mathisen’s test would
be used instead of Mood’s test since with Mathisen’s test an early decision
could be made; see Gastwirth (1968). For another perspective on median tests,
see Freidlin and Gastwirth (2000).
i i
i i
i i
Example 2.6.2 (Quail Data, continued from Examples 2.3.1 and 2.6.1). For
this data, med Ti = 49. Since one of the placebo values was also 49, we elimi-
nated it in the subsequent computation of Mathisen’s test. The test statistic
has the value Ma+ = #(Cj > 49) = 17. Using n2 = 19 and n1 = 10 the null
mean and variance are 9.5 and 11.875, respectively. This leads to a standard-
ized test statistic of 2.03 (using the continuity correction) with a p-value of
.021. Utilizing all the data, the corresponding point estimate and confidence
interval are 19 and (6, 27). This differs from MWW and Mood analyses; see
Examples 2.3.1 and 2.6.1, respectively.
b
Definition 2.7.1. An estimator ∆(X, Y) of ∆ is said to be an equivari-
b
ant estimator of ∆ if ∆(X b
+ a1, Y) = ∆(X, b
Y) − a and ∆(X, Y + a1) =
b
∆(X, Y) + a.
Note that the L1 estimator and the Hodges-Lehmann estimator are both
equivariant estimators of ∆. Indeed, as Exercise 2.13.24 shows, any estimator
based on the rank pseudo-norms discussed in Section 2.5 is an equivariant
estimator of ∆. As the following theorem shows, the breakdown point of an
equivariant estimator is bounded above by .25.
b ∗ , Y) − ∆(X,
|∆(X b Y)| ≤ B . (2.7.1)
i i
i i
i i
Next let X∗∗ = (X1 , . . . , Xm , Xm+1 − a, . . . , Xn1 − a)′ . Then X∗∗ contains
n1 − m = [n1 /2] ≤ m altered points. Therefore,
b ∗∗ , Y) − ∆(X,
|∆(X b Y)| ≤ B . (2.7.2)
b ∗∗ , Y) = ∆(X
Equivariance implies that ∆(X b ∗ , Y) + a. By (2.7.1) we have
b
∆(X, b ∗ , Y) ≤ ∆(X,
Y) − B ≤ ∆(X b Y) + B (2.7.3)
b
∆(X, b ∗∗ , Y) ≤ ∆(X,
Y) − B + a ≤ ∆(X b Y) + B + a . (2.7.4)
i i
i i
i i
Recall the projection of the statistic S R (0) − 1/2 given in Theorem 2.4.7.
Since the difference between it and this statistic goes to zero in probability we
can, after some algebra, obtain the following representation for the Hodges-
Lehmann estimator,
Z −1 (n n2
)
√ 1 X 2
F (Yj ) − 1/2 X F (Xi ) − 1/2
bR =
n∆ f2 √ − + op (1) .
n j=1 λ2 i=1
λ1
i i
i i
i i
Hence, the influence function of the R estimate based on the score function ϕ
is given by
τϕ
− λ1 ϕ(F (z)) if z is an x
Ω(z) = τϕ ,
λ2
ϕ(F (z)) if z is an y
f (t)
hX (t) =
1 − F (t)
and represents the likelihood that a subject dies at time t given that he has
survived until that time; see Exercise 2.13.25.
In this section, we consider the class of lifetime models that are called
Lehmann alternative models for which the distribution function G satisfies
where the parameter α > 0. See Section 4.4 of Maritz (1981) for an overview
of nonparametric methods for these models. The Lehmann model generalizes
the exponential scale model F (x) = 1−exp(−x) and G(x) = 1−(1−F (x))α =
1 − exp(−αx). As shown in Exercise 2.13.25, the hazard function of Yj is given
by hY (t) = αhX (t); i.e., the hazard function of Yj is proportional to the hazard
function of Xi ; hence, these models are also referred to as proportional haz-
ards models; see, also, Section 3.10. The null hypothesis can be expressed
as HL0 : α = 1. The alternative we consider is HLA : α < 1; that is, Y is
less hazardous than X; i.e., Y has more chance of long survival than X and
i i
i i
i i
The last equality holds, since 1−F (X) has a uniform (0, 1) distribution. Under
HLA , then, Pα (Y > X) > 1/2; i.e., Y tends to dominate X.
The MWW test statistic SR+ = #(Yj > Xi ) is a consistent test statistic for
HL0 versus HLA , by Theorem 2.4.10. We reject HL0 in favor of HLA for large
values of SR+ . Furthermore by Theorem 2.4.4 and (2.8.2), we have that
n1 n2
Eα [SR+ ] = n1 n2 Eα [1 − G(X)] = .
1+α
This suggests as an estimate of α, the statistic,
b = ((n1 n2 )/SR+ ) − 1 .
α (2.8.3)
P [ǫ ≤ t] = P [log X − log θ ≤ t]
= 1 − exp (−et ) ;
i i
i i
i i
Next consider the distribution of the log Y . Using expression (2.8.1) and a few
steps of algebra we get
α
P [log Y ≤ t] = 1 − exp (− et ) .
θ
But from this it is easy to see that we can model Y as
1
log Y = log θ + log +ǫ, (2.8.7)
α
where the error random variable has the above extreme value distribution.
From (2.8.6) and (2.8.7) we see that the log-transformation problem is simply
a two-sample location problem with shift parameter ∆ = − log α. Here, HL0
is equivalent to H0 : ∆ = 0 and HLA is equivalent to HA : ∆ > 0. We refer to
this model as the log exponential model for the remainder of this section.
Thus any of the rank-based analyses that we have discussed in this chapter
can be used to analyze this model.
Let’s consider the analysis based on the optimal score function for the
model. Based on Section 2.5 and Exercise 2.13.19, the optimal scores for the
extreme value distribution are generated by the function
Hence the optimal rank test in the log exponential model is given by
Xn2 Xn2
R(Yj ) R(log Yj )
SL = ϕfǫ =− 1 + log 1 −
j=1
n+1 j=1
n+1
Xn2
R(Yj )
= − 1 + log 1 − . (2.8.9)
j=1
n+1
We reject HL0 in favor of HLA for large values of SL . By (2.5.14) the null mean
of SL is 0 while from (2.5.18) its null variance is given by
n 2
2 n1 n2 X i
σϕfǫ = 1 + log 1 − . (2.8.10)
n(n − 1) i=1 n+1
i i
i i
i i
Besides estimation, the confidence intervals discussed in Section 2.5 for general
scores, can be obtained for the score function ϕfǫ ; see Example 2.8.1 for an
illustration. n o
Thus another estimate of α would be α b = exp −∆ b . As discussed in
Exercise 2.13.27, an asymptotic confidence interval for α can be formulated
from this relationship. Keep in mind, though, that we are assuming that X is
exponentially distributed.
As a further note, since ϕfǫ (u) is an unbounded function it follows from
b is unbounded. Thus the estimate
Section 2.7.2 that the influence function of ∆
is not robust.
A frequently used, equivalent test statistic to SL was proposed by Savage.
To derive it, denote R(Yj ) by Rj . Then we can write
Z 1−Rj /(n+1) Z 0
Rj 1 1
log 1 − = dt = dt .
n+1 1 t Rj /(n+1) 1 − t
i i
i i
i i
Exercise 2.13.28 shows that its null mean and variance are given by
EH0 [S̃L ] = 0
( n
)
2 n1 n2 1X1
σ̃ = 1− . (2.8.14)
n−1 n j=1 j
Hence an asymptotic level α test is to reject HL0 in favor of HLA if S̃L ≥ σ̃zα .
Based on the above Riemann sum it would seem that S̃L and SL are close
statistics. Indeed they are asymptotically equivalent and, hence, both are op-
timal when X is exponentially distributed; see Hájek and Šidák (1967) or
Kalbfleisch and Prentice (1980) for details.
Since the Savage test is asymptotically optimal its efficacy is the square root
of √Fisher information, i.e., I 1/2 (fǫ ) discussed in Section 2.5. This efficacy
is λ1 λ2 . Hence the asymptotic relative efficiency of the Mann-Whitney-
Wilcoxon test to the Savage test at the log exponential model, is 3/4; see
Exercise 2.13.29.
Recall√that the efficacy of the L1 procedures, both Mood’s and Mathisen’s,
is 2fǫ (θǫ ) λ1 λ2 , where θǫ denotes the median of the extreme value distribution.
This turns out √ to be θǫ = log(log 2)). Hence fǫ (θǫ ) = (log 2)/2, which leads
to the efficacy λ1 λ2 log 2 for the L1 methods. Thus the asymptotic relative
efficiency of the L1 procedures with respect to the procedure based on Savage
scores is (log 2)2 = .480. The asymptotic relative efficiency of the L1 methods
to the MWW at this model is .6406. Therefore there is a substantial loss of
efficiency if L1 methods are used for the log exponential model. This makes
sense since the extreme value distribution has very light tails.
The variance of a random variable with density fǫ is π 2 /6; hence the asymp-
totic relative efficiency of the t-test to the Savage test at the log exponential
model is 6/π 2 = .608. Hence, for the procedures analyzed in this chapter on
the log exponential model the Savage test is optimal followed, in order, by the
MWW, t-, and L1 tests.
i i
i i
i i
Example 2.8.1 (Lifetimes of an Insulation Fluid). The data below are drawn
from an example on page 3 of Lawless (1982); see, also, Nelson (1982, p. 227).
They consist of the breakdown times (in minutes) of an electrical insulating
fluid when subject to two different levels of voltage stress, 30 and 32 kV.
Suppose we are interested in testing to see if the lower level is less hazardous
than the higher level.
Let Y and X denote the log of the breakdown times of the insulating
fluid at the voltage stresses of 30 kV and 32 kVs, respectively. Let ∆ = θY −
θX denote the shift in locations. We are interested in testing H0 : ∆ = 0
versus HA : ∆ > 0. The comparison boxplots for the log-transformed data are
displayed in the left panel of Figure 2.8.1. It appears that the lower level (30
kV) is less hazardous.
5
4
150
3
Breakdown−time
Voltage level
2
100
1
50
0
−1
0
Exponential Quantiles
The Robnp function twosampr2 with the score argument set at philogr
obtains the analysis based on the log-rank scores. Briefly, the results are:
i i
i i
i i
i i
i i
i i
as:
X(1)1 , . . . , X(1)n1 iid f(1) (t) Y(1)1 , . . . , Y(1)n2 iid f(1) (t − ∆)
· · · · · ·
· · · · · · .
· · · · · ·
X(k)1 , . . . , X(k)n1 iid f(k) (t) Y(q)1 , . . . , Y(q)n2 iid f(q) (t − ∆)
q k
X X
URSS = Usi .
s=1 i=1
i i
i i
i i
∗
compute URSS on these samples. Repeating this B times, we obtain the sample
∗ ∗
of test statistics URSS,1 , . . . , URSS,B . Then the bootstrap p-value for our test
∗
is #(URSS,j ≥ URSS )/B, where URSS is the value of the statistic based on the
original data. Generally we take B = 1000 for a p-value. It is clear how to
modify the above argument to allow for k 6= q and n1 6= n2 .
Zi = ζci + ei , 1 ≤ i ≤ n , (2.10.2)
i i
i i
i i
i i
i i
i i
These scores are not surprising, because the distribution of |X| is exponential.
Hence, this is precisely the log linear problem with exponentially distributed
lifetime that was discussed in Section 2.8; see the discussion around expression
(2.8.8).
Example 2.10.3 (L(|X|) Is a Member of the Generalized F -family). In Sec-
tion 3.10 a discussion is devoted to a large family of commonly used distri-
butions called the generalized F -family for survival type data. In particular,
as shown there, if |X| follows an F (2, 2)-distribution, then it follows (Exercise
2.13.31), that the log |X| has a logistic distribution. Thus the MWW statistic
is the optimal rank score statistic in this case.
where the last equality holds because the log function is strictly increasing.
This is not necessarily a standardized score function, but it follows from the
discussion on general scores found in Section 2.5 and (2.5.18) that the null
mean µϕ and null variance σϕ2 of the statistic are given by
n1 n2 X
µϕ = n2 a and σϕ2 = (a(i) − a)2 . (2.10.12)
n(n − 1)
The asymptotic version of this test statistic rejects H0 at approximate level α
if z ≥ zα where
Sϕ − µ ϕ
z= . (2.10.13)
σϕ
i i
i i
i i
i i
i i
i i
The interval (ζbL, ζbU ) where ζbL and ζbU solve the respective equations
Sϕ∗ (ζbL ) =
˙ zα/2 σϕ∗ + µ∗ϕ
Sϕ∗ (ζbU ) =
˙ −zα/2 σϕ∗ + µ∗ϕ
Xn2 2
R|Yj∗ | 1
SF∗ K = Φ−1
+ . (2.10.20)
j=1
2(n + 1) 2
Example 2.10.4 (Efficacy for the Score Function ϕF K (u)). To use expression
(2.5.27) for the efficacy, we must first standardize the score function ϕF K (u) =
{Φ−1 [(u + 1)/2]}2 − 1, (2.10.7). Using the substitution (u + 1)/2 = Φ(t), we
have Z 1 Z ∞
ϕF K (u) du = t2 φ(t) dt − 1 = 1 − 1 = 0.
0 −∞
i i
i i
i i
i i
i i
i i
Figure 2.10.1: Comparison boxplots of treated and control weight gains in rats.
20
10
0
−10
Control Ozone
(Wilcoxon), the linear Wilcoxon score function; (Quad), the score function
ϕ(u) = u2 ; and (Logistic), the optimal score function if the distribution of X
is logistic (see Exercise 2.13.32). The error distributions include the normal
and the χ2 (1) distributions and several members of the skewed contaminated
normal distribution. In the latter case, the random variable X is written as
X = X1 (1 − Iǫ ) + Iǫ X2 , where X1 and X2 have N(0, 1) and N(µc , σc2 ) distribu-
tions, respectively, Iǫ has a Bernoulli distribution with probability of success
ǫ, and X1 , X2 , and Iǫ are mutually independent. For the study ǫ was set at
0.3 and µc and σc varied. The pdfs of the three SCN distributions in Table
2.10.1 are shown in Figure 2.10.2. The pdf in the bottom right corner panel
of the figure is that of χ2 (1)-distribution. For all but the last situation in Ta-
ble 2.10.1, the sample sizes are n1 = 20 and n2 = 25. The last situation is
for n1 = n2 = 10. The number of simulations for each situation was set at
1000. For each run, the two-sided alternative, HA : η 6= 1, was tested and
the estimator of η and an associated confidence interval for η were obtained.
Computations were performed by Robnp functions.
The table shows the empirical α levels at the nominal 0.10, 0.05, and
0.01 levels; the empirical confidence coefficient for a nominal 95% confidence
interval; the mean of the estimates of η; and the MSE for ηb. Of the five
analyses, overall the Fligner-Killeen analysis (fk2 ) performed the best. This
analysis was valid (nominal levels and empirical coverage) in all the situations,
i i
i i
i i
except for the χ2 (1) distribution at the 10% level and the larger sample sizes.
Even here, its empirical level is 0.128. The other tests were liberal in the
skewed situations, some such as the Wilcoxon test were quite liberal. Also,
the fk analysis (exponent 1 in its score function) was liberal for the χ2 (1)
situations. Notice that the Fligner-Killeen analysis achieved the lowest MSE
in all the situations.
f(x)
0.10
0.00
−2 0 2 4 6 8 −2 0 2 4 6 8 10
x x
1.0
0.8
f(x)
f(x)
0.6
0.4
0.2
0.0
0 5 10 15 0 1 2 3 4
x x
As a final remark, another class of linear rank statistics for the two-sample
scale problem consists of simple linear rank statistics of the form
n2
X
S= a(R(Yj )) , (2.10.24)
j=1
where the scores are generated as a(i) = ϕ(i/(n+1)). The folded rank statistics
discussed above suggest that ϕ be a convex (or concave) function. One popular
score function is the quadratic function ϕ(u) = (u − 1/2)2. The resulting
i i
i i
i i
Table 2.10.1: Empirical Levels, Confidences, and MSE’s for the Monte Carlo
Study Discussed in Example 2.10.6
Normal Errors, n1 = 20, n2 = 25
b.10
α b.05
α α d .95
b.01 Cnf η̂ MSE(η̂)
Logistic 0.083 0.041 0.006 0.961 1.037 0.060
Quad. 0.080 0.030 0.008 0.970 1.043 0.076
Wilcoxon 0.073 0.033 0.004 0.967 1.042 0.097
fk2 0.087 0.039 0.004 0.960 1.036 0.057
fk 0.077 0.033 0.005 0.969 1.037 0.067
√
SKCN(µc = 2, σc = 2, ǫc = 0.3), n1 = 20, n2 = 25
Logistic 0.106 0.036 0.006 0.965 1.035 0.076
Quad. 0.106 0.046 0.008 0.953 1.040 0.095
Wilcoxon 0.103 0.049 0.007 0.952 1.043 0.117
fk2 0.100 0.034 0.006 0.966 1.033 0.073
fk 0.099 0.047 0.006 0.953 1.034 0.085
√
SKCN(µc = 6, σc = 2, ǫc = 0.3), n1 = 20, n2 = 25
Logistic 0.081 0.033 0.006 0.966 1.067 0.166
Quad. 0.122 0.068 0.020 0.933 1.105 0.305
Wilcoxon 0.163 0.103 0.036 0.897 1.125 0.420
fk2 0.072 0.026 0.005 0.974 1.057 0.126
fk 0.111 0.057 0.015 0.942 1.075 0.229
√
SKCN(µc = 12, σc = 2, ǫc = 0.3), n1 = 20, n2 = 25
Logistic 0.084 0.046 0.007 0.954 1.091 0.298
Quad. 0.138 0.085 0.018 0.916 1.183 0.706
Wilcoxon 0.171 0.116 0.038 0.886 1.188 0.782
fk2 0.074 0.042 0.007 0.958 1.070 0.201
fk 0.115 0.069 0.015 0.932 1.109 0.400
2
χ (1), n1 = 20, n2 = 25
Logistic 0.154 0.086 0.023 0.913 1.128056 0.353
Quad. 0.249 0.149 0.047 0.851 1.170 0.482
Wilcoxon 0.304 0.197 0.067 0.804 1.196 0.611
fk2 0.128 0.066 0.018 0.936 1.120 0.336
fk 0.220 0.131 0.039 0.870 1.154 0.432
2
χ (1), n1 = 10, n2 = 10
Logistic 0.132 0.062 0.018 0.934 1.360 1.495
Quad. 0.192 0.099 0.035 0.900 1.457 2.108
Wilcoxon 0.276 0.166 0.042 0.833 1.560 3.311
2
fk 0.111 0.057 0.013 0.941 1.335 1.349
fk 0.199 0.103 0.033 0.893 1.450 2.086
i i
i i
i i
statistic,
n2
X 2
R(Yj ) 1
SM = − , (2.10.25)
j=1
n+1 2
was proposed by Mood (1954) as a test statistic for the hypotheses (2.10.1).
For the realistic problem with unknown location, though, the observations
have to be first aligned. Asymptotic theory holds, provided the underlying
distribution is symmetric. This class of aligned rank tests, though, did not
perform nearly as well as the folded rank statistics, (2.10.16), in the large
Monte Carlo study of Conover et al. (1981). Hence, we recommend the folded
rank-based analyses discussed above.
i i
i i
i i
i i
i i
i i
i i
i i
i i
VarH0 (SR+ )
→ λ1 Var(F (Y )) + λ2 Var(G(X)) .
n1 n2 (n + 1)
Under the assumptions that the sample sizes are the same and that L(X)
and the L(Y ) have the same form we can simplify expression (2.11.2) further.
We express the result in the following theorem.
Theorem 2.11.1. Suppose that the null hypothesis in (2.11.1) is true. Assume
that the distributions of Y and X are symmetric, n1 = n2 , and G(x) = F (x/η)
where η is an unknown parameter. Then the maximum observed significance
level is 1 − Φ(.816zα ) which is approached as η → 0 or η → ∞.
i i
i i
i i
and Z
1 1
2 4
if η → 0
Var(G(X)) = F (x/η)dF (x) − → .
4 0 if η → ∞
From these two results and (2.11.2), the true significance level of the MWW
test satisfies
1 − Φ(zα (3/2)−1/2 ) if η → 0
αS + → .
R 1 − Φ(zα (3/2)−1/2 ) if η → ∞
Hence,
αS + → 1 − Φ(zα (3/2)−1/2 ) = 1 − Φ(.816zα ) ,
R
i i
i i
i i
Recall from expression (2.6.8), Section 2.6.2 that Mathisen’s test statistic (cen-
tered version) is given by S2 (θbX ). This is our test statistic. The modification
lies in its asymptotic distribution which is given in the next theorem.
Theorem 2.11.2. Assume the null hypothesis in expression (2.11.1) is true.
Then under the assumption (D.1), (2.4.7), √1n2 S2 (θbX ) is asymptotically nor-
2 2
mal with mean 0 and asymptotic variance 1 + K12 where K12 is defined by
2 λ2 g 2 (θY )
K12 = . (2.11.6)
λ1 f 2 (θX )
Proof: Assume without loss of generality that θX = θY = 0. From the asymp-
totic linearity results discussed in Example 1.5.2 of Chapter 1, we have that
1 . 1 √
√ S2 (θn ) = √ S2 (0) − 2g(0) n2 θn ,
n2 n2
√ √
for n|θn | ≤ c, c > 0. Since n2 θbX is bounded in probability, upon substitu-
tion in the last expression we get
1 . 1 √
√ S2 (θbX ) = √ S2 (0) − 2g(0) n2 θbX . (2.11.7)
n2 n2
In Example 1.5.2, we also have the approximation
. 1
θbX = S1 (0) , (2.11.8)
n1 2f (0)
i i
i i
i i
i i
i i
i i
By Theorem√ 2.11.2 we have the asymptotic null variance of the test statistic
S2 (θbX )/ n. From the above discussion then the statistic S2 (θbX ) is Pitman
Regular with efficacy
√
2λ2 g(0) λ1 λ2 2g(0)
cm2 = p 2
=p . (2.11.15)
λ2 (1 + K12 ) λ1 + λ2 (g 2 (0)/f 2(0))
Using Theorem 1.5.4 of Chapter 1, consistency of the modified Mathisen’s
test for the hypotheses (2.11.1) is obtained provided µ(∆) > µ(0). But this
follows immediately from the inequality G(−∆) > G(0).
i i
i i
i i
The average to consider here is S R = (nR1 n2 )−1 SR+ . Letting µ(∆) denote the
mean of S R under ∆, we have µ′ (0) = g(x)f (x)dx > 0. The variance we
need is σ 2 (0) = limn→∞ nVar0 (S R ), which using the above result on variance
simplifies to
Z Z ! 2
2
σ (0) = λ−1
2
2
F dG − F dG
Z Z 2 !
+λ−1
1 (1 − G)2 dF − (1 − G)dF .
The process SR+ (∆) is Pitman Regular and, in particular, its efficacy is given
by
√ R
λ1 λ2 g(x)f (x)
cm2 = r .
R R 2 R R 2
λ1 F 2 dG − F dG + λ2 (1 − G)2 dF − (1 − G)dF
(2.11.18)
As with the modified Mathisen’s test, we show consistency of the modified
MWW test by using Theorem 1.5.4. Again we need only show that µ(0) <
µ(∆). But this follows immediately provided the supports of F and G overlap
in a neighborhood of 0. Note that this shows that the modified MWW is
consistent for the hypotheses (2.11.1) under the further restriction that the
densities of X and Y are symmetric.
i i
i i
i i
Y −X
tW = q 2 ,
sX s2Y
n1
+ n2
where s2X and s2Y are the sample variances of Xi and Yj , respectively. Under
these assumptions, it follows that these sample variances are consistent esti-
2
mates of σX and σY2 , respectively; hence, the test has approximate level α.
If F0 is also normal then, under H0 , tW has an approximate t distribution
with a degrees of freedom correction proposed by Welch (1949). This test is
frequently used in practice and we subsequently call it the Welch t-test.
In contrast, the pooled t-test can behave poorly in this situation, since we
have
Y −X
tp = r
(n1 −1)s2X +(n2 −1)s2Y 1 1
n1 +n2 −2 n1
+ n2
. Y −X
= q 2 ;
sX s2Y
n2
+ n1
that is, the sample variances are divided by the wrong sample sizes. Hence
unless the sample sizes are fairly close the pooled t is not asymptotically
distribution free. Exercise 2.13.42 obtains the true asymptotic level of tp .
In order to get the efficacy of the Welch t, consider the statistic Y − X.
The mean function at ∆ is µ(∆) = ∆; hence, µ′ (0) = 1. It follows from the
asymptotic distribution discussed above that
" √ #
√ λ1 λ2 (Y − X) D
n p 2 2
→ N(0, 1) ;
(σX /λ1 ) + (σY )/λ2 )
p √
2
hence, σ(0) = (σX /λ1 ) + (σY2 )/λ2 )/ λ1 λ2 . Thus the efficacy of tW is given
by √
µ′ (0) λ1 λ2
ctW = =p 2 . (2.11.19)
σ(0) (σX /λ1 ) + (σY2 )/λ2 )
We obtain the ARE’s of the above procedures for the case where G(x) =
F (x/η) and F (x) has density f (x) symmetric about 0 with variance 1. Thus η
i i
i i
i i
is the ratio of standard deviations σY /σX . For this case the efficacies (2.11.15),
(2.11.18), and (2.11.19) reduce to
√
2 λ1 λ2 f (0)
cm2 = p
λ2 + λ1 η 2
√ R
λ1 λ2 gf
cm2 = q R R R R
λ1 [ F 2 dG − ( F dG)2 ] + λ2 [ (1 − G)2 dF − ( (1 − G)dF )2]
√
λ1 λ2
ctW = p .
λ2 + λ1 η 2
Thus the ARE between the modified Mathisen’s procedure and the Welch
procedure is the ratio c2m2 /c2tW = 4σX2 2
f (0) = 4f02 (0). This is the same ARE
as in the location problem. In particular the ARE does not depend on η =
σY /σX . Thus the modified Mathisen’s test in comparison to tW would have
poor efficiency at the normal distribution, .63, but in general it would be much
more efficient than tW for heavy-tailed distributions. Similar to the modified
Mathisen’s test, the Mood test can also be modified for these problems; see
Exercise 2.13.43. Its efficacy is the same as that of the Mathisen’s test.
Asymptotic relative efficiencies involving the modified Wilcoxon do depend
on the ratio of scale parameters η. Fligner and Rust (1982) show that if the
variances of X and Y are quite different then the modified Mathisen’s test
may be as efficient as the modified MWW irrespective of the shape of the
underlying distribution.
Fligner and Policello (1981) conducted a simulation study of the pooled
t, Welch’s t, MWW, and the modified MWW over situations where F and G
differ in scale only. The unmodified tests did not maintain their level. Welch’s
t performed well when F and G were normal whereas the modified MWW
performed well over all situations, including unequal sample sizes and normal
and contaminated normal distributions. In the simulation study performed by
Fligner and Rust (1982), they found that the modified Mood test maintains its
level over the situations that were considered by Fligner and Policello (1981).
As a final note, Welch’s t requires distributions with the same shape and
the modified MWW requires symmetric densities. The modified Mathisen’s
test and the modified Mood test, though, are consistent tests for the general
problem stated in expression (2.11.1).
i i
i i
i i
Let X denote the response of a subject after treatment A has been applied and
let Y be the corresponding measurement for a subject after treatment B has
been applied. The natural null hypothesis, H0 , is that there is no difference
in treatment effects. A one-sided alternative is that the response of a subject
under treatment B is in general larger than of a subject under treatment A.
Reversing the roles of A and B yields the other one-sided alternative while
the union of the these two alternatives results in the two-sided alternative.
Again for definiteness we choose as our alternative, HA , the first one-sided
alternative.
The completely randomized design and the paired design are two experi-
mental designs which are often employed in this situation. In the completely
randomized design, n subjects are selected at random from the population of
interest and n1 of them are randomly assigned to treatment A while the re-
maining n2 = n − n1 are assigned to treatment B. At the end of the treatment
period, we then have two samples, one on X while the other is on Y . The two
sample procedures discussed in the previous sections can be used to analyze
the data. Proper randomization along with carefully controlled experimental
conditions give credence to the assumptions that the samples are random and
are independent of one another. The design that produced the data of Example
2.3.1 was a a completely randomized design.
While the completely randomized design is often used in practice, the
underlying variability may impair the power of any procedure, robust or tra-
ditional, to detect alternative hypotheses. The design discussed next usually
results in a more powerful analysis but it does require a pairing device; i.e., a
block of length two.
Suppose we have a pairing device (block of length two). Some examples
include identical twins for a study on human subjects, litter mates for a
study on animal subjects, or the same exterior wall of a house for a study
on the durability of exterior house paints. In the paired design, n pairs of
subjects are randomly selected from the population of interest. Within each
pair, one member is randomly assigned to treatment A while the other re-
ceives treatment B. Again let X and Y denote the responses of subjects after
treatments A and B, respectively, have been applied. This experimental de-
sign results in a sample of pairs (X1 , Y1), . . . , (Xn , Yn ). The sample differences
D1 = X1 −Y1 , . . . Dn = Xn −Yn , however, become the single sample of interest.
Note that the random pairing in this design induces under the null hypothesis
a symmetrical distribution for the differences.
Proof: Let F (x, y) denote the joint distribution of (X, Y ). Under the null
i i
i i
i i
P [D ≤ t] = P [Y − X ≤ t] = P [X − Y ≤ t] = P [−D ≤ t] .
Example 2.12.1 (Darwin Data). The data, Table 2.12.1, are some measure-
ments recorded by Charles Darwin in 1878. They consist of 15 pairs of heights
in inches of cross-fertilized plants and self-fertilized plants (Zea mays), each
pair grown in the same pot.
Let Di denote the difference between the heights of the cross-fertilized
and self-fertilized plants of the ith pot and let θ denote the median of the
i i
i i
i i
Table 2.12.1: Plant Growth, Cross (C) and Self (S) Fertilized
Pot 1 2 3 4 5 6 7 8
C- 23.500 12.000 21.000 22.000 19.125 21.500 22.125 20.375
S 17.375 20.375 20.000 20.000 18.375 18.625 18.625 15.250
Pot 9 10 11 12 13 14 15
C 18.250 21.625 23.250 21.000 22.125 23.000 12.000
S 16.500 18.000 16.250 18.000 12.750 15.500 18.000
Estimate 3 SE is 1.307422
95 % Confidence Interval is ( 1 , 6.125 )
Estimate of the scale parameter tau 5.063624
The value of the signed-rank Wilcoxon statistic for this data is T = 72 with the
approximate p-value of 0.043. The corresponding estimate of θ is 3.14 inches
and the 95% confidence interval is (.50, 5.21).
There are 13 positive differences, so the standardized value of the sign test
statistic is 2.58, with the p-value of 0.01. The corresponding estimate of θ
is 3 inches and the 95% interpolated confidence is (1.00, 6.13). The paired t-
test statistic has the value of 2.15 with p-value 0.050. The difference in sample
means is 2.62 inches and the corresponding 95% confidence interval is (0, 5.23).
Note that the outliers impaired the t-test and to a lesser degree the Wilcoxon
signed-rank test.
i i
i i
i i
10
5
Paired differnces
0
−5
Darwin Data
We only consider the case where the random vector (Y, X) is jointly normal
with variance-covariance matrix
2 1 ρ
V=σ .
ρ 1
p p
Then τ = π/3σ 2(1 − ρ).
Now suppose we select the sample size n∗ so that the Wilcoxon signed-rank
test has power γ + (θ0 ) to detect the one-sided alternative θ0 > 0 for a level α
i i
i i
i i
(1.5.26) that √
.
γ + (θ0 ) = 1 − Φ(zα − n∗ θ0 /τ ) ,
and
2
. (zα − zγ + (θo ) ) 2
n∗ = τ .
θ02
Substituting the value of τ into this final equation, we have that the necessary
sample size for the paired design to have the desired local power is
2
. (zα − zγ + (θo ) )
n∗ = (π/3)σ 2 2(1 − ρ) . (2.12.1)
θ02
2.13 Exercises
2.13.1. (a) Derive the L2 estimates of intercept and shift based on the L2
norm on Model (2.2.4).
(b) Next apply the pseudo-norm, (2.2.16), to (2.2.4) and derive the estimating
function. Show that the natural test statistic is the pooled t-statistic.
2.13.2. Show that (2.2.17) is a pseudo-norm. Show, also, that it can be written
in terms of ranks; see the formula following (2.2.17).
2.13.5. Prove that if a continuous random variable Z has cdf H(z), then the
random variable H(Z) has a uniform distribution on (0, 1).
i i
i i
i i
2.13.9. Explain what happens to the MWW statistic when one support is
shifted completely to the right of the other support. What does this imply
about the consistency of the MWW in this case?
2.13.10. Show that the L2 estimating function is Pitman Regular and derive
the efficacy of the pooled t-test. Also, establish the asymptotic power lemma,
Theorem
√ 2.4.13, for the L2 case. Finally, establish the asymptotic distribution
of n(Ȳ − X̄).
(c) Use the Wilcoxon procedure to estimate ∆ and obtain a 95% confidence
interval for it.
(d) Obtain the true value of τ . Use your confidence interval in the last item
to obtain an estimate of τ . Obtain a symmetric 95% confidence interval
for ∆ based on your estimate.
i i
i i
i i
b Obtain the
2.13.15. Write a R function to bootstrap the distribution of ∆.
bootstrap distribution for 500 bootstraps of data of Problem 2.13.14. What
is your bootstrap estimate of τ ? Compare with the true value and the other
estimates.
2.13.16. Verify the scalar multiple condition for the pseudo-norm in the proof
of Theorem 2.5.1.
2.13.17. Verify (2.5.9) and (2.5.10).
2.13.18. Consider the process Sϕ (∆), (2.5.11):
(a) Show that Sϕ (∆) is a decreasing step function, with steps occurring at
Y j − Xi .
(b) Using Part (a) and the MWW estimator as a starting value, write with
b ϕ.
some details an algorithm which obtains the estimator ∆
(c) Verify expressions (2.5.14), (2.5.15), and (2.5.16).
2.13.19. Consider the the optimal score function (2.5.22):
(a) Show it is location invariant and scale equivariant. Hence, show if g(x) =
1
σ
f ( x−µ
σ
), then ϕg = σ −1 ϕf .
(b) Use (2.5.22) to show that the MWW is asymptotically efficient when the
underlying distribution is logistic. (F (x) = (1 + exp(−x))−1 , −∞ < x <
∞.)
(c) Show that (2.6.1) is optimal for a Laplace or double exponential distribu-
tion. ( f (x) = 21 exp(−|x|), −∞ < x < ∞.)
(d) Show that the optimal score function for the extreme value distribution,
(f (x) = exp{x − ex } , −∞ < x < ∞ ), is given by (2.8.8).
(e) Show that the optimal score function for the normal distribution is given
by (2.5.32). Show that it is standardized.
(f) Show that (2.5.33) is the optimal score function for an underlying distri-
bution that has a left logistic tail and a right exponential tail.
2.13.20. Show that when the underlying density f is symmetric then ϕf (1 −
u) = −ϕf (u).
2.13.21. Show that expression (2.6.6) is true and that the n = 2r differences,
Y(1) − X(r) < Y(2) − X(r−1) < · · · < Y(n2 ) − X(r−n2 +1) ,
can be ordered only knowing the order statistics from the individual samples.
i i
i i
i i
i i
i i
i i
2.13.34. We consider the Siegel-Tukey (1960) test for the equality of variances
when the underlying centers are equal but possibly unknown. The test statistic
is the sum of ranks of the Y sample in the combined sample (MWW statistic).
However, the ranks are assigned in a different way: In the ordered combined
sample assign rank 1 to the smallest value, rank 2 to the largest value, rank
3 to the second largest value, rank 4 to the second smallest value, and so
on, alternatively assigning ranks to end values. To test H0 : varX = varY vs
HA : varX > varY , reject H0 when the sum of ranks of the Y sample is large.
Find the mean, variance, and the limiting distribution of the test statistic.
Show how to find an approximate size α test.
2.13.35. Develop a sample size formula for the scale problem similar to the
sample size formula in the location problem, (2.4.25).
2.13.37. Compute the efficiency of Mood’s scale test and the Ansari-Bradley
scale test relative to the classical F -test for equality of variances.
2.13.38. Show that the Ansari-Bradley scale test is optimal for f (x) = 21 (1 +
|x|)−2 , −∞ < x < ∞.
2.13.39. Show that when F and G have densities symmetric at 0 (or any
common point), the expected value of SR+ = n1 n2 /2.
2.13.40. Show that the estimate of (2.11.17) based on the empirical cdfs is
consistent and that it is a function only of the combined sample ranks.
2.13.41. Under
√ the general model in Section 2.11.5, derive the limiting dis-
tribution of n(Y − ∆ − X).
2.13.42. Find the true asymptotic level of the pooled t-test under the null
hypothesis in (2.11.1).
i i
i i
i i
(a) Obtain comparison dotplots between the heights of the pitchers and hit-
ters. Does a shift model seem appropriate?
(b) Use the MWW test statistic to test the hypotheses H0 : ∆ = 0 versus
HA : ∆ > 0. Compute the p-value.
(c) Determine a point estimate for ∆ and a 95% confidence interval for ∆
based on MWW procedure.
b Use it to obtain an
(d) Obtain an estimate of the standard deviation of ∆.
approximate 95% confidence interval for ∆.
2.13.45. Repeat Exercise 2.13.44 when ∆ is the shift parameter for the dif-
ference in pitchers’ and hitters’ weights.
2.13.46. Repeat Exercise 2.13.44 when ∆ is the shift parameter for the dif-
ference in left-handed (A-1) and right-handed (A-0) pitchers’ ERA’s and the
hypotheses are H0 : ∆ = 0 versus HA : ∆ 6= 0.
X: 8 12 18
Y: 13 22 25
and the scores are a(1) = −6/3.6, a(2) = −1/3.6, a(3) = −1/3.6, a(4) =
1/3.6, a(5) = 1/3.6, a(6) = 6/3.6. Find the p-value of the test.
2.13.48. A study was performed to investigate the response time between two
drugs, A and B. It was thought that the response time for A was higher. Ten
subjects were selected. Each was randomly assigned to one of the drugs and
after a specified period (including a washout period), their response times were
recorded. Using a nonparametric procedure, test the appropriate hypotheses
and conclude in terms of the p-value.
i i
i i
i i
Subject 1 2 3 4 5 6 7 8 9 10
A 114 116 97 54 91 103 99 63 86 102
B 105 111 72 81 56 98 121 81 69 87
2.13.49. Let X1 , X2 , . . . , Xn1 be a random sample with common cdf and pdf
F (t) and f (t), respectively. Let Y1 , Y2 , . . . , Yn2 be a random sample with com-
mon cdf and pdf G(t) = F (t − ∆) and g(t) = f (t − ∆), respectively. Assume
that
Pn the Yj s and Xi s are independent. Let a(i) beP a set of rank scores such that
n1
i=1 a(i) = 0, where n = n1 + n 2 . Let S(∆) = i=1 a(R(Yi − ∆)). Consider
the hypotheses
H0 : ∆ = 0 versus HA : ∆ < 0.
Assume that a level α test is to reject H0 if S(0) < c0 . Prove that the power
function of this test is nonincreasing (decreasing).
2.13.50. Let X1 , X2 , . . . , Xn1 be a random sample with common cdf and pdf
F (t) and f (t), respectively. Let Y1 , Y2 , . . . , Yn2 be a random sample with com-
mon cdf and pdf G(t) = F (t − ∆) and g(t) = f (t − ∆), respectively. Assume
that the Yj s and Xi s are independent. Let n = n1 + n2 . Let a(i) = ϕ[i/(n + 1)]
be a set of rank scores such that
1 3
4 4
<u<1
1 1
ϕ(u) = u− < u < 34 .
1 2 4
−4 0 < u < 41
P 2
Let S = ni=1 a[R(Yi )]. Suppose the sampling results in: X : 8, 13 and Y :
12, 15.
(a) Compute S.
b be the corresponding estimator. Is ∆
(b) Let ∆ b > 0 or is ∆
b < 0? Why (answer
using the value of S)?
2.13.51. Suppose Y1 , . . . , Yn and X1 , . . . , Xn are all independent and have the
same distribution with support on (0, ∞), (X > 0 and Y > 0). Let Zi = Yi /Xi ,
i = 1, 2, . . . , n, and T = #{Zi > 1}.
(a) Find the distribution of T .
(b) Write a location model in terms of the log Zi . What does the location
parameter mean in terms of the original random variables?
(c) What is the underlying hypothesis of Part (b)? What does it mean in
terms of the original random variables?
i i
i i
i i
Chapter 3
Linear Models
3.1 Introduction
In this chapter we discuss the theory for a rank-based analysis of a general
linear model. Applications of this analysis to experimental design models are
discussed in Chapters 4 and 5. The rank-based analysis is complete, consisting
of estimation, testing, and diagnostic tools for checking the adequacy of fit of
the model, outlier detection, and detection of influential cases. As in the earlier
chapters, we present the analysis in terms of its geometry.
The analysis could be based on either rank scores or signed-rank scores.
We have chosen to use the general rank scores of Chapter 2. This allows the
error distribution to be either asymmetric or symmetric. An analysis based
on signed-rank scores would parallel the one based on rank scores except that
the theory would require a symmetric error distribution; see Hettmansperger
and McKean (1983) for discussion. Although the results are established for
general score functions, we illustrate the methods with Wilcoxon and sign
scores throughout. We commonly use the subscripts R and S for results based
on Wilcoxon and sign scores, respectively.
There is software available for the robust nonparametric procedures dis-
cussed in this chapter. The software (R code) ww developed by Terpstra and
McKean (2005) computes the linear model procedures based on Wilcoxon
scores and, also, the high breakdown (HBR) procedures. It also computes
most of the diagnostic procedures discussed in this chapter. We illustrates its
use in several examples. The R software Rfit developed by Kloke and McKean
(2010) uses the R function optim to obtain the rank-based fit for general scores
functions. It includes functions for inference and diagnostics. Kapenga, Mc-
Kean, and Vidmar (1988) developed a fortran program rglm which computes
these methods. A web interface for rglm is discussed by Crimin, Abebe, and
McKean (2008). See, also, McKean, Terpstra, and Kloke (2009) for a recent
review of computational procedures for rank-based fitting procedures.
165
i i
i i
i i
Yi = α + x′i β + ei . (3.2.2)
Y = 1α + η + e , where η ∈ ΩF . (3.2.4)
H0 : Mβ = 0 versus HA : Mβ 6= 0 , (3.2.5)
i i
i i
i i
b ϕ kϕ = min kY − ηkϕ .
Dϕ (Y, ΩF ) = kY − Y (3.2.7)
η ∈ΩF
Y dF
dR
ΩF
η^F
0
η^R ΩR
i i
i i
i i
where τϕ and τS are the scale parameters defined in displays (3.4.4) and (3.4.6),
respectively. From this result, an asymptotic confidence interval for the linear
function h′ β is given by
p
b ± t(α/2,n−p−1) τbϕ h(X′ X)−1 h ,
h′ β (3.2.9)
ϕ
where the estimate τbϕ is discussed in Section 3.7.1. The use of t-critical values
instead of z-critical values is documented in the small sample studies cited in
Section 3.7. Note the close analogy between this confidence interval and those
based on LS estimates. The only difference is that σ b has been replaced by τbϕ .
We make use of the coordinate-free model, especially in Chapter 4; how-
ever, in this chapter we are primarily concerned with the properties of the
estimator βb and it is more convenient to use the coordinate model (3.2.3).
ϕ
Define the dispersion function by
Dϕ (β) = kY − Xβkϕ . (3.2.10)
i i
i i
i i
b LS = Argmin kY − ηk2 ,
Y LS
where k · kLS denotes the least squares pseudo-norm given by (2.2.16) of Chap-
ter 2. The value of η which minimizes this pseudo-norm is
b LS = HY ,
η (3.2.14)
where H is the projection matrix onto the space ΩF ; i.e., H = X(X′ X)−1 X′ .
Denote the sum of squared residuals by SSE = minη ∈Ω kY − ηk2LS =
F
k(I − H)Yk2LS . In order to have similar notation we denote this minimum
2
by DLS (Y, ΩF ). Also, it is easy to show that the least squares estimate of β
b
is β LS = (X′ X)−1 X′ Y.
Y = 1α + η + e , where η ∈ ΩF , (3.2.15)
i i
i i
i i
and ΩF is the column space of the full model design matrix X. Let Y b ϕ,Ω
F
denote the R fitted value in the full model. Note that Dϕ (Y, ΩF ) is the amount
of residual dispersion not accounted for in fitting the Model (3.2.4). These are
shown geometrically in Figure 3.2.1.
Next let ΩR denote the reduced model subspace of ΩF subject to H0 . In
symbols ΩR = {η ∈ ΩF : η = Xβ, for some β such that Mβ = 0}. In
Exercise 3.15.7 the reader is asked to show that ΩR is a subspace of ΩF of
dimension p − q. Let Y b ϕ,Ω denote the R estimate of η when the reduced
R
model is fit and let Dϕ (Y, ΩR ) = kY − Y b ϕ,Ω kR denote the distance between
R
Y and the subspace ΩR . These are illustrated in Figure 3.2.1. The nonnegative
quantity
RDϕ = Dϕ (Y, ΩR ) − Dϕ (Y, ΩF ) , (3.2.16)
denotes the reduction in residual dispersion when we pass from the re-
duced model to the full model. Large values of RDϕ indicate HA while small
values support H0 .
This drop in residual dispersion, RDϕ , is analogous to the drop in residual
sums of squares for the LS analysis. In fact to obtain this reduction in sums
of squares, we need only replace the R norm with the square of the Euclidean
norm in the above development. Thus the drop in sums of squared errors is
2 2
SS = DLS (Y, ΩR ) − DLS (Y, ΩF ) ,
2
where DLS (Y, ΩF ) is defined above. Hence the reduction in sums of squared
residuals can be written as
SS/q
FLS = , (3.2.17)
b2
σ
where σb2 = DLS2
(Y, ΩF )/(n−p). Other than replacing one norm with another,
Figure 3.2.1 remains the same for the two analyses, LS and R.
In order to be useful as a test statistic, similar to least squares, the reduc-
tion in dispersion RDϕ must be standardized. The asymptotic distribution
theory that follows suggests the standardization
RDϕ/q
Fϕ = , (3.2.18)
τbϕ /2
where τbϕ is the estimate of τϕ discussed in Section 3.7. Small sample studies
cited in Section 3.7 indicate that Fϕ should be compared with F -critical values
i i
i i
i i
Table 3.2.1: Robust ANOVA Table for the Hypotheses H0 : Mβ = 0 versus
HA : Mβ 6= 0, Where RDϕ = Dϕ (Y, ΩR ) − Dϕ (Y, ΩF )
Source Reduction Mean Reduction
in Dispersion in Dispersion df in Dispersion Fϕ
Regression RDϕ q RDϕ /q Fϕ
Error n − (p + 1) τbϕ /2
i i
i i
i i
Table 3.3.1: Data for Example 3.3.1. (The number of calls is in tens of millions
and the years are from 1950-1973. The top rows are years and the bottom rows
are the number of calls.)
50 51 52 53 54 55 56 57 58 59 60 61
0.44 0.47 0.47 0.59 0.66 0.73 0.81 0.88 1.06 1.20 1.35 1.49
62 63 64 65 66 67 68 69 70 71 72 73
1.61 2.12 11.90 12.40 14.20 15.90 18.20 21.20 4.30 2.40 2.70 2.90
The R scores test is the test based on the gradient. Theorem 3.5.2,
below, gives the asymptotic distribution of the gradient Sϕ (0) under the null
hypothesis. This leads to the asymptotic level α test, reject H0 if
b ′ (X′ X)β
β b ϕ /p
ϕ
≥ F (α, p, n − p − 1) . (3.2.22)
τbϕ2
3.3 Examples
We offer several examples to illustrate the rank-based estimates and test pro-
cedures discussed
√ in the last section. For all the examples, we use Wilcoxon
scores, ϕ(u) = 12(u − (1/2)), for the rank-based estimates of the regression
coefficients. We estimate the intercept by the median of the residuals and we
estimate the scale parameter τϕ as discussed in Section 3.7. We begin with a
simple regression data set and proceed to multiple regression problems.
Example 3.3.1 (Telephone Data). The response for this data set is the num-
ber of telephone calls (tens of millions) made in Belgium for the years 1950
through 1973. Time, the years, serves as our only predictor variable. The data
is discussed in Rousseeuw and Leroy (1987) and, for convenience, is displayed
in Table 3.3.1.
The Wilcoxon estimates of the intercept and slope are −7.13 and .145,
respectively, while the LS estimates are −26 and .504. The reason for this
disparity in fits is easily seen in Panel A of Figure 3.3.1 which is a scatter-
plot of the data overlaid with the LS and Wilcoxon fits. Note that the years
i i
i i
i i
1964 through 1969 had a profound effect on the LS fit while the Wilcoxon fit
was much less sensitive to these years. As discussed in Rousseeuw and Leroy
the recording system for the years 1964 through 1969 differed from the other
years. Panels B and C of Figure 3.3.1 are the Studentized residual plots of
the fits; see (3.9.31) of Section 3.9. As with internal LS-Studentized residuals,
values of the internal R Studentized residuals which exceed 2 in absolute value
are potential outliers. Note that the internal Wilcoxon Studentized residuals
clearly show that the years 1964-1969 are outliers while the internal LS Stu-
dentized residuals only detect 1969. The Wilcoxon Studentized residuals also
mildly detect the year 1970. Based on the scatterplot, this point does not
follow the trend of the early (before 1964) years either. The scatterplot and
Wilcoxon residual plot indicate that there may be a quadratic trend over the
years before the outliers occur. The last few years, though, do not seem to
follow this trend. Hence, a linear model for this data is questionable. On the
basis of these plots, we do not discuss any formal inference for this data set.
Figure 3.3.1: Panel A: Scatterplot of the telephone data, overlaid with the LS
and Wilcoxon fits; Panel B: Internal LS Studentized residual plot; Panel C:
Internal Wilcoxon Studentized residual plot; and Panel D: Wilcoxon dispersion
function.
Panel A Panel B
• •
20
• •
LS-Studentized residuals
•
• •
15
Number of calls
• ••
1
LS-Fit
Wilcoxon-Fit
••
10
••
••
0
••
••
••
••
••
5
• •
-1
•••
•• ••• ••
••••••••• •
0
50 55 60 65 70 0 2 4 6 8 10
Year LS-Fit
Panel C Panel D
150
•
50
Wilcoxon-Studentized residuals
•
40
140
Wilcoxon dispersion
•
•
30
••
130
20
120
10
•
110
••••••••••••••
0
•••
Wilcoxon-Fit Beta
Panel D of Figure 3.3.1 depicts the Wilcoxon dispersion function over the
i i
i i
i i
interval (−.2, .6). Note that Wilcoxon estimate βbR = .145 is the minimizing
value. Next consider the hypotheses H0 : β = 0 versus HA : β 6= 0. The basis
for the test statistic Fϕ can be read from this plot. The reduction in dispersion
is given by RD = D(0) − D(.145). Also, the gradient test of these hypotheses
would be the negative of the slope of the dispersion function at 0; i.e., −D ′ (0).
i i
i i
i i
Figure 3.3.2: Panels A-G: Plots of log-salary versus each of the predictors for
the baseball data of Example 3.3.2; Panel H: Internal Wilcoxon Studentized
residual plot.
Panel A Panel B
• • • ••
• •
• • • • • • • •
• • • •• •
7
7
• •• • • • • • •• •
• • • ••• •• •• ••• • • • •• •
•
• •• • •• •••
• • •• •• • •••
• • •• •• •• • • •• •• • • • •• • • • •••••• • •
• • • • • • •• • • •• •• • • •• • • •
••• •• • • • • • ••• •• • • • •
Log salary
Log salary
••• •
• •
•
• • •
• •• • •• • • ••
• • ••
6
6
• •• ••• • • •• •
• •• ••• • • • • • • • • • •• •• • • ••• • • •
• •• • • • •
• • • •
•• • • • • •• • • •• • •••
••• •• •• • • • • • • •
5
5
• • • •
•• • • • • •• • • •
•• •• ••• •
•
•• •• • • • ••
•• •• •• •• • • • •
•• •• • • •
• • • • •
4
4
0.0 0.5 1.0 1.5 2.0 2.5 3.0 0 5 10 15 20
Panel C Panel D
• • • • • •
• •
• • • • • •• •• • •• •
7
7
• • •• • • • • • • •
•• • • •• •• •• •• • • • ••
• • • •••• •• • ••• ••
• • ••
• • • •• • • •• •• • • • ••••••••••• • •• • ••• •• • •
•• • • •• • • • • • •
• • • • • • •• • • • • •• • • • • •• ••••••••
• • • •• • •• •• • • •
Log salary
Log salary
• • • • • • • •• •• •
• • • • • • • • • • • • • • •
6
6
• • • • • •• • •
• • ••• • •
• •• •• •
• • •
• • • • • • • • • •• • • • • •• • • • •
• • • • • •
• • •
•• • • • • • • ••• •• • •
• •• • • • • • • • • • • •
• • •
• •• • • • ••
5
5
• •
• •• • •
• • • •
• • •
• • •• • • •• • • •
• •• • •• • •
• • •• • • • • • • • • • • • •• ••
• • • • • • • • •
• • • • • •
4
Panel E Panel F
• •• • • •
• •
• • • • • • •
•• • • •• • • • •
7
• • • • • ••••
• • • • • • • •
• • •• ••••••••• •••• •• •• ••• •• • •
• • ••• •
• • • • • •••
• •• ••••••• • • • •
•
• •• • • • • ••
• ••• • •• • •• • • • • •• • • • •• • • •
• •• • •
• • • • •• •••• • •• • • • • • •• • •
Log salary
Log salary
• • • • • •
• •
•••• •• • • • • • • • • ••
6
• • • • • •• • • • • •
• • •
•• • •• •• ••• • •• • • • •• • • • • • • • • • •
•• • • • •
• •
• • • • • •• •
• •• • • • • • • • • •• ••
• • •• • • • • • • • • ••• • • •
5
• •
• • • • • • •
• • •• • • • •• • • •• • ••
•
• • • • • • • •
• • •• • •• • • • •• • • • •• • • •• •• •
• • • • •••
4
Panel G Panel H
•• • •
6
•
•••• • •
4
• •
7
•••••••• • • •
• • •
•
••••• ••• •• •• • • • ••
••• • •
Studentized resid.
•••• • • ••
2
• • ••• •
• • •• ••• • •••• • •• •• •• • • ••
0
• • • •• • •
• • ••• • • • • • • • • • • ••••••• •• •• ••• • ••
6
• • • • • ••• •• ••
•
••••••• • •• • • • • • •• • • •• •• • • ••
• ••
-2
• • •
••• •
• • • •
••
-4
••••• • O
• • • •• • • ••
5
• O
•• • • • •
-6
• • ••
•• •
••• •• • •
-8
• • •
O
4
0 5 10 15 20 25 4 5 6 7 8
i i
i i
i i
Table 3.3.2: Predictors for Baseball Salaries of Pitchers and Their Estimated
(Wilcoxon Fit) Coefficients
Predictor Estimate Stand. Error t-ratio
log Years in professional baseball .839 .044 19.15
Average wins per year .045 .028 1.63
Average losses per year -.024 .026 -.921
Earned run average -.146 .070 -2.11
Average games per year -.006 .004 1.60
Average innings per year .004 .003 1.62
Average saves per year .012 .011 1.07
Intercept 4.22 .324
Scale (τ ) .388
In Example 3.9.2 of Section 3.9 a residual analysis of this data set is per-
formed. This analysis indicates that the model which includes the covariate,
the linear terms of the factors, the simple two-way interaction terms of the
factors, and the quadratic terms of the three factors SAE, SAI, and ADS is
adequate. Let xj for j = 1, . . . , 4 denote the level of the factors SAI, SAE,
ADS, and TYPE, respectively, and let ci denote the value of the covariate.
Then the model is expressed as,
yi = α + β1 x1,i + β2 x2,i + β3 x3,i + β4 x4,i + β5 x1,i x2,i + β6 x1,i x3,i
+β7 x1,i x4,i + β8 x2,i x3,i + β9 x2,i x4,i + β10 x3,i x4,i
+β11 x21,i + β12 x22,i + β13 x23,i + β14 ci + ei . (3.3.1)
The Wilcoxon and LS estimates of the regression coefficients and their
standard errors are given in Table 3.3.4. The Wilcoxon estimates are more
precise. As the diagnostic analysis of Example 3.9.2 shows, this is due to the
outliers in this data set.
Note that the Wilcoxon estimate of the parameter β13 , the quadratic term
of the factor ADS is significant. Again referring to the residual analysis given in
Example 3.9.2, there is some graphical evidence to retain the three quadratic
coefficients in the model. In order to statistically confirm this evidence, we
test the hypotheses
H0 : β12 = β13 = β14 = 0 versus HA : βi 6= 0 for some i = 12, 13, 14 .
i i
i i
i i
The Wilcoxon test is summarized in Table 3.3.5 and it is based on the test
statistic (3.2.18). The p-value of the test is 0.047 and, hence, is significant
at the 0.05 level. The LS F -test statistic is insignificant, though, with the p-
value 0.340. As with its estimates of the regression coefficients, the LS F -test
statistic has been impaired by the outliers.
i i
i i
i i
see Exercise 1.12.21. It follows from (3.4.2), that assumption (E.1) implies
that f is uniformly bounded and is uniformly continuous.
An assumption that is used for analyses based on the L1 norm is:
where
f ′ (F −1 (u))
ϕf (u) = − . (3.4.5)
f (F −1 (u))
Under (E.1) the scale parameter τϕ is well defined. A second scale parameter
τS is defined as:
τS = (2f (θe ))−1 ; (3.4.6)
see (1.5.22). Note that it is well defined under assumption (E.2).
As above let H = X(X′ X)−1 X′ denote the projection matrix onto Ω, the
column space of X. Our asymptotic theory assumes that the design matrix X is
imbeded in a sequence of design matrices which satisfy the next two properties.
We should subscript quantities such as X and the projection matrix with n to
show this, but as a matter of convenience we have not done so. We subscript
the leverage values hiin which are the diagonal entries of the projection matrix
H. We often impose the next two conditions on the design matrix:
i i
i i
i i
Hence
x2
max max Pn ik 2
≤ max hnii .
j=1 xjk
1≤k≤p 1≤i≤n 1≤i≤n
i i
i i
i i
E [S(Y − Xβ 0 )] = 0
V [S(Y − Xβ 0 )] = σa2 X′ X ,
Pn .
σa2 = (n − 1)−1 i=1 a2 (i) = 1.
= −σa2 /n , (3.5.1)
i i
i i
i i
i i
i i
i i
The first and third limits are positive reals. For the second limit, note that the
random variable inside the expectation is bounded; hence, by Lebesgue Dom-
inated Convergence Theorem we can interchange the limit and expectation.
Since (ǫBn /Mn ) → ∞ as n → ∞, the expectation goes to 0 and our desired
result is obtained.
Similar to Chapter 2, Exercise 3.15.9 obtains the proof of the above theo-
rem for the special case of the Wilcoxon scores by first getting the projection
of the statistic W .
Note from this theorem we have the gradient test that all the regression
coefficients are 0; that is, H0 : β = 0 versus HA : β 6= 0. Consider the test
statistic
T = σa−2 S(Y)′ (X′ X)−1 S(Y) . (3.5.8)
From the last theorem an approximate level α test for H0 versus HA is:
where χ2 (α, p) denotes the upper level α critical value of χ2 -distribution with
p degrees of freedom.
Theorem A.3.8 of the Appendix gives the following linearity result for the
process S(β n ):
1 1 √
√ S(β n ) = √ S(β 0 ) − τϕ−1 Σ n(β n − β 0 ) + op (1) , (3.5.10)
n n
√
for n(β n − β 0 ) = O(1), where the scale parameter τϕ is given by (3.4.4).
Recall that we have made use of this result in Section 2.5 when we showed
that the two-sample location process under general scores functions is Pitman
Regular. If we integrate the RHS of this result we obtain a locally smooth
approximation of the dispersion function D(β n ) which is given by the following
quadratic function:
i i
i i
i i
the R estimates and test statistics. As discussed in Section 3.7.3, it also leads
to a Gauss-Newton-type algorithm for obtaining R estimates.
The following theorem shows that Q provides a local approximation to D.
This is an asymptotic quadraticity result which was proved by Jaeckel (1972).
It in turn is based on an asymptotic linearity result derived by Jurečková
(1971) and displayed above, (3.5.10). It is proved in the Appendix; see Theo-
rem A.3.8.
Theorem 3.5.3. Under the Model (3.2.3) and the assumptions (E.1), (D.1),
(D.2), and (S.1) of Section 3.4, for any ǫ > 0 and c > 0,
" #
P max √
|D(Y − Xβ) − Q(Y − Xβ)| ≥ ǫ → 0 , (3.5.12)
kβ −β 0 k<c/ n
as n → ∞.
We use this result to obtain the asymptotic distribution of the R estimate.
Without loss of generality assume that the true β 0 = 0. Then we can write
Q(Y − Xβ) = (2τϕ )−1 β ′ X′ Xβ − β ′ S(Y) + D(Y). Because Q is a quadratic
function it follows from differentiation that it is minimized by
e = τϕ (X′X)−1 S(Y) .
β (3.5.13)
i i
i i
i i
Under the further restriction that the errors have finite variance σ 2 ,
Exercise 3.15.10 shows that the least squares estimate β b
LS of β satisfies
√ b D 2 −1
n(β LS − β) → Np (0, σ Σ ). Hence as in the location problems of Chapters
1 and 2, the asymptotic relative efficiency between the R estimates and least
squares is the ratio σ 2 /τϕ2 , where τϕ is the scale parameter (3.4.4). Thus the R
estimates of regression coefficients have the same high efficiency relative to LS
estimates as do the rank-based estimates in the location problem. In particu-
lar, the efficiency of the Wilcoxon estimates relative to the LS estimates at the
normal distribution is .955. For longer tailed errors distributions this relative
efficiency is much higher; see the efficiency discussion for contaminated normal
distributions in Example 1.7.1.
From the above corollary, R estimates are asymptotically unbiased. It fol-
lows from the invariance properties, if we additionally assume that the errors
have a symmetric distribution, that R estimates are unbiased for all sample
sizes; see Exercise 3.15.11 for details.
The random vector β̃, (3.5.13), is an asymptotic representation of the R
estimate β.b The following representation proves useful later:
Corollary 3.5.2. Under the Model (3.2.3), (E.1), (D.1), (D.2), and (S.1) in
Section 3.4,
where the notation ϕ(F (Y)) means the n × 1 vector whose ith component is
ϕ(F (Yi )).
Proof: This follows immediately from (A.3.9), (A.3.10), the proof of Theorem
3.5.2, and equation (3.5.13).
Based on this last corollary, we have that the influence function of the
R estimate is given by
i i
i i
i i
Lemma 3.5.1. Assume conditions (E.1), (E.2), (S.1), (D.1), and (D.2) of
Section 3.4. For any ǫ > 0 and for any a ∈ R,
√
lim P [|S1 (Y−an−1/2 1−Xβ b )−S1 (Y − an−1/2 1)| ≥ ǫ n] = 0.
ϕ
n→∞
The proof of this lemma was first given by Jurečková (1971) for general signed-
rank scores and it is briefly sketched in the Appendix for the sign scores; see
Lemma A.3.2. This lemma leads to the asymptotic linearity result for the
process (3.5.18).
We need the following linearity result:
i i
i i
i i
Theorem 3.5.6. Assume conditions (E.1), (E.2), (S.1), (D.1), and (D.2) of
Section 3.4. For any ǫ > 0 and c > 0,
1
b ϕ ) − n− 21 S1 (Y − Xβ
P [sup|a|≤c |n− 2 S1 (Y − an−1/2 1 − Xβ b ϕ ) + aτ −1 | ≥ ǫ] → 0,
S
|n −1/2
S1 (Y − an−1/2
1 − Xβb )−n −1/2
S1 (Y − an−1/2
1)|
ϕ
−1/2 −1/2
+ |n S1 (Y − an 1) − n−1/2 S1 (Y) + aτS−1 |
+ |n−1/2 S1 (Y) − n−1/2 S1 (Y − Xβb )| .
ϕ
We can apply Lemma 3.5.1 to the first and third terms on the right side of
the above inequality. For the middle term we can use the asymptotic linearity
result in Chapter 1 for the sign process, (1.5.23). This yields the result for
any a and the sup follows from the monotonicity of the process, similar to the
proof of Theorem 1.5.6 of Chapter 1.
Letting a = 0 in Lemma 3.5.1, we have that the difference n−1/2 S1 (Y −
Xβ b ) − n−1/2 S1 (Y) goes to zero in probability. Thus the asymptotic distribu-
ϕ
tion of n−1/2 S1 (Y − Xβb ) is the same as that of n−1/2 S1 (Y), namely, N(0, 1).
ϕ
We have two applications of these results. The first is found in the next lemma.
Lemma 3.5.2. Assume conditions (E.1), (E.2), (D.1), (D.2), and (S.1) of
Section 3.4. The random variable, n1/2 α
bS is bounded in probability.
b ) is asymptotically N(0, 1)
Proof: Let ǫ > 0 be given. Since n−1/2 S1 (Y − Xβ ϕ
there exists a c < 0 such that
b ) < c] < ǫ .
P [n−1/2 S1 (Y − Xβ (3.5.20)
ϕ
2
Take c∗ = τS−1 (c − ǫ). By the process’s monotonicity and the definition of α
b,
we have the implication n α1/2 ∗
bS < c ⇒ n −1/2 ∗ −1/2
S1 (Y − c n b ) ≤ 0.
1 − Xβ ϕ
Adding in and subtracting out the above linearity result, leads to
P [n1/2 α b ) ≤ 0]
bS < c∗ ] ≤ P [n−1/2 S1 (Y − n−1/2 c∗ 1 − Xβ ϕ
+ P [n −1/2 b ∗ −1
S1 (Y − Xβ ) − c τ < ǫ].
ϕ S
The first term on the right side can be made less than ǫ/2 for sufficiently
large n whereas the second term is (3.5.20). From this it follows that n1/2 α
bS
i i
i i
i i
D
where τS is given in (3.4.6). From this we have that n1/2 (b
αS − α0 ) → N(0, τS2 ).
bS and β
Our interest, though, is in the joint distribution of α b .
ϕ
By Corollary 3.5.2 the corresponding asymptotic representation of β b for
ϕ
the true vector of regression coefficients β 0 is
b − β ) = τϕ (n−1 X′ X)−1 n−1/2 X′ ϕ(F (Y)) + op (1) ,
n1/2 (β (3.5.23)
ϕ 0
Proof: As above assume without loss of generality that the true pa-
rameters
√ are
√ 0.−1It −1
is easier to work with the random vector Tn =
−1
(τs nbαS , n(τϕ (n X′ X)β b )′ )′ . Let t = (t1 , t′ )′ be an arbitrary, nonzero,
ϕ 2
vector in R . We need only show that Zn = t′ Tn has an asymptotically uni-
p+1
n
X
−1/2
Zn = n (t1 sgn(Yk ) + (t′2 xk )ϕ(F (Yk )) + op (1) . (3.5.24)
k=1
Denote the sum on the right side of (3.5.24) as Zn∗ . We need only show that
Zn∗ converges in distribution to a univariate normal distribution. Denote the
i i
i i
i i
∗
kth summand as Znk . We use the Lindeberg-Feller Central Limit Theorem.
Our application of this theorem is similar to its use in the proof of RTheorem
3.5.2. First note that since the score function ϕ is standardized ( ϕ = 0)
that E(Zn∗ ) = 0. Let Bn2 = Var(Zn∗ ). Because the individual summands
R 2 are
independent, Yk are identically distributed, ϕ is standardized ( ϕ = 1), and
the design is centered, Bn2 simplifies to
n
X n
X n
X
Bn2 = n ( −1
t21 + (t′2 xk )2 + 2t1 cov(sgn(Y1 ), ϕ(F (Y1))t′2 xk
k=1 k=1 k=1
= t21 + ′ −1 ′
t2 (n X X)t2 +0.
Hence by (D.2),
lim Bn2 = t21 + t′2 Σt2 , (3.5.25)
n→∞
Since Bn2 converges to a positive constant we need only show that the sum
converges to 0. By the triangle inequality we can show that the indicator
function satisfies
I(n−1/2 |t1 | + n−1/2 |t′2 xk ||ϕ(F (Yk ))| > ǫBn ) ≥ I(|Znk
∗
| > ǫBn ) . (3.5.27)
Following the discussion after expression (3.5.7), we have that n−1/2 |(x′k t)| ≤
Mn where Mn is independent of k and, furthermore, Mn → 0. Hence, we have
ǫBn − n−1/2 t1
I(|ϕ(F (Yk ))| > ) ≥ I(n−1/2 |t1 | + n−1/2 |t′2xk ||ϕ(F (Yk ))| > ǫBn ) .
Mn
(3.5.28)
Thus the sum in expression (3.5.26) is less than or equal to
X n
∗2 ǫBn − n−1/2 t1
E Znk I |ϕ(F (Yk ))| >
Mn
k=1
ǫBn − n−1/2 t1
= t1 E I |ϕ(F (Y1))| >
Mn
X n
ǫBn − n−1/2 t1 ′
+ (2/n)E sgn(Y1 )ϕ(F (Y1 ))I |ϕ(F (Y1 ))| > t2 xk
Mn
k=1
−1/2
X n
ǫBn − n t1
+ E ϕ2 (F (Y1))I |ϕ(F (Y1))| > (1/n) (t′2 xk )2 .
Mn k=1
i i
i i
i i
Because the design is centered P the middle term on the right side is 0. As re-
marked above, the term (1/n) nk=1 (t′2 xk )2 = (1/n)t′2X′ Xt2 converges to a
−1/2 t
positive constant. In the expression ǫBn −n Mn
1
, the numerator converges to
a positive constant as the denominator converges to 0; hence, the expression
goes to ∞. Therefore since ϕ is bounded, the indicator function converges to 0.
Again using the boundedness of ϕ, we can interchange limit and expectation
by the Lebesgue Dominated Convergence Theorem. Thus condition (3.5.26) is
true and, hence, Zn∗ converges in distribution to a univariate normal distribu-
tion. Therefore, Tn converges to a multivariate normal distribution. Note by
(3.5.25) it follows that the asymptotic covariance of b b ϕ is the result displayed
in the theorem.
In the above development, we considered the centered design. In practice,
though, we are often concerned with an uncentered design. Let α∗ denote the
intercept for the uncentered model. Then α∗ = α − x′ β where x denoted the
vector of column averages of the uncentered design matrix. An estimate of α∗
based on R estimates is given by α bS − x′ β
bS∗ = α b ϕ . Based on the last theorem,
it follows (Exercise 3.15.14) that
∗
bS
α α0 κn −τϕ2 x′ (X′ X)−1
L b ∼ Np+1 , ,
β ϕ β0 −τϕ2 (X′ X)−1x τϕ2 (X′X)−1
(3.5.29)
−1 2 2 ′ ′ −1
where κn = n τS + τϕ x (X X) x and τS and and τϕ are given respectively
by (3.4.6) and (3.4.4).
i i
i i
i i
If Wilcoxon scores are used then the estimate is the median of the Walsh
averages, (1.3.25) while if sign scores are used the estimate is the median of
the residuals.
b+ = (b
Let b b ′ )′ . We next briefly sketch the development of the asymp-
αϕ+ , β
ϕ ϕ
totic distribution of b b + . Assume without loss of generality that the true pa-
ϕ
rameter vector (α0 , β ′0 )′ is 0. Suppose instead of the residuals we had the true
errors in (3.5.30). Theorem A.2.11 of the Appendix then yields an asymptotic
linearity result for the process. McKean and Hettmansperger (1976) show that
this result holds for the residuals also; that is,
1
√ S + (b
eR − α1) = S + (e) − ατϕ−1 + op (1) , (3.5.32)
n
for all |α| ≤ c, where c > 0. Using arguments√similar to those in McKean and
Hettmansperger (1976), we can show that nb αϕ+ is bounded in probability;
hence, by (3.5.32) we have that
√ 1
αϕ+ = τϕ √ S + (e) + op (1) .
nb (3.5.33)
n
But by (A.2.43) and (A.2.45) of the Appendix, we have the second represen-
tation given by
n
√ 1 X + +
αϕ+
nb = τϕ √ ϕ (F |ei |)sgn(ei ) + op (1)
n i=1
n
1 X +
= τϕ √ ϕ (2F (ei ) − 1) + op (1) , (3.5.34)
n i=1
where F + is the distribution function of the absolute errors |ei |. Due to sym-
metry, F + (t) = 2F (t) − 1. Then using the relationship between the rank and
the signed-rank scores, ϕ+ (u) = ϕ((u + 1)/2), we obtain finally
n
√ 1 X
αϕ+
nb = τϕ √ ϕ(F (Yi)) . (3.5.35)
n i=1
i i
i i
i i
Theorem 3.5.8. Under assumptions (D.1), (D.2), (E.1), (E.2), (S.1), and
(S.3) of Section 3.4
bϕ+
α α0
b has an approximate Np+1 , τϕ2 (X′1 X1 )−1 distribution ,
β ϕ β0
(3.5.37)
where X1 = [1 X].
H0 : Mβ = 0 versus HA : Mβ 6= 0 , (3.6.1)
Lemma 3.6.1. Let β b denote the R estimate of β in the full model (3.2.3),
then under (E.1), (S.1), (D.1), and (D.2) of Section 3.4,
b − Q(β)
b →0. P
D(β) (3.6.2)
i i
i i
i i
>1−ǫ.
Lemma 3.6.2. Let β e denote the minimizing value of the quadratic function
Q then under (E.1), (S.1), (D.1), and (D.2) of Section 3.4,
e − Q(β)
b →P
Q(β) 0. (3.6.3)
It is shown in Exercise 3.15.15 that the factor in brackets in the last equation
is bounded in probability. Since the left factor converges to zero in probability
by Theorem 3.5.5 the desired result follows.
It is easier to work with the the equivalent formulation of the linear hy-
potheses given by
Lemma 3.6.3. An equivalent formulation of the model and the hypotheses is:
where the columns of Q1 form an orthonormal basis for the kernel of the
matrix M, the columns of Q2 form an orthonormal basis for the column space
i i
i i
i i
It follows that
Y = 1α + Xβ + e
= 1α + X∗1 β ∗1 + X∗2 β ∗2 + e .
Y = 1α + X1 β 1 + X2 β 2 + e , (3.6.7)
H0 : β 2 = 0 versus HA : β 2 6= 0 . (3.6.8)
With these lemmas, we are now ready to obtain the asymptotic distribution
of Fϕ . Let β r = (β ′1 , 0′ )′ denote the reduced model vector of parameters, let
b denote the reduced model R estimate of β , and let β
β b = (β b ′ , 0′ )′ . We
r,1 1 r r,1
use similar notation with the minimizing value of the approximating quadratic
Q. With this notation, the drop in dispersion becomes RDϕ = D(β b ) − D(β). b
r
McKean and Hettmansperger (1976) proved the following:
Theorem 3.6.1. Suppose the assumptions (E.1), (D.1), (D.2), and (S.1) of
Section 3.4 hold. Then under H0 ,
RDϕ D 2
→ χ (q) ,
τϕ /2
By Lemma 3.6.1 the first and fifth differences go to zero in probability and
by Lemma 3.6.2 the second and fourth differences go to zero in probability.
i i
i i
i i
Hence we need only show that the middle difference converges in distribution
to the intended distribution. As in Lemma 3.6.2, algebra leads to
e = −2−1 τϕ S(Y)′ (X′ X)
Q(β)
−1
S(Y) + D(Y) ,
while
e ) = −2 τϕ S(Y)
−1 ′ (X′1 X1 )−1 0
Q(β r S(Y) + D(Y) .
0 0
Combining these last two results the middle difference becomes
e e −1 ′ ′ −1 (X′1 X1 )−1 0
Q(β r ) − Q(β) = 2 τϕ S(Y) (X X) − S(Y) .
0 0
where
′ A1 B
XX =
B′ A2
−1
W = A2 − B′ A−1
1 B . (3.6.9)
RDϕ /q
Fϕ = . (3.6.10)
τbϕ /2
Although the test statistic qFϕ has an asymptotic χ2 distribution, small sample
studies (see below) have indicated that it is best to compare the test statistic
i i
i i
i i
with F -critical values having q and n − p − 1 degrees of freedom; that is, the
test at nominal level α is:
Reject H0 : Mβ = 0 in favor of HA : Mβ 6= 0 if Fϕ ≥ F (α, q, n − p − 1) .
(3.6.11)
McKean and Sheather (1991) review numerous small sample studies concern-
ing the validity of the rank-based analysis based on the test statistic Fϕ . These
small sample studies demonstrate that the empirical α level of Fϕ over a va-
riety of designs, sample sizes, and error distributions are close to the nominal
values.
In classical inference there are three tests of general hypotheses: the like-
lihood ratio test (reduction in sums of squares test), Wald’s test, and Rao’s
scores (gradient) test. A good discussion of these tests can be found in Rao
(1973). When the hypotheses are the general linear hypotheses (3.6.1), the er-
rors have a normal distribution, and the least squares procedure is used then
the three test statistics are algebraically equivalent. Actually the equivalence
holds without normality, although in this case the reduction in sums of squares
statistic is not the likelihood ratio test; see the discussion in Hettmansperger
and McKean (1983).
There are three rank-based tests for the general linear hypotheses, also.
The reduction in dispersion test statistic Fϕ is the analogue of the likeli-
hood ratio test, i.e., the reduction in sums of squares test. Since Wald’s test
statistic is a quadratic form in full model estimates, its rank analogue is given
by ′
′ −1
b
Mβ M (X X) M ′ −1 b
Mβ /q
Fϕ,Q = . (3.6.12)
τbϕ2
Provided τbϕ is a consistent estimate of τϕ it follows from the asymptotic dis-
b Corollary 3.5.1, that under H0 , qFϕ,Q has an asymptotic χ2
tribution of β,
distribution. Hence the test statistics Fϕ and Fϕ,Q have the same null asymp-
totic distributions. Actually as Exercise 3.15.16 shows, the difference of the
test statistics converges to zero in probability under H0 . Unlike the classical
methods, though, they are not algebraically equivalent, see Hettmansperger
and McKean (1983).
The rank gradient scores test is easiest to define in terms of the repa-
rameterized model, (3.6.18); that is, the null hypothesis is H0 : β 2 = 0.
Rewrite the random vector defined in (3.6.10) of Theorem 3.6.1 using as the
true parameter under H0 , β 0 = (β 01 , 0′ )′ , i.e.,
−1/2 ′ −1/2
−B′ A−1
1 I n S(Y − Xβ 0 ) nW −B′ A−1 1 I n S(Y − Xβ 0 ) .
(3.6.13)
From the proof of Theorem 3.6.1 this quadratic form has an asymptotic χ2
distribution with q degrees of freedom. Since it does depend on β 0 , it can
i i
i i
i i
√ ∗
b − β ) = Op (1), under H0 .
The reduced model estimate must satisfy n(β r 0
Then the statistic in (3.6.15) is
n o−1
−1
A∗ϕ = S∗′
2 X ′
X
2 2 − X ′
X
2 1 (X ′
X
1 1 ) X ′
X
1 2 S∗2 , (3.6.17)
S∗2 = S2 (Y − X1 β b∗ ) .
b ∗ ) − X′ X1 (X′ X1 )−1 S1 (Y − X1 β (3.6.18)
r,1 2 1 r,1
Note that when the R estimate is used, the second term in S∗2 vanishes and
we have (3.6.15); see Adichi (1978) and Chiang and Puri (1984).
Hettmansperger and McKean (1983) give a general discussion of these three
tests. Note that both Fϕ,Q and Fϕ require estimation of full model estimates
and the scale parameter τϕ while Aϕ does not. However when using a linear
model, one is usually interested in more than hypothesis testing. Of primary
interest is checking the quality of the fit; i.e., does the model fit the data.
This requires estimation of the full model parameters and an estimate of τϕ .
Diagnostics for fits based on R estimates are the topics of Section 3.9. One is
i i
i i
i i
also usually interested in estimating contrasts and their standard errors. For
R estimates this requires an estimate of τϕ . Moreover, as discussed in Hett-
mansperger and McKean (1983), the small sample properties of the aligned
rank test can be poor on certain designs.
The influence function of the test statistic Fϕ is p derived in Appendix
A.5.2. As discussed there, it is easier to work with the qFϕ . The result is
given by
1/2
p ′ ′ ′ −1 (X′1 X1 )−1 0
Ω(x0 , y0 ; qFϕ ) = |ϕ[F (y0−x0 β r )]| x0 (X X) − x0
0 0
(3.6.19)
and, as shown in the Appendix, the null distribution of Fϕ can be read from
this result. Note that similar to the R estimates, the influence function of Fϕ
is bounded in the Y -space but not in the x-space; see (3.5.17).
Consistency
We want to show that the test statistic Fϕ is consistent for the general linear
hypothesis, (3.2.5). Without loss of generality, we again reparameterize the
model as in (3.6.18) and consider as our hypothesis H0 : β 2 = 0 versus
HA : β 2 6= 0. Let β 0 = (β ′01 , β ′02 )′ be the true parameter. We assume that
the alternative is true; hence, β 02 6= 0. Let α be a given level of significance.
Let T (τϕ ) = RDϕ /(τϕ /2) where RDϕ = D(β b ) − D(β).
b Because we estimate
r
τϕ under the full model by a consistent estimate, to show consistency of Fϕ it
suffices to show
Pβ [T (τϕ ) ≥ χ2α,q ] → 1 , (3.6.20)
0
as n → ∞.
As in the proof under the null hypothesis, it is convenient to work with
the approximating quadratic function Q(Y − Xβ), (3.5.11). As above, let β e
and βb denote the minimizing values of Q and D, respectively, under the full
b by
model. The present argument simplifies if, for the full model, we replace β
i i
i i
i i
Applying asymptotic quadraticity, Theorem A.3.8, the first and third differ-
ences go to 0 in probability while the second difference goes to 0 in probability
by Lemma 3.6.2; hence the leftside goes to 0 in probability under the alterna-
tive model. Thus we need only show that
b ) − D(β))
Pβ [(2/τϕ )(D(β e ≥ χ2 ] → 1 , (3.6.21)
r α,q
0
Theorem 3.6.2. Suppose conditions (E.1), (D.1), (D.2), and (S.1) of Section
3.4 hold. The test statistic Fϕ is consistent for the hypotheses (3.2.5).
Efficiency Results
The above result establishes that the rank-based test statistic Fϕ is consistent
for the general linear hypothesis, (3.2.5). We next derive the efficiency results
of the test. Our first step is to obtain the asymptotic power of Fϕ along a
sequence of alternatives. This generalizes the asymptotic power lemmas dis-
cussed in Chapters 1 and 2. From this the efficiency results follow. As with the
consistency discussion it is more convenient to work with the model (3.6.18).
The sequence of alternative models to the hypothesis H0 : β 2 = 0 is:
√
Y = 1α + X1 β 1 + X2 (θ/ n) + e , (3.6.22)
Theorem 3.6.3. Under the sequence of models (3.6.22) and the assumptions
(E.1), (D.1), (D.2), and (S.1) of Section 3.4,
i i
i i
i i
where χ2q (ηϕ ) has a noncentral χ2 -distribution with q degrees of freedom and
noncentrality parameter
ηϕ = τϕ−2 θ ′ W0−1 θ , (3.6.24)
where W0 = limn→∞ nW and W is defined in display (3.6.9).
Proof: As in the proof of Theorem 3.6.1 we can write the drop in dispersion as
the sum of the same five differences. Since the first two and last two differences
go to zero in probability under the null model, it follows from the discussion
on contiguity (Section A.2.2) that these differences go to zero in probability
under the model (3.6.22). Hence we need only be concerned about the middle
difference. Since β 1 = 0, the middle difference reduces to the same quantity
as in Theorem 3.6.1; i.e., we obtain,
RDϕ ′
= −B′ A−1
1 I S(Y) W −B′ A−1
1 I S(Y) + op (1) .
τϕ /2
√
for all c > 0. Since nkβ n k = kθk, we can take c = kθk and get
kn−1/2 S(Y − Xβ n ) − n−1/2 S(Y) − τϕ−1 Σ(0′ , θ ′ )′ k = op (1) . (3.6.25)
The above probability statements hold under the null model and, hence, by
contiguity under the sequence of models (3.6.22) also. Under the sequence of
models (3.6.22), however,
D
n−1/2 S(Y − Xβ n ) → Np (0, Σ) .
i i
i i
i i
Both matrices A2 and A−1 1 are positive definite; hence, the noncentrality pa-
rameter is maximized when θ is in the kernel of B. One way of assuring this
for a design is to take B = 0. Because B = X′1 X2 this condition holds for
orthogonal designs. Therefore orthogonal designs are generally more efficient
than nonorthogonal designs.
We next obtain the asymptotic relative efficiency of the test statistic Fϕ
with respect to the least squares classical F -test, FLS , defined by (3.2.17) in
Section 3.2.2. The theory for FLS under local alternatives is outlined in Exer-
cise 3.15.18 where it is shown that, under the additional assumption that the
random errors ei have finite variance σ 2 , the null asymptotic distribution of
qFLS is a central χ2q distribution. Thus both Fϕ and FLS have the same asymp-
totic null distribution. As outlined in Exercise 3.15.18, under the sequence of
models (3.6.22) qFLS has an asymptotic noncentral χ2q,ηLS with noncentrality
parameter
ηLS = (σ 2 )−1 θ ′ W0−1 θ . (3.6.27)
Based on Theorem 3.6.3, the asymptotic relative efficiency of Fϕ and FLS
is the ratio of their noncentrality parameters; i.e.,
ηϕ σ2
e(Fϕ , FLS ) = = 2 .
ηLS τϕ
Thus the efficiency results for the rank-based estimates and tests discussed in
this section are the same as the efficiency results presented in Chapters 1 and
2. An asymptotically efficient analysis can be obtained if the selected rank
score function is ϕf (u) = −f0′ (F0−1 (u))/f0(F0−1 (u)) where f0 is the form of the
density of the error distribution. If the errors have a logistic distribution then
the Wilcoxon scores result in an asymptotically efficient analysis.
Usually we have no knowledge of the distribution of the errors. In which
case, we would recommend using Wilcoxon scores. With them, the loss in
relative efficiency to the classical analysis at the normal distribution is only
5%, while the gain in efficiency over the classical analysis for long-tailed error
distributions can be substantial as discussed in Chapters 1 and 2.
Many of the studies reviewed in the article by McKean and Sheather (1991)
included power comparisons of the rank-based analyses with the least squares
F -test, FLS . The empirical power of FLS at normal error distributions was
slightly better than the empirical power of Fϕ , under Wilcoxon scores. Under
error distributions with heavier tails than the normal distribution, the empir-
ical power of Fϕ was generally larger, often much larger, than the empirical
i i
i i
i i
power of FLS . These studies provide empirical evidence that the good asymp-
totic efficiency properties of the rank-based analysis hold in the small sample
setting.
As discussed above, the noncentrality parameters of the test statistics Fϕ
and FLS differ in only the scale parameters. Hence, in practice, planning de-
signs based on the noncentrality parameter of Fϕ can proceed similar to the
planning of a design using the noncentrality parameter of FLS ; see, for exam-
ple, the discussion in Chapter 4 of Graybill (1976).
By Lemmas 3.6.1 and 3.6.2 the two differences on the right side converge to 0
in probability. After some algebra, we obtain
( −1 )
e =− τϕ 1 1 1
Q(β) √ S(e)′ X′ X √ S(e) + D(e) .
2 n n n
i i
i i
i i
By Theorem 3.5.2 the term in braces on the right side converges in distribution
to a χ2 random variable with p degrees of freedom. This implies that (D(e) −
D(be))/(τϕ /2) also converges in distribution to a χ2 random variable with p
degrees of freedom. Although this is a stronger result than we need, it does
imply that n−1 (D(e) − D(b e)) converges to 0 in probability. Hence, n−1 D(b e)
converges in probability to D e .
The natural analog to the least squares F -test statistic is
RD/q
Fϕ∗ = , (3.6.29)
bD /2
σ
bD = D(b
where σ e)/(n − p − 1), rather than Fϕ . But we have
τbϕ /2 D
qFϕ∗ = −1
qFϕ → κF χ2 (q) , (3.6.30)
n D(b e)/2
where κF is defined by
τbϕ P
−1
→ κF . (3.6.31)
n D(b e)
Hence, to have a limiting χ2 -distribution for qFϕ∗ we need to have κF = 1.
Below we give several examples where this occurs. In the first example, the
form of the error distribution is known while in the second example the errors
are normally distributed; however, these cases rarely occur in practice.
There is an even more acute problem with using Fϕ∗ , though. In Sec-
tion A.5.2 of the Appendix, we show that the influence function of Fϕ∗ is
not bounded in the Y -space, while, as noted above, the influence function of
the statistic Fϕ is bounded in the Y -space provided the score function ϕ(u)
is bounded. Note, however, that the influence functions of D(b e) and Fϕ∗ are
linear rather than quadratic as is the influence function of FLS . Hence, they
are somewhat less sensitive to outliers in the Y -space than FLS ; see Hettman-
sperger and McKean (1978).
Example 3.6.1 (Form of Error Density Known). Assume that the errors have
density f (x) = σ −1 f0 (x/σ) where f0 is known. Our choice of scores would then
be the optimal scores given by
1 f0′ (F0−1 (u))
ϕ0 (u) = − p −1 , (3.6.32)
I(f0 ) f0 (F0 (u))
where I(f0 ) denotes the Fisher information corresponding to f0 . These scores
yield an asymptotically efficient rank-based analysis. Exercise 3.15.20 shows
that with these scores
τϕ = D e . (3.6.33)
Thus κF = 1 for this example and qFϕ∗0 has a limiting χ2 (q)-distribution under
H0 .
i i
i i
i i
Example 3.6.2 (Errors Are Normally√ Distributed). In this case the form
of the error density is f0 (x) = ( 2π)−1 exp {−x2 /2}; i.e., the standard nor-
mal density. This is of course a subcase of the last example. The optimal
scores in this case are the normal scores ϕ0 (u) = Φ−1 (u) where Φ denotes the
standard normal distribution function. Using these scores, the statistic qFϕ∗0
has a limiting χ2 (q)-distribution under H0 . Note here that the score func-
tion ϕ0 (u) = Φ−1 (u) is unbounded; hence the above theory must be modified
to obtain this result. Under further regularity conditions on the design ma-
trix, Jurečková (1969) obtained asymptotic linearity for the unbonded score
function case; see, also, Koul (1992, p. 51). Using these results, the limiting
distribution of qFϕ∗0 can be obtained. The R estimates based on these scores,
however, have an unbounded influence function; see Section 1.8.1. We next
consider this analysis for Wilcoxon and sign scores.
If Wilcoxon scores are employed then Exercise 3.15.21 shows that
r
π
τϕ = σ (3.6.34)
3
r
3
De = σ . (3.6.35)
π
Thus, in this case, a consistent estimate of τϕ /2 is n−1 D(b
e)(π/6).
For sign scores a similar computation yields
r
π
τS = σ (3.6.36)
2
r
2
De = σ (3.6.37)
π
Hence n−1 D(b
e)(π/4) is a consistent estimate of τS /2.
Note that both examples are overly restrictive and again in all cases the re-
sulting rank-based test of the general linear hypothesis H0 has an unbounded
influence function, even in the case when the errors have a normal density and
the analysis is based on Wilcoxon or sign scores. In general then, we recom-
mend using a bounded score function ϕ and the corresponding test statistic
Fϕ , (3.2.18) which is highly efficient and whose influence function, 3.6.19, is
bounded in the Y -space.
i i
i i
i i
ϕ(u) − ϕ(0)
ϕ∗ (u) = . (3.7.1)
ϕ(1) − ϕ(0)
i i
i i
i i
Z ∞
∗
γ = − ϕ∗ (F (x))f ′ (x)dx
Z ∞−∞
= ϕ∗′ (F (x))f 2 (x)dx
Z−∞
∞
= f (x)dϕ∗ (F (x)) ,
−∞
R∞
P [|Z1 −Z2 | ≤ y] = [F (z2 +y)−F (z2 −y)]dϕ∗ (F (z2 )) y > 0
−∞
H(y) =
0 y ≤ 0.
(3.7.2)
Let h(y) denote the density of H(y). Upon differentiating under the integral
sign in expression (3.7.2) it easily follows that
h(0) = 2γ ∗ . (3.7.3)
Z 1
H(y) = F (F −1 (t) + y) − F (F −1(t) − y) dϕ∗ (t) . (3.7.4)
0
Next let Fbn denote the empirical distribution function of the R residuals and
let Fbn−1 (t) = inf{x : Fbn (x) ≥ t} denote the usual inverse of Fbn . Let Hb n denote
the estimate of H which is obtained by replacing F by Fbn . Some simplification
follows by noting that for t ∈ ((j − 1)/n, j/n], Fbn−1 (t) = eb(j) . This leads to the
i i
i i
i i
bn,
following form of H
Z 1 h i
b n (y) =
H Fbn (Fbn−1 (t) + y) − Fbn (Fbn−1 (t) − y) dϕ∗ (t)
0
n Z
X h i
= b b b b
Fn (Fn (t) + y) − Fn (Fn (t) − y) dϕ∗ (t)
−1 −1
j−1 j
j=1 ( , ]
n n
Xn h
j
i
j −1
= Fbn (b
e(j) + y) − Fbn (b
ϕ −ϕ
e(j) − y)∗ ∗
j=1
n n
n n
1 XX ∗ j ∗ j−1
= ϕ −ϕ e(i) − b
I(|b e(j) | ≤ y) . (3.7.5)
n i=1 j=1 n n
Theorem 3.7.1. Under (E.1),(D.1), (S.1), and (S.2) of Section 3.4, and for
any 0 < δ < 1,
P
sup |b
γn,δ − γ| → 0 ,
ϕ∈C
where C denotes the class of all bounded, right continuous, nondecreasing score
functions defined on the interval (0, 1).
The proof can be found in Koul et al. (1987). It follows immediately that
τbϕ = 1/b
γn,δ is a consistent estimate of τϕ . Note that the uniformity condition
on the scores in the theorem is more than we need here. This result, though,
proves useful in adaptive procedures which estimate the score function; see
McKean and Sievers (1989).
Since the scores are differentiable, an approximation of H b n is obtained by
an application of the mean value theorem to (3.7.5) which results in
n n
b ∗ (y) 1 X X ∗′ j
H n = ϕ e(i) − b
I(|b e(j) | ≤ y) , (3.7.7)
cn n i=1 j=1 n+1
Pn ∗′ b ∗ is a distribution function.
where cn = j=1 ϕ (j/(n + 1)) is such that H n
i i
i i
i i
Note that this is similar to the least squares correction on the maximum
likelihood estimate (under normality) of the variance.
i i
i i
i i
1 ′
b (0) X′ X β − β b (0)
Q(β) = β − β
2τbϕ (0)
′
− β−β b (0) S Y − Xβ b (0) + D Y − Xβb (0) .
b (1) = β
β b (0) + τb(0) (X′ X)−1 S(Y − Xβ
b (0) ) . (3.7.9)
ϕ
This is the first Newton step. In the same way that the first step was defined in
terms of the initial estimate, so can a second step be defined in terms of the first
step. We call these iterated estimates
(1) or k-step estimates.
(0) In practice, though,
we would want to know if D β b is less than D βb before proceeding.
A more formal algorithm is presented below.
These k-step estimates satisfy some interesting properties themselves which
we briefly discuss; details can be found in McKean and Hettmansperger (1978).
√ b (0)
Provided the initial estimate is such that n(β − β) is bounded in proba-
bility then for any k ≥ 1 we have
√ (k)
P
b b
n β − βϕ → 0 ,
where β b denotes a minimizing value of D. Hence the k-step estimates have the
ϕ
same asymptotic distribution as β b . Furthermore τbϕ(k) is a consistent estimate
ϕ
of τϕ , if it is any of the scale estimates discussed in Section 3.7.1 based on k-
(k)
step residuals. Let Fϕ denote the R test of a general linear hypothesis based
(k)
on reduced and full model k-step estimates. Then it can be shown that Fϕ
satisfies the same asymptotic properties as the test statistic Fϕ under the null
hypothesis and contiguous alternatives. Also it is consistent for any alternative
HA .
Formal Algorithm
In order to outline the algorithm used by RGLM, first consider the QR-
decomposition of X which is given by
Q′ X = R , (3.7.10)
i i
i i
i i
e(k) = b
b e(k−1) − τbϕ Ha(R(b
e(k−1) ) (3.7.11)
(k−1)
where a(R(b e(k−1) ) denotes the vector whose ith component is a(R(b ei ). Let
(k) (k)
b
D denote the dispersion function evaluated at e . The Newton step is a
step from be(k−1) along the direction τbϕ Ha(R(b e(k−1) )). If D (k) < D(k−1) the
step has been successful; otherwise, a linear search can be made along the
direction to find a value which minimizes D. This would then become the kth
step residual. Such a search can be performed using methods such as false
position as discussed below in Section 3.7.3. Stopping rules can be based on
the relative drop in dispersion, i.e., stop when
D (k) − D (k−1)
< ǫD , (3.7.12)
D (k−1)
where ǫD is a specified tolerance. A similar stopping rule can be based on
the relative size of the step. Upon stopping at step k, obtain the fitted value
b = Y−b
Y e(k) and then the estimate of β by solving Xβ = Y.b
A formal algorithm is: Let ǫD and ǫs be the given stopping tolerances.
i i
i i
i i
S(b) = K , (3.7.13)
1. Bracket Step. Beginning with an initial estimate b(0) step along the b-axis
to b(1) where the interval (b(0) , b(1) ), or vice-versa, brackets the solution.
i i
i i
i i
Asymptotic linearity can be used here to make these steps; for instance,
if ζ (0) is an estimate of ζ based on b(0) then the first step is
2. Regula-Falsi. Assume the interval (b(0) , b(1) ) brackets the solution and
that b(1) is the more recent value of b(0) , b(1) . If |b(1) − b(0) | < ǫ then
stop. Else, the next step is where the secant line determined by b(0) , b(1)
intersects the b-axis; i.e.,
b(1) − b(0)
b(2) = b(0) − S(b(0) ) . (3.7.15)
S(b(1) ) − S(b(0) )
(a) If (b(0) , b(2) ) brackets the solution then replace b(1) by b(2) and go
to (2) but use S(b(0) )/2 in place of S(b(0) ) in determination of the
secant line (this is the Illinois modification).
(b) If (b(2) , b(1) ) brackets the solution then replace b(0) by b(2) and go to
(2).
The above algorithm is easy to implement. Such searches are used in the
package RGLM; see Kapenga, McKean, and Vidmar (1988).
3.8 L1 Analysis
This section is devoted to L1 procedures. These are widely used procedures;
see, for example, Bloomfield and Steiger (1983). We first show that they are
equivalent to R estimates based on the sign score function under Model (3.2.4).
Hence the asymptotic theory for L1 estimation and subsequent analysis is
contained in Section 3.5. The asymptotic theory for L1 estimation can also
be found in Bassett and Koenker (1978) and Rao (1988) from an L1 point of
view.
Consider the sign scores; i.e., the scores generated by ϕ(u) = sgn(u − 1/2).
In this section we denote the associated pseudo-norm by
n
X
kvkS = sgn(R(vi ) − (n + 1)/2)vi v ∈ Rn ;
i=1
see, also, Section 2.6.1. This score function is optimal if the errors follow a
double exponential (Laplace) distribution; see Exercise 2.13.19 of Chapter 2.
We summarize the analysis based on the sign scores, but first we show that
indeed the R estimates based on sign scores are also L1 estimates, provided
that the intercept is estimated by the median of residuals.
i i
i i
i i
Consider the intercept model, (3.2.4), as given in Section 3.2 and let Ω de-
note the column space of X and Ω1 denote the column space of the augmented
matrix X1 = [1 X].
First consider the R estimate of η ∈ Ω based on the L1 pseudo-norm. This
b S ∈ Ω such that
is a vector Y
Next consider the L1 estimate for the space Ω1 ; i.e., the L1 estimate of
b L1 ∈ Ω1 such that
α1 + η. This is a vector Y
i i
i i
i i
the normal distribution is .63, and it can be much more efficient than LS for
heavier tailed error distributions.
As Exercise 3.15.22 shows, the drop in dispersion test based on sign scores,
FS , is, except for the scale parameter, the likelihood ratio test of the general
linear hypothesis (3.2.5), provided the errors have a double exponential dis-
tribution. For other error distributions, the same comments about efficiency
of the L1 estimates can be made about the test FS .
In terms of implementation, Schrader and McKean (1987) found it more
difficult to standardize the L1 statistics than other R procedures, such as
the Wilcoxon. Their most successful standardization of FS was based on the
following bootstrap procedure:
2. Select e
e1 , . . . , e
eñ , the ñ = n − (p + 1) nonzero residuals.
5. Reject H0 at level α if p∗ ≤ α.
Notice that by using full model residuals, the algorithm estimates the null
distribution of FS . The algorithm depends on the number B of bootstrap
samples taken. We suggest at least 2000.
3.9 Diagnostics
An important part in the analysis of a linear model is the examination of
the resulting fit. Tools for doing this include residual plots and diagnostic
techniques. Over the last fifteen years or so, these tools have been developed
for fits based on least squares; see, for example, Cook and Weisberg (1982) and
Belsley, Kuh, and Welsch (1980). Least squares residual plots can be used to
detect such things as curvature not accounted for by the fitted model; see Cook
and Weisberg (1989) for a recent discussion. Further diagnostic techniques can
be used to detect outliers which are points that differ greatly from pattern set
by the bulk of the data and to measure the influence of individual cases on
the least squares fit. See McKean and Sheather (2009) for a recent review of
diagnostic procedures.
i i
i i
i i
In this section we explore the properties of the residuals from the rank-
based fits, showing how they can be used to determine model misspecification.
We present diagnostic techniques for rank-based residuals that detect outlying
and influential cases. Together these tools offer the user a residual analysis for
the rank-based method for the fit of a linear model similar to the residual
analysis based on least squares estimates.
In this section we consider the same linear model, (3.2.3), as in Section
3.2. For a given score function ϕ, let β b and b eR denote the R estimate of
ϕ
β and residuals from the R fit of the model based on these scores. Much of
the discussion is taken from the articles by McKean, Sheather, and Hettman-
sperger (1990, 1991, 1993). Also, see Dixon and McKean (1996) for a robust
rank-based approach to modeling heteroscedasticity.
Y = 1α + Xβ + Zγ + e , (3.9.1)
√
where Z is an n × q centered matrix of constants and γ = θ/ n, for θ 6= 0.
Note that this sequence of models is contiguous to Model (3.2.3). Suppose
we fit Model (3.2.3), i.e. Y = 1α + Xβ + e, when model (3.9.1) is the true
model. Hence the model has been misspecified. As a first step in examining
the residuals in this situation, we consider the limiting distribution of the
corresponding R estimate.
Theorem 3.9.1. Assume Model (3.9.1) is the true model. Let β b be the R
ϕ
estimate for Model (3.2.3). Suppose that conditions (E.1) and (S.1) of Section
3.4 are true and that conditions (D.1) and (D.2) are true for the augmented
matrix [X Z]. Then
b has an approximate Np β + (X′ X)−1 X′ Zθ/√n, τ 2 (X′ X)−1 distribution.
β ϕ ϕ
(3.9.2)
Proof: Without loss of generality assume that β = 0. Note that the situation
here is the same as the situation in Theorem 3.6.3; except now the null hy-
pothesis corresponds to γ = 0 and β b is the reduced model estimate. Thus we
ϕ
seek the asymptotic distribution of the reduced model estimate. As in Section
e which is the
3.5.1 it is easier to consider the corresponding pseudo estimate β
reduced model estimate which minimizes the quadratic Q(Y − Xβ), (3.5.11).
i i
i i
i i
√ b e P
Under the null hypothesis, γ = 0, n(β ϕ − β) → 0; hence by contiguity
√ b e →P b and β e have
n(β ϕ − β) 0 under the sequence of Models (3.9.1). Thus β ϕ
the same distributions under (3.9.1); hence, it suffices to find the distribution
e But by (3.5.13),
of β.
e = τϕ (X′X)−1 S(Y) ,
β (3.9.3)
where S(Y) is the first p components of the vector T (Y) = [X Z]′ a(R(Y)).
By (3.6.26) of Theorem 3.6.3
D
n−1/2 T (Y) → Np+q (τϕ−1 Σ∗ (0′ , θ ′ )′ , Σ∗ ) , (3.9.4)
Proof: Without loss of generality assume that the regression coefficients are 0.
By (A.3.10) and expression (3.6.25) of Theorem 3.6.3 we can write
1 1 X′ ϕ(F (e)) 1 X′ Zθ
√ T (Y) = √ + τϕ−1 + op (1) ;
n n Z′ ϕ(F (e)) n Z′ Zθ
1 1 1
√ S(Y) = √ X′ ϕ(F (e)) + τϕ−1 X′ Zθ + op (1) .
n n n
√ b e P
By expression (3.9.3) and the fact that n(β − β) → 0 the result follows.
From this corollary we obtain the following first order expressions of the
R residuals and R fitted values:
.
bR =
Y α1 + Xβ + τϕ Hϕ(F (e)) + HZγ (3.9.6)
.
b
eR = e − τϕ Hϕ(F (e)) + (I − H)Zγ , (3.9.7)
i i
i i
i i
where H = X (X′ X)−1 X′. In Exercise 3.15.23 the reader is asked to show that
the least squares fitted values and residuals satisfy
b LS = α1 + Xβ + He + HZγ
Y (3.9.8)
b
eLS = e − He + (I − H)Zγ . (3.9.9)
In terms of model misspecification the coefficients of interest are the regres-
sion coefficients. Hence, at this time we need not consider the effect of the
estimation of the intercept. This avoids the problem of which estimate of the
intercept to use. In practice, though, for both R and LS fits, the intercept
is also fitted and, subsequently, its effect is removed from the residuals. We
also include the effect of estimation of the intercept in our discussion of the
standardization of residuals and fitted values in Sections 3.9.2 and 3.9.3, re-
spectively.
Suppose that the linear model (3.2.3) is correct. Based on its first order
expression when γ = 0, b eR is a function of the random errors similar to b eLS ;
hence, it follows that a plot of b eR versus Y b R should generally be a random
scatter, similar to the least squares residual plot.
In the case of model misspecification, note that the R residuals and least
squares residuals have the same asymptotic bias, namely (I − H)Zγ. Hence R
residual plots, similar to those of least squares, are useful in identifying model
misspecification.
For least squares residual plots, since least squares residuals and the fitted
values are uncorrelated, any pattern in this plot is due to model misspecifica-
tion and not the fitting procedure used. The converse, however, is not true.
As the example on the potency of drug compounds below illustrates, the least
squares residual plot can exhibit a random scatter for a poorly fitted model.
This orthogonality in the LS residual plot does, however, make it easier to
pick out patterns in the plot. Of course the R residuals are not orthogonal to
the R fitted values, but they are usually close to orthogonality; see Naranjo
et al. (1994). We introduce the following parameter ν to measure the extent
of departure from orthogonality.
Denote general fitted values and residuals by Y b and b e respectively. The
expected departure from orthogonality is the parameter ν defined by
h i
ν=E b b .
e′ Y (3.9.10)
For least squares, νLS is of course 0. For R fits, we have the following first
order expression for it:
Theorem 3.9.2. Under the assumptions of Theorem 3.9.1 and either Model
(3.2.3) or Model (3.9.1),
.
νR = pτϕ (E[ϕ(F (e1 ))e1 ] − τϕ ) . (3.9.11)
i i
i i
i i
Proof: Suppose Model (3.9.1) holds. Using the above first order expressions
we have
.
νR = E [(e + α1 − τϕ Hϕ(F (e)) + (I − H)Zγ)′ (Xβ + τϕ Hϕ(F (e)) + HZγ)] .
Using E[ϕ(F (e))] = 0, E[e] = E(e1 )1, and the fact that X is centered this
expression simplifies to
.
νR = τϕ E [trHϕ(F (e))e′ ] − τϕ2 E [trHϕ(F (e))ϕ(F (e))′] .
Since the components of e are independent, the result follows. The result is
invariant to either of the models.
Although in general, νR 6= 0 for R estimates, if, as the next corollary shows,
optimal scores (see Examples 3.6.1 and 3.6.2) are used the expected departure
from orthogonality is 0.
Corollary 3.9.2. Under the hypothesis of the last theorem, if optimal R scores
are used then νR = 0.
′ (F −1 (u)) R
Proof: Let ϕ(u) = −cf
f (F −1 (u))
where c is chosen so that ϕ2 (u)du = 1. Then
Z ′ −1 −1
f (F (u))
τϕ = ϕ(u) − du = c.
f (F −1 (u))
Some simplification and an integration by parts shows
Z Z
ϕ(F (e))e dF (e) = c f (e) de = c.
i i
i i
i i
i i
i i
i i
model would be more appropriate. Panel B of Figure 3.9.1 displays the residual
plot from the R fit of a quadratic model. Some curvature is still present in the
plot. A cubic polynomial was fitted next. Its R residual plot, found in Panel C
of Figure 3.9.1, is much more of a random scatter than the first two plots. On
the basis of residual plots the cubic polynomial is an adequate model. Least
squares residual plots would also lead to a third degree polynomial.
Figure 3.9.1: Panel A through C are the residual plots of the Wilcoxon fits of
the linear, quadratic, and cubic models, respectively, for the cloud data. Panel
D is the q−q plot based on the Wilcoxon fit of the cubic model.
Panel A Panel B
0.5
0.5
0.0
Wilcoxon residuals
Wilcoxon residuals
−0.5
0.0
−1.0
−1.5
−0.5
−2.0
24 26 28 30 32 34 24 26 28 30 32
Panel C Panel D
0.4
0.4
0.2
0.2
Wilcoxon residuals
Wilcoxon residuals
0.0
0.0
−0.2
−0.2
−0.4
−0.4
i i
i i
i i
Table 3.9.2: Wilcoxon (W) and LS Estimates (LS) of the Regression Coeffi-
cients for Cloud Data. (Standard errors are in parentheses.)
Method Intercept Linear Quadratic Cubic Scale
W 22.35 (.18) 2.24 (.17) -.23 (.04) .01 (.003) τbϕ = .307
LS 22.31 (.15) 2.22 (.15) -.22 (.04) .01 (.003) b = .281
σ
these residual plots as discussed in the next section. Table 3.9.2 displays the
estimated coefficients along with their standard errors. The Wilcoxon and least
squares fits are practically the same.
Example 3.9.2 (Potency Data, Example 3.3.3 continued). This example was
discussed in Section 3.3. Recall that the data were the result of an experiment
concerning the potency of drug compounds manufactured under different levels
of four factors and one covariate. Here we want to discuss a residual analysis
of the rank-based fits of the two models that were fit in Example 3.3.3.
First consider Model (3.3.1) without the quadratic terms, i.e., without the
parameters β11 , β12 , and β13 . The residuals used are the internal R Studentized
residuals defined in the next section; see (3.9.31). They provide a convenient
scale for detecting outliers. The curvature in the Wilcoxon-residual plot of
this model, Panel A of Figure 3.9.2, is quite apparent, indicating the need
for quadratic terms in the model; whereas, the LS residual plot, Panel C of
Figure 3.3.1, does not exhibit this quadratic effect. As the R residual plot
indicates, there are outliers in the data and these had an effect on the LS
fit. Panels B and D display the residual plots, when the squared terms of the
factors are added to model, i.e., Model (3.3.1) was fit. This R residual plot
no longer exhibits the quadratic effect indicating a better fitting model. Also
by examining the R plots for both models, it is seen that the outlyingness of
some of the outliers indicated in the plot for the first model was accounted for
by the larger model.
i i
i i
i i
Figure 3.9.2: Panels A and B are the Wilcoxon internal Studentized residuals
plots for models without and with, respectively, the three quadratic terms
β11 , β12 and β13 . Panels C and D are the analogous plots for the LS fit.
Panel A Panel B
• •
•
1.0
Wilcoxon residuals
Wilcoxon residuals
0.4
•
• •
0.5
• •• •
• •
• •• •
• •
0.0
•• •• • • • •
• • • •• • • • • • •• •• • • •
0.0
• • •• • • • •• •
• • •• • •
• • • •
-0.4
•
• • •
Panel C Panel D
0.6
0.6
• •
•
0.4
0.4
•
• •
•
•
0.2
LS residuals
LS residuals
••• •
0.2
• ••
• • • •• • • •• •
•
0.0
• • • • ••
• •• • •• •
0.0
•• • •
• • • ••
• • • • •• •
-0.4
• • •
• • •
• •
-0.4
• • • •
i i
i i
i i
Theorem 3.9.3. Under the conditions (E.1), (E.2), (D.1), (D.2), and (S.1)
bs then a first order representation
of Section 3.4, if the intercept estimate is α
of the variance of ebR,i is
.
eR,i ) = σ 2 (1 − K1 n−1 − K2 hci ) ,
Var(b (3.9.13)
Let J = 11′ /n denote the projection onto the space spanned by 1. Since our
design matrix is [1 X], the leverage of the ith case is hi = n−1 + hci where hci
is the ith diagonal entry of the projection matrix H. By expanding the above
expression and using the independence of the components of e we get after
some simplification (see Exercise 3.15.25):
.
eR ) = σ 2 {I − K1 J − K2 H} ,
Cov(b (3.9.17)
i i
i i
i i
where
τ 2 δ
S S
K1 = 2 −1 , (3.9.18)
σ τS
τ 2 δ
ϕ
K2 = 2 −1 , (3.9.19)
σ τϕ
δS = E[ei sgn(ei )] , (3.9.20)
δ = E[ei ϕ(F (ei ))] , (3.9.21)
σ2 = Var(ei ) = E((ei − E(ei ))2 ) . (3.9.22)
This yields the first result, (3.9.13). Next consider the case of a symmetric
bϕ+ , discussed in
error distribution. If the estimate of the intercept is given by α
Section 3.5.2, the result simplifies to (3.9.14).
From Cook and Weisberg (1982, p. 11) in the least squares case,
Var(beLS,i) = σ 2 (1 − hi ) so that K1 and K2 are correction factors due to using
the rank score function.
Based on the results in the theorem, an estimate of the variance-covariance
matrix of b eR is
e=σ
S b2 {I − K̂1 J − K̂2 Hc } , (3.9.23)
where
!
b1 τ̂ 2 2δbS
K = S2 −1 , (3.9.24)
σ̂ τ̂S
!
τ̂ϕ2 2δb
b2
K = −1 , (3.9.25)
σ̂ 2 τ̂ϕ
1 X
δbS = |êR,i | , (3.9.26)
n−p
and
1
δb = D(β̂ ϕ ) .
n−p
The estimators τbS and τ̂ϕ are discussed in Section 3.7.1.
To complete the estimate of the Cov(b eR ) we need to estimate σ. A robust
estimate of it is given by the MAD,
se2R,i = σ b1 1 − K
b2 (1 − K b 2 hc,i ) , (3.9.28)
n
i i
i i
i i
se2LS,i = σ 2
bLS (1 − hi ) ; (3.9.29)
see page 11 of Cook and Weisberg (1982)) and recall that hi = n−1 +
x′i (X′ X)−1xi . If the error distribution is symmetric (3.9.28) reduces to
se2R,i = σ b 2 hi ) .
b2 (1 − K (3.9.30)
where e sR,i is the square root of either (3.9.28) or (3.9.30) depending on whether
one assumes an asymmetric or symmetric error distribution, respectively.
It is interesting to compare expression (3.9.30) with the estimate of the
variance of the least squares residual σ 2
bLS (1 − hi ). The correction factor K b2
depends on the score function ϕ(·) and the underlying symmetric error distri-
bution. If, for example, the error distribution is normal and if we use normal
scores, then K b 2 converges in probability to 1; see Exercise 3.15.26. In general,
however, we do not wish to specify the error distribution and then K b 2 provides
a natural adjustment.
A simple benchmark is useful in declaring whether or not a case is an out-
lier. We are certainly not advocating eliminating such cases but flagging them
as potential outliers and targeting them for further study. As we discussed in
the last section, the distribution of the R residuals should resemble the true
distribution of the errors. Hence a simple rule for all cases is not apparent. In
general, unless the residuals appear to be from a highly skewed distribution,
a simple rule is to declare a case to be a potential outlier if its residual
exceeds two standard errors in absolute value; i.e., |rR,i | > 2.
The matrix S, e (3.9.23), is an estimate of a first order approximation of
cov(b eR ). It is not necessarily positive semi-definite and we have not constrained
it to be so. In practice this has not proved troublesome since only occasionally
i i
i i
i i
Figure 3.9.3: Internal Wilcoxon Studentized residual plot, Panel A, and cor-
responding normal q−q plot, Panel B, for the Cloud Data.
Panel A Panel B
2
2
1
1
Wilcoxon Studentized residuals
0
−1
−1
−2
−2
i i
i i
i i
Y = X1 b + θi di + e , (3.9.32)
H0 : θi = 0 versus HA : θi 6= 0 . (3.9.33)
One way of testing these hypotheses is to use the test procedures described in
Section 3.6. This requires fitting Model (3.9.32) for each value of i. A second
approach is described next.
Note that we can rewrite Model (3.9.32) equivalently as
Y = X1 b∗ + θi d∗i + e , (3.9.34)
where d∗i = (I − H1 )di , H1 is the projection matrix onto the column space of
X1 and b∗ = b + H1 di θi ; see Exercise 3.15.27. Because of the orthogonality
between X and d∗i , the least squares estimate of θi can be obtained by a
simple linear regression of Y on d∗i or equivalently of b
eLS on d∗i . For the rank-
based estimate, the asymptotic distribution theory of the regression estimates
suggests a similar approach. Accordingly, let θbR,i denote the R estimate when
eR is regressed on d∗i . This is a simple regression and the estimate can be
b
obtained by a linear search algorithm; see Section 3.7.2. As Exercise 3.15.29
shows, this estimate is the inversion of an aligned rank statistic to test the
hypotheses (3.9.33). Next let τbϕ,i denote the estimate of τϕ produced from
this regression. We define the external R Studentized residual to be the
statistic
θb
tR (i) = pR,i , (3.9.35)
τbϕ,i / 1 − h1,i
where h1,i is the ith diagonal entry of H1 . Note that we have standardized θbR,i
by its asymptotic standard error.
A final remark on these external t-statistics is in order. In the mean shift
model, (3.9.32), the leverage value of the ith case is 1. Hence, the design
i i
i i
i i
assumption (D.2), (3.4.7), is not true. This invalidates both the LS and rank-
based asymptotic theory for the external t-statistics. In light of this, we do
not propose the statistic tR (i) as a test statistic for the hypotheses (3.9.33)
but as a diagnostic for flagging potential outliers. As a benchmark, we suggest
the value 2.
i i
i i
i i
to a fitted value scale with estimates of scale based on the delete i model is
given by
RDF F ITi
RDF F IT Si = 1 . (3.9.39)
(n−1 τbS2 (i) + hc,i τbϕ2 (i)) 2
This is an R analogue of the least squares diagnostic DF F IT Si proposed by
Belsley et al. (1980). For standardization based on the original model, replace
τbS (i) and τ̂ϕ (i) by τbS and τ̂ϕ , respectively. We define
RDF F ITi
RDCOOKi = 1 . (3.9.40)
(n−1 τbS2 + hc,i τbϕ2 ) 2
+
bR
If α is used as the estimate of the intercept then, provided the errors
have a symmetric distribution, the R diagnostics are obtained by replacing
Var(YbR,i ) with Var(YbR,i ) = hi τ̂ϕ2 ; see Exercise 3.15.30 for details. This results
in the diagnostics,
RDF F ITi
RDF F IT Ssymm,i = √ , (3.9.41)
hi τbϕ (i)
and
RDF F ITi
RDCOOKsymm,i = √ . (3.9.42)
hi τbϕ
This eliminates the need to estimate τS .
There is also a disagreement on what benchmarks to use for flagging points
of potential influence. As Belsley et al. (1980) discuss in some detail, DF F IT S
is inverselypinfluenced by sample size. They advocate a size-adjusted bench-
mark of 2 p/n for DF F IT S. Cook and Weisberg (1982) suggest a more
√
conservative value which results in p. We use both benchmarks in the ex-
amples. We realize these diagnostics only flag potentially influential points
that require investigation. Similar to the two references cited above, we would
never recommend indiscriminately deleting observations solely because their
diagnostic values exceed the benchmark. Rather these are potential points of
influence which should be investigated.
The diagnostics described above are formed with the leverage values based
on the projection matrix. These leverage values are nonrobust (see Rousseeuw
and van Zomeren, 1990). For data sets with clusters of outliers in factor space
robust leverage values can be formulated in terms of high breakdown estimates
of the center and scatter matrix in factor space. One such choice would be the
MVE, minimum volume ellipsoid, proposed by Rousseeuw and van Zomeren
(1990). Other estimates could be based on the robust singular value decompo-
sition discussed by Ammann (1993). See, also, Simpson, Ruppert, and Carroll
(1992). We recommend computing YbR (i) with a one or two step R estimate
i i
i i
i i
based on the residuals from the original model; see Section 3.7.2. Each step
involves a single ordering of the residuals which are nearly in order (in fact
on the first step they are in order) and a single projection onto the range of
X (easily obtained by using the routines in LINPACK as discussed in Section
3.7.2).
The diagnostic RDF IT T Si measures the change in the fitted values when
the ith case is deleted. Similarly we can also measure changes in the esti-
mates of the regression coefficients. For the LS analysis, this is the diagnostic
DBET AS proposed by Belsley, Kuh, and Welsch (1980). The corresponding
diagnostics for the rank-based analysis are:
where βb (i) denotes the R estimate of β in the delete i-model. A similar statis-
ϕ
tic can be constructed for the intercept parameter. Furthermore, a DCOOK
verison can also be constructed as above. These diagnostics are often used
when |RDF F IT Si| is large. In such cases, it may be of interest to know which
components of the regression coefficients are more influential than other com-
ponents.
√ The benchmark suggested by Belsley, Kuh, and Welsch (1980) is
2/ n.
Example 3.9.4 (Free Fatty Acid (FFA) Data). The data for this example
can be found in Morrison (1983, p. 64) and for convenience we have placed it
(Free Fatty Acid Data) at the url cited in the Preface. The response is the level
of free fatty acid of prepubescent boys while the independent variables are age,
weight, and skin fold thickness. The sample size is 41. Panel A of Figure 3.9.4
depicts the residual plot based on the least squares internal t-residuals. From
this plot there appears to be several outliers. Certainly the cases 12, 22, 26,
and 9 are outlying and perhaps the cases 8, 10, and 38. In fact, the first four
of these cases probably control the least squares fit, obscuring cases 8, 10, and
38.
As our first R fit of this data, we used the Wilcoxon scores with the inter-
cept estimated by the median of the residuals, α bs . Note that all seven cases
stand out in the Wilcoxon residual plot based on the internal R Studentized
residuals, (3.9.31); see Panel B of Figure 3.9.4. This is further confirmed by the
fits displayed in Table 3.9.3, where the LS fit with these seven cases deleted
is very similar to the Wilcoxon fit using all the cases. The q − q plot of the
internal R Studentized residuals, Panel C of Figure 3.9.4, also highlights these
outlying cases. Similar to the residual plot, the q − q plot suggests that the
underlying error distribution is positively skewed with a light left tail. The
estimates of the regression coefficients and their standard errors are displayed
in Table 3.9.3. Due to the skewness in the data, it is not surprising that the LS
i i
i i
i i
and R estimates of the intercept are different since the former estimates the
mean of the residuals while the later estimates the median of the residuals.
Table 3.9.4 displays the values of the R and LS diagnostics for the cases of
interest. For the seven cases cited above, the internal Wilcoxon Studentized
residuals, (3.9.31), definitely flag three of the cases and for two of the others it
exceeds 1.70; see Panel B of Figure 3.9.4. As RDF F IT S, (3.9.39), indicates
none of these seven cases seem to have an effect on the Wilcoxon fit (the
liberal benchmark is .62), whereas the 12th case appears to have an effect on
the least squares fit. RDF F IT S exceeded the benchmark only for case 2 for
which it had the value -.64. Case 36 with h36 = .53 has high leverage but it
did not have an adverse effect on either the Wilcoxon fit or the LS fit. This
is true too of cases 11 and 40 which were the only other cases whose leverage
values exceeded the benchmark of 2p/n.
As we noted above, both the residual and the q−q plots indicate that the
distribution of the residuals is positively skewed. This suggests a transforma-
tion as discussed below, or perhaps a prudent choice of a score function which
would be more appropriate for skewed error distributions than the Wilcoxon
scores. The score function ϕ.5 (u), (2.5.33), is more suited to positively skewed
errors. Panel D of Figure 3.9.4 displays the internal R Studentized residuals
based on the R fit using this bent score function. From this plot and the
tabled diagnostics, the outliers stand out more from this fit than the previous
two fits. The RDF F IT S values for this fit are even smaller than those of the
Wilcoxon fit, which is expected since this score function protects on the right.
While Case 7 has a little influence on the bent score fit, no other cases have
RDF F IT S exceeding the benchmark.
Table 3.9.3 displays the estimates of the betas for the three fits along with
their standard errors. At the .05 level, coefficients 2 and 3 are significant for
the robust fits while only coefficient 2 is significant for the LS fit. The robust
fits appear to be an improvement over LS. Of the two robust fits, the bent
i i
i i
i i
Panel A Panel B
3
Wilcoxon Studentized residuals
LS Studentized residuals
2
1
1
0
0
−1
0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.3 0.4 0.5 0.6 0.7 0.8
Panel C Panel D
Wilcoxon Studentized residuals (orig)
3
2
2
1
1
0
0
−1
i i
i i
i i
Table 3.9.4: Regression Diagnostics for Cases of Interest for the Fatty Acid
Data
LS Wilcoxon Bent Score
Case hi Int. t DFFIT Int. t DFFIT Int. t DFFIT
8 0.12 1.16 0.43 1.57 0.44 1.73 0.31
9 0.04 1.74 0.38 2.14 0.13 2.37 0.26
10 0.09 1.12 0.36 1.59 0.53 1.84 0.30
12 0.06 2.84 0.79 3.30 0.33 3.59 0.30
22 0.05 2.26 0.53 2.51 -0.06 2.55 0.11
26 0.04 1.51 0.32 1.79 0.20 1.86 0.10
38 0.15 1.27 0.54 1.70 0.53 1.93 0.19
2 0.10 -1.19 -0.40 -0.17 -0.64 -0.75 -0.48
7 0.11 -1.07 -0.37 -0.75 -0.44 -0.74 -0.64
11 0.22 0.56 0.30 0.97 0.31 1.03 0.07
40 0.25 -0.51 -0.29 -0.31 -0.21 -0.35 0.06
36 0.53 0.18 0.19 -0.04 -0.27 -0.66 -0.34
Figure 3.9.5: Panel A: Internal R Studentized residuals plot of the log trans-
formed free fatty acid data; Panel B: Corresponding normal q−q plot.
Panel A
Wilcoxon Studentized residuals
2.0
1.0
0.0
−1.0
Panel B
Wilcoxon Studentized residuals (logs)
2.0
1.0
0.0
−1.0
−2 −1 0 1 2
Normal quantiles
i i
i i
i i
Y = α + x′ β + e , (3.10.1)
where λ and γ are unknown parameters. In this case it follows that the hazard
function of T is proportional to the baseline hazard function with the covariate
acting as the factor of proportionality; i.e.,
i i
i i
i i
and generalized gamma if (m1 , m2 ) → (∞, 1); see Kalbfleisch and Prentice. If
(m1 , m2 ) = (1, 1) then the e has a logistic distribution. In general this class
contains a variety of shapes. The distributions are symmetric for m1 = m2 ,
positively skewed for m1 > m2 , and negatively skewed for m1 < m2 . While
Kalbfleisch and Prentice discuss this class for m1 , m2 ≥ 1, we extend the class
to m1 , m2 > 0 in order to include heavier-tailed error distributions.
For random errors with distribution GF (2m1 , 2m2 ), the optimal rank score
function is given by
ϕm1 ,m2 (u) = (m1 m2 (exp {F −1 (u)} − 1))/(m2 + m1 exp {F −1 (u)}) , (3.10.6)
where F is the cdf of the GF (2m1 , 2m2 ) distribution; see Exercise 3.15.31.
We label these scores as GF (2m1 , 2m2 ) scores. It follows that the scores are
strictly increasing and bounded below by −m1 and above by m2 . Hence an
R-analysis based on these scores has bounded influence in the Y -space.
This class of scores can be conveniently divided into the four subclasses
C1 through C4 which are represented by the four quadrants with center (1, 1)
as depicted in Figure 3.10.1. The point (1, 1) in this figure corresponds to
the linear-rank, Wilcoxon scores. These scores are optimal for the logistic
distribution, GF (2, 2), and form a “natural” center point for the scores. One
score function from each class with the density for which it is optimal is plotted
in Figure 3.10.2. These plots are generally representative. The score functions
in C2 change from concave to convex as u increases and, hence, are suitable
for light-tailed error structure, while, those in C4 pass from convex to concave
and are suitable for heavy tailed error structure. The score functions in C3
are always convex and are suitable for negatively skewed error structure with
heavy left tails and moderate right tails, while those in C1 are suitable for
positively skewed errors with heavy right tails and moderate left tails.
Figure 3.10.2 shows how a score function corresponds to its density. If the
density has a heavy right tail then the score function tends to be flat on the
right side; hence, the resulting estimate is less sensitive to outliers on the right.
While if the density has a light right tail then the scores tend to rise on the
right in order to accentuate points on the right. The plots in Figure 3.10.2
suggest approximating these scores by scores consisting of two or three line
segments such as the bent score function, (2.5.33).
Generally the GF (2m1 , 2m2 ) scores cannot be obtained in closed form due
to F −1 , but software such as R can easily produce them. For example, the R
command qf(u,df1,df2) returns F −1 (u), where F is the cdf of a F -random
variable with degrees of freedom ν1 = df1 and ν2 = df2. There are two inter-
esting subclasses for which closed forms are possible. These are the subclasses
GF (2, 2m2 ) and GF (2m1 , 2). As Exercise 3.15.32 shows, the random vari-
ables for these classes are the logs of variates having Pareto distributions. For
i i
i i
i i
Figure 3.10.1: Schematic of the four classes, C1 - C4, of the GF (2m1 , 2m2 )
scores.
2.0
1.5 Neg. Skewed Light Tailed
C3 C2
m2
1.0
C4 C1
0.5
m1
i i
i i
i i
Figure 3.10.2: Column A contains plots of the densities: the Class C1 distribu-
tion GF (3, .8); the Class C2 distribution GF (4, 8); the Class C3 distribution
GF (.5, 6); and the Class C4 distribution GF (1, .6). Column B contains the
corresponding optimal score functions.
Column A Column B
0.15
0.0
GF(3,.8)-density(x)
GF(3,.8)-score(u)
0.10
-0.5
0.05
-1.0
0.0
x u
0.4
2
GF(4,8)-density(x)
GF(4,8)-score(u)
0.3
1
0.2
0
0.1
-1
-2 -1 0 1 2 0.0 0.2 0.4 0.6 0.8 1.0
x u
1.5
0.12
GF(.5,6)-density(x)
GF(.5,6)-score(u)
1.0
0.08
0.5
0.04
0.0
0.0
x u
0.12
0.2
GF(1,.6)-density(x)
GF(1,.6)-score(u)
0.08
0.0
-0.2
0.04
-0.4
0.0
x u
3m2 (m2 + 2)
ARE = , (3.10.9)
(2m2 + 1)2
i i
i i
i i
to overfitting. Its use, however, can lead to insight and may prove beneficial
for fitting future data sets of the same type; see McKean et al. (1989) for such
an application. Using XLISP-STAT (Tierney, 1990) Wang (1996) presents a
graphical interface for methods of score selection.
Panel A Panel B
8
8
6
6
4
4
Log survival time
2
0
GF(2,10)−Fit
LS−Fit
−2
−2
The one method for score selection that we briefly touch on here is based on
q−q plots; see McKean and Sievers (1989). Using Wilcoxon scores we obtained
an initial fit of the oneway layout model as discussed in Chapter 4. Panel A of
Figure 3.10.4 displays the q−q plot of the ordered residuals versus the logistic
quantiles based on this fit. Although the left tail of the logistic distribution
appears adequate, the right side of the plot indicates that distributions with
lighter right tails might be more appropriate. This is confirmed by the near
linearity of the GF (2, 10) quantiles versus the Wilcoxon residuals. After trying
several R-fits using GF (2m1 , 2m2 ) scores with m1 , m2 ≥ 1, we decided that
i i
i i
i i
the q −q plot of the GF (2, 10) fit, Panel B of Figure 3.10.4, appeared to be
most linear and we used it to conduct the following R-analysis.
Panel A Panel B
2
2
1
0
0
Wilcoxon residuals
GF(2,10) residuals
−1
−2
−2
−3
−4
−4
−4 −2 0 2 4 −8 −6 −4 −2 0 2
For the fit of the full model using the scores GF (2, 10), the minimum value
of the dispersion function, D, is 103.298 and the estimate of τϕ is 1.38. Note
that this minimum value of D is the analogue of the “pure” sum of squared
errors in a least squares analysis; hence, we use the notation DP E = 103.298
for pure error dispersion. We first test the goodness of fit of a simple
linear model. The reduced model in this case is a simple linear model. The
alternative hypothesis is that the model is not linear but, other than this, it is
not specified; hence, the full model is the oneway layout. Thus the hypotheses
are
i i
i i
i i
interval for the slope parameter β to be −17.67 ± 3.67; hence, it appears that
the slope parameter differs significantly from 0. In Lawless there was interest
in computing a confidence interval for E(Y |x = log 20). The robust estimate
of this conditional mean is Ŷ = 11.07 and a confidence interval is 11.07 ± 1.9.
Similar to the other robust confidence intervals, this interval is the same as in
the least squares analysis, except that τ̂ϕ replaces σ̂. A fuller discussion of the
R analysis of this data set can be found in McKean and Sievers (1989).
i i
i i
i i
Y = α + x′ β + e (3.11.1)
i i
i i
i i
(a.s.) for the correlation model. This allows us to easily obtain inference meth-
ods for the correlation model as discussed below.
First define the modulus of a matrix A to be
m(A) = max |aij | . (3.11.6)
i,j
As Exercise 3.15.33 shows the following three facts follow from this defini-
tion: m(AB) ≤ p m(A)m(B) where p is the common dimension of A and B;
m(AA′ ) ≥ m(A)2 ; and m(A) = max aii if A is positive semidefinite. We next
need a preliminary lemma found in Arnold (1980).
Lemma
Pn 3.11.1. Let {a−1 n } be a sequence of nonnegative real numbers. If
−1
n i=1 ai → a0 then n sup1≤i≤n ai → 0.
Proof: We have
n n−1
an 1X n−1 1 X
= ai − ai → 0 . (3.11.7)
n n i=1 n n − 1 i=1
Now suppose that n−1 sup1≤i≤n an 6→ 0. Then for some ǫ > 0 and for all
integers N there exists an nN such that nN ≥ N and n−1N sup1≤i≤nN ai ≥ ǫ.
Thus we can find a subsequence of integers {nj } such that nj → ∞ and
n−1
j sup1≤i≤nj ai ≥ ǫ. Let ainj = sup1≤i≤nj ai . Then
ainj ainj
ǫ≤ ≤ . (3.11.8)
nj inj
Also, since nj → ∞ and ǫ > 0, inj → ∞; hence, expression (3.11.8) leads to a
contradiction of expression (3.11.7).
The following theorem is due to Arnold (1980).
Theorem 3.11.1. Under (3.11.5),
n o
−1
lim max diag X (X′ X) X′ = 0 , a.s. ; (3.11.9)
n→∞
Proof: Using the facts cited above on the modulus of a matrix, we have
−1 !
−1 1
m X (X′ X) X′ ≤ p2 n−1 m (XX′ ) m X′ X . (3.11.10)
n
i i
i i
i i
a.s
By Lemma 3.11.1 we have n−1 supi≤n Ui → 0. Since XX′ is positive semidefi-
nite, the desired conclusion is obtained from the facts which followed expres-
sion (3.11.6).
Thus given X, we have the same assumptions on the design matrix as
we did in the previous sections. By conditioning on X, the theory derived in
Section 3.5 holds for the correlation model also. Such a conditional argument
is demonstrated in Theorem 3.11.2 below. For later discussion we summarize
the rank-based inference for the correlation model. Given a specified score
function ϕ, let βb denote the R estimate of β defined in Section 3.2. Under
ϕ
the correlation model (3.11.1) and the assumptions (3.11.4), (S.1), (3.4.10),
√ b D 2 −1
and (3.11.5) n(β ϕ − β) → Np (0, τϕ Σ ). Also the estimates of τϕ discussed
in Section 3.7.1 are consistent estimates of τϕ under the correlation model. Let
τbϕ denote such an estimate. In terms of testing, consider the R test statistic,
Fϕ = (RD/p)/(b τϕ /2), of the above hypothesis H0 of independence. Employing
D
the usual conditional
√ argument, it follows that pFϕ → χ2 (p, δR ), a.e. M under
Hn : β = θ/ n where the noncentrality parameter δR is given by δ =
θ ′ Σθ/τϕ2 .
Likewise for the LS estimate β b of β. Using the conditional argument,
√ LS b − β) → D
(see Arnold (1980) for details), n(β LS Np (0, σ 2 Σ−1 ) and under Hn ,
D
pFLS → χ2 (p, δLS ) with noncentrality parameter δLS = θ ′ Σθ/σ 2 . Thus the
ARE of the R test Fϕ to the least squares test FLS is the ratio of noncen-
trality parameters, σ 2 /τϕ2 . This is the usual ARE of rank tests to tests based
on least squares in simple location models. Hence the test statistic Fϕ has
efficiency robustness. The theory of rank-based tests in Section 3.6 applies to
the correlation model.
We return to measures of association and their estimates. For motivation,
we consider the least squares measure first.
i i
i i
i i
i i
i i
i i
It follows from the above discussion on the R test statistic that the influence
function of R2 has bounded influence in the Y -space.
The parameters that respectively correspond to the statistics D(0) and
R R
b
D(β R ) are D y = ϕ(G(y))ydG(y) and D e = ϕ(F (e))edF (e); see the dis-
cussion in Section 3.6.3. The population CMDs associated with R1 and R2
are:
R1 = RD/Dy (3.11.16)
R2 = RD/(RD + (τϕ /2)) , (3.11.17)
Theorem 3.11.2. Under the correlation model (3.11.1) and the assumptions
(E.1), (2.4.16), (S.1), (3.4.10), (S.2), (3.4.11), and (3.11.5),
P
Ri → Ri a.e. M , i = 1, 2 .
Xn
n b 1
D(0) = ϕ Fn (Yi ) Yi
i=1
n+1 n
Z
n b
= ϕ Fn (t) tdFbn (t) ,
n+1
where Fbn denotes the empirical distribution function of the random sample
Y1 , . . . , Yn . As n → ∞ the integral converges to Dy .
Next consider the reduction in dispersion. By Theorem 3.11.1, with prob-
ability 1, we can restrict the sample space to a space on which Huber’s design
condition (D.1) holds and on which n−1 X′ X → Σ. Then conditionally given
X, we have the assumptions found in Section 3.4 for the non-stochastic model.
b )→ P
Hence from the discussion found in Section 3.6.3 (1/n)D(β R D e . Hence
it is true unconditionally, a.e. M. The consistency of τbϕ was discussed above.
The result then follows.
i i
i i
i i
i i
i i
i i
the facts that ϕ takes on both positive and negative values and that G is
absolutely continuous.
The next theorem is taken from Witt (1989).
Theorem 3.11.4. Suppose f and g satisfy the conditions (E.1) and (E.2)
in Section 3.4 and that ϕ satisfies assumption (S.2), (3.4.11). Then RD is a
strictly convex function of β and has a minimum value of 0 at β = 0.
Proof: We show that the gradient of RD is zero at β = 0 and that its second
matrix derivative is positive definite. Note first that the
R distribution function,
′
G, and Rdensity, g, of Y can be expressed as G(y) = F (y − β x)dM(x) and
g(y) = f (y − β ′ x)dM(x). We have
Z Z Z
∂RD
= − ϕ′ [G(y)]yf (y − β ′ x)f (y − β ′ u)udM(x)dM(u)dy
∂β
Z Z
− ϕ[G(y)]yf ′(y − β ′ x)xdM(x)dy . (3.11.23)
Since E[x] = 0, both terms on the right side of the above expression are 0
at β = 0. Before obtaining the second derivative, we rewrite the first term of
(3.11.23) as
Z Z Z
′ ′ ′
− ϕ [G(y)]yf (y − β x)f (y − β u)dydM(x) udM(u) =
Z Z
′ ′
− ϕ [G(y)]g(y)yf (y − β u)dy udM(u) .
i i
i i
i i
Now integrate the first term on the right side of (3.11.24) by parts with respect
to y by using dt = f ′ (y − β ′ x)dy and v = ϕ[(G(y)]. This leads to
Z Z Z
∂ 2 RD
=− ϕ′ [G(y)]f (y − β ′ x)f (y − β ′ u)x(u − x)′ dydM(x)dM(u) .
∂β∂β ′
(3.11.25)
We have, however, the following identity
Z Z
ϕ′ [G(y)]f (y − β ′ x)f (y − β ′ u)(u − x)(u − x)′ dydM(x)dM(u) =
Z Z
ϕ′ [G(y)]f (y − β ′ x)f (y − β ′ u)u(u − x)′ dydM(x)dM(u)
Z Z
− ϕ′ [G(y)]f (y − β ′ x)f (y − β ′ u)x(u − x)′ dydM(x)dM(u) .
Since the two integrals on the right side of the last expression are negatives of
each other, this combined with expression (3.11.24) leads to
Z Z
∂ 2 RD
2 = ϕ′ [G(y)]f (y − β′ x)f (y − β′ u)(u − x)(u − x)′ dydM(x)dM(u) .
∂β∂β ′
Since the functions f and M are continuous and the score function is increas-
ing, it follows that the right side of this last expression is a positive definite
matrix.
It follows from these theorems that the Ri s satisfy properties of association
2
similar to R . We have 0 ≤ Ri ≤ 1. By Theorem 3.11.4, Ri = 0 if and only if
β = 0 if and only if Y and x are independent.
Theorem 3.11.5. Suppose Model (3.11.1) holds. Assume further that the
(x, Y ) follows a multivariate normal distribution with the variance-covariance
matrix
Σ β′ Σ
Σ(x,Y ) = . (3.11.26)
Σβ σe2 + β ′ Σβ
Then, from (3.11.16) and (3.11.17),
q
2
R1 = 1 − 1 − R (3.11.27)
q
2
1− 1−R
R2 = q , (3.11.28)
2
1 − 1 − R [1 − (1/(2T 2))]
i i
i i
i i
Proof: Note that σy2 = σe2 + β ′ Σβ and E(Y ) = α + β ′ E[x]. Further the
distribution function of Y is G(y) = Φ((y − α − β ′ E(x))/σy ) where Φ is the
standard normal distribution function. Then
Z ∞
Dy = ϕ [Φ(y/σy )] ydΦ(y/σy ) (3.11.29)
−∞
= σy T . (3.11.30)
Similarly, D e = σe T . Hence,
RD = (σy − σe )T . (3.11.31)
2 2 σe2
By the definition of R , we have R = 1 − σy2
. This leads to the relationship,
q
2 σy − σe
1− 1−R = . (3.11.32)
σy
The result (3.11.27) follows from the expressions (3.11.31) and (3.11.32).
For the result (3.11.28), by the assumptions on the distribution of (x, Y ),
the distribution of e is N(0, σe2 ); i.e., f (x) = (2πσe2 )−1/2 exp {−x2 /(2σe2 )} and
F (x) = Φ(x/σe ). It follows that f ′ (x)/f (x) = −σe−2 x, which leads to
f ′ (F −1 (u)) 1
− ′
= Φ−1 (u) .
f (F (u)) σe
Hence,
Z 1
1 −1
τϕ−1 = ϕ(u) du Φ (u)
0 σe
Z
1 1
= ϕ(u)Φ−1 (u) du .
σe 0
i i
i i
i i
Note that T is free of all parameters. It can be shown directly that the Ri s
2
are one-to-one increasing functions of R ; see Exercise 3.15.35. Hence, for the
2
multivariate normal model the parameters R , R1 , and R2 are equivalent.
Although the CMDs are equivalent for the normal model, they measure
dependence between x and Y on different scales. We can use the above rela-
tionships derived in the last theorem to have these coefficients measure the
2
same quantity at the normal model by simply solving for R in terms of R1
and R2 in (3.11.27) and (3.11.28), respectively. These parameters are useful
∗ ∗
later so we call them R1 and R2 , respectively. Hence solving as indicated we
get
∗2
R1 = 1 − (1 − R1 )2 (3.11.33)
2
∗2 1 − R2
R2 = 1− . (3.11.34)
1 − R2 (1 − (1/(2T 2 )))
2 ∗2 ∗2
Again, at the multivariate normal model we have R = R1 = R2 .
For Wilcoxon scores and sign scores the reader is ask to show in Exercise
3.15.36 that (1/(2T 2)) = π/6 and (1/(2T 2)) = π/4, respectively.
i i
i i
i i
Ghosh and Sen (1971) proposed the mixed rank test statistic to test the
hypothesis of independence (3.11.3). It is essentially the gradient test of the
hypothesis H0 : β = 0. As we showed in Section 3.6, this test statistic is
asymptotically equivalent to Fϕ . Ghosh and Sen (1971), also, proposed a pure
rank statistic in which both variables are ranked and scored.
i i
i i
i i
i i
i i
i i
formalized like the MAXR procedure in SAS. In a similar vein, the stepwise
model building criteria based on LS estimation (Draper and Smith, 1966)
could easily be robustified by using R estimates in place of LS estimates and
the robust test statistic Fϕ in place of FLS .
where the weights bij are positive and symmetric, i.e., bij = bji . It is then
easy to show, see Exercise 3.15.42, that the function (3.12.1) is a pseudo-
norm. As noted in Section 2.2.2, if the weights bij ≡ 1, then this pseudo-norm
is proportional to the pseudo-norm based on the Wilcoxon scores. Hence we
refer to this as a generalized R (HBR) pseudo-norm.
Since this is a pseudo-norm we can develop estimation and testing proce-
dures using the same geometry as in the last chapter. Briefly the HBR estimate
b
of β in model (3.2.3) is a vector β HBR such that
b
β HBR = Argminβ kY − XβkHBR . (3.12.2)
i i
i i
i i
3.12.2 Weights
The weight for a point (xi , Yi), i = 1, . . . , n, for the HBR estimates is a function
of two components. One component depends on the “distance” of the point
xi from the center of the X-space (factor space) and the other component
depends on the size of the residual based on an initial high breakdown fit. As
shown below, these components are used in combination, so the weight due to
one component may be offset by the weight of the other component.
First, we consider distance in factor space. It seems reasonable to down-
weight points far from the center of the data. The leverage values hi =
n−1 + x′ci (X′c Xc )−1 xci , for i = 1, . . . , n, measure distance (Mahalanobis) from
the center relative to the scatter matrix X′c Xc . Leverage values, though, are
based on means and the usual (LS) variance-covariance scatter matrix which
are not robust estimators. There are several robust estimators of location and
scatter from which to choose, including the high breakdown minimum co-
variance determinant (MCD) which is an ellipsoid that covers about half
of the data and yet has minimum determinant. Although computationally
intensive, Rousseeuw and Van Driessen (1999) present a fast computational
algorithm for it. Let vc denote the center of the ellipsoid. Letting V denote
the MCD, the sample covariance of the points covered, the robust distances
are given by
Qi = (xi − vc )′ V−1(xi − vc ). (3.12.1)
n o
We define the associated weights by wi = min 1, Qci , where c is usually
set at the 95th percentile of the χ2 (p) distribution. Note that “good” points
generally have weights 1.
The class of GR estimates proposed by Sievers (1983) use weights of the
form bij = wi wj which depend only on distance in factor space. As shown by
Naranjo and Hettmansperger (1994), these estimates have positive breakdown
and bounded influence in factor space, but as Exercise 3.15.41 shows they are
always less efficient than the Wilcoxon estimates, unless all the weights are 1.
Further, at times, the loss in efficiency can be severe; see Chang et al. (1999)
for discussion. One reason is that “good” points of high leverage (points that
follow the model) are downweighted by the same amount as points at the same
i i
i i
i i
distance from the center of factor space but which do not follow the model
(“bad” points of high leverage). The asymptotic variance of the GR estimators
is given in Exercise 3.15.41, also.
The weights for the HBR estimates are a function of the GR weights and
residual information from the Y -space. The residuals are based on a high
breakdown initial estimate of the regression coefficients. We have chosen to
use the least trim squares (LTS) estimate which is given by
h
X
Argmin [Y − α − x′ β]2(i) , (3.12.2)
i=1
where h = [n/2]+1 and where the notation (i) denotes the ith ordered absolute
residual; see Rousseeuw and Van Driessen (1999). Let b e0 denote the residuals
from this initial fit.
Define the function ψ(t) by ψ(t) = 1, t, or − 1 according as t ≥ 1,
−1 < t < 1, or t ≤ −1. Let σ be estimated by the initial scaling estimate
(0) (0)
MAD = 1.483 medi |b ei − medj {bej }| . Recall the robust distances Qi , defined
in expression (3.12.1). Let
b b
mi = ψ = min 1, ,
Qi Qi
where the tuning constants b and c are both set at 4. From this point-of-view, it
is clear that these weights downweight both outlying points in factor space and
outlying responses. Note that the initial residual information is a multiplicative
factor in the weight function. Hence, a good leverage point generally has a
small (in absolute value) initial residual which offsets its distance in factor
space. The following example illustrates the differences among the Wilcoxon,
GR, and HBR estimates.
Example 3.12.1 (Stars Data). This data set is drawn from an astronomy
study on the star cluster CYG OB1 which contains 47 stars; see Rousseeuw
and Leroy (1987) for a discussion on the history of the data set. The response is
the logarithm of the light intensity of the star while the independent variable is
the logarithm of the temperature of the star. For convenience, we have placed
the Stars Data at the url cited in the Preafce. The data are also shown in Panel
A of Figure 3.12.1. Note that four of the stars, called giants, form a cluster of
outliers in factor space while the rest of the stars fall in a point cloud. Panel A
i i
i i
i i
shows also the overlay plot of the LS and Wilcoxon fits. Note that the cluster
of four outliers in the x space have exerted such a strong influence on the
fits that they have drawn the LS and Wilcoxon fits towards the cluster. This
behavior is predictable based on the influence functions of these estimates.
These four giant cases have very large robust distances from the center of the
data. Hence the weights used by the GR estimates severely downweight these
points, resulting in its fit through the point cloud. For this data set, the initial
LTS fit ignores the four giant stars and fits the point cloud. Hence, the four
giant stars are “bad” leverage points and, hence, are downweighted for the
HBR fit, also.
The ww command to compute the GR or HBR estimates is the same as
for the Wilcoxon estimates, wwest, except that the argument for the weight
indicator bij is set at bij="GR" or bij="HBR", respectively. For example,
suppose the design matrix without the intercept column is in the variable xmat
and the response vector is in the variable y. Then the following R commands
return the LS, Wilcoxon, GR and HBR estimates
ls.fit = lm(y~x)
wil.fit = wwest(x,y,bij="WIL",print.tbl=F)
gr.fit = wwest(x,y,bij="GR",print.tbl=F)
hbr.fit = wwest(x,y,bij="HBR",print.tbl=F)
Example 3.12.2 (Stars Data, continued). Suppose in the last example that
we had no subject matter available concerning the data set. Then based on
the scatterplot, we may decide to fit a quadratic model. The plots of the LS,
Wilcoxon, GR, and HBR fits for the quadratic model are found in Panel B of
Figure 3.12.1. The quadratic fits based on the LS, Wilcoxon, and HBR esti-
mates follow the curvature in the data, while the GR fit misses the curvature
resulting in a very poor fit. For the quadratic model, the cluster of four giant
stars are “good” data points and the HBR weights take this into account.
The weights used for the GR fit, however, ignore this residual information
and severely downweight the four giant star cases, resulting in the poor fit as
shown in the figure.
The last two plots in the figure, Panels C and D, are the residual plots
for the GR and HBR fits. Based on their fits, the LS and Wilcoxon residual
plots are the same as the HBR. The pattern in the GR residual plot (Panel
C), while not random, does not indicate how to proceed with model selection.
This is often true for residual plots based on high breakdown fits; see McKean
et al. (1993).
i i
i i
i i
Figure 3.12.1: Panel A: Stars data overlaid with LS, Wilcoxon, and GR fits;
Panel B: Wilcoxon residual plot; Panel C: GR residual plot; Panel D: Internal
GR Studentized residuals by case.
Panel A: Linear Model Panel B: Quadratic Model
6.0
6.0
5.5
5.5
Wil LS,Wil,HBR
Log light intensity
5.0
4.5
4.5
HBR GR
GR
4.0
4.0
3.6 3.8 4.0 4.2 4.4 4.6 3.6 3.8 4.0 4.2 4.4 4.6
1.0
1.0
0.5
0.5
0.0
0.0
−0.5
−0.5
3.12.3 b HBR
Asymptotic Normality of β
The asymptotic normality of the HBR estimates was developed by Chang
(1995) and Chang et al. (1999). Much of our development is in Appendix A.6
which is taken from this later article. Our discussion is for general weights un-
der assumptions that we specify as we proceed. In order to establish asymp-
b
totic normality of β HBR , we need some further notation and assumptions.
Define the parameters
where
Bij (t) = Eβ [bij I(0 < Yi − Yj < t)] . (3.12.5)
i i
i i
i i
Cn = X ′ A n X . (3.12.7)
Since the rows and columns of An sum to zero, it can be shown that
X
Cn = γij bij (xj − xi )(xj − xi )′ ; (3.12.8)
i<j
For the correlation model, an explicit expression can be given for the matrix
CH assumed in (H.1); see (3.12.24) and, also, Lemma 3.12.1.
As our theory shows, the HBR estimate√attains 50% breakdown (Section
3.12.4) and asymptotic normality, at rate n, provided the initial estimate
of regression estimates have these qualities. One such estimate is the least
trimmed squares, LTS, which is given by expression (3.12.2). Another class
of such estimates are the rank-based estimates proposed by Hössjer (1994);
see also Croux, Rousseeuw, and Hössjer (1994).
The development of the theory for βb
HBR proceeds similar to that of the R
estimates. The theory is sketched in the Appendix, Section A.6, and here we
present only the two main results: the asymptotic distribution of the gradient
and the asymptotic distribution of the estimate.
i i
i i
i i
i i
i i
i i
where the remainder term is uniformly small over all i and j. Under Assump-
tion (H.1), (3.12.10), the result follows.
Discussion on obtaining standard errors for the estimators can be found in
Section 3.12.6.
Remark 3.12.1 (Empirical Efficiency). As noted above, there is always a
loss of efficiency of the GR estimator relative to the Wilcoxon estimator. It
was hoped that the HBR estimator would regain some of this efficiency. This
was confirmed in a Monte Carlo study which is discussed in Section 8 of the
article by Chang et al. (1999). In this study, over a series of designs, which
included contamination in both responses and factor space, in all but two of
the situations, the empirical efficiency of the HBR estimate relative to the
Wilcoxon estimate was always larger than that of the GR estimate relative to
the Wilcoxon estimate.
Remark 3.12.2 (Stability Study). To obtain its full 50% breakdown, the
HBR estimates require initial estimates with 50% breakdown. It is known that
slight changes to centrally located data can cause some high breakdown esti-
mates to change by a large amount. This was discussed for the high breakdown
least median squares (LMS) estimates by Hettmansperger and Sheather (1992,
1993) and later confirmed in a Monte Carlo study by Sheather, McKean, and
Hettmansperger (1997). Part of the article by Chang et al. (1999) consisted of
a stability study for the HBR estimator using LMS and LTS starting values.
Over the situations investigated, the HBR estimates were much more stable
than either the LTS or LMS estimates but were less stable than the Wilcoxon
estimates.
i i
i i
i i
i i
i i
i i
unique solution β. In particular, this implies that neither all of the xi s are
the same nor all of the yis are the same; hence, provided the weights have not
broken down, this implies that both constants M1 and M2 of Lemma 3.12.2
are positive.
Theorem 3.12.3. Assume that the data points Z are in general position. Let
v, V, and β b (0) denote the initial estimates of location, scatter, and β. Let
b (0) , Z) denote their corresponding breakdown
ǫ∗n (v, Z), ǫ∗n (V, Z) and ǫ∗n (β
points. Then breakdown point of the HBR estimator is
b ∗ b (0)
ǫ∗n (β ∗ ∗
HBR , Z) = min{ǫn (v, Z), ǫn (V, Z), ǫn (β , Z), 1/2} . (3.12.20)
Proof: Corrupt m points in the data set Z and let Z′ be the sample consisting
of these corrupt points and the remaining n − m points. Assume that Z′ is
in general position. Assume that v(Z′ ), V(Z′ ) and β b (0) (Z′ ) have not broken
down. Then the constants M1 and M2 of Lemma 3.12.2 are positive and finite.
Hence, by Lemma 3.12.2, kβ b ′
HBR (Z )k < ∞ and the theorem follows.
Based on this last result, the HBR estimate has 50% breakdown provided
the initial estimates v, V, and β b (0) all have 50% breakdown. Assuming that
the data points are in general position, the MCD estimates of location and
scatter as discussed near expression (3.12.1) have 50% breakdown. For initial
estimates of the regression coefficients, again assuming that the data points
are in general position, the LTS estimates, (3.12.2), have 50% breakdown; see,
also, Hössjer (1994). The HBR estimates used in the examples of Section 3.12.6
employ the MCD estimates of location and scatter and the LTS estimate of
the regression coefficients, resulting in the weights defined in (3.12.3).
i i
i i
i i
Influence functions are derived at the model where both x and y are
stochastic; hence, consider the correlation model of Section 3.11,
Y = x′ β + e , (3.12.22)
b
Theorem 3.12.4. The influence function for the estimate β HBR is given by
ZZ
b 1
Ω(x0 , y0 , β HBR ) = (x0−x1 )b(x1 , x0 , y1 , y0)sgn{y0−y1 }dF (y1)dM(x1 ),
2CH
(3.12.25)
where CH is given by expression (3.12.24).
i i
i i
i i
In order to show that the influence function correctly identifies the asymp-
totic distribution of the estimator, define Wi as
Z Z
Wi = (xi − x1 )b(x1 , xi , y1, Yi )sgn(yi − y1 ) dF (y1)dM(x1 ) . (3.12.26)
√ P d
If we can show that (1/ n) nj=1 Wi∗ → N(0, ΣH ), then we are done. From
the proof of Theorem A.6.4 in the Appendix, it suffices to show that
n
1 X P
√ (Ui − Wi∗ ) → 0 , (3.12.28)
n i=1
Pn
where Ui = (1/n) j=1 (xi − xj )E[bij sgn(yi − yj )|yi ] . Writing the left side of
(3.12.28) as
n X
X n
3/2
(1/n ) (xi − xj ) {E [bij sgn(yi − yj )|yi] − gij (0)sgn(yi − yj )} ,
i=1 j=1
where gij (0) ≡ b(xj , xi , yj , yi), the proof is analogous to the proof of Theorem
A.6.4.
3.12.5 Discussion
b
The influence function, Ω(x0 , y0 , β HBR ), for the HBR estimate is a continuous
function of x0 and y0 . With a proper choice of a weight function it is bounded
in both the x and y spaces. This is true for the weights given by (3.12.3);
furthermore, for these weights Ω(x0 , y0 , β b
HBR ) goes to zero as x0 and y0 get
large in any direction.
b is a generalization of the influence func-
The influence function Ω(x0 , y0 , β)
tions for the Wilcoxon estimates; see Exercise 3.15.46. Figure 3 of Chang et
al. (1999) shows the influence function of the HBR estimator for the special
case where (x, Y ) has a bivariate normal distribution with mean 0 and the
identity matrix as the variance-covariance matrix. For this plot we used the
weights given by (3.12.3) where mi = ψ(b/x2i ) with the constants b = c = 4.
As discussed above, the plot verifies that the influence function is bounded in
both the x and y spaces and goes to zero as x0 and y0 get large in any di-
rection. For comparison purposes, Figure 3 of Chang et al. (1999) also shows
i i
i i
i i
1 Xb ′
n
bH =
Σ b
Ui − U Ubi − U
b . (3.12.30)
n − 1 i=1
i i
i i
i i
For the matrix CH , consider the results in Lemma 3.12.1. Upon substi-
tuting the estimated weights for the weights, expression (3.12.19) simplifies
to Z
. b 1
Bij (0) = bij f 2 (t) dt = bbij √
′
, (3.12.31)
12τW
where τW is the scale parameter
√ R (3.4.4) for the Wilcoxon score function; which,
for convenience, is τW = [ 12 f (t) dt]−1 . To estimate τW , we use the estima-
2
tor τc
W given in expression (3.7.8). Now approximating bij in Cn using (3.12.3)
leads to the estimate
XX n n
1
bn = √ bbij (xj − xi )(xj − xi )′ .
n−2 C (τc
W n)
−2
(3.12.32)
4 3 i=1 j=1
α b HBR } .
bHBR = med1≤i≤n {yi − x′i β (3.12.34)
√ b
Because nβ HBR is bounded in probability and X is centered, it follows, using
an argument very similar to the corresponding result for the R estimates (see
McKean et al., 1990), that the joint asymptotic distribution of α b and βb
HBR
is given by
2
√ b
α α D τS
0 0′
n b HBR →N , ,
β β 00 (1/4)C−1ΣC−1
(3.12.35)
where τS is defined by (3.4.6); see the discussion around expression expression
(1.5.29) for estimation of this scale parameter.
i i
i i
i i
eb∗i = Yi − α b
b − x′i β HBR , (3.12.36)
see the discussion around expression (A.7.2) in the Appendix. This yields an
estimate of the matrix A∗ .
We recommend estimating σ 2 by MAD = 1.483med{|e|∗i }; κ1 by
n
1X ∗
b
κ1 = |b
e |; (3.12.39)
n i=1 i
and κ2 by
n
1X e∗i ) 1 ∗
R(b
b
κ2 = − eb , (3.12.40)
n i=1 n+1 2 i
which is a consistent estimate of κ2 ; see McKean et al. (1990). An estimator
of ΣH is given in expression (3.12.30). Let V b denote the estimate of Var(b
e∗ ).
beb2i denote the the ith diagonal entry of V.
Let σ b Define the Studentized
residuals by
b
ei
e∗i =
b . (3.12.41)
bebi
σ
As in LS, these standard errors correct for both the underlying variance of
the errors and location. For flagging outliers, appropriate benchmarks for these
residuals are ±2; see Section 3.9 for discussion.
Using the ww package, the Studentized residuals based on the HBR fit
are computed by the R commands: fit.hbr = wwest(xmat,y,bij="HBR")
and studres.hbr(xmat,fit.hbr$tmp2$wmat,fit.hbr$tmp1$resid), where
xmat and y contain the design matrix and the vector of responses, respectively.
i i
i i
i i
where the ei s are simulated iid N(0, 1) variates and the xi s are simulated
contaminated normal variates with the contamination proportion set at .25
and the ratio of the variance of the contaminated part to the noncontaminated
part set at 16. Panel A of Figure 3.12.2 displays a scatter plot of the data
overlaid with the Wilcoxon, HBR, and LMS fits. The estimated coefficients
for these fits are in Table 3.12.1. As shown, the Wilcoxon fit is quite good in
fitting the curvature of the data. Its estimates are close to the true values. On
the other hand, the high breakdown fits are quite poor. The LMS fit missed
the curvature in the data. This is true too for the HBR fit, although the fit
did correct itself somewhat from the poor LMS starting values. Panels B and
C of Figure 3.12.2 contain the internal Studentized residual plots based on
the Wilcoxon and the HBR fits, respectively. Based on the Wilcoxon residual
plot, no further models would be considered. The HBR residual plot shows as
outliers the two points which were fitted poorly. It also has a mild linear trend
in it, which is not helpful since a linear term was fit. This trend is true for the
LMS residual plot (Panel D), although it gives an overall impression of the
lack of a quadratic term in the model. In such cases in practice, a higher degree
polynomial may be fitted, which in this case would be incorrect. Difficulties
i i
i i
i i
Figure 3.12.2: Panel A: For the quadratic data of Example 3.12.3, scatter
plot of data overlaid by Wilcoxon, HBR, and LMS fits; Panel B: Studentized
residual plot based on the Wilcoxon fit; Panel C: Studentized residual plot
based on the HBR fit; Panel D: Residual plot based on the LMS fit.
Panel A Panel B
15 • •
• •
• •
2
• • •
Wilcoxon residual
•• • • ••
10
• • •
1
• •
• • •
• •• • • •
Y
•
• ••• • • •
•
•••
0
•
5
•• • •• • • •
• • ••
••• •
•••
•• •• •
-1
• Wilc. • •
•• HBR
•
0
•• LMS
•
0 2 4 6 0 2 4 6 8 10
X Wilcoxon fit
Panel C Panel D
• •
2
• •
• • • •• • • • • •• • • • ••
• •• • ••• •
• •• • ••
0
• • • •
•• • • • • •
• • • • •• • •
0
• ••• •
• • •••
HBR residual
LMS residual
• • • • • • •
•
-2
-5
-4
-10
-6
• •
• •
-15
-8
2 4 6 8 10 14 5 10 15 20
in reading residual plots from high breakdown fits, as encountered here, were
discussed in general in McKean et al. (1993).
i i
i i
i i
measure the difference in fits among LS, highly efficient R, and high breakdown
R fits. These diagnostics were developed by McKean et al. (1996a,1999); see,
also, McKean and Sheather (2009) for a recent discussion. The package ww
computes them; see Terpstra and McKean (2005) and McKean, Terpstra, and
Kloke (2009). We sketch the development for the diagnostics that differentiate
between the LS and R fits first. Also, we focus on Wilcoxon scores, leaving the
general scores analog to the exercises. We begin by looking at the difference
in LS and Wilcoxon (W) fits, which leads to our diagnostics.
Consider the linear model (3.2.3). The design matrix is centered, so the
LS estimator is β b ′ −1 ′ b
LS = (X X) X Y. Let β W denote the R estimate, im-
mediately
√ following expression (3.2.7), based on Wilcoxon scores, (ϕ(u) =
12[u − (1/2)]). We use τW to denote the √ corresponding
R scale parameter τϕ .
For Wilcoxon scores, recall that τW = 1/( 12 f (t)2 dt). We then have the
following theorem:
Theorem 3.13.1. Assume that the random errors ei of Model (3.2.3) have
finite variance σ 2 . Also, assume that assumptions (E1), (E2), (D1), and (D2)
of Section 3.4 are true. Then β b −β b is asymptotically normal with mean
LS W
0 and variance-covariance matrix
b −β
Var(β b ) = δ 2 (X′ X)−1 , (3.13.1)
LS W
√
where δ 2 = σ 2 + τW
2
− 2κ and κ = 12τW E[e(F (e) − 1/2)].
The proof of this theorem can be found in the Appendix; see Section A.8.
It follows that δ 2 ≥ (τW − σ)2 . To see this note that E[e(F (e) − 1/2)] =
Cov(e, F (e)) ≥ 0; hence, κ ≥ 0. Then by the Cauchy-Schwarz inequality,
q √ √
κ ≤ τW E(e2 ) · E( 12(F (e) − 1/2))2 = τW σ 2 · 1 = τW σ,
ee .
µd = µe − µ (3.13.2)
i i
i i
i i
One way to avoid a problem here is to assume that the errors have a symmetric
distribution, i.e. µd = 0, but this is undesirable in developing diagnostics for
exploratory analysis. Instead, we consider measures composed of two parts:
one part measures the difference in slope parameters and the other part mea-
sures differences in the estimates of intercept. First, the following notation is
convenient, here. Let b = (α, β ′ )′ denote the vector of parameters. Let b b LS
and bbW denote respectively the LS and Wilcoxon estimates of b.
The version of Theorem 3.13.1 that includes intercept is the following corol-
lary. Let τs = 1/(2f (θ)), where f is the error density and θ is the median of
f . Assume without loss of generality that the design matrix X is centered.
b LS − b
T DBET AS(LS, W ) = (b b W )′ A b LS − b
b −1 (b bW ) (3.13.4)
W
where
bW = τbs2 /n 0
A 2 , (3.13.5)
0 τbW (X′ X)−1
i i
i i
i i
i i
i i
i i
|YbW,i − YbLS,i | p
√ > 2 (p + 1)/n . (3.13.8)
τbW hii
We use this expression to obtain a benchmark for the diagnostic
T DBET AS(LS, W ) as follows:
bW − b
T DBET AS(LS, W ) = (b bLS )′ [b2
τW bW − b
(X′1 X1 )−1 ]−1 (b b LS )
= (1/b2
τW )[X1 (bbR − b b GR )]′ [X1 (b
bW − b
b LS )]
X
= (1/b2
τW ) (YbW,i − YbLS,i)2
i
!2
1X YbW,i − YbLS,i
= (p + 1) p .
n i τbW (p + 1)/n
Since hii has the average value (p + 1)/n, (3.13.8) suggests flagging
T DBET
p AS(LS,2W ) as large whenever T DBET AS(LS, W ) > (p +
1)(2 (p + 1)/n) , or
4(p + 1)2
T DBET AS(LS, W ) > . (3.13.9)
n
We proceed the same way for diagnostics to indicate differences in fits
between Wilcoxon and HBR fits and between LS and HBR fits. The asymptotic
representation for the HBR estimate of β, (A.7.1), can be used to obtain the
asymptotic distribution of the differences between these fits. For data sets
where the HBR weights are close to one, though, the covariance matrix of
this difference is practically singular, resulting in the diagnostic being quite
liberal; see McKean et al. (1996a, 1999) for discussion. So, as we did for the
diagnostic between the Wilcoxon and LS fits, we standardize the differences
in fits using the asymptotic covariance matrix of the Wilcoxon estimate; i.e.,
AW . Hence, the total differences in fits are given by
T DBET AS(W, HBR) = (b̂W − b̂HBR )′ A−1
W (b̂W − b̂HBR ) (3.13.10)
and
T DBET AS(LS, HBR) = (b̂LS − b̂HBR )′ A−1
W (b̂LS − b̂HBR ). (3.13.11)
We recommend using the benchmark given by (3.13.9) for these diagnostics,
also. Likewise the diagnostics for casewise differences are given by
ŶW,i − ŶHBR,i
CF IT Si(W, HBR) = (3.13.12)
SE(ŶW,i )
ŶLS,i − ŶHBR,i
CF IT Si (LS, HBR) = . (3.13.13)
SE(ŶW,i)
i i
i i
i i
Example 3.13.1 (Bonds Data). Siegel (1997) presented a data set, the Bonds
data, which we use to illustrate some of these concepts. It was further discussed
in Sheather (2009) and McKean and Sheather (2009). The responses are the
bid prices for U.S. treasury bonds while the dependent variable is the coupon
rate (size of the bond’s periodic payment rate in percent). The data are shown
in Panel A of Figure 3.13.1 overlaid with the LS (solid line) and Wilcoxon (bro-
ken line) fits. The fits differ dramatically and the diagnostic TDBETA(LS,W)
has the value 213.7 which far exceeds the benchmark of 0.457. The three cases
yielding the largest values for the casewise diagnostic CFITS are cases 4, 13,
and 35. Panels B and C display the LS and Wilcoxon Studentized residual
plots. As can be seen, the Wilcoxon Studentized residual plot highlights Cases
4, 13, and 35, also. Their Studentized residuals exceed 20 and clearly should be
labeled outliers. These are the outlying points on the far left in the scatterplot
of the data. On the other hand, the LS Studentized residual plot shows only
two of them exceeding the benchmark. Further, the bow-tie pattern of the
Wilcoxon residual plot indicates heteroscedasticity of the errors. As discussed
in Sheather (2009), this heteroscedasticity is to be expected because the bonds
have different maturity dates.
As further discussed in Sheather (2009), the three outlying cases are of a
different type of bond than the others. The plot in Panel D is the Studentized
residuals versus fitted values for the Wilcoxon fit after removing these three
cases. Note that there are still a few outlying data points. The diagnostic
TDBETA(LS,Wil) has the value 1.55 which exceeds the benchmark of 0.50
but the difference is far less than the difference based on the original data.
Next consider the differences between the LS and HBR fits. The leverage
values corresponding to the three outlying cases exceed the benchmark for
leverage points (the smallest leverage value of these three cases has value 0.152
which exceeds the benchmark of 0.114). The diagnostic TDBETA(LS,HBR)
has the value 318.8 which far exceeds the benchmark. As discussed above
the Wilcoxon fit is sensitive to outliers in factor space and in this case TD-
BETA(Wil,HBR) is 10.5. When the outliers are omitted, the value of this
statistic is 0.034 which is less than the benchmark.
In this simple regression model, it is obvious that the three outlying cases
are on the edge of factor space. As the next example shows, in a multiple
i i
i i
i i
120
3
115
2
110
LS Studentized residuals
1
105
Bid price
0
100
95
−1
90
−2
4 6 8 10 12 85 90 95 100 105 110 115
Panel C Panel D
30
2
20
Wilcoxon Studentized residuals
0
10
−2
−4
0
−6
i i
i i
i i
Panel A Panel B
2
4
0
2
−2
0
Wilcoxon residual
−4
LS residual
−2
−6
−4
−8
−6
−10
−8
−12
0 2 4 6 8 0 2 4 6 8 10 12
The residual plot for the HBR fit of the Hawkins data is in Panel A of
Figure 3.13.3. Note that the HBR fit correctly identified the 10 bad points
of high leverage and fit well the 4 good points of high leverage. Table 3.13.1
displays the estimates and standard errors of the fits.
The differences in fits diagnostics were successful for this data set. As
displayed on the plot in Panel B of Figure 3.13.3, T DBET AS(W, HBR) =
1324 which far exceeds the benchmark value of 0.853 and which indicates the
the Wilcoxon and HBR fits differ substantially. The plot in Panel B consists
of the diagnostics CF IT SW,i (W, HBR) versus Case i. The 14 largest values of
CF IT SW,i (W, HBR) are for the 14 outlying cases. Recall that the Wilcoxon
fit incorrectly fit the 4 good leverage points. So it is reassuring to see that
all 14 outlying were correctly identified. Also, in a further investigation of
this data set, the gap between these 14 CF IT SW,i (W, HBR) values and the
other cases, would lead to one considering a fit based on the other 61 cases.
Assuming that the matrix x1 is the design matrix (not including the intercept)
the following is the ww code which obtained the fits and the diagnostics:
fit.ls = lm(y~x1)
i i
i i
i i
Panel A Panel B
12
10
10
8
8
Standardized GR residuals
6
GR residual
6
4
4
2
2
0
0
GR fit Case
fit.wl = wwest(x1,y)
fit.hbr = wwest(x1,y,bij="HBR")
fdwilhbr = fitdiag(x1,y,est=c("WIL","HBR"))
fdwilhbr$tdbeta
fdwilhbr$bmtd
cfit =fdwilhbr$cfit
fdwills = fitdiag(x1,y,est=c("WIL","LS"))
fdlshbr = fitdiag(x1,y,est=c("LS","HBR"))
Yi = fi (θ 0 ) + εi . i = 1, . . . , n , (3.14.1)
i i
i i
i i
bW,n = argmin
θ θ ∈Θ kY − f(θ)kW . (3.14.2)
We assume that fi (θ) is defined and continuous for all θ ∈ Θ, for all i. It then
follows that the dispersion function is a continuous function of θ and, hence,
since Θ is compact, that the Wilcoxon estimate θ b exists.
To state the asymptotic properties of the LS and R nonlinear estimates,
certain assumptions are required. These are discussed in detail in Abebe and
McKean (2007). We do note the analog of Assumption D.3, (3.4.8), for the
linear model; that is, the sequence of matrices
n
X
−1
n {∇fi (θ 0 )}{∇fi (θ 0 )}′ (3.14.3)
i=1
converges to a positive definite matrix Σ(θ 0 ) where ∇fi (θ) is the p×1 gradient
of fi (θ) with respect to θ. Under these assumptions, Abebe and McKean
(2007) showed that θ bW,n converges in probability to θ 0 . They, then, derived
the asymptotic distribution of θ bW,n . Similar to the derivation in the linear
model case of Section 3.5, this involves a pseudo-linear model.
Consider the local linear model given by
where for i = 1, . . . , n,
i i
i i
i i
Note that the probability density function of the errors of Model 3.14.4 is h(t),
i.e., the density function of εi. Define the corresponding Wilcoxon dispersion
function as X
Dn∗ (θ) ≡ [2n(n + 1)]−1 |e∗i (θ) − e∗j (θ)| . (3.14.5)
i<j
Furthermore, let
bn = argmin ∗
θ θ ∈Θ Dn (θ) . (3.14.6)
It then follows (see Exercise 3.15.43) that
√ D
bn − θ 0 ) → Np (0, τ 2 Σ(θ 0 )) ,
n(θ (3.14.7)
√ R
where τ = 1/( 12 h2 (t) dt), (3.4.4). Abebe and McKean (2007) show that
√ b bn ) → 0, in probability; hence, we have the asymptotic distribution
n(θ W,n − θ
for the Wilcoxon estimator, which we state in the next theorem.
i i
i i
i i
Using the pseudo model and Section 3.5, a useful asymptotic representation
of the Wilcoxon estimate is given by:
√
n(θbW,n − θ 0 ) = τ (n−1 X∗T X∗ )−1 n−1/2 X∗T {ϕ[H(Y ∗ − X∗ θ 0 )]} + op (1) ,
(3.14.10)
where X is the n × p matrix with the ith row given by {∇fi (θ 0 )} and Y ∗
∗ T
3.14.1 Implementation
To implement the asymptotic inference based on the Wilcoxon estimate we
need a consistent estimator of the variance-covariance matrix. Define the
statistic Σ(θbW,n ) to be Σ(θ 0 ) of expression (3.14.3) with θ 0 replaced by θ
bW,n .
By the consistency of θ bW,n to θ 0 , Σ(θ
bW,n ) converges in probability to Σ(θ 0 ).
Next, it follows from the asymptotic representation (3.14.10) that the estima-
tor, (3.7.8), of τ proposed by Koul et al. (1987) for linear models is also a
consistent estimator of τ for our nonlinear model. We denote this estimator
by τb. Thus τb2 Σ(θ bW,n ) is a consistent estimator of the asymptotic variance-
covariance matrix of θ bW,n .
Estimation Algorithm
Similar to the LS estimates for nonlinear models, a Gauss-Newton-type of
algorithm can be used to obtain the Wilcoxon fit. Recall that this is an it-
erated algorithm which uses the Taylor Series expansion of the function f(θ)
evaluated at the current estimate to obtain the estimate at the next iteration.
Thus each iteration consists of fitting a linear model. Abebe and McKean
(2007) show that this algorithm for obtaining the Wilcoxon fit converges in
a probability sense. Using this algorithm, all that is required to compute the
i i
i i
i i
Table 3.14.1: Wilcoxon and LS Estimates Based on the Original Data with
Standard Errors (SE) and the Wilcoxon Estimates Based on the Data with
Substituted Gross Outlier
Original Data Set Outlier Data Set
Wil. Est. SE LS est. SE Wil. Est. SE
θ1 0.1902 0.0161 0.1903 0.0219 0.1897 0.0161
θ2 0.0061 0.0002 0.0061 0.0003 0.0061 0.0002
θ3 0.0197 0.0006 0.0105 0.0008 0.0107 0.0006
Example 3.14.1 (Chwirut’s Data). These data are taken from the ultrasonic
block reference study by Chwirut (1979). The response variable is ultrasonic
response and the predictor variable is metal distance. The study involved 214
observations. The model under consideration is
exp[−θ1 xi ]
fi (θ) ≡ f (xi ; θ1 , θ2 , θ3 ) ≡ , i = 1, . . . , 214 .
θ2 + θ3 x
Using the Wilcoxon and LS fitting procedures, we fit the (original) data and
then a data set with one observation replaced by an outlier. Figure 3.14.1
displays the results of the fits.
For the original data, as shown in the figure and by the estimates given
in Table 3.14.1, the LS and Wilcoxon fits are quite similar. As shown in the
residual plots of Figure 3.14.1, there are several moderate outliers in the orig-
inal data set. These outliers have an impact on the LS estimate of scale,
the square-root of MSE, which has the value σ bLS = 3.36. In contrast, the
Wilcoxon estimate of τ is τb = 2.45 which explains the Wilcoxon’s smaller
standard errors than those of LS in the table of estimates.
For robustness considerations, we introduced a gross outlier in the response
space (observation 17 was changed from 8.025 to 5000). The Wilcoxon and LS
fits were obtained. As shown in Figure 3.14.1, the LS estimate essentially did
not converge. From the plot of the fitted models and residual plots, it is clear
that the Wilcoxon fit performs dramatically better than its LS counterpart. In
Table 3.14.1 the Wilcoxon estimates are displayed with their standard error.
There is basically little difference between the Wilcoxon fits for the original
set and the data set with the gross outlier.
i i
i i
i i
10
5
5
resids
resids
0
0
−5
−5
−10
−10
1 2 3 4 5 6 1 2 3 4 5 6
x x
(c) Wilcoxon and LS Fits : Original Data (d) Wilcoxon Residual vs. Predictor : Outlier
10
Wilcoxon
80
LS
5
60
0
resids
y
40
−5
−10
20
−15
1 2 3 4 5 6 1 2 3 4 5 6
x x
(e) LS Residual vs. Predictor : Outlier (f) Wilcoxon and LS Fits : Outlier
Wilcoxon
80
LS
0
−50
60
resids
y
−100
40
−150
20
−200
1 2 3 4 5 6 1 2 3 4 5 6
x x
i i
i i
i i
3.15 Exercises
3.15.1. For the baseball data in Example 3.3.2, explore other transformations
of the predictor Years in order to obtain a better fitting model than the one
discussed in the example.
(b) Determine the change in dispersion as β moves across one of these defining
planes.
(c) For the telephone data, Example 3.3.1, obtain the plot shown in Panel
D of Figure 3.3.1; i.e., a plot of the dispersion function D(β) for a set
of values β in the interval (−.2, .6). Locate the estimate of slope on the
plot.
(d) Plot the gradient function S(β) for the same set of values β in the interval
(−.2, .6). Locate the estimate of slope on the plot.
(a) Show that the gradient test statistic (3.5.8) simplifies to the square of the
standardized MWW test statistic (2.2.21).
(b) Show that the regression estimate of the slope parameter is the Hodges-
Lehmann estimator given by expression (2.2.18).
(c) Verify Parts (a) and (b) by fitting the data in the two-sample problem of
Exercise 2.13.45 as a regression model.
3.15.4. For the simple linear regression problem, if the values of the indepen-
dent variable x are distinct and equally spaced show that the Wilcoxon test
statistic is equivalent to the test for correlation based on Spearman’s rs , where
P
(R(xi ) − n+1
2
)(R(yi ) − n+12
))
rs = q P qP .
(R(xi ) − n+1
2
) 2 (R(y i ) − n+1 2
2
)
i i
i i
i i
3.15.5. For the simple linear regression model consider the process
n X
X n
T (β) = sgn(xi − xj )sgn((Yi − xi β) − (Yj − xj β)) .
i=1 j=1
(a) Show under the null hypothesis H0 : β = 0, that E(T (0)) = 0 and that
Var (T (0)) = 2(n − 1)n(2n + 5)/9.
(b) Determine the estimate of β based on inverting the test statistic T (0);
i.e., the value of β which solves
.
T (β) = 0 .
(c) Show that when the two-sample problem is written as a regression model,
(2.2.2), this estimate of β is the Hodges-Lehmann estimate (2.2.18).
Note: Kendall’s τ is a measure of association between xi and Yi given by
τ = T (0)/(n(n − 1)); see Chapter 4 of Hettmansperger (1984) for further
discussion.
3.15.6. Show that the R estimate, β b is an equivariant estimator; that is,
ϕ
b b b b (Y).
β ϕ (Y + Xδ) = β ϕ (Y) + δ) and β ϕ (kY) = k β ϕ
3.15.7. Consider Model 3.2.1 and the hypotheses (3.2.5). Let ΩF denote the
column space of the full model design matrix X and let ΩR denote the subspace
of ΩF subject to H0 . Show that ΩR is a subspace of ΩF and determine its
dimension. Hint: One way of establishing the dimension is to show that C =
X(X′X)−1 M′ is a basis matrix for ΩF ∩ ΩcR .
3.15.8. Show that Assumptions (3.4.9) and (3.4.8) imply Assumption (3.4.7).
3.15.9. For the special case of Wilcoxon scores, obtain the proof of Theorem
3.5.2 by first getting the projection of the statistic S(0).
3.15.10. Assume that the errors ei in Model (3.2.2) have finite variance σ 2 .
Let βb denote the least squares estimate of β. Show that √n(β b − β) → D
LS LS
Np (0, σ 2 Σ−1 ). Hint: First show that the LS estimate is location and scale
equivariant. Then without loss of generality we can assume that the true β is
0.
3.15.11. Under the additional assumption that the errors have a symmetric
distribution, show that R estimates are unbiased for all sample sizes.
3.15.12. Let ϕf (u) = −f ′ (F −1 (u))/f (F −1(u)) denote the optimal scores for
the density f (x) and suppose that f is symmetric. Show that ϕf (1 − u) =
−ϕf (u); that is, the optimal scores are odd about 1/2.
i i
i i
i i
i i
i i
i i
(d) Based on (3.15.3) amd (3.15.4), show that FLS (σ 2 ) has a limiting non-
central χ2 distribution with noncentrality parameter given by expression
(3.6.27).
b2 is a consistent estimate of σ 2
(e) Obtain the final result by showing that σ
under the sequence of Models (3.6.22).
3.15.19. Show that D e , (3.6.28) is a scale parameter; i.e., De (Fae+b ) =
|a|De (Fe ).
3.15.20. Establish expression (3.6.33).
3.15.21. Suppose Wilcoxon scores are used.
(a) Establish the expressions (3.6.34) and (3.6.35).
i i
i i
i i
(a) Show that the optimal rank score function is given by expression (3.10.6).
(b) Show that the asymptotic relative efficiency between the Wilcoxon anal-
ysis and the rank-based analysis based on the optimal scores for the
distribution GF (2m1 , 2m2 ) is given by expression (3.10.8).
3.15.32. Suppose the errors have density function
(a) Show that the optimal scores are given by expression (3.10.7).
(b) Show that the asymptotic relative efficiency of the Wilcoxon analysis to
the rank analysis based on the optimal rank score function for the density
(3.15.5) is given by expression (3.10.9).
3.15.33. The definition of the modulus of a matrix A is given in expression
(3.11.6). Verify the three properties concerning the modulus of a matrix listed
in the text following this definition.
3.15.34.
p Consider Example 3.11.1. If Wilcoxon scores are used, show that
D y = 3/4E|Y
p 1 − Y2 | where Y1 , Y2 are iid with distribution function G and
that D e = 3/4E|e1 − e2 | where e1 , e2 are iid with distribution function F .
Next assume that sign scores are used. Show that D y = E|Y − med Y | where
med Y denotes the median of Y . Likewise D e = E|e − med e|.
Using this, verify the expressions for D y , D e , and τϕ found in the example.
3.15.38. For the baseball data given in Exercise 1.12.33, consider the variables
height and weight.
i i
i i
i i
Y = X∗ β ∗ + e , (3.15.6)
where X∗ is n × p whose column space Ω∗F does not include 1. This model is
often called regression through the origin. Note for the pseudo-norm k · kϕ
that
n
X
∗ ∗
kY − X β kϕ = a(R(yi − x∗′ ∗′
i β))(yi − xi β) (3.15.7)
i=1
Xn
= a(R(yi − (x∗i − x∗ )′ β ∗ ))(yi − (x∗i − x∗ )′ β ∗ )
i=1
n
X
= a(R(yi − α − (x∗i − x∗ )′ β ∗ ))(yi − α − (x∗i − x∗ )′ β ∗ ),
i=1
where x∗i is the ith row of X∗ and x∗ is the vector of column averages of X∗ .
Based on this result, the estimate of the regression coefficients based on the
R fit of Model (3.15.6) is estimating the regression coefficients of the centered
model, i.e., the model with the design matrix X = X∗ − H1 X∗ . Hence, in
general, the parameter β is not estimated. This problem also occurs in a
weighted regression model. Dixon and McKean (1996) proposed the following
solution. Assume that (3.15.6) is the true model, but obtain the R fit of the
model:
∗ ∗ ∗ α1
Y = 1α1 + X β 1 + e = [1 X ] +e, (3.15.8)
β ∗1
where the true α1 is 0. Let X1 = [1 X∗ ] and let Ω1 denote the column space of
X1 . Let Yb 1 = 1b b ∗ denote the R fitted value based on the fit of Model
α1 + X∗ β 1
(3.15.8). Note that Ω∗ ⊂ Ω1 . Let Yb ∗ = H Ω∗ Y
b 1 be the projection of this fitted
value onto the desired space Ω . Finally estimate β ∗ by solving the equation
∗
b∗ = Y
X∗ β b∗ (3.15.9)
(b) Assume that the density function of the errors is symmetric, that the R
score function is odd about 1/2 and that the intercept α1 is estimated
.
by solving the equation T + (b
eR − α) = 0 as discussed in Section 3.5.2.
Under these assumptions show that
i i
i i
i i
(c) Next, suppose that the intercept is estimated by median of the residuals
from the R fit of (3.15.8). Using the asymptotic representations (3.5.23)
and (3.5.22), show that the asymptotic representation of βb ∗ is given by
√
b ∗ = τS (X∗′ X∗ )−1 X∗′ H1 sgn(e)+τϕ (X∗′ X∗ )−1 X∗′ HX ϕ[F(e)+op (1/ n).
β
(3.15.11)
∗
b
Use this result to show that the asymptotic variance of of β is given by
∗
b ) = τ 2 (X∗′ X∗ )−1 X∗′ H1 X∗ ((X∗′ X∗ )−1
AsyVar(β S
+τϕ2 (X∗′ X∗ )−1 X∗′ HX X∗ (X∗′ X∗ )−1 . (3.15.12)
(d) Show that the invariance to x∗ as shown in (3.15.7) is true for any pseudo-
norm.
3.15.40. The data in Table 3.15.1 are presented in Graybill and Iyer (1994).
The dependent variable is the weight (in grams) of a crystalline form of a
certain chemical compound while the independent variable is the length of
time (in hours) that the crystal was allowed to grow. A model of interest is
the regression through the origin model (3.15.6). Obtain the R estimate of β ∗
for this data using the procedure described in (3.15.9). Compare this fit with
the intercept of the R fit of the intercept model.
i i
i i
i i
v′ Bv = kWXuk2 − kHWXuk2 ≥ 0 ,
3.15.42. Consider the linear model (3.2.3). Show that expression (3.12.1) is
a pseudo-norm.
3.15.43. Consider the pseudo-linear model (3.14.4) of Section 3.14. For the
Wilcoxon pseudo estimator, obtain the asymptotic result (3.14.7).
3.15.44. By filling in the brief sketch below, write out the Gauss-Newton
algorithm for the Wilcoxon estimate of the nonlinear model (3.14.1).
Let θb0 be an initial estimate of θ. Let f0 = f(θ b0 ). Write the norm to be
minimized as kY − fkW = kY − f0 + [f0 − f]kW . Then use a Taylor series
of order 1 to approximate the term in brackets. The increment for the next
step estimate is the Wilcoxon estimator of this approximate linear model with
Y − f0 as the dependent variable. For actual implementation, discuss why the
regression through the origin algorithm of Exercise 3.15.39 is usually necessary
here.
3.15.46. Consider the influence function of the HBR estimate given in expres-
sion (A.5.22). If the weights for residuals and the xs are both set at 1, show
that the influence function of the HBR estimate simplifies to the influence
function of the Wilcoxon estimate given in (3.5.17).
i i
i i
i i
Chapter 4
4.1 Introduction
In this chapter we discuss rank-based inference for experimental designs based
on the theory developed in Chapter 3. We concentrate on factorial type de-
signs and analysis of covariance designs but, based on our discussion, it is
clear how to extend the rank-based analysis for any fixed effects design. For
example, based on this rank-based inference Vidmar and McKean (1996) de-
veloped a response surface methodology which is quite analogous to the tra-
ditional response surface methods. We discuss estimation of effects, tests of
linear hypotheses concerning effects, and multiple comparison procedures. We
illustrate this rank-based inference with numerous examples. One purpose of
our discussion is to show how this rank-based analysis is analogous to the tra-
ditional analysis based on least squares. In Section 4.2.5 we introduce pseudo-
observations which are based on an R fit of the full model. We show that
the rank-based analysis (Wald-type) can be obtained by substituting these
pseudo-observations in place of the responses in a package that obtains the
traditional analysis. We begin with the one-way design.
In our development we apply rank scores to residuals. In this sense our
methods are not pure rank statistics; but they do provide consistent and highly
efficient tests for traditional linear hypotheses. The rank transform method is
a pure rank test and it is discussed in Section 4.7 where we describe various
drawbacks to the approach for testing traditional linear hypotheses in linear
models. Brunner and his colleagues have successfully developed a general ap-
proach to testing in designed experiments based on pure ranks, although the
hypotheses of their approach are generally not linear hypotheses. Brunner and
Puri (1996) provide an excellent survey of these pure rank tests. We do pursue
them further in this book.
291
i i
i i
i i
where the eij s are iid random variables with density f (x) and distribution
function F (x) and the parameter µi is a convenient location parameter (for
example, the mean or median). Let T (F ) denote the location functional. As-
sume, without loss of generality, that T (F ) = 0. Let ∆ii′ denote the shift
between the distributions of Yij and Yi′ l . Recall from Chapter 2 that the pa-
rameters ∆ii′ are invariant to the choice of locational functional and that
∆ii′ = µi − µi′ . If µi is the mean of the Yij then Hocking (1985) calls this the
means model. If µi is the median of the Yij then we call it the medians
model; see Section 4.2.4 below.
Observational studies can also be modeled this way. Suppose k independent
samples are drawn from k different populations. If we assume further that the
distributions for the different populations differ by at most a shift in locations
then Model (4.2.1) is appropriate. But as in all observational studies, care
must be taken in the interpretation of the results of the analyses.
While the parameters µi fix the locations, P the parameters of Pkinterest in
k
this chapter are contrasts of the form h = i=1 ci µi where i=1 ci = 0.
i i
i i
i i
Similar to the shift parameters, contrasts are invariant to the choice of location
functional. In fact contrasts are linear functions of these shifts; i.e.,
k
X k
X k
X
h= ci µ i = ci (µi − µ1 ) = ci ∆i1 = c′1 ∆1 , (4.2.2)
i=1 i=1 i=2
is the vector of location shifts from the first cell. In order to easily reference
the theory of Chapter 3, we often use ∆1 which references cell 1. But picking
cell 1 is only for convenience and similar results hold for the selection of any
other cell.
As in Chapter 2, we can write this model in terms of a linear model as
follows. Let Z′ = (Y11 , . . . , Y1n1 , . . . , Yk1, . . . , Yknk ) denote the vector of allPob-
servations, µ′ = (µ1 , . . . , µk ) denote the vector of locations, and n = ni
denote the total sample size. The model can then be expressed as a linear
model of the form
Z = Wµ + e , (4.2.4)
where e denotes the n × 1 vector of random errors eij and the n × k design
matrix W denotes the appropriate incidence matrix of 0s and 1s; i.e.,
1n1 0 · · · 0
0 1n · · · 0
2
W = .. .. .. . (4.2.5)
. . ··· .
0 0 · · · 1nk
Note that the vector 1n is in the column space of W; hence, the theory derived
in Chapter 3 is valid for this model.
At times it is convenient to reparameterize the model in terms of a vector
of shift parameters. For the vector ∆1 , let W1 denote the last k − 1 columns
of W and let X be the centered W1 ; i.e., X = (I − H1 )W1 , where H1 =
1(1′ 1)−1 1′ = n−1 11′ and 1′ = (1, . . . , 1) . Then we can write Model (4.2.4) as
Z = α1 + X∆1 + e , (4.2.6)
where ∆1 is given in expression (4.2.3). It is easy to show that for any matrix
[1|X∗ ], having the same column space as W, that its corresponding non-
intercept parameters are linear functions of the shifts and, hence, are invariant
to the selected location functional. The relationship between Models (4.2.4)
and (4.2.6) are explored further in Section 4.2.4.
i i
i i
i i
i i
i i
i i
The symbol DE stands for the dispersion of the errors and is analogous to LS
sums of squared errors, SSE. Upon fitting such a model, a residual analysis
as discussed in Section 3.9 should be conducted to assess the goodness of fit
of the model.
Example 4.2.1 (LDL Cholesterol of Quail). Thirty-nine quail were randomly
assigned to four diets, each diet containing a different drug compound, which,
hopefully, would reduce low density lipid (LDL) cholesterol. The drug com-
pounds are labeled: I, II, III, and IV. At the end of the prescribed experimental
time the LDL cholesterol of each quail was measured. The data are displayed
in the comparison boxplots, found in Panel A of Figure 4.2.1; for convenience,
we have also placed the data at the url site listed in the Preface. From the
comparison boxplots, found in Panel A of Figure 4.2.1, it appears that Drug
Compound II was more effective than the other three in lowering LDL. The
data appear to be positively skewed√ with a long right tail. We fitted the data
using Wilcoxon scores, ϕ(u) = 12(u − 1/2), and adjusted the residuals to
have median 0. Panel B of Figure 4.2.1 displays the q − q of the Wilcoxon
Studentized residuals. The long right tail of the error distribution is apparent
from this plot as are also the six outliers. The estimates of τϕ and τS are
19.19 and 21.96, respectively. For comparison the LS estimate of σ was 30.49.
The Wilcoxon and LS estimates of the cell locations along with their standard
errors are:
Drug Wilcoxon Fit LS Fit
Compound Est. SE Est. SE
I 67.0 6.3 74.5 9.6
II 42.0 6.3 52.3 9.6
III 63.0 6.3 73.8 9.6
IV 62.0 6.6 67.6 10.1
The Wilcoxon and LS estimates of the location levels are quite different, as
they should be since they estimate different functionals under asymmetric
errors. The long right tail has drawn out the LS estimates. The standard
errors of the Wilcoxon estimates are much smaller than their LS counterparts.
i i
i i
i i
Figure 4.2.1: Panel A: Comparison boxplots for data of Example 4.2.1; Panel
B: Wilcoxon internal R Studentized residual normal q−q plot.
Panel A Panel B
8
150
6
Wilcoxon Studentized residuals
LDL Cholesterolo
4
100
2
0
50
I II III IV −2 −1 0 1 2
The data set in the last example was taken from a much larger study dis-
cussed in McKean, Vidmar, and Sievers (1989). Most of the data in that study
exhibited long right tails. The left tails were also long; hence, transformations
such as logarithms were not effective. Scores more appropriate for positively
skewed data were used with considerable success in this study. These scores
are briefly discussed in Example 2.5.1.
i i
i i
i i
H0 : Ik−1 ∆1 = 0; that is, all the regression coefficients are zero. We discuss
two rank-based tests for this hypothesis.
One appropriate test statistic is the gradient test statistic, (3.5.8), which
is given by
T = σa−2 S(Z)′ (X′ X)−1 S(Z) , (4.2.15)
where S(Z)′ = (S2 (Z), . . . , Sk (Z)) for
ni
X
Si (Z) = a(R(Zij )) , (4.2.16)
j=1
i i
i i
i i
i i
i i
i i
H0 : Mµ = 0 versus HA : Mµ 6= 0 , (4.2.25)
Example 4.2.3 (Poland China Pigs). This data set, presented on page 87
of Scheffé (1959), concerns the birth weights of Poland China pigs in eight
litters. For convenience we have tabled that data in Table 4.2.4. There are 56
pigs in the eight litters. The sample sizes of the litters vary from 4 to 10.
In Exercise 4.8.3 a residual analysis is conducted of this data set and the
hypothesis (4.2.13) is tested. Here we are only concerned with the following
i i
i i
i i
contrast suggested by Scheffé. Assume that the litters 1, 3, and 4 were sired
by one boar while the other litters were sired by another boar. The contrast
of interest is that the average litter birthweights of the pigs sired by the boars
are the same; i.e., H0 : h = 0 where
1 1
h = (µ1 + µ3 + µ4 ) − (µ2 + µ5 + µ6 + µ7 + µ8 ) . (4.2.26)
3 5
For this hypothesis, the matrix M of expression (4.2.25) is given by [5 −
3 5 5 − 3 − 3 − 3 − 3]. The value of the LS F -test statistic is 11.19, while
Fϕ = 15.65. There are 1 and 48 degrees of freedom for this hypothesis so both
tests are highly significant. Hence both tests indicate a difference in average
litter birthweights of the boars. The reason Fϕ is more significant than FLS is
clear from the residual analysis found in Exercise 4.8.3.
i i
i i
i i
Table 4.2.5: All Pairwise 95% Wilcoxon Confidence Intervals for the Quail
Data
Difference Estimate Confidence Interval
µ2 − µ1 -25.0 (−42.7, −7.8)
µ2 − µ3 -21.0 (−38.6, −3.8)
µ2 − µ4 -20.0 (−37.8, −2.0)
µ1 − µ3 4.0 (−13.41, 21.41)
µ1 − µ4 5.0 (−12.89, 22.89)
µ3 − µ4 1.0 (−16.89, 18.89)
h = Mµ = M1 ∆1 . (4.2.28)
Note that the only difference for the LS fit, is that σ 2 would be substituted
for τϕ2 . Expressions (4.2.28) and (4.2.29) are the basic relationships used by
pseudo-observations discussed in Section 4.2.5.
To illustrate these relationships, suppose we want a confidence interval for
µi − µi′ . Based on expression (4.2.29), an asymptotic (1 − α)100% confidence
interval is given by
r
1 1
bi − µ
µ bi′ ± t(α/2,n−k) τbϕ + ; (4.2.30)
ni ni′
i i
i i
i i
Medians Model
Suppose we are interested in estimates of the level locations themselves. We
first need to select a location functional. For the discussion we use the median;
although, for any other functional, only a change of the scale parameter τS
is necessary. Assume then that the R residuals have been adjusted so that
their median is zero. As discussed above, (4.2.10), the estimate of µi is Ybij , for
any j = 1, . . . , ni , where Ybij is the fitted value of Yij . Let µ
b = (b bk )′ .
µ1 , . . . , µ
Further, µ b is asymptotically normal with mean µ and the asymptotic variance
of µbi is given in expression (4.2.11). As Exercise 4.8.4 shows, the asymptotic
covariance between estimates of location levels is:
for i 6= i′ . As Exercises 4.8.4 and 4.8.18 show, expressions (3.9.38) and (4.2.31)
lead to a verification of the confidence interval (4.2.30).
Note that if the scale parameters are the same, say, τS = τϕ = κ then the
approximate variance reduces to κ2 /ni and the covariances are 0. Hence, in
this case, the estimates µ bi are asymptotically independent. This occurs in the
following two ways:
1. For the fit of Model (4.2.4) use a score function ϕ(u) which satisfies (S2)
and use the location functional based on the corresponding signed-rank
score function ϕ+ (u) = ϕ((u + 1)/2). The asymptotic theory, though,
requires the assumption of symmetric errors. If the Wilcoxon score func-
tion is used then the location functional would result in the residuals
being adjusted so that the median of the Walsh averages of the adjusted
residuals is 0.
2. Use the l1 score function ϕS (u) = sgn(u − (1/2)) to fit Model (4.2.4) and
use the median as the location functional. This of course is equivalent
to using an l1 fit on Model (4.2.4). The estimate of µi is then the cell
median.
4.2.5 Pseudo-observations
We next discuss a convenient way to estimate and test contrasts once an R fit
of Model (4.2.4) is obtained. Let Zb denote the R fit of this model, let b
e denote
the vector of residuals, let a(R(be)) denote the vector of scored residuals, and
let τbϕ be the estimate of τϕ . Let HW denote the projection matrix onto the
column space of the incidence matrix W. Because of (3.2.13), the fact that 1n
is in the column space of W, and that the scores sum to 0, we get
HW a(R(b
e)) = 0 . (4.2.32)
i i
i i
i i
i i
i i
i i
The execution of this command returned the value of the Fϕ,Q = 3.45 with a
p-value .027, which is close to the result based on the Fϕ -statistic.
i i
i i
i i
In this section we say that this confidence interval has experiment error
rate α. As Exercise 4.8.8 illustrates, simultaneous confidence for several such
intervals can easily slip well below 1 − α. The error rate for a simultaneous
confidence procedure is called its family error rate.
We next describe six robust multiple comparison procedures for the prob-
lem of all pairwise comparisons. The error rates for them are based on asymp-
totics. But note that the same is true for MCPs based on least squares when
the normality assumption is not valid. Sufficient Minitab code is given, to
demonstrate how easily these procedures can be performed.
i i
i i
i i
The F -test that appears in the AOV table upon execution of these com-
mands is Wald’s test statistic Fϕ,Q for the hypotheses (4.2.13). Recall
from Chapter 3 that it is asymptotically equivalent to Fϕ under the null
and local hypotheses.
3. Tukey’s T Procedure. This Pkis a multiple Pcomparison procedure for
the set of all contrasts, h = i=1 ci µi where ki=1 ci = 0. Assume that
the sample sizes for the levels are the same, say, n1 = · · · = nk = m.
The basic geometric fact for this procedure is the following equivalence
due to Tukey (see Miller, 1981): for t > 0,
max |(b µi′ − µi′ )| ≤ t ⇐⇒
µi − µi ) − (b (4.3.4)
1≤i,i′ ≤k
k
X k k k k
1 X X X 1 X
bi − t
ci µ |ci | ≤ ci µ i ≤ bi + t
ci µ |ci |,
i=1
2 i=1 i=1 i=1
2 i=1
P P
for all contrasts ki=1 ci µi where ki=1 ci = 0. Hence to obtain simulta-
neous confidence intervals for the set of all contrasts we need the distri-
bution of the left side of this inequality. But first note that
(b µi′ − µi′ ) = {(b
µi − µi ) − (b µi − µi ) − (b µ1 − µ1 )}
− {(bµi′ − µi′ ) − (b
µ1 − µ1 )}
b i1 − ∆i1 ) − (∆
= (∆ b i′ 1 − ∆i′ 1 ) .
b 1 which
Hence, we need only consider the asymptotic distribution of ∆
τϕ2
by (4.2.19) is Nk−1(∆1 , m [I + J]).
Recall, if v1 , . . . , vk are iid N(0, σ 2 ), then the max1≤i,i′ ≤k |vi − vi′ |/σ has
the Studentized range distribution, with k −1 and ∞ degrees of freedom.
But we can write this random variable as
max |vi − vi′ | = max |(vi − v1 ) − (vi′ − v1 )| .
1≤i,i′ ≤k ′1≤i,i ≤k
i i
i i
i i
Hence we need only consider the random vector of shifts v1′ = (v2 −
v1 , . . . , vk − v1 ) to determine the distribution. But v1 has distribution
Nk−1 (0, σ 2 [I+J]). Based on this, it follows from the asymptotic distribu-
tion of ∆ b 1 , that if we substitute qα;k,∞ τϕ /√m for t in expression (4.3.4),
where qα;k,∞ is the upper α critical value of a Studentized range distribu-
tion with k and ∞ degrees of freedom, then the asymptotic probability
of the resulting expression is 1 − α.
The parameter τϕ , though, is unknown and must be replaced by an es-
timate. In the Tukey T procedure for LS, the parameter is σ. The usual
estimate s of σ is such that if the errors are normally distributed then the
random variable (n − k)s2 /σ 2 has a χ2 distribution and is independent
of the LS location estimates. In this case the Studentized range distri-
bution with k − 1 and n − k degrees of freedom is used. If the errors are
not normally distributed then this distribution leads to an approximate
simultaneous confidence procedure. We proceed similarly for the proce-
dure based√on the robust estimates. Replacing t in expression (4.3.4) by
qα;k,n−k τbϕ / m, where qα;k,n−k is the upper α critical value of a Studen-
tized range distribution with k and n − k degrees of freedom, yields an
approximate simultaneous confidence procedure for the set of all con-
trasts. As discussed before, though, small sample studies have shown
that the Student t-distribution works well for inference based on the
robust estimates. Hopefully these small sample properties carry over to
the approximation based on the Studentized range distribution. Further
research is needed in this area.
Tukey’s procedure requires that the level sample sizes are the same which
is frequently not the case in practice. A simple adjustment due to Kramer
(1956) results in the simultaneous confidence intervals,
r
1 1 1
bi′ ± √ qα;k,n−k τbϕ
bi − µ
µ + . (4.3.5)
2 ni ni′
These intervals have approximate family error rate α. This approxima-
tion is often called the Tukey-Kramer procedure.
In summary the R Tukey-Kramer procedure declares
r
′ 1 1 1
levels i and i differ if |b bi′ | ≥ √ qα;k,n−k τbϕ
µi − µ + . (4.3.6)
2 ni ni′
The asymptotic family error rate for this procedure is approximately α.
To obtain these R Tukey intervals by Minitab assume that pseudo-
observations, Yeij , are in column 10, the corresponding levels, i, are in
column 11, and the constant α is in k1. Then the following two lines of
Minitab code obtains the intervals:
i i
i i
i i
i i
i i
i i
Tables for this critical value at the 5% and 1% levels are provided in
Miller (1981). The separate ranking procedure declares
(i′ ) (i)
levels i and i′ differ if Ri· ≥ cα;m,k or Ri′ · ≥ cα;m,k . (4.3.12)
This procedure has an approximate family error rate of α and was de-
veloped independently by Steel (1960) and Dwass (1960).
An approximate level α test of the hypotheses (4.2.13) is given by
(i′ )
Reject H0 if max Ri· ≥ cα;m,k , (4.3.13)
1≤i,i ≤k
′
although as noted for the last procedure the Kruskal-Wallis test is the
usual choice in practice.
Corresponding simultaneous confidence intervals can be constructed sim-
ilar to the confidence intervals developed in Chapter 2 for a shift in lo-
cations based the MWW statistic. For the confidence interval for the
ith and i′ th samples corresponding to the test (4.3.12), first form the
differences between the two samples, say,
ii ′
Dkl = Yik − Yi′ l 1 ≤ k, l ≤ m .
Let D(1) , . . . , D(m2 ) denote the ordered differences. Note here that the
critical value cα;m,k is for the sum of the ranks and not statistics of the
form SR+ , (2.4.2). But recall that these versions of the Wilcoxon statistic
differ by the constant m(m + 1)/2. Hence the confidence interval is
It follows that this set of confidence intervals, over all pairs of levels i
and i′ , form a set of simultaneous 1 − α confidence intervals. Using the
iterative algorithm discussed in Section 3.7.2, the differences need not
be formed.
i i
i i
i i
i i
i i
i i
4.3.1 Discussion
We have presented robust analogues to three of the most popular multiple
comparison procedures: the Bonferroni, Fisher’s protected least significant
difference, and the Tukey T method. These procedures provide the user with
estimates of the most interesting parameters in these experiments, namely the
simple contrasts between treatment effects, and estimates of standard errors
with which to assess these contrasts. The robust analogues are straightforward.
Replace the LS estimates of the effects by the robust estimates and replace the
estimate of σ by the estimate of τϕ . Furthermore, these robust procedures can
easily be obtained by using the pseudo-observations as discussed in Section
4.2.5. Hence, the asymptotic relative efficiency between the LS-based MCP
and its robust analogue is the same as the ARE between the LS estimator
and robust estimator, as discussed in Chapters 1-3. In particular if Wilcoxon
scores are used, then the ARE of the Wilcoxon MCP to that of the LS MCP
is .955 provided the errors are normally distributed. For error distributions
with longer tails than the normal, the Wilcoxon MCP is generally much more
efficient than its LS MCP counterpart.
The theory behind the robust MCPs is asymptotic, hence, the error rates
are approximate. But this is true also for the LS MCPs when the errors are
not normally distributed. Verification of the validity and power of both LS
i i
i i
i i
and robust MCPs is based on small sample studies. The small sample study
by McKean et al. (1989) demonstrated that the Wilcoxon Fisher PLSD had
the same validity as its LS counterpart over a variety of error distributions for
a one-way design. For normal errors, the LS MCP had slightly more empirical
power than the Wilcoxon. Under error distributions with heavier tails than
the normal, though, the empirical power of the Wilcoxon MCP was larger
than the empirical power of the LS MCP.
The decision as to which MCP to use has long been debated in the litera-
ture. It is not our purpose here to discuss these issues. We refer the reader to
books devoted to MCPs for discussions on this topic; see, for example, Miller
(1981) and Hsu (1996). We do note that, besides τϕ replacing σ, the error
part of the robust MCP is the same as that of LS; hence, arguments that one
procedure dominates another in a certain situation holds for the robust MCP
as well as for LS.
There has been some controversy on the two simultaneous rank-based test-
ing procedures that we presented: pairwise tests based on joint rankings and
pairwise tests based on separate rankings. Miller (1981) and Hsu (1996) both
favor the tests based on separate rankings because in the separate rankings
procedure the comparison between two levels is not influenced by any infor-
mation from the other levels which is not the case for the procedure based on
joint rankings. They point out that this is true of the LS procedure, also, since
the comparison between two levels is based only on the difference in sample
means for those two levels, except for the estimate of scale. However, Lehmann
(1975) points out that the joint ranking makes use of all the information in
the experiment while the separate ranking procedure does not. The spacings
between all the points is information that is utilized by the joint ranking
procedure and that is lost in the separate ranking procedure. The quail data,
Example 4.3.1, is illustrative. The separate ranking procedure did quite poorly
on this data set. The sample sizes are moderate and in the comparisons when
half of the information is lost, the outliers impaired the procedure. In contrast,
the joint ranking procedure came close to declaring drug compounds 1 and
2 different. Consider also the LS procedure on this data set. It is true that
the outliers impaired the sample means, but the estimated variance, being a
weighted average of the level sample variances, was drawn down some over
all the information; for example, instead of using s3 = 37.7 in the compar-
isons with the third level, the LS procedure uses a pooled standard deviation
s = 30.5. There is no way to make a similar correction to the separate ranking
procedure. Also, the separate rankings procedure can lead to inconsistencies
in that it could declare Treatment A superior to B, Treatment B superior to
Treatment C, while not declaring Treatment A superior to Treatment C; see
page 245 of Lehmann (1975) for a simple illustration.
i i
i i
i i
where eijk are iid with distribution and density functions F and f , respectively.
Let T denote the location functional of interest and assume without loss of
generality that T (F ) = 0. The submodels described below utilize the two-way
structure of the design.
Model 4.4.1 is the same as the one-way design model (4.2.1) of Section 4.2.
Using the scores a(i) = ϕ(i/(n + 1)), the R fit of this model can be obtained
as described in that section. We use the same notation as in Section 4.2: i.e.,
b
e denotes the residuals from the fit adjusted so that T (Fn ) = 0 where Fn is
the empirical distribution function of the residuals; µb denotes the R estimate
of µ the ab × 1 vector of the µij s; and τbϕ denotes the estimate of τϕ . For the
examples discussed in this section, Wilcoxon scores are used and the residuals
are adjusted so that their median is 0.
An interesting submodel is the additive model which is given by
For the additive model, the profile plots, (µij versus i or j), are parallel. A
diagnostic check for the additive model is to plot the sample profile plots,
(µc
ij versus i or j), and see how close the profiles are to parallel. The null
hypotheses of interest for this model are the main effect hypotheses given
by
Note that there are a − 1 and b − 1 free constraints for H0A and H0B , respec-
tively. Under H0A , the levels of A have no effect on the response.
The interaction parameters are defined as the differences between the
full model parameters and the additive model parameters; i.e.,
i i
i i
i i
i i
i i
i i
hypotheses (H0A and H0b ) and the interaction hypothesis (H0AB ). As with a
LS analysis, one has to know what main hypotheses are being tested by the LS
package. For instance, the main effect hypothesis H0A , (4.4.3), is a Type III
sums of squares hypothesis in SAS; see Speed, Hocking, and Hackney (1978).
Source RD df MRD FR
Temperature (T) 26.40 2 13.20 121.7
Motor Insulation (I) 3.72 2 1.86 17.2
T×I 1.24 4 .310 2.86
Error 30 .108
Since F (.05, 4, 30) = 2.69, the test of interaction is significant at the .05 level.
This confirms the profile plot, Panel A. It is interesting to note that the
least squares F -test statistic for interaction was 1.30 and, hence, was not
significant. The LS analysis was impaired because of the outliers. The row
effect hypothesis is that the average row effects are the same. The column
i i
i i
i i
Figure 4.4.1: Panel A: Cell median profile plot for data of Example 4.4.1, cell
medians based on the Wilcoxon fit; Panel B: Internal Wilcoxon Studentized re-
sidual plot; Panel C: Logistic q−q plot based on internal Wilcoxon Studentized
residuals; Panel D: Casewise plot of the Wilcoxon Studentized residuals.
Panel A Panel B
6
Cell median estimates (Wilcoxon)
4
2
7
• • ••
•• • • •
• • •• • ••
0
•
• • • •
6
•• • • •
-2
•
200 deg •
225 deg
-4
5
250 deg
•
-6
Panel C Panel D
• •
6
6
Wilcoxon Studentized residuals
• •
4
4
2
•• • • • • • •
••••• • • • • •
• ••• • • ••• • • •
••••••••••••• • ••
0
•••••• • ••• • •
• •••• • • • • •
-2
-2
• •
• •
-4
-4
• •
-6
-6
-2 0 2 0 10 20 30 40
i i
i i
i i
effect hypothesis is similarly defined. Both main effects are significant. In the
presence of interaction, though, we have interpretation difficulties with main
effects.
In Nelson’s discussion of this problem it was of interest to estimate the
simple contrasts of mean lifetimes of insulations at the temperature setting of
200◦ F. Since this is the first temperature setting, these contrasts are µ1j −µ1j ′ .
The estimates of these contrasts along with corresponding confidence intervals
formed under the Tukey-Kramer procedure as discussed above, (4.3.6), are
given by:
It seems that insulations 2 and 3 are better than insulation 1 at the tem-
perature of 200◦ F, but between insulations 2 and 3 there is no discernible
difference.
In the last example, the number of observations per parameter was less
than five. To offset uneasiness over the use of the rank analysis for such small
samples, McKean and Sievers (1989) conducted a a Monte Carlo study on
this design. The empirical levels and powers of the R analysis were good over
situations similar to those suggested by this data.
Hence the slope at the ith level is βi = β + γi and, thus, each treatment
combination has its own linear model. There are two natural hypotheses for
this model: H0C : β1 = · · · = βk and H0L : µ1 = · · · = µk . If H0C is true
then the difference between the levels of Factor A are just the differences in the
location parameters µi for a given value of the covariate. In this case, contrasts
i i
i i
i i
H0C : γ11 = · · · = γpk,pk versus HAC : γij 6= γi′ j ′ for some (i, j) 6= (i′ , j ′ ).
(4.5.3)
Other hypotheses of interest consist of contrasts in the the µij . In general, let
M be a q × k matrix of contrasts and consider the hypotheses
H0 : Mµ = 0 versus HA : Mµ 6= 0 . (4.5.4)
i i
i i
i i
and the values of the test statistic Fϕ can be used to test them. This analysis
can be conducted by the package rglm. It can also be conducted by fitting
the full model and obtaining the pseudo-observations. These in turn can be
substituted for the responses in a package which performs the traditional LS
analysis of covariance in order to obtain the R analysis.
Example 4.5.1 (Snake Data). As an example of an analysis of covariance
problem consider the data set discussed by Afifi and Azen (1972). For conve-
nience, we have placed the Snake Data at the url cited in the Preface. The
study involves four methods, three of which are intended to reduce a human’s
fear of snakes. Forty subjects were given a behavior approach test to determine
how close they could walk to a snake without feeling uncomfortable. This score
was taken as the covariate. Next they were randomly assigned to one of the
four treatments with ten subjects assigned to a treatment. The first treatment
was a control (placebo) while the other three treatments were different meth-
ods intended to reduce a human’s fear of snakes. The response was a subject’s
score on the behavior approach test after treatment. Hence, the sample size is
40 and the number of independent variables in Model (4.5.2) is 8. Wilcoxon
scores were used to conduct the analysis of covariance described above with
the residuals adjusted to have median 0.
The plots of the response variable versus the covariate for each treatment
are found in Panels A-D of Figure 4.5.1. It is clear from the plots that the
relationship between the response and the covariate varies with the treatment,
from virtually no relationship for the first treatment (placebo) to a fairly
strong linear relationship for the third treatment. Outliers are apparent in
these plots also. These plots are overlaid with Wilcoxon and LS fits of the
full model, Model (4.5.1). Panels E and F of Figure 4.5.1 are, respectively,
the internal Wilcoxon Studentized residual plot and the internal Wilcoxon
Studentized logistic q−q plot. The outliers stand out in these plots. From the
residual plot, the data appears to be heteroscedastic and, as Exercise 4.8.14
shows, the square root transformation of the response does lead to a better
fit.
Table 4.5.1 displays the Wilcoxon and LS estimates of the linear models for
each treatment. As this table and Figure 4.5.1 shows, the larger discrepancy
between the Wilcoxon and LS estimates occurs for those treatments which
have large outliers. The estimates of τϕ and σ are 3.92 and 5.82, respectively;
hence, as the table shows the estimated standard errors of the Wilcoxon esti-
mates are lower than their LS counterparts.
Table 4.5.2 displays the analysis of dispersion table for this data. Note that
Fϕ strongly rejects HOC (p-value is 0.015). This confirms the discussion above
based on Figure 4.5.1. The second hypothesis tested is no treatment effect,
H0 : µ1 = · · · = µ4 . Although Fϕ strongly rejects this hypothesis also, in lieu
of the results for HOC , the practical interpretation of such a decision is not
i i
i i
i i
Figure 4.5.1: Panels A-D: For the snake data, scatterplots of final distance
versus initial distance for the placebo and treatments 2-4, overlaid with the
Wilcoxon fit (solid line) and the LS fit (dashed line); Panel E: Internal
Wilcoxon Studentized residual plot; Panel F: Wilcoxon Studentized logistic
q−q plot.
25
35
20
20
30
15
Final distance
Final distance
Final distance
15
25
10
10
20
5
5
15
0
0
10 15 20 25 5 10 15 20 25 30 5 10 15 20 25 30
10
25
5
Final distance
20
0
15
−5
−5
10
5 10 15 20 25 30 0 5 10 15 20 25 −2 0 2
i i
i i
i i
obvious. The value of the LS F-test for HOC is 2.34 (p-value is 0.078). If HOC
is not rejected then the LS analysis could lead to an invalid interpretation.
The outliers spoiled the LS analysis of this data set. As shown in Exercise
4.8.15 both the R analysis and the LS analysis strongly reject HOC for the
square root transformation of the response.
where yijkl denotes the response for the lth replicate, at Fee i, Scope j, and
Supervision k. Wilcoxon scores were selected for the fit with residuals adjusted
to have median 0. Panels A and B of Figure 4.6.1 show, respectively, the
residual and normal q−q plots for the internal R Studentized residuals, (3.9.31),
based on this fit. The scatter in the residual plot is fairly random and flat.
There do not appear to be any outliers. The main trend in the normal q −q
i i
i i
i i
Figure 4.6.1: Panel A: Wilcoxon Studentized residual plot for data of Example
4.6.1; Panel B: Wilcoxon Studentized residual normal q−q plot.
Panel A Panel B
1.5
1.5
1.0
1.0
Wilcoxon Studentized residuals
0.5
0.0
0.0
−0.5
−0.5
−1.0
−1.0
−1.5
plot indicates tails lighter than those of a normal distribution. Hence, the fit
is good and we proceed with the analysis.
Table 4.6.1 displays the tests of the effects based on the LS and Wilcoxon
fits. The Wald-type Fϕ,Q statistic based on the pseudo-observations is also
given. The LS and Wilcoxon analyses agree which is not surprising based on
the residual plot. The main effects are highly significant and the only sig-
nificant interaction is the interaction between Scope and Supervision. As a
subsequent analysis, we consider nine contrasts of interest. We use the Bon-
ferroni method based on the pseudo-observations as discussed in Section 4.3.
We used Minitab to obtain the results that follow. Because the factor Fee
does not interact with the other two factors, the contrasts of interest for this
factor are: µ1·· − µ2·· , µ1·· − µ3·· , and µ2·· − µ3·· . Table 4.6.2 presents the esti-
i i
i i
i i
Table 4.6.2: Wilcoxon Estimates of Contrasts of Interest for the Market Data
Contr. Est. Conf. Int. Contr. Est. Conf. Int.
µ1·· − µ2·· 1.05 (−1.59, 3.69) µ·11 111.9 (108.9, 114.9)
µ1·· − µ3·· 31.34 (28.70, 33.98) µ·12 101.0 (98.0, 104.0)
µ2·· − µ3·· 30.28 (27.64, 32.92) µ·21 10.6.4 (103.4, 109.4)
µ·22 81.6 (78.6, 84.6)
mates of these contrasts and the 95% Bonferroni confidence p intervals which
.
are given by the estimates of the contrast ±t(.05/18;36) τbϕ 2/16 = ±2.64. From
these results, quality of work significantly improves for either higher or aver-
age fees over low fees. The results for high or average fees are insignificant.
Since the factors Scope and Supervision interact, but do not interact sepa-
rately or jointly with the factor fee, the parameters of interest are the simple
contrasts among µ·11 , µ·12 , µ·21 and µ·22 . Table 4.6.2 displays the estimates of
these parameters.pUsing α = .05, the Bonferroni bound for a simple contrast
.
here is t.05/18;36 τbϕ 2/12 = 3.04. Hence all 6 simple pairwise contrasts among
these parameters, are significantly different from 0. In particular, averaging
over fees, the best quality of work occurs when all contract work is done in
house and under local supervision. The source of the interaction between the
factors Scope and Supervision is also clear from these estimates.
Example 4.6.2 (Pigs and Diets). This data set is discussed on page 291
of Rao (1973). It concerns the effect of diets on the growth rate of pigs. For
convenience, we have tabled the data at the url cited in the Preface. There are
three diets, called A, B, and C. Besides the diet classification, the pigs were
classified according to their pens (5 levels) and sex (2 levels). Their initial
weight was also recored as a covariate.
The design is a 5 × 3 × 2 a crossed factorial with only one replication.
For comparison purposes, we use the same model that Rao used which is a
fixed effects model with main effects and the two-way interaction between the
factors Diets and Sex. Letting yijk and xijk denote, respectively, the growth
rate in pounds per week and the initial weight of the pig in pen i, on diet j
and sex k, this model is given by:
i i
i i
i i
Figure 4.6.2: Panel A: Internal Wilcoxon Studentized residual plot for data of
Example 4.6.2; Panel B: Internal Wilcoxon Studentized residual normal q−q
plot.
Panel A Panel B
4
2
2
Wilcoxon Studentized residuals
0
−2
−2
−4
−4
−6
−6
4.6.2 display the residual plot and normal q−q plot of the internal R Studen-
tized residuals based on the Wilcoxon fit. The residual plot shows the three
outliers. The outliers are prominent in the q−q, but note that even the remain-
ing plotted points indicate an error distribution with heavier tails than the
normal. Not surprisingly the estimate of τϕ is smaller than that of σ, .413 and
.506, respectively. The largest outlier corresponds to the 6th pig which had the
lowest initial weight (recall that the internal R Studentized residuals account
for position in factor space), but its response was above the first quartile. The
second largest outlier corresponds to the pig which had the lowest response.
The results of the tests for the effects for the LS and Wilcoxon fits are:
Effect df FLS Fϕ Fϕ,Q
Pen 4 2.35 3.65 3.48∗∗
i i
i i
i i
Table 4.6.3: Test Statistics for the Effects of Pigs and Diets Data with No
Covariate
Effect df FLS Fϕ Fϕ,Q
Pen 4 2.95∗ 4.20∗ 5.87∗
Diet 2 2.77 4.80∗ 5.54∗
Sex 1 1.08 3.01 3.83
Diet×Sex 2 0.55 1.28 1.46
b or τbϕ
σ 20 .648 .499 .501
∗
Denotes significance at the .05 level
interpretation of the main effects. Also all three agree on the need for the
covariate. Diet has a significant effect on weight gain as does sex. The robust
analyses indicate that pens is also a contributing factor.
The results of the analyses when the covariate is not taken into account
are given by:
Effect df FLS Fϕ Fϕ,Q
∗ ∗
Pen 4 2.95 4.20 5.87∗
∗
Diet 2 2.77 4.80 5.54∗
Sex 1 1.08 3.01 3.83
Diet×Sex 2 0.55 1.28 1.46
b or τbϕ
σ 20 .648 .499 .501
∗
Denotes significance at the .05 level
It is interesting to note, here, that the factor diet is not significant based on
the LS fit while it is for the Wilcoxon analyses. The heavy tails of the error
distribution, as evident in the residual plots, has foiled the LS analysis.
i i
i i
i i
Hence the R residuals have been adjusted by the fit so that the ranks are
orthogonal to the x-space, i.e., the ranks are “free” of the x’s. These are the
ranks that are used in the R test statistic Fϕ , at the full model. Under H0 this
would also be true of the expected ranks of the residuals in the R fit of the
reduced model. Note, also, that the statistic Fϕ is invariant to the values of
the parameters of the reduced model.
Unlike the rank-based analysis there is no general supporting theory for
the RT. Hora and Conover (1984) presented asymptotic null theory on the
RT for treatment effect in a randomized block design with no interaction.
Thompson and Ammann (1989) explored the efficiency of this RT, showing,
however, that this efficiency depends on the block parameters. RT theory for
repeated measures designs has been developed by Akritas (1991, 1993) and
Thompson (1991b). These extensions also have the unpleasant trait that their
efficiencies depend on nuisance parameters.
Many of these theoretical studies on the RT have raised serious questions
concerning the validity of the RT for simple two-way and more complicated
designs. For a two-way crossed factorial design, Brunner and Nuemann (1986)
i i
i i
i i
showed that the RT statistics are not reasonable for testing main effects in the
presence of interaction for designs larger than 2×2 designs. This was echoed by
Akritas (1990) who stated further that RT statistics are not reasonable test
statistics for interaction nor most other common hypotheses in either two-
way crossed or nested classifications. In several of these articles (see Akritas,
1990 and Thompson, 1991a, 1993), the nonlinear nature of the RT is faulted.
For a given model the hypotheses of interest are linear contrasts in model
parameters. The rank transform, though, is nonlinear; hence often the original
hypothesis is no longer tested by the rank transformed data. The same issue
was raised earlier by Fligner (1981) in a discussion of the article by Conover
and Iman (1981).
In terms of small sample properties, initial simulations of the RT analysis
on certain models (see for example Iman, 1974), did appear promising. Now
there has been ample evidence based on simulation studies questioning the
wisdom of doing RTs on designs as simple as two-way factorial designs with
interaction; see, for example, Blair, Sawilowsky, and Higgins (1987) and the
Preface in Sawilowsky (2007). We discuss one such study next and then present
an analysis of covariance example where the use of the RT results in a poor
analysis.
i i
i i
i i
In order to compare the behavior of the rank-based analysis and the RT,
on this design, we performed part of their simulation study. We considered
standard normal errors and contaminated normal errors, which had 10% con-
tamination from a normal distribution with mean 0 and standard deviation 8.
The normal variates were generated as discussed in Marsaglia and Bray (1964)
using uniform variates which were generated by a portable fortran generator
written by Kahaner, Moler, and Nash (1989). There were 5 replications per cell
and the nonnull constant of proportionality c was set at .75. The simulation
size was 1000.
Tables 4.7.1 and 4.7.2 summarize the results of our study for the following
two situations: the two-way interaction A × C and the three-way interaction
effect A × B × C. The alternative for the A × C situation had all main effects
and all two-way interactions in while the alternative for the A×B×C situation
had all main effects, two-way interactions besides the three-way alternative in.
These were poor situations for the RT in the study conducted by Sawilowsky
et al. and as Tables 4.7.1 and 4.7.2 indicate the RT behaves poorly for these
situations in our study also. Its empirical α levels are deplorable. For instance,
at the nominal .10 level for the three-way interaction test under normal errors,
the RT has an empirical level of .777, while the level is .511 at the contaminated
normal. In contrast the levels of the rank-based analysis were quite close to
the nominal levels under normal errors and slightly conservative under the
contaminated normal errors. In terms of power, note that the empirical power
of the rank-based analysis is slightly less than the empirical power of LS under
normal errors while it is substantially greater than the power of LS under
contaminated normal errors. For the three-way interaction test, the empirical
power of the RT falls below its empirical level.
Example 4.7.1 (The Rat Data). This example, taken from Shirley (1981),
contrasts the rank-based methods, the rank transformed methods, and least
squares methods in an analysis of covariance setting. The response is the time
it takes a rat to enter a chamber after receiving a treatment designed to delay
the time of entry. There were 30 rats in the experiment and they were divided
evenly into three groups. The rats in Groups 2 and 3 received an antidote to
i i
i i
i i
Table 4.7.3: LS (Top Row) and Wilcoxon (Bottom Row) Estimates (Standard
Errors) for the Rat Data
Group 1 Group 2 Group 3
α β α β α β σ or τϕ
-39.1 (20.) 76.8 (10.) -15.6 (22.) 20.5 (14.) -14.7 (19.) 21.9 (12.) 20.5
-54.3 (16.) 84.2 (8.6) -19.3 (18.) 21.0 (11.) -11.6 (16.) 17.4 (10.) 17.0
the treatment. The covariate is the time taken by the rat to enter the chamber
prior to its treatment. The data are displayed in Panel A of Figure 4.7.1; for
convenience we have also placed the data at the url listed in the Preface. As
a full model, we considered the model,
where yij denotes the response for the ith rat in Group j and xij denotes the
corresponding covariate. There is a slight quadratic aspect to the Wilcoxon
residual plot, Panel B of Figure 4.7.1, which is investigated in Exercise 4.8.16.
Panel C of Figure 4.7.1 displays a plot of the internal Wilcoxon Studentized
residuals by case. Note that there are several outliers. These also can be seen
in the plots of the data for groups 2 and 3, Panels E and F of Figure 4.7.1.
Note that the outliers have an effect on the LS-fits, drawing the fits toward
the outliers in each group. In particular, for Group 3, it only took one outlier
to spoil the LS fit. On the other hand, the Wilcoxon fit is not affected by
the outliers. The estimates are given in Table 4.7.3. As the plots indicate, the
LS and Wilcoxon estimates differ numerically. Further evidence of the more
precise R fits relative to the LS fits is given by the estimates of the scale
parameters σ and τϕ found in the Table 4.7.3.
We first test for homogeneity of slopes for the groups; i.e, H0 : β1 = β2 =
β3 . As clearly shown in Panel A of Figure 4.7.1 this does not appear to be true
for this data. While the slopes for Groups 2 and 3 seem to be about the same
(the Wilcoxon 95% confidence interval for β2 − β3 is 3.9 ± 27.2), the slope for
Group 1 appears to differ from the other two. To confirm this statistically, the
value of the Fϕ statistic to test homogeneity of slopes, H0 , has the value 9.88
with 2 and 24 degrees of freedom, which is highly significant (p < .001). This
says that Group 1, the group that did not receive the antidote, does differ
i i
i i
i i
Figure 4.7.1: Panel A: Wilcoxon fits of all groups; Panel B: Internal Wilcoxon
Studentized residual plot; Panel C: Internal Wilcoxon Studentized residuals
by Case; Panel D: LS (solid line) and Wilcoxon (dashed line) fits for Group
1; Panel E: LS (solid line) and Wilcoxon (dashed line) fits for Group 2; Panel
F: LS (solid line) and Wilcoxon (dashed line) fits for Group 3.
Panel A Panel B
200
•
•
•
•
•
Group 1 •
•
1
Group 2
150
•
Group 3 •
Time (After Treatment)
Wicoxon Residuals
•
•
•
•
100
0
•
•
•
•
•
•
•
50
-1
•
•
•
•
•
•
•
0
Panel C Panel D
6
• •
• LS
• Wilcoxon
4
150
•
• •
Wilcoxon Studentized Residual
• •
Time (After Treatment)
2
•
• • • • •
100
0
• • • • •
• •
• •
• •
•
•
-2
•
50
-4
Panel E Panel F
40
60
50
30
Time (After Treatment)
40
20
30
20
10
10
LS LS
Wilcoxon Wilcoxon
0
i i
i i
i i
significantly from the other two groups in terms of how the groups interact
with the covariate. In particular, the estimated slope of post-treatment time
to pre-treatment time for the rats in Group 1 is about four times as large as
the slope for the rats in the two groups which received the antidote. Because
there is interaction between the groups and the covariate, we did not proceed
with the second test on average group effects; i.e., testing α1 = α2 = α3 .
Shirley (1981) performed a rank transform on this data by ranking the
response and then applying standard least squares analysis. It is clear from
Panel A of Figure 4.7.1 that this nonlinear transform results in homogeneous
slopes for the ranked problem, as confirmed by Shirley’s analysis. But the rank
transform is a nonlinear transform and the subsequent analysis based on the
rank transformed data does not test homogeneity of slopes in Model (4.7.2).
The RT analysis is misleading in this case.
Note that using the rank-based analysis we performed an overall analysis
of this data set, including a residual analysis for model criticism. Hypotheses
of interest were readily tested and estimates of contrasts, along with standard
errors, were easily obtained.
4.8 Exercises
4.8.1. Derive expression (4.2.19).
4.8.2. In Section 4.2.2 when we have only two levels, show that the Kruskal-
Wallis test is equivalent to the MWW test discussed in Chapter 2.
4.8.3. Consider a one-way design for the data in Example 4.2.3. Fit the model
using Wilcoxon estimates and conduct a residual analysis, including residual
and q −q plots of standardized residuals. Identify any outliers. Next test the
hypothesis (4.2.13) using the Kruskal-Wallis test and the test based on Fϕ .
4.8.4. Using the notation of Section 4.2.4, show that the asymptotic covari-
ance between µ bi and µ
bi′ is given by expression (4.2.31). Next show that ex-
pressions (3.9.38) and (4.2.31) lead to a verification of the confidence interval
(4.2.30).
h′ y p
sup √ = y′ Dy . (4.8.1)
h h′ D−1 h
i i
i i
i i
P
for all vectors h such that hi = 0.
Hence, if the Kruskal-Wallis test rejects H0 at level α then there must
be
p at least one contrast in the rank averages that exceeds the critical value
2
χ (k − 1). This provides Scheffé type multiple contrast tests with family
error rate approximately equal to α.
4.8.7. Apply the procedure presented in Exercise 4.8.6 to the quail data of
Example 4.2.1. Use α = .10.
4.8.8. Let I1 and I2 be (1 − α)100% confidence intervals for parameters θ1
and θ2 , respectively. Show that
(a) Suppose the confidence intervals I1 and I2 are independent. Show that
1 − 2α ≤ P [{θ1 ∈ I1 } ∩ {θ2 ∈ I2 }] ≤ 1 − α .
(b) Generalize expression (4.8.2) to k confidence intervals and derive the Bon-
ferroni procedure described in (4.3.2).
4.8.9. In the notation of the Pairwise Tests Based on Joint Rankings
procedure of Section 4.7, show that R1 is asymptotically Nk−1 (0, k(n+1)
12
(Ik−1 +
Jk−1 )) under H0 : µ1 = · · · = µk . (Hint: The asymptotic normality follows as in
Theorem 3.5.2. In order to determine the covariance matrix of R1 , first obtain
′
the covariance matrix of the random vector R = (R1· , . . . , Rk· ) and then
obtain the covariance matrix of R1 by using the transformation [−1k−1 Ik−1 ].)
4.8.10. In Section 4.3, the Pairwise Tests Based on Joint Rankings
procedure was discussed based on Wilcoxon scores. Generalize this procedure
for an arbitrary score function ϕ(u).
4.8.11. For the baseball data in Exercise 1.12.33, consider the following one-
way problem. The response of interest is the hitter’s average and the three
groups are left-handed hitters, right-handed hitters, and switch hitters. Using
either Minitab or rglm, obtain the following analyses based on Wilcoxon scores:
(a) Using the test statistic Fϕ , test for an overall group effect. Obtain the
p-value and conclude at the 5% level.
i i
i i
i i
(b) Use the protected LSD procedure of Fisher to compare the groups at the
5% level.
height = α + weightβ + e .
Using Wilcoxon scores and either Minitab or rglm, investigate whether or not
the same simple linear model can be used for both the pitchers and hitters.
Obtain the p-value for the test of this hypothesis based on the statistic Fϕ .
4.8.14. In Example 4.5.1 obtain the square root of the response and fit it to
the full model. Perform a residual analysis on the resulting fit. In particular
identify any outliers and compare the heteroscedasticity in the plot of the
residuals versus the fitted values with the analogous plot in Example 4.5.1.
4.8.15. For Example 4.5.1, overlay the Wilcoxon and LS fits for the four
treatments based on the square root transformation of the response. Then
obtain an analysis of covariance for both the Wilcoxon and LS analyses for
the transformed data. Comment on the plots and the results of the analyses.
4.8.16. Consider Example 4.7.1. Investigate whether a model which also in-
cludes quadratic terms in the covariates is more appropriate for the Rat data
than Model (4.7.2).
4.8.17. Consider Example 4.7.1. Eliminate the placebo group, Group 1, and
perform an analysis of covariance on Groups 2 and 3. Use the linear model,
(4.7.2). Is there any difference between these groups?
i i
i i
i i
Yijl = µ + αi + βj + eijl ; i = 1, . . . , a , j = 1, . . . , k , l = 1, . . . , m ,
Suppose we rank the data in the ith block from 1 to mk for i = 1, . . . , a. Let
Rj be the sum of the ranks for the jth treatment. Show that
am(mk + 1)
E(Rj ) =
2
2
am (mk + 1)(k − 1)
Var(Rj ) =
12
am2 (mk + 1)
Cov(Rj , Rl ) = − .
12
Further, argue that
Xk " #2
k−1 Rj − E(Rj )
Km = p
j=1
k Var(Rj )
" k
#
12 X
= R2 − 3a(mk + 1)
akm2 (mk + 1) j=1 j
4.8.20. The data in Table 4.8.1 are the results of a 3 × 4 design discussed
in Box and Cox (1964). Forty-eight animals were exposed to three different
poisons and four different treatments. The response was the survival time of
the animal. The design was balanced. Use (4.4.1) as the full model to answer
the questions below.
(a) Using Wilcoxon scores obtain the fit of the full model. Sketch the cell me-
dian profile plot based on this fit and discuss whether or not interaction
between poison and treatments is present.
i i
i i
i i
(b) Based on the Wilcoxon fit, plot the residuals versus the fitted values.
Comment on the appropriateness of the model. Also obtain the internal
Wilcoxon Studentized residuals and identify any outliers.
(c) Using the statistic Fϕ , obtain the robust ANOVA table (main effects and
interaction) for this data. Conclude in terms of the p-values.
(d) Note that the hypothesis matrix for interaction defines six interaction
contrasts. Use the Bonferroni and Protected LSD multiple comparison
procedures, (4.3.2) and (4.3.3), to investigate these contrasts. Determine
which, if any, are significant.
(e) Repeat the analysis in Parts (c) and (d), (Bonferroni analysis), using LS.
Compare the Wilcoxon and LS results.
H0 : µ1 = · · · = µk versus HA : µ1 ≤ · · · ≤ µk ,
+
where Sst = #(Ytj > Ysi) for i = 1, . . . , ns and j = 1, . . . , nt ; see (2.2.20).
This test for ordered alternatives was proposed independently by Jonckheere
(1954) and Terpstra (1952). Under H0 , show the following:
i i
i i
i i
i i
i i
i i
Chapter 5
5.1 Introduction
In this chapter, we develop robust rank-based fitting and inference procedures
for linear models with dependent error structure. The first four sections con-
sider general mixed models. For these, the underlying model is linear but the
data come in clusters; for example, in blocks, subjects, or centers. Hence, there
is a cluster (block) effect; that is, the observations within a cluster are depen-
dent random variables. As in Chapter 4, the fixed effects of the linear model
are of interest but the dependent error structure must be taken into account
in the development of the inference procedures. The last section of the chapter
discusses rank-based fitting and inference procedures for autoregressive time
series models.
337
i i
i i
i i
Geometry
The geometry of the R fit of Model 5.2.1 is the same as for that of the linear
model in Chapter 3. Let ϕ(u), 0 < u < 1, be a specified score function
which satisfies Assumption (S.1) of Section 3.4 and consider the norm given
in expression (3.2.6). Then, as in Chapter 3, the R estimator of β is
b = ArgminkY − Xβkϕ .
β (5.2.3)
ϕ
For Model (5.2.1), properties of this estimator were developed by Kloke, Mc-
Kean, and Rashid (2009). They refer to it as the JR estimator for joint rank-
ing; however, we use the terminology of Chapter 3 and call it an R estimator.
i i
i i
i i
Asymptotic Theory
The asymptotic theory for the R estimates is similar to the theory in Chapter
3. For this reason we only briefly sketch it in the following discussion. First,
certain conditions are needed. Assume that the random vectors e1 , e2 , . . . , em
are independent; i.e., the responses drawn from different blocks or clusters
are independent. Assume further that the univariate marginal distributions
of ek are the same for all k. As discussed at the end of this section (see
Subsection 5.2.1), this holds for many models of practical interest; however, in
Section 5.5, we do discuss more general rank-based estimators which do not
require this assumption. Let F (x) and f (x) denote the common univariate
marginal distribution function and density function. Assume that f (x) follows
Assumption (E.1) of Section 3.4 and that the usual regularity (likelihood)
conditions hold; see, for example, Section 6.5 of Lehmann and Casella (1998).
For the design matrix X, assume that Huber’s condition (D.2) of Section 3.4
holds. As with the asymptotic theory for the traditional estimates (see Liang
and Zeger, 1986), assume that the number of clusters goes to ∞, i.e., m → ∞,
and that nk ≤ M, for all k, for some constant M.
Because of the invariances, without loss of generality, assume that the true
regression parameters are zero in Model (5.2.1). As in Chapter 3, asymptotic
theory for the fixed effects estimator involves establishing the distribution of
the gradient and the asymptotic quadraticity of the dispersion function.
Consider Model (5.2.1) and assume the above conditions. It then follows
from Brunner and Denker (1994) that the projection of the gradient Sϕ (Y −
Xβ) is the random vector X′ ϕ[F(Y−Xβ)], where ϕ[F(Y−Xβ)] = (ϕ[F (Y11 −
x′11 β)], . . . , ϕ[F (Ymnm − x′mnm β)])′ . We need to assume that the covariance
structure of this projection is asymptotically stable; that is, the following
limit exists and is positive definite:
Pm
Σϕ = limm→∞ n−1 k=1 X′k Σϕ,k X, (5.2.5)
i i
i i
i i
where
Σϕ,k = Cov{ϕ[F(ek )]}. (5.2.6)
In likelihood methods, a similar assumption is made on the limit of the co-
variance matrix of the errors.
As discussed by Kloke et al. (2009), under these assumptions, it follows
from Theorem 3.1 of Brunner and Denker (1994) that
1 D
√ SX (0) → Np (0, Σϕ ), (5.2.7)
n
b
Based on (5.2.7) and (5.2.8), we obtain the asymptotic distribution of β ϕ
which we summarize in the following theorem,
Theorem 5.2.1. Under the assumptions discussed above, the distribution of
b ϕ is approximately normal with mean β and covariance matrix
β
m
!
X
Vϕ = τϕ2 (X′ X)−1 X′k Σϕ,k Xk (X′ X)−1 . (5.2.9)
k=1
b b .
bs 1n − Xβ
eR = Y − α (5.2.11)
ϕ
i i
i i
i i
b )T [MV
TW,ϕ = (Mβ b ).
b ϕ MT ]−1 (Mβ (5.2.13)
ϕ ϕ
i i
i i
i i
5.2.1 Applications
In many applications the form of the covariance structure of the random vector
of errors ek of Model 5.2.1 is known. This can result in a simplified asymp-
totic covariance structure for βb . We discuss several such cases in the next
ϕ
few sections. In Section 5.3, we consider a simple mixed model with clusters
handled as a random effect. Here, besides an estimate of τϕ , only an additional
covariance parameter is required to estimate Vϕ . In Section 5.4.1, we discuss
a transformed procedure for a simple mixed model, provided that the design
matrices for each cluster, Xk s, have full column rank. Another rich class of
such models is the repeated measure designs, where cluster is synonymous
with subject. Two common types of covariance structure for these designs
are: (i) the covariance of the errors for a subject have compound symmetri-
cal structure, i.e., a simple random effect model, or (ii) the errors follow a
stationary time series model, for instance an autoregressive model. For Case
(ii), the univariate marginals would have the same distribution and, hence,
the above assumptions hold for our rank-based estimates. Using the residuals
from the rank-based fit, R estimators of the autoregressive parameters of the
error distribution can be obtained. These estimates could then be used in the
usual way to transform the observations and then a second (generalized) R
estimate could be obtained based on these transformed observations. This is
a robust analogue of the two-stage estimation procedure discussed for cluster
samples in Rao, Sutradhar and Yue (1993). Generalized R estimators based
on transformations are discussed in Sections 5.4 and 5.5.
Assume that the random effects b1 , . . . , bm are independent and identically dis-
tributed random variables. It follows that the distribution of ek is exchange-
able. In particular, all marginal distributions of ek are the same; so, the theory
of Section 5.2 holds. This family of models contains the randomized block de-
signs, but as in Section 5.2 the blocks can be incomplete and/or unbalanced.
We call Model 5.3.1, the simple mixed model.
For this model, the asymptotic variance-covariance matrix of β b , (5.2.9),
ϕ
i i
i i
i i
simplifies to
Pm
τϕ2 (X′ X)−1 k=1 X′k Σϕ,k Xk (X′ X)−1, Σϕ,k = (1 − ρϕ )Ink + ρϕ Jnk , (5.3.2)
where ρϕ = cov {ϕ[F (e11 )], ϕ[F (e12 )]} = E{ϕ[F (e11 )]ϕ[F (e12 )]}. Also, the
asymptotic variance of the intercept (5.2.10) simplifies
P to n−1 τS2 (1 + n∗ ρ∗S ),
for ρ∗S = cov [sgn (e11 ), sgn (e12 )] and n∗ = n−1 m k=1 nk (nk − 1). As with LS,
∗
for positive definiteness, we needPto assume that each of ρϕ and ρS exceeds
m nk
maxk {−1/(nk − 1)}. Let M = k=1 2 − p (the subtraction of p, the di-
mension of the vector β, is a degree of freedom correction). A simple moment
estimator of ρϕ is
m X
X
ρbϕ = M −1 a[R(b
eki )]a[R(b
ekj )]. (5.3.3)
k=1 i>j
Plugging this into (5.3.2) and using the estimate of τϕ discussed earlier, we
have an estimate of the asymptotic covariance matrix of the R estimators.
For the general mixed model (5.2.1) of Section 5.2, the AREs for the rank-
based procedures are difficult to obtain; however, for the simple mixed model,
(5.3.1), the ARE can be obtained in closed form provided the design is centered
within each block; see Kloke et al. (2009). The reader is asked to show in
Exercise 5.7.2 that for Wilcoxon scores, this ARE is
Z 2
2 2
ARE(FW,ϕ , FLS ) = [(1 − ρ)/(1 − ρϕ )]12σ f (t) dt , (5.3.4)
i i
i i
i i
b
eR,kj = ykj − [b b k = 1, . . . , m; j = 1, . . . , nk .
α + x′kj β], (5.3.5)
In this simple mixed model, the residuals ebkj , (5.3.5), are often call the
marginal residuals. In addition, though, we have the conditional residuals
for the errors εkj which are defined by
eR,kj − bbk , j = 1, . . . nk , k = 1, . . . , m.
εbkj = b (5.3.7)
σ εkj )]2 .
bε2 = [MAD1≤j≤nn ,1≤k≤m (b (5.3.8)
Hence, robust estimates of the total variance σ 2 and the intraclass correlation
coefficient are
b2 = σ
σ bε2 + σ
bb2 and ρb = σ
bb2 /b
σ2. (5.3.9)
Thus, our robust estimates of the variance components are given in expressions
(5.3.6), (5.3.8), and (5.3.9).
i i
i i
i i
Because the block sample sizes nk are not necessarily the same,
some additional notation simplifies the presentation. Let ν1 and ν2
be two parameters and define the block-diagonal matrix B(ν1 , ν2 ) =
diag{B1 (ν1 , ν2 ), . . . , Bm (ν1 , ν2 )}, where Bk (ν1 , ν2 ) = (ν1 − ν2 )Ink + ν2 Jnk ,
k = 1, . . . , m. Hence, for Model (5.3.1), we can write Var(e) = σ 2 B(1, ρ).
Using the asymptotic representation for β b given in expression (5.2.8), a
ϕ
tedious calculation, similar to that in Section 3.9.2, shows that the approxi-
mate covariance matrix of b eR is given by
τ2
CR = σ 2 B(1, ρ) + s2 Jn B(1, ρ∗S )Jn + τ 2 Hc B(1, ρϕ )Hc (5.3.10)
n
τs ∗ ∗ τs ∗ ∗
− B(δ11 , δ12 )Jn − τ B(δ11 , δ12 )Hc − Jn B(δ11 , δ12 )
n n
τ τs τs τ
+ Jn B(γ11 , γ12 )Hc − τ Hc B(δ11 , δ12 ) + Hc B(γ11 , γ12 )Jn ,
n n
where Hc is the projection matrix onto the column space of the centered design
matrix Xc , Jn is the n × n matrix of all ones, and
∗
δ11 = E[e11 sgn (e11 )],
∗
δ12 = E[e11 sgn (e12 )],
δ11 = E[e11 ϕ(F (e11 ))],
δ12 = E[e11 ϕ(F (e12 ))],
γ11 = E[sgn(e11 )ϕ(F (e11 ))],
γ12 = E[sgn(e11 )ϕ(F (e12 ))],
i i
i i
i i
i i
i i
i i
Table 5.3.1: Wilcoxon and LS Estimates with SEs of Effects for the Crab Grass
Data
Wilcoxon Traditional
Contrast Est. SE Est. SE
Nit 39.90 4.08 69.76 28.7
Pho 10.95 4.08 −11.52 28.7
Pot −1.60 4.08 28.04 28.7
D34 3.26 5.76 57.74 40.6
D24 7.95 5.76 8.36 40.6
D14 24.05 5.76 31.90 40.6
test for the factor density: TW,ϕ = 23.23 (p = 0.001) and Flme = 6.33 with
p = 0.001. The robust estimates of the variance components are: σ b2 = 209.20,
bb2 = 20.45, and ρb = 0.098 These are essentially unchanged from their values
σ
on the original data. If on the original data the experimenter had run the ro-
bust fit and compared it with the traditional fit, then the outlier would have
been discovered immediately.
Figure 5.3.1 contains the Wilcoxon Studentized residual plot and q−q plot
for the original data. We have removed the large outlier from the plots, so
that we can focus on the remaining data. The “vacant middle” in the residual
plot is an indication that interaction may be present. For the hypothesis of
interaction between the nutrients, the value of the Wald-type test statistic
is TW,ϕ = 30.61, with p = 0.000. Hence, the R analysis strongly confirms
that interaction is present. On the other hand, the traditional likelihood ratio
test statistic for this interaction is 2.92, with p = 0.404. In the presence of
interaction, many statisticians would consider interaction contrasts instead of
a main effects analysis. Hence, for such statisticians, the robust and traditional
analyses would have different practical interpretations.
i i
i i
i i
Figure 5.3.1: Studentized residual and q−q plots, minus large outlier.
8
6
4
2
−2 0
40 60 80 100
Wilcoxon fit
8
6
4
2
−2 0
−2 −1 0 1 2
Normal quantiles
ized block design as in the crab grass example was used, with the correlation
structure as estimated by the Wilcoxon analysis. The empirical confidences of
the asymptotic 95% confidence intervals were recorded. These intervals are of
the form Estimate ±1.96×SE, where SE denotes the standard errors of the
estimates. The number of simulations was 10,000 for each situation, therefore,
the error in the table based on the usual 95% confidence interval for a propor-
tion is 0.004. The empirical confidences for the Wilcoxon are quite good with
the target of 0.95 usually within range of error. They were perhaps a little
conservative at the the contaminated normal situation. Hence, the Wilcoxon
analysis appears to be valid for this design. The intervals based on the tra-
ditional fit are slightly liberal. The empirical AREs between two estimators
displayed in Table 5.3.2 are the ratios of empirical mean squared errors of the
two estimators. As the table shows, the traditional fit is more efficient at the
normal but the efficiencies are close to the value 0.95 for the independent er-
ror case. The Wilcoxon analysis is much more efficient over the contaminated
normal situation.
Does this rank-based analysis differ from the independent error analysis
of Chapter 3? As a tentative answer to this question, Kloke et al. (2009)
ran 10,000 simulations using the model for the crab grass example. Wilcoxon
scores were used for both analyses. To avoid confusion, call the analysis of
Chapter 3 the IR analysis (I for independent errors), and the analysis of this
i i
i i
i i
section the R analysis. They considered normal error distributions, setting the
variance components at the values of the robust estimates. Because the R and
IR fits are the same, they considered the differences in their inferences of the six
effects listed in Table 5.3.1. For 95% nominal confidence, the average empirical
confidences over these six contrasts are 95.32% and 96.12%, respectively for the
R and IR procedures. Hence, both procedures appear valid. For a measure of
efficiency, they averaged, across the contrasts, the averages of squared lengths
of the confidence intervals. The ratio of the R to the IR averages is 0.914;
hence for the simulation, the R inference is about 9% more efficient than the
IR inference. Similar results for the traditional analyses are reported in Rao
et al. (1993).
i i
i i
i i
the WW2 fit has more precision than one based on the Wilcoxon fit.
To investigate this gain in precision, we ran a small simulation study. We
used the same model and the same correlation structure as estimated by the
Wilcoxon fit. We considered normal and contaminated normal errors, with the
percent of contamination at 20% and the relative variance of the contaminated
part at 25. For each situation 10,000 simulations were run. The AREs were
very similar for all six parameters, so we only report their averages. For the
normal situation the average ARE between the WW2 and Wilcoxon estimates
was 0.90; hence, the WW2 estimate was 10% less efficient for the normal
situation. For the contaminated normal situation, though, this average was
1.21; hence, the WW2 estimate was 20% more efficient than the Wilcoxon
estimate for the contaminated normal situation.
There are families of score functions besides the Winsorized Wilcoxon
scores. Gastwirth (1966) presents several families of score functions appro-
priate for classes of distributions with tails heavier than the exponential dis-
tribution. For certain cases, he selects a score based on a maxi-min strategy.
Y ∗ = X∗ b + e∗ , (5.4.1)
i i
i i
i i
b ∗ = HX∗ Y
Y b ∗. (5.4.3)
1
Arnold Transformation
Arnold (Chapters 14 and 15, 1981) discusses a Helmert transformation for
these types of models for traditional (least squares) analyses for balanced
designs, i.e., all nk ’s are the same. Kloke and McKean (2010a) generalized
Arnold’s results to unbalanced designs and developed the properties of the R
fit for the transformed data. Consider the nk × nk orthogonal matrix
1 ′
√ 1n
nk
Γk = k
(5.4.6)
C′k ,
i i
i i
i i
i i
i i
i i
b
i.e., β ∗′ ∗ −1 ∗′ ∗
AT LS = (X X ) X y2 . This is the extension of Arnold’s (1981) solu-
tion that was proposed by Kloke and McKean (2010a) for the unbalanced case
of Model (5.4.7). As usual, estimate the intercept based on the mean of the
residuals,
1 ′
bLS =
α 1 (y − y b)
n
1 ′
= 1 (In − X(X∗′ X∗ )−1 X∗ C′ )y = ȳ.
n
As Exercise 5.7.3 shows the joint asymptotic distribution is
2
bLS
α α σ1 0′
b ∼N
˙ p+1 , (5.4.8)
β AT LS β 0 σ22 (X∗′ X∗ )−1
P
where σ12 = (σ 2 /n2 ) m 2 2 2
k=1 [(1 − ρ)nk + nk ] and σ2 = σ (1 − ρ). Notice that if
inference is to be on β then we avoid explicit estimation of ρ. To estimate σ22
P Pnk ∗2 b
we may use σ b22 = m k=1 bkj /(n∗ − p) where b
j=1 e e∗kj = ykj
∗
− x∗′
kj β.
i i
i i
i i
i i
i i
i i
Table 5.4.2: ATR and ATLS Estimates and Standard Errors for Example 5.4.1
ATR ATLS
Est SE Est SE
α 70.8 3.54 72.8 8.98
∆ −14.45 1.61 −14.45 1.19
β 1.43 0.65 1.46 0.33
i i
i i
i i
(i)
second data point to y12 = y11 + ∆y, where ∆y varied from -30 to 30. Then
the parameters ∆(i) are estimated based on the data set with the outlier. The
b −∆
graph below displays the relative change of the estimate, (∆ b (i) )/∆,
b as a
function of ∆y.
0.04
0.02
0.00
Relative change of the estimate of ∆.
Relative Change ∆
−0.02
−0.04
−0.06
∆y
Over this range of ∆y, the relative changes in the ATR estimate is between
−0.042 to 0.062. In contrast, as the reader is asked to show in Exercise 5.7.4,
the relative change in ATLS over this range is between 0.125 to 0.394. Hence,
the relative change in the ATR estimates is small, which indicates the robust-
ness of the ATR estimates.
i i
i i
i i
i i
i i
i i
Thus the solution to the GEE equations (5.5.5) also can be expressed as
where wit (β) = ϕ[R(eit (β))/(n + 1)]/[eit (β) − m(β)] is a weight function.
As usual, we take wit (β) = 0 if eit (β) − m(β) = 0. Note that by using
i i
i i
i i
the median of the residuals in conjunction with property (2.5.9), the weights
are positive. To accommodate other score functions besides those that satisfy
(2.5.9), quantiles other than the median can be used; see Example 5.5.3 and
Sievers and Abebe (2004) for discussion.
For the initial estimator of β, we recommend the rank-based estimator of
Chapter 3 based on the score function ϕ(u). Denote this estimator by b (0) . As
β
(0) R
estimates of the weights, we use wbit βb ; i.e., the weight function evaluated
R
(0)
b . Expression (5.5.10) leads to the dispersion function
at β
(0)
X ni
m X (0) h (0) i2
∗
DR b
β|β R = w b
bit β eit (β) − m b
β
R R
i=1 t=1
"r r #2
ni
m X
X (0) (0)
(0)
= w b
bit β eit (β) − b
w it
b
β b
m β .
R R R
i=1 t=1
Let
βb (1) = ArgminD ∗ β|βb (0) . (5.5.11)
R R
n (k) o
This establishes a sequence of IRLS estimates, β b , k = 1, 2, . . ..
R
After some algebraic simplification, we obtain the gradient
(k)
m
X h i
∗ b T b −1/2 c b −1/2 ′ ∗ b (k)
▽DR β|β R = −2 Di V i W i V i Yi − a (θ) − m β R ,
i=1
i i
i i
i i
where ϕ†i denotes the ni × 1 vector (ϕ[R(e†i1 )/(n + 1)], . . . , ϕ[R(e†ini )/(n + 1)])T .
i i
i i
i i
The covariance structure suggests a simple moment estimator. Let β b (k) and
b (k) denote the final estimates of β and Vi , respectively.
(for the ith subject) V i
Then the residuals which estimate e†i ≡ (e†i1 , . . . , e†ini )T are given by
h i−1/2
e†i = V
b b (k) Yi − G b (k) ), i = 1, . . . , m,
b (k) (β (5.5.16)
i i
h i−1/2 (k)
b (k) =
where G Vb (k) a′ b
θ and b(k) = h xT β
θ b (k) . Let R(b
e†it ) de-
i i it it
where the second integral is derived from the first by integration by parts
followed by a substitution. The parameter τϕ is of course the usual scale pa-
rameter for the R estimates in the linear model based on the score function
ϕ(u). The second part of the approximation is to replace the weight matrix
b (k) by (AP).
by (1/τ̂ϕ )I. We label this estimator of the covariance matrix of β
i i
i i
i i
i i
i i
i i
Empirical Efficiency
Norm 0.974 0.974 0.972 0.973
CN 2.065 2.102 2.050 2.055
here is C-reactive protein (CRP). Elevated CRP levels are a marker of low-
grade chronic inflammation and may predict a higher risk for cardiovascular
disease (Ridker et al., 2002). The effect of interest is the difference in CRP
between the two groups, which we denote by θ. Hence, a one-sided hypothesis
of interest is
H0 : θ ≥ 0 versus HA : θ < 0. (5.5.19)
Out of the 21 subjects in the study, 3 were removed due to noncompliance
or incomplete information. Thus, we consider the remaining 18 individuals, 9
in each group. CRP level was obtained 24 hours and immediately prior to the
acute bout of exercise and subsequently 24, 72, and 120 hours following exer-
cise giving 90 data points in all. For the reader’s convenience, the CRP data
are displayed at the url listed in the Preface. The top left comparison boxplot
of Figure 5.5.1 shows the effect based on the raw responses. An estimate of
the effect based on the raw data is difference in medians which is −0.54. Note
that the responses are skewed with outliers in each group. We took the time
of measurement as a covariate. Let yi and xi denote respectively the 5 × 1
vectors of observations and times of measurements for subject i and let ci
denote his/her indicator variable for Group, i.e., its components are either 0
(for Moderate Fitness) or 1 (for High Fitness). Then our model is
i i
i i
i i
where ei denotes the vector of errors for the ith individual. We present the
results for three covariance structures of ei : working independence (WI), com-
pound symmetry (CS), and autoregressive-one (AR(1)). We fit the GEEWR
estimate for each of these covariance structures using Wilcoxon scores.
4
4
3
3
Residuals
2
CRP
1
1
0
0
4
3
3
CS Residuals
CS Residuals
2
2
1
1
0
The error model for compound symmetry is the simple mixed model; i.e.,
ei = bi 1ni + ai , where bi is the random effect for subject i and the components
of ai are iid and independent from bi . Let σb2 and σa2 denote the variances of bi
and aij , respectively. Let σt2 = σb2 +σa2 denote the total variance and ρ = σb2 /σt2
denote the intraclass coefficient. In this case, the covariance matrix of ei is of
the form σt2 [(1−ρ)I+ρJ]. We estimated these variance component parameters
σt2 and ρ at each step of the fit of Model (5.5.20) using the robust estimators
discussed in Section 5.3.1
The error model for the AR(1) is eij = ρ1 ei,j−1 + aij , j = 2, . . . ni , where
the aij ’s are iid, for the ith subject. The (s, t) entry in the covariance matrix
i i
i i
i i
|s−t|
of ei is κρ1 , where κ = σa2 /(1 − ρ21 ). To estimate the covariance structure at
step k, for each subject, we model this autoregressive model using the current
residuals. For each subject, we then estimate ρ1 , using the Wilcoxon regression
estimate of Chapter 3; see, also, Section 5.6 on time series. As our estimate of
ρ1 , we take the median over subjects of these Wilcoxon regression estimates.
Likewise, as our estimate of σa2 we took the median over subjects of MAD2 of
the residuals based on the AR(1) fits.
Note that there are only 18 experimental units in this problem, nine for
each treatment. So it is a small sample problem. Accordingly, we used a boot-
strap to standardize the GEEWR estimates. Our bootstrap consisted of re-
sampling the 18 experimenter units, nine from each group. This keeps the
covariance structure intact. Then for each bootstrap sample, the GEEWR es-
timate was computed and recorded. We used 3000 bootstrap samples. With
these small samples, the outliers had an effect on the bootstrap, also. Hence,
we used the MAD of the bootstrap estimates of θ as our standard error of θ. b
Table 5.5.2 summarizes the three GEEWR estimates of θ and β, along
with the estimates of the variance components for the CS and AR(1) models.
As the comparison boxplot of residuals shows in Figure 5.5.1, the three fits are
similar. The WI and AR(1) estimates of the effect θ are quite similar, including
their bootstrap standard errors. The CS estimate of θ, though, is more precise
and it is closer to the difference (based on the raw data) in medians −0.54.
The traditional fit of the simple mixed model (under CS covariance structure),
would be the maximum likelihood fit based on normality. We obtained this fit
by using the lme function in R. Its estimate of θ is −0.319 with standard error
0.297. For the hypotheses of interest (5.5.19), based on asymptotic normality,
the CS GEEWR estimate is marginally significant with p = 0.064, while the
mle estimate is insignificant with p = 0.141.
Note that the residual and q − q plots of the CS GEEWR fit, bottom
i i
i i
i i
plots of Figure 5.5.1, show that the error distribution is right skewed with
a heavy right tail. This suggests using scores more appropriate for skewed
error distributions than the Wilcoxon scores. We considered a simple score
from the class of Winsorized Wilcoxon scores. The Wilcoxon score function
is linear. For this data, a suitable Winsorizing score function is the piecewise
linear function, which is linear on the interval (0, c) and then constant on the
interval (c, 1). As discussed in Example 2.5.1 of Chapter 2, these scores are
optimal for a skewed distribution with a logistic left tail and an exponential
right tail. We obtained the GEEWR fit of this data using this score function
with c = 0.80, i.e., the bend is at 0.80. To insure positive weights, we used the
47th percentile as the location estimator m(β) in the definition of the weights;
see the discussion around expression (5.5.10). The computed estimates and
their bootstrap standard errors are given in the last row of Table 5.5.2 for the
compound symmetry case. The estimate of θ is −0.442 which is closer than the
Wilcoxon estimate to the difference in medians based on the raw data. Using
the bootstrap standard error, the corresponding z-test for hypotheses (5.5.19)
is −1.57 with the p-value of 0.059, which is more significant than the test
based on Wilcoxon scores. Computationally, the iterated reweighted GEEWR
algorithm remains the same except that the Wilcoxon scores are replaced by
these Winsorized Wilcoxon scores.
As a final note, the residual plot of the GEEWR fit for the compound
symmetric dependence structure also shows some heteroscedasticity. The vari-
ability of the residuals is directly proportional to the fitted values. This scalar
trend can be modeled robustly using the rank-based procedures discussed in
Exercise 3.15.39.
i i
i i
i i
′ ′
where R(Xi − Yi−1 φ) denotes the rank of Xi − Yi−1 φ among X1 −
′ ′
Y0 φ, . . . , Xn − Yn−1 φ. Koul and Saleh (1993) developed the asymptotic the-
ory for these rank-based estimates. As we note in the next paragraph, though,
because of the autoregressive model, error distributions with even moderately
heavy tails produce outliers in factor space (points of high leverage). With
this in mind, the high breakdown weighted-Wilcoxon estimates discussed in
Section 3.12 seem more appropriate. The asymptotic theory for these weighted
Wilcoxon estimates was developed by Terpstra, McKean, and Naranjo (2000,
2001). For an account of M estimates and GM estimates for the autoregressive
model see Bustos (1982), Martin and Yohai (1991), and Rousseeuw and Leroy
(1987).
Suppose the random errors of Model (5.6.1) have a heavy-tailed distri-
bution. In this case, by the nature of the model, outlying errors (outliers in
response space) become also errors in factor space. For instance, if, at time
i, ei is an outlier then by the model Xi is an outlier in response space but,
at time i + 1, Xi appears in the design matrix and hence is also an outlier in
factor space. Since the outlier becomes incorporated into the model, outliers
of this form are generally “good” points of high leverage; see, e.g., page 275
i i
i i
i i
of Rousseeuw and Leroy (1987). These are called innovative outliers (IO).
Another class of outliers, additive outliers (AO), occur frequently in time
series data; see Fox (1972) for discussion of both AO and IO types of outliers
(he labeled them as Type I and Type II, respectively). One way of modeling
AO and IO types of outliers is with a simple mixture distribution. Suppose
Xi follows Model (5.6.1) but we observe instead Xi∗ where
Xi∗ = Xi + νi ; i = 1, 2, . . . , n (5.6.5)
where the weights bbij are the robust weights given by expression (3.12.3).
Recall these weights downweight “bad” points of high leverage but not “good”
points of high leverage. Hence, based on the discussion of AO and IO outliers,
the HBR estimates seem appropriate for the autoregressive model. Other types
of weights, for autoregressive models, including the GR weights discussed in
Chapter 3, are presented in Terpstra et al. (2001).
The R collection of functions ww developed by Terpstra and McKean (2005)
can be used to compute the HBR estimates with the weights (3.12.3). In their
accompanying article, Terpstra and McKean discuss the application of ww to
an autoregressive time series data set. The package ww also computes the di-
agnostics T DBET AS(W, HBR), (3.13.10), and CF IT Si (W, HBR), (3.13.12)
discussed in Chapter 3, which are used to compare the Wilcoxon and HBR
fits. We used ww for the computation in this section.
i i
i i
i i
theory. The results are very close to those for the HBR estimates for the lin-
ear model of Chapter 3 and we will point out the similarities. The proofs,
though, use results from stochastic processes and are different; see Terpstra et
al. (2000, 2001) for the details of the proofs. Let bij = b (Yi−1 , Yj−1, bei , b
ej ) de-
note the weight function, which is similar to its definition in the linear model
case; see expression (3.12.13). Define the terms,
Z ∞
γF (Y0 , Y1 ) = b (Y0 , t, Y1 , t) f (t) dF (t) ,
−∞
Z ∞
B(u1 , u2 , e) = sgn(s − e)b(u1 , u2 , e, s) dF (s), (5.6.7)
−∞
Z ∞
AF (Y0 , Y1 , Y2 ) = B(Y0 , Y1 , e)B(Y0 , Y2 , e) dF (e).
−∞
Next, define
Z
1
CF = (y1 − y0 )γF (y0 , y1 )(y1 − y0 )′ dG(y0 )dG(y1 ) (5.6.8)
2
and
Z
ΣF = (y1 − y0 )AF (y0 , y1 , y2 )((y2 − y0 )′ dG(y0 )dG(y1 )dG(y2 ), (5.6.9)
i i
i i
i i
Note how similar the results of (2) and (3) are to Theorems 3.12.1 and
3.12.2, respectively. Terpstra et al. (2000) developed consistent, method of
moment type of estimators for CF and ΣF . These estimates are essentially
the same as the estimates discussed in Section 3.12.6 for the HBR estimates.
For inference, we recommend that the estimates discussed in Chapter 3 be
used. Hence, we suggest using K b HBR of expression (3.12.33) for the estimate
of the asymptotic variance-covariance matrix of φ b .
n
As in the linear model case, the intercept parameter φ0 cannot be estimated
directly by minimizing the rank-based pseudo-norms. As in Chapter 3, a robust
estimate of the intercept is the median of the residuals. More specifically, define
the initial residuals as follows,
b ′
ei = Xi − Yi−1 b , i = 1, 2, . . . , n.
φn
where K b
b HBR , (3.12.33), is the estimate of the variance-covariance of φ HBR .
For efficiency results, consider the sequence of local alternatives Hn : Mφ =
n−1/2 ζ, where ζ 6= 0. The following theorem follows easily from Theorem 5.6.1;
see Exercise 5.7.7.
i i
i i
i i
(0) Set p = P , i = 1.
(2) Let φ2,i = (φp−i+1, . . . , φp )′ . Then use the Wald test procedure to test
H0 : φ2,i = 0 versus HA : φ2,i 6= 0.
(3) If H0 is rejected then stop and declare p to be the order; otherwise, set
p = p − 1 and i = i + 1 and go to (1).
Graybill (1976) discusses an algorithm similar to the one in the last ex-
ample for selecting the order of a polynomial. Terpstra and McKean (2005)
discuss this algorithm for rank-based methods for polynomial models. In a
small simulation study, the algorithm was successful in determining the order
of the polynomial.
i i
i i
i i
The algorithm selected p = 2 for both fits. The reader is asked to run this
algorithm for the HBR fit in Exercise 5.7.8.
Table 5.6.1 displays the estimates along with standard errors for the LS,
Wilcoxon, and HBR fits. Notice that the HBR fit differs from the LS fit. This
is clear from the top right plot of Figure 5.6.1 which shows the data overlaid
with the LS (dashed line) and HBR (solid line) fits. Both large outliers were
omitted from this plot to improve the resolution. The HBR fit hugs the data
much better than the LS fit. The HBR shows a negative estimate of φ2 while
the LS estimate is positive. In terms of inference, the HBR estimates of both
orders are highly significant. For LS, only the estimate of φ1 is significant. The
outliers have impaired the LS fit and its associated inference. The diagnostic
TDBETA between the HBR and LS fits is 258, well beyond the benchmark of
0.48.
The HBR fit differs also from the Wilcoxon fit. The diagnostic for it is
TDBETA = 233. As the casewise plot of Figure 5.6.1 shows, the two fits differ
at many cases. The Wilcoxon fit differs some from the LS fit (TDBETA =
11.7). The final plot of Figure 5.6.1 shows the Studentized residuals of the
HBR fit versus Cases. The two large outliers are clear, along with a few others.
But the remainder of the points fall within the benchmarks. For this data, the
HBR fit performed the best.
The Studentized residuals discussed in the last example were those dis-
cussed in Chapter 3 for the HBR fit, expression (3.12.41); see Terpstra, Mc-
Kean, and Anderson (2003) for Studentized residuals for the traditional and
robust fits for the AR(1) model.
Table 5.6.1: LS, Wilcoxon and HBR Estimates and Standard Errors for the
Residential Data of Example 5.6.2
Procedure φb1 s.e.(φb1 ) φb2 s.e.(φb2 )
LS 0.473 0.116 −0.166 0.116
Wil 0.503 0.029 −0.151 0.029
HBR 0.413 0.069 0.290 0.076
i i
i i
i i
Figure 5.6.1: Plots for Example 5.6.2. Note on the top right plot, the large
two outliers have been omitted.
Seasonally Adjusted Data versus Lag2 Data Seasonally Adjusted Data and Fits
Telephone installaltions (Seasonally adjusted)
10000
50000
6000
*
30000
* **
* * ** * **
* ******* ** ** * ***** ** *
2000
*
***** * * **** ** **** * * ** * **
10000
* * * * * *
−2000 0
* * **** ** *
0
CFIT vs Case, HBR and Wil. Fits (TD = 233) HBR Studentized residual Plot
CFIT between HBR and Wilcoxon fits
40
5
20
0
−5
0
−10
−20
−15
−40
20 40 60 20 40 60
Case Case
time; the A phase of the responses for the subject falls before the intervention
while the B phase of his/her responses falls after the intervention. A common
design matrix for this experiment is a first order design allowing for differ-
ent intercepts and slopes in the phases; see Huitema and McKean (2000) for
discussion. Since the data are collected over time on the same subject, an au-
toregressive model for the random errors is often assumed. The general mixed
model of Section 5.2 is also of this type when the observations in a cluster are
taken over time, such as in a repeated measures design. In such cases, we may
want to model the random errors with an autoregressive model. These types of
models differ from the autoregressive model (5.6.1) discussed at the beginning
of this section in two aspects: firstly, the parameters of interest are those of
the linear model not those of the time series and, secondly, the series is often
quite short. Type AB intervention experiments on a single subject may only
be of length five for each phase. Likewise in a repeated measures design there
may be just a few measurements per subject.
For discussion, suppose the data (Y1 , x1 ), . . . , (Yn , xn ) follow the linear
i i
i i
i i
model
and xi is a p × 1 vector of covariates for the ith response, k is the order of the
autoregressive model, and the ai ’s are iid (white noise). For many real data
situations k is quite small, often k = 1, i.e., an AR(1). One way of proceeding
is to fit the linear model, (5.6.12), and obtain the residuals from the fit. For
our discussion, assume a robust fit is used, say, the Wilcoxon fit. Let b ei denote
the residuals based on this fit. In practice, diagnostics are then run on these
residuals examining them for time series trends. If the check is negative then
usually one proceeds with the linear model analysis. If it is positive then other
fitting methods are used. We discuss these two aspects from a robust point of
view.
A simple diagnostic plot for systematic dependence consists of the resid-
uals versus time order. There are general tests for dependence, including the
nonparametric runs tests. For this test, runs of positive and negative residuals
(in time order) are obtained and measured against what is expected under
independence. Huitema and McKean (1996), though, found that the runs test
based on residuals had very poor small sample properties for the AB interven-
tion designs that they considered. On the other hand, diagnostic tests designed
for specific dependent alternatives, such as the Durbin-Watson test, were valid.
With the autoregressive errors in mind, there are specific diagnostic tools
to use on the residuals. Simple diagnostic plots, lag plots, consist of the
scatter plots of bei versus b
ej−1, j = 1, . . . , k. Linear patterns are indicative of
an autoregressive model. For traditional methods, a common test for an AR(1)
model on the errors of a linear model is based on the Durbin-Watson statistic
given by Pn
et − eet−1 )2
(e e2 + e
e e2
d = t=2Pn 2 = 2 − P1 n n2 − 2r1 , (5.6.14)
et
t=1 e t=1 e
et
where P
e ee
ne
r1 = Pn t 2t−1
t=2
(5.6.15)
t=1 e
et
and e
et denotes the th LS residual. The null (errors are iid) distribution depends
on the design matrix, so often approximate critical values are used. By the far
right expression of (5.6.14), the statistic d is a function of r1 . This suggests
another test statistic based on r1 given by
r1 + [(p + 1)/n]
h= p ; (5.6.16)
(n − 2)2 /[(n − 1)n2 ]
i i
i i
i i
see Huitema and McKean (2000). The associated test is based on standard
normal critical values. In the article by Huitema and McKean, this test per-
formed as well as the Durbin-Watson tests in terms of power and validity over
AB type designs. The factor (p + 1)/n in the formula for h is a bias correction.
Provided an AR(1) is used to model the errors, r1 is the LS estimate of φ1 ;
however, in the test statistic h it is standardized under the null hypothesis
of independence. This suggests using the robust analog; i.e., using the HBR
estimate of the AR(1) model (based on the R residuals), with standardization
under the null hypothesis, as a test statistic diagnostic.
If dependence is diagnosed, there are several traditional fitting procedures
to fit the linear model. Several methods make use of transformations based
on estimates of the dependent structure. The reader is cautioned, though,
because this can lead to very liberal inference; see, for instance, the study by
Huitema et al. (1999). The problem appears to be the bias in the estimates.
McKnight et al. (2000) developed a double bootstrap procedure based on a
two-stage Durbin type approach (Chapter 9 of Fuller, 1996), for autoregressive
errors. The first bootstrap corrects the biasedness of the estimates of the
autocorrelation coefficients while the second bootstrap yields a valid inference
for the regression parameters of the linear model, (5.6.1). Robust analogs of
these traditional methods are currently being investigated.
5.7 Exercises
5.7.1. Assume the simple mixed model (5.3.1). Show that expression (5.3.2)
is true.
5.7.2. Obtain the ARE between the R and traditional estimates found in
expression (5.3.4), for Wilcoxon scores when the random error vector has a
multivariate normal distribution.
5.7.3. Show that the asymptotic distribution of the LS estimator for the
Arnold transformed model is given by expression (5.4.8).
5.7.4. Consider Example 5.4.1.
(a) Verify the ATR and ATLS estimates in Table 5.4.2.
(b) Over the range of ∆y used in the example, verify the relative changes in
the ATR and ATLS estimates as shown in the example.
5.7.5. Consider the discussion of test statistics around expression (5.2.13).
Explore the asymptotic distributions of the drop in dispersion and aligned
rank test statistics under the null and contiguous alternatives for the general
mixed model.
i i
i i
i i
5.7.6. Continuing with the last exercise, suppose that the simple mixed model
(5.3.1) is true. Suppose further that the design is centered within each block;
i.e., X′k 1nk = 0p . For example, this is true for an ANOVA design in which
all subjects have all treatment combinations such as the Plasma Example of
Section 4.
(d) Show that FW,ϕ , (5.2.13), and FRD,ϕ are asymptotically equivalent under
the null and local alternative models.
(e) Explore the asymptotic distribution of the aligned rank test under the
conditions of this exercise.
(b) Replace the two large outliers in the data set with their predicted HBR
fits. Run the Wilcoxon and HBR fits of the changed data set. Obtain
the diagnostics TDBETA and CFIT.
i i
i i
i i
Chapter 6
Multivariate
377
i i
i i
i i
g(kxk) where kxk = (xT x)1/2 is the Euclidean norm of x. The contours
of the density are circular.
3. In an elliptical model the pdf has the form |det Σ|−1/2 g(xT Σ−1 x), where
det denotes determinant and Σ is a symmetric, positive definite matrix.
The contours of the density are ellipses.
Note that elliptical symmetry implies symmetry which in turn implies di-
rectional symmetry. In an elliptical model, the contours of the density are
elliptical and if Σ is the identity matrix then we have a spherically symmet-
ric distribution. An elliptical distribution can be transformed into a spherical
one by a transformation of the form Y = DX where D is a nonsingular ma-
trix. Along with various models, we encounter various transformations in this
chapter. The following definition summarizes the transformations.
where S is the sample covariance matrix. In Exercise 6.8.1, the reader is asked
to show that the vector of sample means is affine equivariant and Hotelling’s
T 2 test statistic is affine invariant.
As in the earlier chapters, we begin with a criterion function or with a set
of estimating equations. To fix the ideas, suppose that we wish to estimate
θ or test the hypothesis H0 : θ = 0 and we are given a pair of estimating
equations:
S1 (θ)
S(θ) = =0; (6.1.1)
S2 (θ)
i i
i i
i i
Definition 6.1.3. We say that the mutivariate process S(θ) is Pitman Reg-
ular if the following conditions hold:
(b) E0 (S(0)) = 0.
D0
(c) √1 S(0) → Z ∼ N2 (0, A).
n
P
(d) supkbk≤B √1n S √1n b − √1 S(0)
n
+ Bb → 0 .
The matrix A in (c) is the asymptotic covariance matrix of √1n S(0) and the
matrix B in (d) can be computed in various ways, depending on when differen-
tiation and expectation can be interchanged. We list the various computations
of B for completeness. Note that ▽ denotes differentiation with respect to the
components of θ.
1
B = −E0 ▽ S(θ) |θ =0
n
1
= ▽Eθ S(0) |θ=0
n
= E0 [(− ▽ log f (X))ΨT (X)] (6.1.2)
where ▽ log f (X) denotes the vector of partial derivatives of log f (X) and
Ψ(· ) is such that
n
1 1 X
√ S(θ) = √ Ψ(Xi − θ) + op (1).
n n i=1
i i
i i
i i
i i
i i
i i
The first criterion function generates the vector of means, the L2 or least
squares estimates. The other two criterion functions generate different versions
of what may be considered L1 estimates or bivariate medians. The two types
i i
i i
i i
of medians differ in their equivariance properties. See Small (1990) for an ex-
cellent review of multidimensional medians. The vector of means is equivariant
under affine transformations of the data; see Exercise 6.8.1. In each of these
criterion functions we have pushed the square root operation deeper into the
expression. As we see, this produces very different types of estimates. We now
take the gradients of these criterion functions and display the corresponding
estimating functions. The computation of these gradients is given in Exercise
6.8.2.
P
−1 P (x i1 − θ1 )
S1 (θ) = [D1 (θ)] (6.1.6)
(xi2 − θ2 )
Xn
−1 xi1 − θ1
S2 (θ) = kxi − θ i k (6.1.7)
xi2 − θ2
i=1
P
P sgn(x i1 − θ1 )
S3 (θ) = . (6.1.8)
sgn(xi2 − θ2 )
In (6.1.8) if the vector is zero, then we take the term in the summation to
be zero also. In Exercise 6.8.3 the reader is asked to verify that S2 (θ) = S3 (θ)
in the univariate case; hence, we already see something new in the structure of
the bivariate location model over the univariate location model. On the other
hand, S1 (θ) and S3 (θ) are componentwise equations unlike S2 (θ) in which the
two components are entangled. The solution to (6.1.8) is the vector of medians,
and the solution to (6.1.7) is the spatial median which is discussed in Section
6.3. We begin with an analysis of componentwise estimating equations and
then consider other types.
Sections 6.2.3 through 6.4.4 deal with one sample estimates and tests based
on vector signs and ranks. Both rotational and affine invariant/equivariant
methods are developed. Two and several sample models are treated in Section
6.6 as examples of location models. In Section 6.6 we are primarily concerned
with componentwise methods.
i i
i i
i i
i i
i i
i i
Figure 6.2.1: Panel A: Boxplots of the changes in pulmonary function for the
cotton dust data. Note that the responses have been standardized by compo-
nentwise standard deviations; Panel B: Normal q −q plot for the component
CC, original scale.
2 Panel A Panel B
15
1
10
Standardized responses
Changes in CC
0
5
−1
0
−2
−5
outlying values on the right side. The plots (not shown) for the other two
components exhibit no outliers. In the case of the componentwise Wilcoxon
test, Section 6.2.3, we consider (n + 1)S4 (0) in (6.2.14) along with (n + 1)2 A,
essentially in (6.2.15). For the this data (n + 1)ST4 (0) = (−63, −52, 28) and
649 620.5 −260.5
b = 1 620.5
(n + 1)2 A 649.5 −141.5 .
n
−260.5 −141.5 650
P P
The diagonal elements are i R2 (|Xis |) which should be i i2 = 650 but differ
for the first two components
P due to ties among the absolute values. The off-
diagonal elements are i R(|Xis |)R(|Xit |)sgn (Xis )sgn (Xit ). The test statistic
b −1 S4 (0) = 7.82. From the χ2 (3) distribution, the approx-
is then n−1 ST4 (0)A
imate p-value is 0.0498. Hence, the Wilcoxon test rejects the null hypothesis
at essentially the same level as Hotelling’s T 2 test.
In the construction of tests we generally must estimate the matrix A.
When testing H0 : θ = 0 the question arises as to whether or not we should
center the data using θ.b If we do not center then we are using a reduced model
estimate of A; otherwise, it is a full model estimate. Reduced model estimates
b must only
are generally used in randomization tests. In this case, generally, A
be computed once in the process of randomizing and recomputing the test
b −1 S. Note also that when H0 : θ = 0 is true, θ P
b −→
statistic n−1 ST A 0.
i i
i i
i i
6.2.1 Estimation
Let θ = (θ1 , θ2 )T denote the true vector of location parameters. Then, when
(6.1.2) holds, the asymptotic covariance matrix in Theorem 6.1.2 is
Eψ2 (X −θ ) Eψ(X −θ )ψ(X −θ )
11 1 11 1 12 2
[Eψ′ (X11 −θ1 )]2 Eψ′ (X11 −θ1 )Eψ′ (X12−θ2 )
B−1 AB−1 = . (6.2.4)
Eψ(X11 −θ1 )ψ(X12 −θ2 ) Eψ2 (X12 −θ2 )
Eψ′ (X11 −θ1 )Eψ′ (X12 −θ2 ) [Eψ′ (X12 −θ2 )]2
.
where f1 and f2 denote the marginal pdfs of the joint pdf f (s, t) and θ1 and θ2
denote the componentwise medians. In applications, the estimate of θ is the
vector of componentwise sample medians, which we denote by (θb1 , θb2 )′ . For
inference an estimate of the asymptotic covariance matrix, (6.2.5) is required.
An estimate of Esgn(X11 − θ1 )sgn(X12 − θ2 ) is the simple moment estimator
P
n−1 sgn(xi1 − θb1 )sgn(xi2 − θb2 ). The estimators discussed in Section 1.5.5,
(1.5.29), can be used to estimate the scale parameters 1/2f1 (0) and 1/2f2 (0).
We now turn to the efficiency of the vector of sample medians with respect
to the vector of sample means. Assume for each component that the median
and mean are the same and that without loss of generality their common value
is 0. Let δ = det(B−1 AB−1 ) = det(A)/[det(B)]2 be the Wilk’s generalized
i i
i i
i i
1 − (EsgnX11 sgnX12 )2
δ=
16f12(0)f22 (0)
and the efficiency of the vector of medians with respect to the vector of means
is given by
s
1 − ρ2
e(med,mean) = 4σ1 σ2 f1 (0)f2 (0) . (6.2.6)
1 − [EsgnX11 sgnX12 ]2
Note that EsgnX11 sgnX12 = 4P (X11 < 0, X12 < 0) − 1. When the underlying
distribution is bivariate normal with means 0, variances 1, and correlation ρ,
Exercise 6.8.4 shows that
1 1
P (X11 < 0, X12 < 0) = + . (6.2.7)
4 2π sin ρ
Further, the marginal distributions are standard normal; hence, (6.2.6) be-
comes s
2 1 − ρ2
e(med, mean) = . (6.2.8)
π 1 − [(2/π) sin−1 ρ]2
The first factor 2/π ∼ = .637 is the univariate efficiency of the median relative
to the mean when the underlying distribution is normal and also the efficiency
of the vector of medians relative to the vector of means when the correlation
in the underlying model is zero. The second factor accounts for the bivariate
structure of the model and, in general, depends on the correlation ρ. Some
values of the efficiency are given in Table 6.2.1.
Clearly, as the elliptical contours of the underlying normal distribution
flatten out, the efficiency of the vector of medians decreases. This is the first
indication that the vector of medians is not affine (or even rotation) equiv-
ariant. The vector of means is affine equivariant and hence the dependency
of the efficiency on ρ must be due to the vector of medians. Indeed, Exercise
6.8.5 asks the reader to construct an example showing that when the axes are
rotated the vector of means rotates into the new vector of means while the
vector of medians fails to do so.
6.2.2 Testing
We now consider the properties of bivariate tests. Recall that we assume the
underlying bivariate distribution is symmetric. In addition, we would generally
i i
i i
i i
Table 6.2.1: Efficiency (6.2.8) of the Vector of Medians Relative to the Vector
of Means When the Underlying Distribution is Bivariate Normal
ρ 0 .1 .2 .3 .4 .5 .6 .7 .8 .9 .99
eff .64 .63 .63 .62 .60 .58 .56 .52 .47 .40 .22
use an odd ψ-function, so that ψ(t) = −ψ(−t). This implies that ψ(t) =
ψ(|t|)sgn(t) which is useful shortly.
Now referring to Theorem 6.1.2 along with the corresponding matrix
A, the test of H0 : θ = 0 vs HA : θ 6= 0 rejects the null hypothe-
sis when n1 ST (0)ARR−1
S(0) ≥ χ2α (2). Note that the covariance term in A is
Eψ(X11 )ψ(X12 ) = ψ(s)ψ(t)f (s, t) dsdt and it depends upon the underlying
bivariate distribution f . Hence, even the sign test based on the componentwise
sign statistics S3 (0) is not distribution-free under the null hypothesis as it
is in the univariate case. In this case, Eψ(X11 )ψ(X12 ) = 4P (X11 < 0, X12 <
0) − 1 as we saw in the discussion of estimation.
To make the test operational we must estimate the components of A. Since
they are expectations, we use moment estimates, under the null hypothesis.
Now condition (c) in Definition 6.1.3 guarantees that the test with the esti-
mated A is asymptotically distribution-free since it has a limiting chisquare
distribution, independent of the underlying distribution. What can we say
about finite samples?
First note that
Σψ(|xi1 |)sgn(xi1 )
S(0) = . (6.2.9)
Σψ(|xi2 |)sgn(xi2 )
Under the assumption of symmetry, (x1 , . . . , xn ) is a realization of
(s1 x1 , . . . , sn xn ) where (s1 , . . . , sn ) is a vector of independent random variables
each equalling ±1 with probability 1/2, 1/2. Hence Esi = 0 and Es2i = 1. Con-
ditional on (x1 , . . . , xn ) then, under the null hypothesis, there are 2n equally
likely sign combinations associated with these vectors. Note that the sign
changes attach to the entire vector. From (6.2.9), we see that conditionally,
the scores are not affected by the sign changes and S(0) depends on the sign
changes only through the signs of the components of the observation vectors.
It follows at once that the conditional mean of S(0) under the null hypothesis
is 0. Further the conditional covariance matrix is given by
2
P
P Σψ (|xi1 |) ψ(|x i1 |)ψ(|x
P 2i2 |)sgn(xi1 )sgn(xi2 )
.
ψ(|xi1 |)ψ(|xi2 |)sgn(xi1 )sgn(xi2 ) ψ (|xi2 |)
(6.2.10)
Note that conditionally, n−1 times this matrix is an estimate of the matrix A
above. Thus we have a conditionally distribution-free sign change distribution.
i i
i i
i i
For small to moderate n the test statistic (quadratic form) can be computed
for each combination of signs and a conditional p-value of the test is the
number of values (divided by 2n ) of the test statistic at least as large as the
observed value of the test statistic. In the first chapter on univariate methods
this argument also leads to unconditionally distribution-free tests in the case
of the univariate sign and rank tests since in those cases the signs and the
ranks do not depend on the values of the conditioning variables. Again, the
situation is different in the bivariate case due to the matrix A which must
be estimated since it depends on the unknown underlying distribution. In
Exercise 6.8.6 the reader is asked to construct the sign change distributions
for some examples.
We now turn to a more detailed analysis of the tests based on S1 = S1 (0)
and S3 = S3 (0). Recall that S1 is the vector of sample means. The matrix
A is the covariance matrix of the underlying distribution and we take the
sample covariance matrix as the natural estimate. The resulting test statistic
T −1
is nX A b X which is Hotelling’s T 2 statistic. Note for T 2 , we typically use
a centered estimate of A. If we want the randomization distribution then we
use the uncentered estimate. Since BA−1 B = Σ−1 f , the covariance matrix
of the underlying distribution, the asymptotic noncentrality parameter for
Hotelling’s test is γ T Σf−1 γ. The vector S3 is the vector of component sign
statistics. By inverting (6.2.5) we can write down the noncentrality parameter
for the bivariate componentwise sign test.
To illustrate the efficiency of the bivariate sign test relative to Hotelling’s
test we simplify the structure as follows: assume that the marginal distribu-
tions are identical. Let ξ = 4P (X11 < 0, X12 < 0) − 1 and let ρ denote the
underlying correlation, as usual. Then Hotelling’s noncentrality parameter is
1 T 1 −ρ γ12 − 2ργ1 γ2 + γ22
γ γ = . (6.2.11)
σ 2 (1 − ρ2 ) −ρ 1 σ 2 (1 − ρ2 )
Likewise the noncentrality parameter for the bivariate sign test is
4f 2 (0) T 1 −ξ 4f 2 (0)(γ12 − 2ξγ1 γ2 + γ22 )
γ γ = . (6.2.12)
(1 − ξ 2 ) −ξ 1 (1 − ξ 2 )
The efficiency of the bivariate sign test relative to Hotelling’s test is the ratio
of the their respective noncentrality parameters:
4f 2 (0)σ 2 (1 − ρ2 )(γ12 − 2ξγ1γ2 + γ22 )
. (6.2.13)
(1 − ξ 2 )(γ12 − 2ργ1 γ2 + γ22 )
There are three contributing factors in this efficiency: 4f 2 (0)σ 2 which is the
univariate efficiency of the sign test relative to the t-test, (1 − ρ2 )/(1 − ξ 2 )
due to the dependence structure in the bivariate distribution, and the final
i i
i i
i i
Table 6.2.2: Minimum and Maximum Efficiencies of the Bivariate Sign Test
Relative to Hotelling’s T 2 When the Underlying Distribution is Bivariate Nor-
mal
ρ 0 .2 .4 .6 .8 .9 .99
min .64 .58 .52 .43 .31 .22 .07
max .64 .68 .71 .72 .72 .71 .66
In Table 6.2.2 we give some values of the maximum and minimum efficien-
cies when the underlying distribution is bivariate normal with means 0, vari-
ances 1, and correlation ρ. This table can be compared to Table 6.2.1 which
contains the corresponding estimation efficiencies. We have f 2 (0) = (2π)−1
and ξ = (2/π) sin−1 ρ . Hence, the dependence of the efficiency on direction
determined by γ is apparent. The examples involving the bivariate normal
distribution also show the superiority of the vector of means over the vec-
tor of medians and Hotelling’s test over the bivariate sign test as expected.
Bickel (1964, 1965) gives a more thorough analysis of the efficiency for general
models. He points out that when heavy-tailed models are expected then the
medians and sign test are much better provided ρ is not too close to ±1.
In the exercises the reader is asked to show that Hotelling’s T 2 statistic is
affine invariant. Thus the efficiency properties of this statistic do not depend on
ρ. This means that the bivariate sign test cannot be affine invariant; again, this
is developed in the exercises. It is now natural to inquire about the properties
of the estimate and test based on S2 . This estimating function cannot be
written in the componentwise form that we have been considering. Before we
turn to this statistic, we consider estimates and tests based on componentwise
ranking.
i i
i i
i i
Using the projection method, Theorem 2.4.6, we have from Exercise 6.8.8,
for the case θ = 0,
P + P
F (|xi1 |)sgn(xi1 ) 2 [F1 (xi1 ) − 1/2]
S4 (0) = P 1+ + op (1) = P + op (1)
F2 (|xi2 |)sgn(xi2 ) 2 [F2 (xi2 ) − 1/2]
i i
i i
i i
ρ 0 .2 .4 .6 .8 .9 .99
min .96 .94 .93 .91 .89 .88 .87
max .96 .96 .97 .97 .96 .96 .96
est .96 .96 .95 .94 .93 .92 .91
where Rit is the rank of |Xit | in the tth component among |X1t |, . . . , |Xnt |. This
estimate is the conditional covariance and can be used in estimating A in the
construction of an asymptotically distribution-free test; when we estimate the
asymptotic covariance matrix of θ b we first center the data and then compute
(6.2.16).
The estimator that solves S4 (θ) = 0 is the vector of Hodges-Lehmann
estimates for the two components; that is, the vector of medians of Walsh
averages for each component. Like the vector of medians, the vector of HL
estimates is not equivariant under orthogonal transformations and the test is
not invariant under these transformations. This shows up in the efficiency with
respect to the L2 methods which are an equivariant estimate and an invariant
test. Theorem 6.1.2 provides the asymptotic distribution of the estimator and
the asymptotic local power of the test.
Suppose the underlying distribution is bivariate normal with means 0,
variances 1, and correlation ρ, then the estimation and testing efficiencies are
given by
r
3 1 − ρ2
e(HL, mean) = (6.2.17)
π 1 − 9δ 2
3 (1 − ρ2 ) γ12 − 6δγ1 γ2 + γ22
e(Wil, Hotel) = { }. (6.2.18)
π (1 − 9δ 2 ) γ12 − 2ργ1 γ2 + γ22
Exercise 6.8.9 asks the reader to apply Lemma 6.2.1 and show the testing
efficiency is bounded between
3(1 + ρ) 3(1 − ρ)
ρ and . (6.2.19)
3 −1
2π[2 − π cos ( 2 )] 2π[2 − π3 cos−1 ( ρ2 )]
In Table 6.2.3 we provide some values of the minimum and maximum effi-
ciencies as well as estimation efficiency. Note how much more stable the rank
methods are than the sign methods. Bickel (1964) points out, however, that
when there is heavy contamination and ρ is close to ±1 the estimation effi-
ciency can be arbitrarily close to 0. Further, this efficiency can be arbitrarily
large. This behavior is due to the fact that the sign and rank methods are
i i
i i
i i
X n
b = 1
A kxi k−2 xi xTi , (6.3.1)
n i=1
which can be used to construct the spatial sign test statistic with
1 D 1 D
b −1 S2 (0) →
√ S2 (0) → N2 (0, A) and ST2 (0)A χ2 (2) . (6.3.2)
n n
i i
i i
i i
where I is the identity matrix. Use a moment estimate for B similar to the
estimate of A.
The spatial median is determined by
n
X
b = Argmin
θ kxi − θk (6.3.4)
i=1
i i
i i
i i
Table 6.3.1: Weight of Cork Borings (in Centigrams) in Four Directions for 28
Trees
N E S W N E S W
72 66 76 77 91 79 100 75
60 53 66 63 56 68 47 50
56 57 64 58 79 65 70 61
41 29 36 38 81 80 68 58
32 32 35 36 78 55 67 60
30 35 34 26 46 38 37 38
39 39 31 27 39 35 34 37
42 43 31 25 32 30 30 32
37 40 31 25 60 50 67 54
33 29 27 36 35 37 48 39
32 30 34 28 39 36 39 31
63 45 74 63 50 34 37 40
54 46 60 52 43 37 39 50
47 51 52 43 48 54 57 43
four measurements on each tree and we wish to test the equality of marginal
locations: H0 : θN = θS = θE = θW . This is a common hypothesis in repeated
measure designs. See Jan and Randles (1996) for an excellent discussion of is-
sues in repeated measures designs. We reduce the data to trivariate vectors via
N −E, E−S, S−W . Then we test δ = 0 where δ T = (θN −θS , θS −θE , θE −θW ).
Table 6.3.1 displays the original n = 28 four component data vectors.
We consider the differences: N − S, S − E, and E − W . For the reader’s
convenience, at the url listed in the Preface, we have tabled these differences
along with the unit spatial sign vectors kxk−1 x for each data point. Note that,
except for rounding error, for the spatial sign vectors, the sum of squares in
each row is 1.
We compute the spatial sign statistic to be ST2 = (7.78, −4.99, 6.65) and,
from (6.3.1),
.2809 −.1321 −.0539
b = −.1321
A .3706 −.0648 .
−.0539 −.0648 .3484
b −1S2 (0) = 14.74 which yields an asymptotic p-value of .002,
Then n−1 ST2 (0)A
using a χ2 approximation with 3 degrees of freedom. Hence, we easily reject
H0 : δ = 0 and conclude that boring size depends on direction.
For estimation we return to the original component data. Since we have
rejected the null hypothesis of equality of locations, we want to estimate the
four components of the location vector: θ T = (θ1 , θ2 , θ3 , θ4 ). The spatial me-
i i
i i
i i
and this is used in the quadratic form with S2 (0) to construct the test statistic;
see Möttönen and Oja (1995, Section 2.1).
To consider the asymptotically distribution-free version of this test we use
the form
X cos φi
S2 (0) = (6.3.10)
sin φi
where, recall 0 ≤ φ < 2π, and the multivariate central limit theorem im-
plies that √1n S2 (0) has a limiting bivariate normal distribution with mean 0
i i
i i
i i
and covariance matrix A. We now translate A and its estimate into polar
coordinates.
n
cos2 φ cos φ sin φ b 1X cos2 φi cos φi sin φi
A=E ,A =
cos φ sin φ sin2 φ n i=1 cos φi sin φi sin2 φi
(6.3.11)
1 T
Hence, n S2 (0)A b S2 (0) ≥ χ (2) is an asymptotically size α test.
−1 2
α
The polar coordinate representation of B is given by
−1 1 − cos2 φ − cos φ sin φ −1 sin2 φ − cos φ sin φ
Er = Er .
− cos φ sin φ 1 − sin2 φ − cos φ sin φ cos2 φ
√ (6.3.12)
Hence, n times the spatial median is limiting bivariate normal with asymp-
totic covariance matrix equal to B−1 AB−1 . The corresponding noncentral-
ity parameter of the noncentral chisquare limiting distribution of the test is
γ T BA−1Bγ. We are now in a position to evaluate the efficiency of the spa-
tial median and the spatial sign test with respect to the mean vector and
Hotelling’s test under various model assumptions. The following result is ba-
sic and is derived in Exercise 6.8.10.
Theorem 6.3.1. Suppose the underlying distribution is spherically symmet-
ric so that the joint density is of the form f (x) = h(kxk). Let (r, φ) be the
polar coordinates. Then r and φ are stochastically independent, the pdf of φ is
uniform on (0, 2π] and the pdf of r is g(r) = 2πrf (r), for r > 0.
Theorem 6.3.2. If the underlying distribution is spherically symmetric, then
the matrices A = (1/2)I and B = [(Er −1 )/2]I. Hence, under the null hypoth-
esis, the test statistic n−1 ST2 (0)A−1 S2 (0) is distribution-free over the class of
spherically symmetric distributions.
Proof: First note that
Z
1
E cos φ sin φ = cos φ sin φdf = 0 .
2π
Then note that
i i
i i
i i
of the direction. Recall, for the mean vector and T 2 , that A = 2−1 E(r 2 )I,
det B−1 AB−1 = 2−1 E(r 2 ), and γ T BA−1 Bγ = [2/E(r 2)]γ T γ. This is because
both the spatial L1 methods and the L2 methods are equivariant and invariant
with respect to orthogonal (rotations and reflections) transformations. Hence,
we see that the efficiency
1
e(spatialL1 , L2 ) = Er 2 {Er −1 }2 . (6.3.13)
4
If, in addition, we assume the underlying distribution is spherical normal (bi-
variate normal with means 0 and identity covariance matrix) then Er −1 =
p
π/2, Er 2 = 2 and e(spatialL1 , L2 ) = π/4 ≈ .785. Hence, the efficiency of
the spatial L1 methods based on S2 (θ) are more efficient relative to the L2
methods at the spherical normal model than the componentwise L1 methods
(.637) discussed in Section 6.2.3.
In Exercise 6.8.12 the reader is asked to show that the efficiency of the
spatial L1 methods relative to the L2 methods with a k-variate spherical model
is given by 2
k−1
ek (spatial L1 , L2 ) = E(r 2 )[E(r −1 )]2 . (6.3.14)
k
−1
When the k-variate spherical
√ model is normal, the exercise shows that Er =
Γ[(k−1)/2)]
√
2Γ(k/2)
with Γ(1/2) = π. Table 6.3.2 gives some values for this efficiency
as a function of dimension. Hence, we see that the efficiency increases with
dimension. This suggests that the spatial methods are superior to the compo-
nentwise L1 methods, at least for spherical models.
i i
i i
i i
k 2 4 6
e(spatial L1 , L2 ) 0.785 0.884 0.920
σ 1 .8 .6 .4 .2 .05 .01
e(spatial L1 , L2 ) 0.785 0.783 0.773 0.747 0.678 0.593 0.321
r ∞ 2
−1 2 1 πX (2j + 2)!(2j)!
Er cos φ = 4j+1
(1 − σ 2 )j ,
2 2 j=0 2 (j!)2 [(j + 1)!]2
and
Er −1 sin2 φ = Er −1 − Er −1 cos2 φ .
Thus A = diag[(1 + σ)−1 , σ(1 + σ)−1 ] and the distribution of the test
statistic, even under the normal model depends on σ. The formulas can be
used to compute the efficiency of the spatial L1 methods relative to the L2
methods; numerical values are given in Table 6.3.3. The dependency of the
efficiency on σ reflects the dependency of the efficiency on the underlying
correlation which is present prior to rotation.
Hence, just as the componentwise L1 methods have decreasing efficiency as
a function of the underlying correlation, the spatial L1 methods have decreas-
ing efficiency as a function of the ratio of underlying variances. It should be
i i
i i
i i
emphasized that the spatial methods are most appropriate for spherical models
where they have equivariance and invariance properties. The componentwise
methods, although equivariant and invariant under scale transformations of
the components, cannot tolerate changes in correlation. See Mardia (1972)
and Fisher (1987, 1993) for further discussion of spatial methods. In higher
dimensions, Mardia refers to the angle test as Rayleigh’s test; see Section 9.3.1
of Mardia (1972). Möttönen and Oja (1995) extend the spatial median and
the spatial sign test to higher dimensions. See Table 6.3.5 below for efficien-
cies relative to Hotelling’s test for higher dimensions and for a multivariate t
underlying distribution. Note that for higher dimensions and lower degrees of
freedom, the spatial sign test is superior to Hotelling’s T 2 .
i i
i i
i i
This is also the sum of the centered spatial ranks of the observations when
ranked in the combined P observations
P and their reflections. Note that −u(xi −
xj ) = u(xj −xi ) so that u(xi −xj ) = 0 and the statistic can be computed
from XX
S5 (0) = u(xi + xj ) , (6.3.17)
i j
i i
i i
i i
Table 6.3.4: Each Row is a Spatial Signed-Rank Vector for the Data Differences
Example 6.3.2 (Cork Borings, Example 6.3.1 continued). We use the spatial
signed-rank method (6.3.20) to test the hypothesis. Table 6.3.4 provides the
vector signed-ranks, r+ (xi ) defined in expression (6.3.18).
Then ST5 (0) = (4.94, −2.90, 5.17),
.1231 −.0655 .0050
b −1 = −.0655
n3 A .1611 −.0373 ,
.0050 −.0373 .1338
Efficiency
The test in (6.3.20) can be developed from the point of view of asymptotic
theory and the efficiency can be computed. The computations are quite in-
volved. The multivariate t distributions provide both a range of tailweights
and a range of dimensions. A summary of these efficiencies is found in Table
6.3.5; see Möttönen, Oja, and Tienari (1997) for details.
The Möttönen and Oja (1995) test efficiency increases with the dimen-
sion; see especially, the circular normal case. The efficiency begins at .95
and increases! The efficiency also increases with tailweight, as expected. This
i i
i i
i i
Table 6.3.5: The Row Labeled Spatial SR Are the Asymptotic Efficiencies of
Multivariate Spatial Signed-Rank Test, (6.3.20), Relative to Hotelling’s Test
under the Multivariate t Distribution; the Efficiencies for the Spatial Sign
Test, (6.3.2), Are Given in the Rows Labeled Spatial Sign
Degress of Freedom
Dimension Test 3 4 6 8 10 15 20 ∞
1 Spatial SR 1.90 1.40 1.16 1.09 1.05 1.01 1.00 0.95
Spatial Sign 1.62 1.13 0.88 0.80 0.76 0.71 0.70 0.64
2 Spatial SR 1.95 1.43 1.19 1.11 1.07 1.03 1.01 0.97
Spatial Sign 2.00 1.39 1.08 0.98 0.93 0.88 0.85 0.79
3 Spatial SR 1.98 1.45 1.20 1.12 1.08 1.04 1.02 0.97
Spatial Sign 2.16 1.50 1.17 1.06 1.01 0.95 0.92 0.85
4 Spatial SR 2.00 1.46 1.21 1.13 1.09 1.04 1.025 0.98
Spatial Sign 2.25 1.56 1.22 1.11 1.05 0.99 0.96 0.88
6 Spatial SR 2.02 1.48 1.22 1.14 1.10 1.05 1.03 0.98
Spatial Sign 2.34 1.63 1.27 1.15 1.09 1.03 1.00 0.92
10 Spatial SR 2.05 1.49 1.23 1.14 1.10 1.06 1.04 0.99
Spatial Sign 2.42 1.68 1.31 1.19 1.13 1.06 1.03 0.95
strongly suggests that the Möttönen and Oja approach is an excellent way to
extend the idea of signed rank from the univariate case. See Example 6.6.2 for
a discussion of the two-sample spatial rank test.
Hodges-Lehmann Estimator
.
The estimator derived from S5 (θ) = 0 is the spatial median of the pairwise
averages, a spatial Hodges-Lehmann (1963) estimator. This estimator is stud-
ied in great detail by Chaudhuri (1992). His paper contains a thorough review
of multidimensional location estimates. He develops a Bahadur representation
for the estimate. From his Theorem 3.2, we can immediately conclude that
√ n n
1 b −1 n XX 1
√ θ = B2 u (xi + xj ) + op (1) (6.3.21)
n n(n − 1) i=1 j=1 2
where B2 = E{kx∗ k−1 (I − kx∗ k−2 x∗ (x∗ )T )} and x∗ = 21 (x1 + x2 ). Hence, the
b is determined by that of n−3/2 S5 (0). This leads
asymptotic distribution of √1n θ
to
1 b D
√ θ → N2 (0, B−1 −1
2 A2 B2 ) , (6.3.22)
n
where A2 = E{u(x1 + x2 )(u(x1 + x2 ))T }. Moment estimates of A2 and B2 can
b defined in expression (6.3.19), is a consistent
be used. In fact the estimator A,
estimate of A2 . Bose and Chaudhuri (1993) and Chaudhuri (1993) discuss
refinements in the estimation of A2 and B2 .
Choi and Marden (1997) extend these spatial rank methods to the two-
sample model and the one-way layout. They also consider tests for ordered
alternatives; see, also, Oja (2010).
i i
i i
i i
where !
R1 R1
cos2 πtdt cos πt sin πtdt 1
A= R1 0 0 R
1 = I,
0
cos πt sin πtdt 0
sin2 πtdt 2
i i
i i
i i
Proof: To prove this we show that under a spherical model the angle statistic
S2 (0) and scores statistic S6 (0) are asymptotically equivalent. Then S6 (0) has
the same A and B matrices as in Theorem 6.3.2. But since S6 (0) leads to an
affine invariant test statistic, it follows that the same A and B continue to
apply for elliptical models.
Recall that under the spherical model, s(1) , . . . , s(n) are iid with P (si =
1) = P (si = −1) = 1/2 random variables. Then we consider
n n
1 X πi
cos n+1 1 X cos ϕi
√ s(i) πi −√ s(i) =
n i=1 sin n+1 n i=1 sin ϕi
n
1 X πi
cos n+1 − cos ϕ(i)
√ s(i) πi .
n sin n+1 − sin ϕ(i)
i=1
i i
i i
i i
The cdf of the uniform distribution on [0, π) is equal to t/π for 0 ≤ t < π. Let
i
Gn (t) be the empirical cdf of the angles ϕi , i = 1, . . . , n. Then G−1
n ( n+1 ) = ϕ(i)
πi
and maxi | n+1 − ϕ(i) | ≤ supt |G−1
n (t) − tπ| = supt |Gn (t) − tπ| → 0 wp1 by the
Glivenko-Cantelli Lemma. The result now follows by using a linear approxi-
πi
mation to cos( n+1 )−cos ϕ(i) and noting that the cos and sin are bounded. The
same argument applies to the second component. Hence, the difference of the
two statistics are op (1) and are asymptotically equivalent. The results for the
angle statistic now apply to S6 (0) for a spherical model. The affine invariance
extends the result to an elliptical model.
The main implication of this proposition is that the efficiency of the test
based on S6 (0) relative to Hotelling’s test is π/4 ≈ .785 for all bivariate normal
models, not just the spherical normal model. Recall that the test based on
S2 (0), the angle sign test, has efficiency π/4 only for the spherical normal and
declining efficiency for elliptical normal models. Hence, we not only gain affine
invariance but also have a constant, nondecreasing efficiency.
Oja and Nyblom (1989) study a class of sign tests for the bivariate loca-
tion problem. They show that Blumen’s test is locally most powerful invariant
for the entire class of elliptical models. Ducharme and Milasevic (1987) define
a normalized spatial median as an estimate of location of a spherical distri-
bution. They construct a confidence region for the modal direction. These
methods are resistant to outliers.
i i
i i
i i
distributed on the unit sphere. We then compute the spatial sign test (6.3.2)
on the transformed data. The result is an affine invariant test.
Let x1 , ..., xn be a random sample of size n from a k-variate multivariate
symmetric distribution with symmetry center 0. Suppose for the moment that
a nonsingular matrix Ux determined by the data, exists and satisfies
n T
1X Ux xi Ux xi 1
= I. (6.4.4)
n i=1 kUx xi k kUx xi k k
Hence, the unit vectors of the transformed data have covariance matrix
equal to that of a random vector uniformly distributed on the unit k − sphere.
Below we describe a simple and fast way to compute Ux for any dimension k.
The test statistic in (6.4.4) computed on the transformed data becomes
1 T b −1 k
S7 A S7 = ST7 S7 (6.4.5)
n n
where
Xn
Ux xi
S7 = (6.4.6)
i=1
kUx xi k
The following lemma is helpful in the proof of the theorem. The lemma’s
proof depends on a uniqueness result from Tyler (1987).
1. DT UTDx UDx D = c0 UTx Ux for some positive constant ċ0 that may depend
on D and the data and
√
2. there exists an orthogonal matrix G such that c0 GUx = UDx D.
i i
i i
i i
Tyler (1987) showed that the matrix UDx defined from Dx1 , ..., Dxn is unique
up to a positive constant. Hence, UDx = aU∗ for some positive constant a.
Hence,
UTDx UDx = a2 U∗T U∗ = a2 (DT )−1 UTx Ux D−1
and DT UtDx UDx D = a2 UTx Ux which completes the proof of Part (1) with
−1/2
c0 = a2 . Next, define G = c0 UDx DU−1 x where c0 comes from the lemma.
T
Then, using part a, it follows that G G = I and G is orthogonal. Hence,
−1/2 1/2
c0 GUx = c0 c0 UDx DU−1
x Ux = c0 UDx D
and n
X Ux xi
SD
7 = G = GS7 .
i=1
kUx xi k
Hence, SDT D T
7 S7 = S7 S7 and the affine invariance follows from the orthogonal
invariance of ST7 S7 .
Sketch of the argument that the asymptotic distribution is
chisquared with k degrees of freedom. Tyler (1987) showed that there
exists a unique upper triangular matrix U∗ with upper left diagonal element
equal to 1 and such that
" ∗ T #
∗
UX UX 1
E ∗ ∗
= I
kU Xk kU Xk k
√
and n(Ux − U∗ ) = Op (1). Theorem 6.1.2 implies that (k/n)S∗T ∗
7 S7 is asymp-
totically chisquared with k degrees of freedom where U∗ replaces Ux in S7 . But
since Ux and U∗ are close, (k/n)S∗T ∗ T
7 S7 − (k/n)S7 S7 = op (1), the asymptotic
distribution follows. See the appendix in Randles (2000) for details.
We have assumed symmetry of the underlying multivariate distribution.
The results continue to hold with the weaker assumption of directional sym-
metry about 0 in which X/kXk and −X/kXk have the same distribution.
i i
i i
i i
n
!T n
!
k X Ux xi X Ux xi
δi δi
n i=1
kUx xi k i=1
kUx xi k
Computation of Ux
It remains to compute Ux from the data x1 , ..., xn . The following efficient
iterative procedure is due to Tyler (1987) who also shows the sequence of
iterates converges when n > k(k − 1).
We begin with
n T
1 X xi xi
V0 = ,
n i=1 kxi k kxi k
and U0 = Chol (V0−1 ), where Chol (M) is the upper triangular Cholesky de-
composition of the positive definite matrix M divided by the upper left diago-
nal element of the upper triangular matrix. This places a 1 as the first element
of the main diagonal and makes Chol (M) unique.
If kV0 − k −1 Ik is sufficiently small (a prespecified tolerance) stop and take
Ux = U0 . If kV0 − k −1 Ik is large, compute
n T
1X U0 xi U0 xi
V1 = ,
n i=1 kU0 xi k kU0 xi k
i i
i i
i i
This starting procedure is used, since starting values need to be affine invariant
and equivariant.
For a fixed Ux there exists a unique solution for θ, and for fixed θ there ex-
ists a unique Ux up to multiplicative constant. In simulations and calculations
described in Hettmansperger and Randles (2002) the alternating algorithm did
not fail to converge. However, the equations defining the simultaneous solution
i i
i i
i i
b do not fully satisfy all conditions stated in the literature for existence
(Ux , θ)
and uniqueness; see Maronna (1976) and Kent and Tyler (1991).
The asymptotic distribution theory developed in Hettmansperger and Ran-
dles (2002) shows that θ b is approximately multivariate normally distributed
under the assumption of directional symmetry and, hence, symmetry. The
asymptotic covariance matrix is complicated and we recommend a bootstrap
estimate of the covariance matrix of θ.b
The approach taken above is more general. If we begin with the orthog-
onally invariant statistic in (6.3.2) and use a matrix U that satisfies the in-
variance property in part (2) of Lemma 6.4.1 then the resulting statistic is
affine invariant. For example we could take U to be the inverse of the sample
covariance matrix. This results in a test statistic studied by Hössjer and Croux
(1995). We prefer the more robust matrix Ux proposed by Tyler (1987).
Example 6.4.1 (Mathematics and Statistics Exam Scores). We now illus-
trate the one-sample affine invariant spatial sign test (6.4.5) and the affine
equivariant spatial median on a small data set. A major advantage of this
method is the speed of computation which allows for bootstrap estimates of
the covariance matrix and standard errors for the estimator. The data consists
of 20 vectors, chosen at random from a larger data set published in Mardia,
Kent, and Bibby (1979). Each vector consists of four components and records
test scores in Mechanics, Vectors, Analysis, and Statistics. We wish to test the
hypothesis that there are no differences among the examination topics. This
is a traditional hypothesis in repeated measures designs; see Jan and Randles
(1996) for a thorough discussion of this problem. Similar to our findings above
on efficiencies, they found that multivariate sign and signed-rank tests were
often superior to least squares in robustness of level and efficiency.
We consider the trivariate data that result when the Statistics score is
subtracted from the other three scores. For convenience, we have tabled these
differences at the url cited in the Preface. We suppose that the trivariate
data are a sample of size 20 from a symmetric distribution with center θ =
(θ1 , θ2 , θ3 )T and we wish to test H0 : θ = 0 versus HA : θ 6= 0. In Table 6.4.1
we have the HR estimates (standard errors) and the tests for the affine spatial
methods, Hotelling’s T2 , and Oja’s affine methods described later in Section
6.4.3. The standard errors of the HR estimate are obtained from a bootstrap
estimate of the covariance matrix. The following estimates are based on 500
bootstrap resamples.
33.88 10.53 21.05
b = 10.53 17.03 12.49 .
Cov (θ)
21.05 12.49 32.71
The standard errors in Table 6.4.1 are the squareroots of the main diagonal
of this matrix.
i i
i i
i i
Table 6.4.1: Results for the Original and Contaminated Test Score Data: Mean
of Signed-Rank Vectors, Usual Mean Vectors, the Hodges-Lehmann Estimate
of θ; Results for the Signed-Rank Test (6.4.16) and Hotelling’s T 2 Test
Test Asymp.
M −S V −S A−S Statistic p-value
Original Data
HR Estimate −2.12 13.85 6.21
SE HR 5.82 4.13 5.72
Mean -4.95 12.10 2.40
SE Mean 4.07 3.33 3.62
Oja HL-est. -3.05 14.06 4.96
Affine Sign Test (6.4.5) 14.19 0.0027
Hotelling’s T 2 13.47 0.0037
Oja Signed rank (6.4.16) 14.07 0.0028
Contaminated Data
HR Estimate −2.92 12.83 6.90
SE HR 5.58 8.27 6.60
Mean Vector -4.95 8.60 2.40
Oja HL-estimate -3.90 12.69 4.64
Affine Sign Test (6.4.5) 10.76 0.0131
Hotelling’s T 2 6.95 .0736
Oja Signed rank (6.4.16) 10.09 0.0178
The affine sign methods suggest that the major source of statistical sig-
nificance is the V − S difference. In particular, Vector scores are higher than
Statistics scores. A more convenient comparison is achieved by estimating the
locations in the four-dimensional problem. We find the affine equivariant spa-
tial median for M, V, A, S to be (along with bootstrap standard errors) 36.54
(8.41), 53.04 (5.09), 44.28 (8.39), and 39.65 (7.06). This again reflects the sig-
nificant differences between Vector scores and Statistics. In fact, it appears
the Vector exam was easiest while the other subjects are roughly equivalent.
i i
i i
i i
with δi = ±1.
Recall that the Hodges-Lehmann estimate related to the spatial signed-
rank statistic is the spatial median of the pairwise averages of the data vectors.
This estimate is orthogonally equivariant but not affine equivariant. We use
the transformation-retransformation method. We transform the data using
Vx to get yi = Vx xi i = 1, ..., n and then compute the spatial median of the
pairwise averages (yi + yj )/2 which we denote by τb . Then we retransform
b = V−1τb . This estimate is now affine equivariant. Because of the
it back: θ x
complexity of the asymptotic covariance matrix we recommend a bootstrap
estimate of the covariance matrix of θ. b
i i
i i
i i
Efficiency
Recall Table 6.3.5 which provides efficiency values for either the spatial sign
test or the spatial signed-rank test relative to Hotelling’s T2 test. The calcula-
tions were made for the spherical t-distribution for various degrees of freedom
and finally for the spherical normal distribution. Now that we have affine
invariant sign and signed-rank tests and affine equivariant estimates we can
apply these efficiency results to elliptical t and normal distributions. Hence, we
again see the superiority of the sign and signed-rank methods over Hotelling’s
test and the sample mean. The affine invariant tests and affine equivariant
estimates are efficient and robust alternatives to the traditional least squares
methods.
In the case of the affine invariant sign test, Randles (2000) presents a power
sensitivity simulation comparing his test to Hotelling’s T 2 test, Blumen’s test,
and Oja’s sign test (6.4.14). In addition to the multivariate normal distribu-
tion, he included t distributions and a skewed distribution. Randles’ affine
invariant sign test performed extremely well. Although Oja’s sign test per-
formed comparably, it is much more computationally intensive than Randles’
test.
i i
i i
i i
The estimator that solves (6.4.12) is called the Oja median and we are inter-
ested in its properties. This estimator minimizes the sum of triangular areas
formed by all pairs of observations along with θ. Niinimaa, Nyblom, and Oja
(1992) provide a fortran program for computing the Oja median and discuss
further aspects of its computation; see, also, the R package OjaNP. Brown and
Hettmansperger (1987a) present a geometric description of the determination
of the Oja median. The statistic S8 (0) forms the basis of a sign-type statistic
for testing H0 : θ = 0. We refer to this test as the Oja sign test. In order
to study the Oja median and the Oja sign test we need once again to deter-
mine the matrices A and B. Before doing this we rewrite (6.4.12) in a more
convenient form, a form that expresses it as a function of s1 , . . . , sn . Recall
the polar form of x, (6.3.7), that we have been using and at the same time
introduce the vector y as follows:
cos φ cos ϕ
x=r = rs = sy .
sin φ sin ϕ
As usual 0 ≤ ϕ < π, s indicates whether x is above or below the horizontal
axis, and r is the length of x. Hence, if s = 1 then y = x, and if s = −1 then
y = −x, so y is always above the horizontal axis.
Theorem 6.4.3. The following string of equalities is true:
n−1 n
1 1 X X xi1 xj1
S8 (0) = sgn{det }(x∗j − x∗i )
n 2n xi2 xj2
i=1 j=i+1
n−1 n n
1 X X 1X
= si sj (sj yj∗ − si yi∗ ) = si z i
2n i=1 j=i+1 2 i=1
where
n−1
1X ∗
zi = y and yn+i = −yi .
n j=1 i+j
i i
i i
i i
Proof: The first formula follows at once from (6.4.12). In the second formula
we need to recall the ∗ operation. It entails a counterclockwise rotation of 90
degrees. Suppose, without loss of generality, that 0 ≤ ϕ1 ≤ . . . ≤ ϕn ≤ π.
Then
xi1 xj1 si ri cos ϕi sj rj cos ϕj
sgn det = sgn det
xi2 xj2 si ri sin ϕi sj rj sin ϕj
= sgn{si sj ri rj cos ϕi sin ϕj − sin ϕi cos ϕj }
= si sj sgn{sin(ϕj − ϕi )}
= si sj .
Now if xi is in the first or second quadrants then yi∗ = x∗i = si x∗i and if xi is
in the third or fourth quadrants then yi∗ = −x∗i = si x∗i . Hence, in all cases we
have x∗i = si yi∗ . The second formula now follows. The third formula follows by
straightforward algebraic manipulations. We leave these details to the reader;
see Exercise 6.8.15.
Based on the notation at the end of the proof of Theorem 6.4.3, we have
n
X i−1
X n
X n−1
X
zi = yj∗ − yj∗ , i = 2, . . . , n − 1, z1 = yj∗ , zn = − yj∗ . (6.4.13)
j=i+1 j=1 j=2 j=1
The third formula shows that we have a sign statistic similar to the ones
that we have been studying. Under the null hypothesis (s1 , . . . , sn ) and
(z1 , . . . , zn ) are independent. Hence conditionally on z1 , . . . , zn (or equiva-
lently conditionally on y1 , . . . , yn ) the conditional covariance matrix of S8 (0)
b = 1 P zi zT . A conditional distribution-free test is
is A 4 i i
b −1 S8 (0) ≥ χ2 (2) .
reject H0 : θ = 0 when ST8 (0)A (6.4.14)
α
i i
i i
i i
then n
1 1 X i
3/2
S8 (0) = √ si z + op (1) .
n 2 n i=1 n
Proof: We sketch the argument. A more general result and a rigorous argument
can be found in Brown et al. (1992). We begin by referring to formula (6.4.13).
Recall that n n
1X ∗ 1X − sin ϕi
y = ri .
n i=1 i n i=1 cos ϕi
Consider the second component and let ∼ = mean that the approximation is
valid up to op (1) terms. From the discussion of maxi |πi/(n + 1) − ϕ(i) | in
Theorem 6.4.1, we have
[nt] [nt] X [nt]
1X ∼ 1X ∼ Er π πi
ri cos ϕi = {Er} cos ϕi = cos
n i=1 n i=1 π n i=1 n+1
Z πt
∼ Er Er
= cos udu = sin πt .
π 0 π
Furthermore,
n Z π
1 X ∼ Er Er
ri cos φi = cos udu = − sin πt .
n π πt n
i=[nt]
Hence the formula holds for the second component. The first component for-
mula follows in a similar way.
This proposition is important since it shows that the Oja sign test is asymp-
totically equivalent to Blumen’s test under elliptical models since they are
both invariant under affine transformations. Hence, the efficiency results for
Blumen’s test carry over for spherical and elliptical models to the Oja sign
test. Also recall that Blumen’s test is locally most powerful invariant for the
class of elliptical models so the Oja sign test should be quite good for elliptical
models in general. The two tests are not equivalent for nonelliptical models.
In Brown et al. (1992) the efficiency of the Oja sign test relative to Blumen’s
test was computed for a class of symmetric densities with contours of the form
|x1 |m + |x2 |m . When m = 2 we have spherical densities, and when m = 1 we
have Laplacian densities with independent marginals. Table 1 of Brown et al.
i i
i i
i i
(1992) shows that the Oja sign test is more efficient than Blumen’s test except
when m = 2 where, of course, the efficiency is 1. Hettmansperger, Nyblom,
and Oja (1994) extend the Oja methods to dimensions higher than 2 in the
one sample case and Hettmansperger and Oja (1994) extend the methods to
higher dimensions for the multisample problem.
In Brown and Hettmansperger (1987a), the idea of an affine invariant
rank vector is introduced. The approach is similar to that of Möttönen and
Oja (1995) for the spatial rank vector discussed earlier; see Section 6.3.2. The
Oja criterion D8 (θ) with m = 1 in Section 6.4.3 is a multivariate extension
of the univariate L1 criterion function and we take its gradient P to be the
centered rank vector.PRecall in the univariate case D(θ) = |xj − θ| and
the derivative D ′ (θ) = sgn(θ − xj ). Hence, D ′ (xi ) is the centered rank of
xi . Likewise the vector centered rank of xk is defined to be:
1 1 1
1X X
Rn (xk ) = ▽D8 (xk ) = sgn det xk1 xi1 xj1 (x∗j − x∗i ).
2
i<j xk2 xi2 xj2
(6.4.15)
Again we use the idea of affine invariant vector rank to define the Oja signed-
rank statistic. Let R2n (xk ) be the rank vector when xk is ranked among the
observation vectors x1 , . P
. . , xn and their reflections −x1 , . . . , −xn . Then the
test statistic is S9 (0) = R2n (xj ). Now R2n (−xj ) = −R2n (xj ) so that the
conditional covariance matrix (conditioning on the observed data) is
n
X
b =
A R2n (xj )RT2n (xj ) .
j=1
i i
i i
i i
the data of Example 6.4.1. The numerical results are similar to the results of
the affine spatial methods; see Table 6.4.1 for the results. Note that due to
computational complexity it is not possible to bootstrap the covariance matrix
of the Oja HL-estimate. The R library OjaNP can be used for computations.
i i
i i
i i
the data. Further, the discriminant representation can be used for allocation
of data of unknown group membership. Crimin, McKean, and Sheather (2007)
proposed a robust approach to discriminant analysis based on efficient robust
discriminant coordinates. These coordinates are obtained by the maximiza-
tion of a Lawley-Hotelling test based on robust estimates. The design matrix
used in their fitting is the usual one-way incidence matrix of zeros and ones;
hence, the procedure uses highly efficient robust estimators to do the fitting.
This produces efficient robust discriminant coordinates which allow the user
to visually assess the differences among groups. In particular, Crimin et al.
showed that the robust procedure based on using the Hettmansperger and
Randles (2002) (HR) estimates of location for each group had quite good ef-
ficiency properties. In the simulation study conducted by Crimin et al. these
efficiency properties were verified for the situations investigated. Other than
the normally distributed data, the HR procedure was more efficient than the
LS procedure in terms of empirical misclassification probabilities. Further,
over the situations investigated (including the elliptical Cauchy distribution),
it was much more efficient than the high breakdown but low efficient dis-
criminant analysis proposed by Hawkins and McLachlan (1997) which uses
minimum covariance determinants (MCD) as an estimator of scatter.
i i
i i
i i
ΩT (x) ∝ (F1 (x11 ) − 1/2, F2 (x21 ) − 1/2), where Fi (· ) is the ith marginal cdf.
Hence, the influence is bounded in this case as well. The breakdown point is
29%, the same as the univariate case. Note, however, that the componentwise
methods are neither rotation nor affine invariant/equivariant.
b m )k ≥ kyj k − kθ(Y
kyj − θ(Y b m )k ≥ kyj k − (dm + 2M) . (6.5.1)
b m )k ≥ M + kxk k + dm .
kxk − θ(Y (6.5.2)
Next split the following sum up over contaminated and not contaminated
i i
i i
i i
where di = kUx (xi − θ)k2 with u1 (d) = d−1/2 and u2 (d) = kd−1 . Because they
b is between (k+1)−1 and k −1 where
are M estimators, the breakdown value for θ
k is the dimension of the underlying population. The asymptotic theory for
i i
i i
i i
recall (6.3.3). Hence, we see that the influence function is bounded with a pos-
itive breakdown. Note however that the breakdown decreases as the dimension
of the underlying distribution increases.
i i
i i
i i
Y = 1αT + Xβ + ǫ , (6.6.2)
where Rij is the rank of Yij −xTi β (j) when ranked among Y1j −xT1 β (j) , . . . , Ynj −
xTn β (j) . The rank scores are generated by a(i) = ϕ( n+1 i
), 0 < ϕ(u) <
R R 2
1, ϕ(u)du = 0, and ϕ (u)du = 1; see Section 3.4. Let the score matrix
A be defined as follows:
a11 a12
.. = (a(1) , a(2) )
A = ... . (6.6.4)
an1 an2
so that each column is the set of rank scores within the column.
where aTi = (ai1 ai2 ) = (a(R(Yi1 − xTi β (1) ), a(R(Yi2 − xTi β (2) )) and rTi = (Yi1 −
xTi β (1) , Yi2 −xTi β (2) ). Note at once that this is an analog, using inner products,
of the univariate criterion in Section 3.2.1. In fact, D(β) is the sum of the
i i
i i
i i
see Exercise 6.8.18 and equation (3.2.11). Again, note that the two columns in
(6.6.6) are the estimating functions for the two concatenated univariate linear
models and xi is the ith row of X written as a column.
Hence, the componentwise multivariate R estimator of β is β b that mini-
.
mizes (6.6.5) or solves L(β) = 0. Further, L(0) is the basic quantity that we
use to test H0 : β = 0. We must statistically assess the size of L(0) and reject
H0 and claim the presence of a regression effect when L(0) is “too large” or
“too far from the zero matrix.”
We first consider testing H0 : β = 0 since the distribution theory of the test
statistic is useful later for the asymptotic distribution theory of the estimate.
For the linear model we need some results on direct products; see Magnus
and Neudedker (1988) for a complete discussion. We list here the results that
we need:
1. Let A and B be m × n and p × q matrices. The mp × nq matrix A ⊗ B
defined by
a11 B · · · a1n B
..
A ⊗ B = ... . (6.6.7)
am1 B · · · amn B
is called the direct product or Kronecker product of A and B.
2.
(A ⊗ B)T = AT ⊗ BT , (6.6.8)
(A ⊗ B)−1 = A−1 ⊗ B−1 , (6.6.9)
(A ⊗ B)(C ⊗ D) = (AC ⊗ BD) . (6.6.10)
These facts are used in the proofs of the theorems in the rest of this chapter.
i i
i i
i i
where F1 (s) and F2 (t) are the marginal cdfs of F (s, t). Since the ranks are
centered and using the same argument as in Theorem 3.5.1, E(Lcol ) = 0 and
σa2(1) XT X σa(1) a(2) XT X
V = Cov(Lcol ) =
σa(1) a(2) XT X σa2(2) XT X
1
= A A ⊗ XT X .
T
(6.6.14)
n−1
Further,
1 1 σ12
V→ ⊗Σ , (6.6.15)
n σ12 1
where n−1 XT X → Σ and Σ is positive definite.
The test statistic for H0 : β = 0 is the quadratic form
AR = LTcol V−1Lcol = (n − 1)LTcol (AT A)−1 ⊗ (XT X)−1 Lcol (6.6.16)
where we use a basic formula for finding the inverse of a direct product; see
(6.6.9). Before discussing the distribution theory we record one final result
from traditional multivariate analysis:
AR = (n − 1)trace{LT (XT X)−1 L(AT A)−1 } ; (6.6.17)
see Exercise 6.8.19. This result is useful in translating a quadratic form involv-
ing a direct product into a trace involving ordinary matrix products. Expres-
sion (6.6.17) corresponds to the Lawley-Hotelling trace statistic based on
ranks within the components. The following theorem summarizes the distri-
bution theory needed to carry out the test.
i i
i i
i i
Proof: This theorem follows along the same lines as Theorem 3.5.2. Use a
projection to establish that √1n Lcol is asymptotically normally distributed
and then AR is asymptotically chisquared. The details are left as an Exercise
6.8.20; however, the projection is provided below for use with the estimator.
T (1)
1 1 X ϕ
√ Lcol = √ + op (1) (6.6.18)
n n XT ϕ(2)
T
where ϕ(i) = (ϕ(Fi (ǫ1i ) . . . ϕ(Fi (ǫni )) i = 1, 2 and F1 , F2 are the marginal cdfs.
i
Recall also that a(i) = ϕ( n+1 ) where ϕ(· ) is the score generating function.
The asymptotic covariance matrix is given in (6.6.15).
where R11 , . . . , Rn1 are the ranks of the combined samples in the first com-
ponent and similarly for R12 , . . . , Rn2 for the second component. Note that
n
σa(1) a(2) = n+1 rs , where rs is Spearman’s Rank Correlation Coefficient.
Hence,
n
1 T n+1
σa(1) a(2) n 1 rs 1 σ12
A A= n = →
n−1 σa(1) a(2) n+1 n + 1 rs 1 σ12 1
where ZZ
1 1
σ12 = 12 F1 (r) − F2 (s) − dF (r, s)
2 2
depends on the underlying bivariate distribution.
Next, we must consider the design matrix X for the two-sample model.
Recall (2.2.1) and (2.2.2) which cast the two-sample model as a linear model in
the univariate case. The design matrix (or vector in this case) is not centered.
For convenience we modify C in (2.2.1) to have 1 in the first n1 places and
i i
i i
i i
0 elsewhere. Note that the mean of C is nn1 and subtracting this from the
elements of C yields the centered design:
n2
..
.
1 n2
X=
n −n1
.
..
−n1
So li is the centered and scaled sum of ranks of the first sample in the ith
component.
Now Lcol = (l1 , l2 )T has an approximate bivariate normal distribution
with covariance matrix:
1 T T n1 n2 T n1 n2 1 rs
Cov(Lcol ) = (A A) ⊗ (X X) = A A= .
n−1 n(n − 1) n + 1 rs 1
Note that σ12 is unknown but estimated by Spearman’s rank correlation co-
efficient rs (see above discussion). Hence the test is based on AR in (6.6.17).
It is easy to invert Cov(Lcol ) and we have (see Exercise 6.8.20)
n+1 2 2 1 ∗2 ∗2 ∗ ∗
AR = {l + l − 2r s l1 l2 } = l + l − 2r s l l ,
n1 n2 (1 − rs2 ) 1 2
1 − rs2 1 2 1 2
where l1∗ and l2∗ are the standardized MWW statistics. We reject H0 : β =
0 at approximately level α when AR ≥ χ2α (2). The test statistic AR is a
quadratic form in the component Mann-Whitney-Wilcoxon rank statistics and
rs provides the adjustment for the correlation between the components.
Example 6.6.2 (Brains of Mice Data). For this example, we consider bivari-
ate data on levels of certain biochemical components in the brains of mice;
see the url listed in the Preface for the tabled data. The treatment group re-
ceived a drug which was hypothesized to alter these levels. The control group
received a placebo.
The ranks of the combined treatment and control data for each component
are given in the table, under component ranks. The Spearman rank correlation
i i
i i
i i
i i
i i
i i
Figure 6.6.1: Panel A: Plot of the data for the brains of mice data; Panel B:
Plot of the corresponding ranks of the brains of mice data.
Panel A Panel B
T T
C
1.4
20
C
C
C
1.2
C
T
15
Component 2 responses
T C
Component 2 ranks
1.0
T
C
C T
0.8
10
T
C
T
0.6
C
T
C C
5
C
C T C T
TC T
C C T
0.4
C T
T T
T
C T
T
T T
T T
In this design the first of the k populations is taken as the reference pop-
ulation with location (α1 , α2 )T . The ith row of the β matrix is the vector of
shift parameters for the (i + 1)st population relative to the first population.
We wish to test H0 : β = 0 that all populations have the same (unknown)
location vector.
The matrix A = (a(1) , a(2) ) has the centered and scaled Wilcoxon scores of
the previous example. Hence, a(1) is the vector of rank scores for the combined
k samples in the first component. Since the rank scores are centered, we have
and the second version is easier to compute. Now L(0) = (L(1) , L(2) ) and the
hth component of column i is
√ X Rji 1
lhi = 12 −
j∈Sh
n+1 2
√
12 n+1
= nh Rhi −
n+1 2
where Sh is the index set corresponding to the hth sample and Rhi is the
average rank of the hth sample in the ith component.
1
As in the previous example, we replace n−1 AT A by its limit with 1 on the
main diagonal and σ12 off the diagonal. Then let ((σ ij )) be the inverse matrix.
i i
i i
i i
This is easy to compute and is useful below. The test statistic is then, from
(6.6.16)
11 T −1 12 T −1 (1)
(1)T (2)T σ (X X) σ (X X) L
AR ≈ (L , L ) 12 T −1 22 T −1
σ (X X) σ (X X) L(2)
T T T
= σ 11 L(1) (XTX)−1 L(1) +2σ 12 L(1) (XT X)−1 L(2) +σ 22 L(2) (XTX)−1 L(2) .
The ≈ indicates that the right side contains asymptotic quantities which must
be estimated in practice. Now
X k k
(1)T T −1 (1) −1 2 12 X n+1 n
L (X X) L = nj lj1 = 2
nj R j1 − = H1
j=1
(n + 1) j=1
2 n + 1
Treatment
1 2 ··· k
Component 1 R11 R21 ··· R1k
Component 2 R12 R22 ··· R2k
Then use Minitab or some other package to find the two Kruskal-Wallis statis-
tics. To compute H12 either use the formula above or use
k
X nj
H12 = 1− Zj1 Zj2 , (6.6.19)
j=1
n
q
where Zji = (Rji − (n + 1)/2)/ VarRji and VarRji = (n − nj )(n + 1)/nj ; see
Exercise 6.8.22. The package Minitab lists Zji in its output.
i i
i i
i i
The last example shows that in the general regression problem with
Wilcoxon scores, if we wish to test H0 : β = 0, the test statistic (6.6.16)
can be written as
1 (1)T (1)T T
AR = 2 {L (XT X)−1 L(1) −2σc
12 L (XT X)−1L(2) +L(2) (XT X)−1 L(2) }
1− σc
12
n
where the estimate of σ12 can be taken to be rs or n+1 rs and rs is Spearman’s
rank correlation coefficient and
√ n n
12 X n+1 X
lhi = Rji − xjh = a(R(Yji))xjh .
n + 1 j=1 2 j=1
i i
i i
i i
√ b
and this is the asymptotic covariance matrix for n(β col − β col ).
√ We remind the reader √thatR when we use the Wilcoxon score ϕ(u) =
12(u − 2 ), then τi−1 = 12 fi2 (x)dx, fi the marginal pdf i = 1, 2 and
1
n
b12 = n+1
σ rs , where rs is Spearman’s rank correlation coefficient. See Section
3.7.1 for a discussion of the estimation of τi .
i i
i i
i i
here τ , σ12 , Σ are given in Theorem 6.6.2. Let V denote the asymptotic co-
variance matrix and let V ∗ = n(Mβ) b T V b col . Then
b −1 (Mβ)
col
b T {[b 1 b col
V∗ = β col τ AT Ab τ ]−1 ⊗ MT (M(XT X)−1 MT )−1 M}β
n−1
b T (M(XT X)−1 MT )−1 (Mβ)
= trace{(Mβ) b
1
·[b
τ AT Abτ ]−1 } (6.6.21)
n−1
Using tr to denote trace, denote the test statistic, (6.6.21), defined in the
last theorem to be
i i
i i
i i
The next theorem describes the test when only K is involved. After that
we put the two results together for the general statement.
√ b
Theorem 6.6.4. Under H0 : βK = 0, where K is a 2 × s matrix, n(βK) col
is asymptotically
T 1 σ12 −1
Nps 0, K τ τK ⊗ Σ
σ12 1
where τ , σ12 , and Σ are given in Theorem 6.6.2. Let V denote the asymptotic
covariance matrix. Then
b T V b col = trace{(βK)
b −1 (βK) b T (XT X)(βK)[Kb
b 1
n(βK) col τ AT Ab
τ K]−1 }
n−1
is asymptotically χ2 (ps).
cT K) T
cT
Proof: First note that (β col = K ⊗ I β col . Then from Theorem 6.6.2,
√ cT
the asymptotic covariance matrix of nβ col is
√ cT 1 σ12
AsyCov( nβ col ) = τ τ ⊗ Σ−1 .
σ12 1
√ b
Hence, the asymptotic covariance matrix of n(βK) col is
√ 1 σ12 T
b
AsyCov( n(βK) T −1
KT ⊗ I
col ) = K ⊗I τ
σ12 1
τ ⊗Σ
T 1 σ12 −1
= K τ τ ⊗Σ (K ⊗ I)
σ12 1
T 1 σ12
= K τ τ K ⊗ Σ−1 ,
σ12 1
which is the desired result. The asymptotic normality and chisquare distribu-
tion follow from Theorem 6.6.2.
The previous two theorems can be combined to yield the general case.
i i
i i
i i
b then, letting tr
If V is the asymptotic covariance matrix with estimate V
denote the trace operator,
b T V
n(MβK) b col =
b −1 (MβK)
col
b T {[KT τb 1 AT Ab
(MβK) b col =
τ K]−1 ⊗ [M(XT X)−1 MT ]−1 }(MβK)
col
n−1
b T [M(XT X)−1 MT ]−1 (MβK)[K
b T 1
tr{(MβK) τb AT Abτ K]−1 } (6.6.24)
n−1
has an asymptotic χ2 (rs) distribution.
The last theorem provides great flexibility in composing and testing hy-
potheses in the multivariate linear model. We must estimate the matrix β
along with the other parameters familiar in the linear model. However, once
we have these estimates it is a simple series of matrix multiplications and the
trace operation to yield the test statistic.
Denote the test statistic, (6.6.24), defined in the last theorem to be
QM V RK = (6.6.25)
b T [M(XT X)−1 MT ]−1 (MβK)[K
b T 1
tr{(MβK) τb AT Ab
τ K]−1 }.
n−1
Then the corresponding level α asymptotic decision rule is:
i i
i i
i i
Under the above assumptions and the assumption that Λ is positive definite
and assuming H0 : MβK = 0 is true, QLH has an asymptotic χ2 distribution
with rs degrees of freedom. This type of hypothesis arises in profile analysis;
see Chinchilli and Sen (1982) for this application. In order to illustrate these
tests, we complete this section with an example.
Example 6.6.4 (Tablet Potency Data). The following data are the results
from a pharmaceutical experiment on the effects of four factors on five mea-
surements of a tablet:
Responses Factors Covariate
POT2 POT4 RSDCU HARD H2 O SAE SAI ADS TYPE POT0
7.94 3.15 1.20 8.50 0.188 1 1 1 1 9.38
8.13 3.00 0.90 6.80 0.250 1 1 1 -1 9.67
8.11 2.70 2.00 9.50 0.107 1 1 -1 1 9.91
7.96 4.05 2.30 6.00 0.125 1 1 -1 -1 9.77
7.83 1.90 0.50 9.80 0.142 -1 1 1 1 9.50
7.91 2.30 0.90 6.60 0.229 -1 1 1 -1 9.35
7.82 1.40 1.10 8.43 0.112 -1 1 -1 1 9.58
7.42 2.60 2.60 8.50 0.093 -1 1 -1 -1 9.69
8.06 2.00 1.90 6.17 0.207 1 -1 1 1 9.62
8.51 2.80 1.70 7.20 0.184 1 -1 1 -1 9.89
7.88 3.35 4.70 9.30 0.107 1 -1 -1 1 9.80
7.58 3.05 4.00 8.10 0.102 1 -1 -1 -1 9.73
8.14 1.20 0.80 7.17 0.202 -1 -1 1 1 9.51
8.06 2.95 2.50 7.80 0.027 -1 -1 1 -1 9.82
7.31 1.85 2.10 8.70 0.116 -1 -1 -1 1 9.20
8.66 4.10 3.60 6.40 0.114 -1 -1 -1 -1 9.53
8.16 3.95 2.00 8.00 0.183 0 0 0 1 9.67
8.02 2.85 1.10 6.61 0.139 0 0 0 -1 9.41
8.03 3.20 3.60 9.80 0.171 0 1 0 1 9.62
7.93 3.20 6.10 7.33 0.152 0 1 0 -1 9.49
7.84 3.95 2.00 7.70 0.165 0 -1 0 1 9.96
7.59 1.15 2.10 7.03 0.149 0 -1 0 -1 9.79
8.28 3.95 0.70 8.40 0.195 1 0 0 1 9.46
7.75 3.35 2.20 6.37 0.168 1 0 0 -1 9.78
7.95 3.85 7.20 9.30 0.158 -1 0 0 1 9.48
8.69 2.80 1.30 6.57 0.169 -1 0 0 -1 9.46
8.38 3.50 1.70 8.00 0.249 0 0 1 1 9.73
8.15 2.00 2.30 6.80 0.189 0 0 1 -1 9.67
8.12 3.85 2.50 7.90 0.116 0 0 -1 1 9.84
7.72 3.50 2.20 5.60 0.110 0 0 -1 -1 9.84
7.96 3.55 1.80 7.85 0.135 0 0 0 1 9.50
8.20 2.75 0.60 7.20 0.161 0 0 0 -1 9.78
8.10 3.30 0.97 8.73 0.152 0 0 0 1 9.71
8.16 3.90 2.40 7.50 0.155 0 0 0 -1 9.57
There are n = 34 data cases. The five responses are: (POT2), potency of
the tablet at the end of 2 weeks; (POT4), potency of the tablet at the end
of 4 weeks; the third and fourth responses are measures of the tablet purity
(RSDCU) and hardness (HARD); and the fifth response is its water content
(H2 O); hence, we have a 5-dimensional response rather than the bivariate
responses discussed so far. This means that the degrees of freedom are 5r rather
than 2r in Theorem 6.6.3. The factors are: SAI, the amount of intragranular
steric acid, which was set at the three levels −1, 0, and 1; SAE, the amount of
extragranular steric acid, which was set at the three levels −1, 0, and 1; ADS,
i i
i i
i i
the amount of cross carmellose sodium, which was set at the three levels −1, 0,
and 1; and TYPE of steric acid which was set at two levels −1 and 1. The
initial potency of the compound, POT0, served as a covariate. This data were
used in an example in the article by Davis and McKean (1993) and much of
our discussion below is taken from this article.
This data set was treated as a univariate model for the response POT2
in Chapter 3; see Examples 3.3.3 and 3.9.2. As our full model we choose the
same model described in expression (3.3.1) of Example 3.3.3. It includes: the
linear effects of the four factors; six simple two-way interactions between the
factors; the three quadratic terms of the factors SAI, SAE, and ADS; and the
covariate for a total of fifteen terms. The need for the quadratic terms was
discussed in the diagnostic analysis of this model for the response POT2; see
Example 3.9.2. Hence, Y is 34 × 5, X is 34 × 14, and β is 14 × 5.
Table 6.6.1 displays the results for the test statistic QM V R , (6.6.22) for the
usual ANOVA hypotheses of interest: main effects, interaction effects broken
down as simple two-way and quadratic, and covariate. Also listed are the
hypothesis matrices M for each effect where the notation Ot×u represents a
t × u matrix of 0s and It is the t × t identity matrix. Also given for comparison
purposes are the results of the traditional Lawley-Hotelling test, based on the
statistic (6.6.27) with K = I5 . For example, M = [I4 O4×10 ] yields a test of
the hypothesis:
that is, the linear terms vanish in all five components. Note that M is 4 × 14
so r = 4 and hence we have 4 × 5 = 20 degrees of freedom in Theorem 6.6.3.
The other hypothesis matrices are developed similarly. The robust analysis
indicates that all effects are significant except the covariate effect. In particular
the quadratic effect is significant for the robust analysis but not for the Lawley-
Hotelling test. This confirms the discussion on LS and robust residual plots
for this data set given in Example 3.9.2.
Are the effects of the factors different on potencies of the tablet after 2
weeks, POT2, or 4 weeks, POT4? This question can be answered by evaluating
the statistic QM V RK , (6.6.26), for hypotheses of the form MβK, for the ma-
trices M given in Table 6.6.1 and the 5 × 1 matrix K where K′ = [1 −1 0 0 0].
For example, β11 , . . . , β41 are the linear effects of SAE, SAI, ADS, and TYPE
i i
i i
i i
on PO2 and β12 , . . . , β42 are the linear effects on PO4. We may want to test
the hypothesis
H0 : β11 = β12 , . . . , β41 = β42 .
The M matrix picks the appropriate βs within a component and the K
matrix compares the results across components. From Table 6.6.1, choose
M = [I4 O4×10 ]. Then
β11 β12 · · · β15
.. .. ..
Mβ = . . . .
β41 β42 · · · β45
i i
i i
i i
This approach can be extended to the spatial rank and affine rank cases;
recall the discussion
P in Example 6.3.2. In the spatial case the criterion function
is D(α, β) = kyi − αTP
T
− x′i βk, (6.1.4). Let u(x) = kxk−1 x and rTi =
α − xi β, then D(α, β) = uT (ri )ri and hence,
T ′
uT (r1 )
..
A= . .
T
u (rn )
P
Further, let Rc (ri) = j u(ri − rj ) be the centered spatial rank vector.
P
Then the criterion function is D(α, β) = RTc (ri )ri and
RT (r1 )
..
A∗ = . .
T
R (rn )
The tests then can be carried out using the chisquare critical values. See
Brown and Hettmansperger (1987b) and Möttönen, and Oja (1995) for details.
For the details in the affine invariant sign or rank vector cases see Brown
and Hettmansperger (1987b), Hettmansperger, Nyblom, and Oja (1994), and
Hettmansperger, Möttönen, and Oja (1997a,b).
Rao (1988) and Bai, Chen, Miao, and Rao (1990) consider a different
formulation of a linear model. Suppose, for i = 1, . . . , n, Yi = Xi β + ǫi where
Yi is a 2 × 1 vector, Xi is a q × 2 matrix of known values, β is a 2 × 1 vector of
unknown parameters. Further, ǫ1 , . . . , ǫn is an iid set of random
P vectors from
a distribution with median vector 0. The criterion function is kYi − Xi βk,
the spatial criterion function. Estimates, tests, and the asymptotic theory are
developed in the above references.
i i
i i
i i
been recorded. Let yijl denote the response for the ith subject in Group j for
the lth variable and let yij = (yij1 , . . . , yij2 )T denote the vector of responses
for this subject. Consider the model,
yij = µj + eij , j = 1, . . . , k , i = 1, . . . , nk ,
P
where the eij are independent and identically distributed. Let n = nj denote
the total sample size. Let Yn×d denote the matrix of responses (the yij s are
stacked sequentially by group) and let ǫ be the corresponding n × d matrix of
eij . Let Γ = (µ1 , . . . , µk )T be the k × d matrix of parameters. We can then
write the model as
Y = WΓ + ǫ , (6.7.1)
where W is the incidence matrix in expression (4.2.5). This is our full model
and it is the multivariate analog of the basic model of Chapter 4, (4.2.1). If
µj is the vector of medians then this is the multivariate medians model.
On the other hand, if µj is the vector of means then this is the multivariate
means model.
We are interested in the following general hypotheses:
H0 : MΓK = O versus HA : MΓK 6= O , (6.7.2)
where M is an r × k contrast matrix (the rows of M sum to zero) of rank r
and K is a d × s matrix of rank s.
In order to use the theory of Section 6.6 we need to transform Model (6.7.1)
into a model of the form (6.6.2). Consider the k × k elementary column matrix
E which replaces the first column of a matrix by the sum of all columns of the
matrix; i.e., " k #
X
[c1 c2 · · · ck ]E = ci c2 · · · ck , (6.7.3)
i=1
for any matrix [c1 c2 · · · ck ]. Note that E is nonsingular. Hence we can write
Model (6.7.1) as
T
−1 α
Y = WΓ + ǫ = WEE Γ + ǫ = [1 W1 ] +ǫ, (6.7.4)
β
i i
i i
i i
i i
i i
i i
Table 6.7.1: Estimates Based on the Wilcoxon and LS Fits for the Paspalum
Grass Data, Example 6.7.1. (V is the variance-covariance matrix of vector of
random errors ǫ.)
Wilcoxon Fit LS Fit
Components Components
Parameter (1) (2) (3) (1) (2) (3)
µ11 1.04 3.14 .82 .97 3.12 .78
µ21 2.74 3.70 3.05 2.71 3.67 3.02
µ31 2.47 3.63 3.25 2.40 3.61 3.20
µ41 1.49 3.29 2.79 1.45 3.29 2.76
µ12 .94 3.12 .77 .92 3.12 .70
µ22 1.95 3.43 2.58 1.96 3.43 2.55
µ32 2.26 3.36 3.19 2.19 3.39 3.11
µ42 1.09 3.18 2.45 1.01 3.17 2.41
τ or σ .376 .188 .333 .370 .197 .292
1.04 .62 .92 .14 .04 .09
AT A or V .62 1.04 .57 .04 .04 .03
.92 .57 1.04 .09 .03 .09
Table 6.7.2: Test Statistics QM V RK and QLH Based on the Wilcoxon and LS
Fits, Respectively, for the Paspalum Grass Data, Example 6.7.1. (Marginal
F -tests are also given. The numerator degrees of freedom are given. Note that
the denominator degrees of freedom for the marginal F -tests is 40.)
Wilcoxon LS
MVAR Marginal Fϕ MVAR Marginal FLS
Effect df QM V RK df (1) (2) (3) df QLH df (1) (2) (3)
Treat. 3 14.9 1 9.19 7.07 11.6 3 12.2 1 11.4 6.72 8.66
Temp. 9 819 3 32.5 13.4 61.4 9 980 3 45.2 13.4 162
Treat. ×Temp. 9 11.2 3 2.27 1.49 1.35 9 7.98 3 2.01 .79 1.36
Table 6.7.1 and the elementary column matrix E, as defined above expression
(6.7.3), we obtained the test statistics QM V RK , (6.6.26) based on the Wilcoxon
fit. For comparison we also obtain the LS test statistics QLH , (6.6.27). The
values of these statistics for the hypotheses of interest are summarized in
Table 6.7.2. The test for interaction is not significant while both main effects,
Treatment and Temperature, are significant. The results are quite similar for
the traditional test also. We also tabulated the marginal test statistics, Fϕ .
The results for each component are similar to the multivariate result.
i i
i i
i i
6.8 Exercises
6.8.1. Show that the vector of sample means of the components is affine
equivariant. See Definition 6.1.1.
6.8.3. Show that in the univariate case S2 (θ) = S3 (θ), (6.1.7) and (6.1.8).
6.8.5. Construct an example in the bivariate case for which the mean vector
rotates into the new mean vector but the vector of componentwise medians
does not rotate into the new vector of medians.
6.8.6. Students were given a math aptitude and reading comprehension test
before starting an intensive study skills workshop. At the end of the program
they were given the test again. The following data represents the change in
the math and reading tests for the five students in the program.
Math Reading
11 7
20 40
-10 -4
10 12
16 5
6.8.8. Using the projection method discussed in Chapter 2, derive the pro-
jection of the statistic given in (6.2.14).
6.8.9. Apply Lemma 6.2.1 and show that (6.2.19) provides the bounds on the
testing efficiency of the Wilcoxon test relative to Hotelling’s test in the case
of a bivariate normal distribution.
i i
i i
i i
(a) Show that the efficiency of the spatial L1 methods relative to the L2
methods with a k-variate spherical model is given by
2
k−1
ek (spatial L1 , L2 ) = E(r 2 )[E(r −1 )]2 .
k
(b) Next assume that the k-variate spherical model is normal. Show that
√
Er −1 = Γ[(k−1)/2)]
√
2Γ(k/2)
with Γ(1/2) = π.
6.8.13. . Show that the spatial median is equivariant and that the spatial sign
test is invariant under orthogonal transformations of the data.
6.8.15. Complete the proof of Theorem 6.4.3 by establishing the third formula
for S8 (0).
6.8.16. Show that the Oja median and Oja sign test are affine equivariant
and affine invariant, respectively. See Section 6.4.3.
6.8.17. Show that the maximum breakdown point for a translation equiv-
ariant estimator is (n+1)/(2n). An estimator is translation equivariant if
T (X + a1) = T (X) + a1, for every real a. Note that 1 is the vector of all
ones.
6.8.23. Consider Model (6.7.1) for a repeated measures design in which the
responses are recorded on the same variable over time; i.e., yijl is response for
the ith subject in Group j at time period l. In this model the vector µj is
the profile vector for the jth group and the plot of µij versus i is called the
b j denote the estimate of µj based on the R fit
profile plot for Group j. Let µ
of Model (6.7.1). The plot of µbij versus j is called the sample profile plot of
Group j. These group plots are overlaid and are called the sample profiles.
A hypothesis of interest is whether or not the population profiles are parallel.
i i
i i
i i
(c) Test the hypotheses (6.8.1) using the procedure (6.6.26) based on
Wilcoxon scores. Repeat using the LS test procedure (6.6.27).
(d) Repeat items (b) and (c) if the 13th rat at time period 2 took 80 seconds
to run the maze instead of 34. Note that p-value of the LS procedure
changes from .77 to .15 while the p-value of the Wilcoxon procedure
changes from .95 to .85.
6.8.24. Consider the data of Example 6.7.1.
(a) Using the Wilcoxon scores, fit Model (6.7.4) to the original data. Obtain
the marginal residual plots which show heteroscedasticity. Reason that
the log transformation is appropriate. Show that the residual plots based
on the transformed remove much of the heteroscedasticity. For both the
transformed and original data obtain the internal Wilcoxon Studentized
residuals. Identify the outliers.
(b) In order to see the effect of the transformation, obtain the Wilcoxon and
LS analyses of Example 6.7.1 based on the original data. Discuss your
findings.
i i
i i
i i
Appendix A
Asymptotic Results
447
i i
i i
i i
Then
n
X D
ain Xi → N(0, σ 2 σa2 ) .
i=1
Proof: Take Win of Theorem A.1.1 to be ain Xi . Then the mean of Win is 0
2
and
P 2its variance is σin = a2in σ 2 . By (A.1.5), max σin
2
→ 0 and by (A.1.4),
2 2
σin → σ σa . Hence we need only show that condition (A.1.3) is true. For
i = 1, . . . , n, define
∗
Win = max |ajn ||Xi | .
1≤j≤n
∗ ∗
Then |Win | ≥ |Win |; hence, Iǫ (|Win |) ≤ Iǫ (|Win |), for ǫ > 0. Therefore,
n n
( n
)
X 2 X 2 X
∗
E Win Iǫ (|Win |) ≤ E Win Iǫ (|Win |) = a2in E X12 Iǫ (|W1n
∗
|) .
i=1 i=1 i=1
(A.1.6)
Note that the sum in braces converges to σ 2 σa2 .
Because converges ∗
X12 Iǫ (|W1n |)
to 0 pointwise and it is bounded above by the integrable function X12 , it then
follows that by Lebesgue’s Dominated Convergence Theorem that the rightside
of (A.1.6) converges to 0. Thus condition (A.1.3) of Theorem A.1.1 is true and
we are finished.
Note that the simple Central Limit Theorem follows from this corollary by
taking ain = n−1/2 , so that (A.1.4) and (A.1.5) hold.
i i
i i
i i
We have immediately from (A.2.3) that the mean and variance of T are
P
E(T ) = 0 and Var(T ) = ni=1 x2i . (A.2.8)
i i
i i
i i
Because the means of S and T are the same, it follows that S has the same
asymptotic distribution as T provided the second moment of their difference
goes to 0. But this follows from the string of inequalities:
" !2
2 # Xn
1 1 1 n
E √ S−√ T = E xi ϕ Fn (Yi ) − ϕ(F (Yi))
n n n i=1
n + 1
( n ) " 2 #
n 1X 2 n
≤ x E ϕ Fn (Y1 ) − ϕ(F (Y1 )) → σx2 · 0,
n − 1 n i=1 i n+1
where the inequality and the derivation of the limit is given on page 160 of
Hájek and Šidák (1967). This results in the following theorem,
Theorem A.2.1. Under the above assumptions,
1 P
√ (T − S) → 0 , (A.2.10)
n
and
1 D
√ S → N(0, σx2 ) . (A.2.11)
n
Hence we have established the null asymptotic distribution theory of a
simple linear rank statistic.
i i
i i
i i
i i
i i
i i
The random variables in the first term, f ′ (Yi)/f (Yi ) are iid with mean 0 and
variance I(f ). Because the sequence d1 , . . . , dn satisfies (A.2.14)-(A.2.16), we
can use Corollary A.1.1 to show that, under py , l converges in distribution to
N(−I(f )σd2 /2, I(f )σd2). By the definition of contiguity (A.2.1) and the imme-
diate following discussion of LeCam’s first lemma, we have the result
the densities qd = Πni=1 f (yi + di) are contiguous to py = Πni=1 f (yi ) ;
(A.2.21)
see, also, page 204 of Hájek and Šidák (1967).
We next establish the key result:
Theorem A.2.2. For T given by (A.2.7) and under py and the assumptions
(3.4.1), (A.2.1), (A.2.2), (A.2.14)-(A.2.17),
1 ! !
√ T
n D 0 σ 2
x σ γ
xd f
→ N2 I(f )σ2 , . (A.2.22)
l − 2 d σxd γf I(f )σd2
√
Proof: Consider the random vector V = (T / n, l)′ , where T is defined in ex-
pression (A.2.7). To show that V is asymptotically normal under pn it suffices
to show that for t ∈ R2 , t 6= 0, t′V is asymptotically univariate normal. By
the above discussion, for the second component of V, we need only be con-
cerned with the first term in expression (A.2.19); hence, for t = (t1 , t2 )′ , define
the random variables Win by
X n X n
1 f ′ (Yi)
√ xi t1 ϕ(F (Yi)) + t2 di = Win . (A.2.23)
i=1
n f (Y i ) i=1
We want to apply Theorem A.1.1. The random variables Win are independent
and have mean 0. After some simplification, we can show that the variance of
Win is
2 1 xi
σin = x2i t21 + t22 d2i I(f ) − 2t1 t2 di √ γf , (A.2.24)
n n
where γf is given by
Z 1 ′ −1
f (F (u))
γf = ϕ(u) − du . (A.2.25)
0 f (F −1(u))
Note by assumptions (A.2.1), (A.2.2), and (A.2.15)-(A.2.17) that
n
X
2
σin → t21 σx2 + t22 σd2 I(f ) − 2t1 t2 γf σxd > 0 , (A.2.26)
i=1
and that
2 1 22 2 1
max σin ≤ max xi t1 + t2 I(f ) max d2i + 2|t1 t2 |γf max √ |xi | max |di | → 0 ;
1≤i≤n 1≤i≤n n 1≤i≤n 1≤i≤n n 1≤i≤n
(A.2.27)
i i
i i
i i
hence conditions (A.1.2) and (A.1.1) are true. Thus to obtain the result we
need to show n
X 2
lim E Win Iǫ (|Win |) = 0 , (A.2.28)
n→∞
i=1
∗
for ǫ > 0. But |Win | ≤ Win where
′
1 f (Yi)
∗
Win
= |t1 | max √ |xj ||ϕ(F (Yi))| + |t2 | max |dj | .
1≤j≤n n 1≤j≤n f (Yi )
Hence,
n
X n n
2 X 2 ∗
X 1 2 2 2
E Win Iǫ (|Win |) ≤ E Win Iǫ (Win ) = E t1 xi ϕ (F (Y1))
i=1 i=1 i=1
n
′ 2 ′ ! #
f (Y 1 ) 1 f (Y 1 )
+t22 d2i + 2t1 t2 √ xi diϕ(F (Y1 )) − ∗
Iǫ (W1n ) =
f (Y1) n f (Y1 )
( n )
X1
t21 x2i E ϕ2 (F (Y1))Iǫ (W1n ∗
)
i=1
n
( n
) " 2 #
X f ′
(Y i )
+t22 d2i E Iǫ (W1n∗
)
i=1
f (Y i )
( n
) " ′ 2 #
1 X f (Y1 ) ∗
+2t1 t2 √ xi di E ϕ(F (Y1)) − Iǫ (W1n ) . (A.2.29)
n i=1 f (Y1 )
∗
Because Iǫ (W1n ) → 0 pointwise and each of the other random variables in
the expectations of (A.2.29) are absolutely integrable, the Lebesgue Domi-
nated Convergence Theorem implies that each of these expectations converge
to 0. The desired limit in expression (A.2.28) then follows from assumptions
(A.2.1), (A.2.2), and (A.2.15)-(A.2.17). Hence V is asymptotically bivariate
normal. We can obtain its asymptotic variance-covariance matrix from expres-
sion (A.2.26), which completes the proof.
Based on Theorem A.2.2, an application
√ of LeCam’s third lemma leads to
the asymptotic distribution of T / n under local alternatives which we state
in the following theorem.
Theorem A.2.3. Under the sequence of densities qd = Πni=1 f (yi + di ), and
the assumptions (3.4.1), (A.2.1), (A.2.2), (A.2.14)-(A.2.17),
1 D
√ T → N(σxd γf , σx2 ) , (A.2.30)
n
1 D
√ S → N(σxd γf , σx2 ) . (A.2.31)
n
i i
i i
i i
We next establish the connection between T and Td ; see Theorem 1.3.1, also.
Theorem A.2.4. Under the likelihoods qd and py , we have the following iden-
tity:
1 1
Pqd √ T ≤ t = Ppy √ Td ≤ t . (A.2.34)
n n
Proof: The proof follows from the following string of equalities.
" n
#
1 1 X
Pq d √ T ≤ t = Pq d √ xi ϕ(F (Yi )) ≤ t
n n i=1
" n
#
1 X
= Pq d √ xi ϕ(F ((Yi − di ) + di )) ≤ t
n i=1
" n
#
1 X
= Pp y √ xi ϕ(F (Zi + di )) ≤ t
n i=1
1
= Ppy √ Td ≤ t ,
n
(A.2.35)
where the third equality follows because the sequence of random variables
Z1 , . . . , Zn follows the likelihood py .
We next establish an asymptotic relationship between T and Td .
i i
i i
i i
The first factor in the last expression converges to σx2 ; hence, it suffices to
show that the lim of the second factor is 0. Fix y. Let ǫ > 0 be given. Then
since ϕ(u) is continuous a.e. we can assume it is continuous at F (y). Hence
there exists a δ1 > 0 such that |ϕ(z) − ϕ(F (y))| < ǫ for |z − F (y)| < δ1 . By
the uniform continuity of F , choose δ2 > 0 such that |F (t) − F (s)| < δ1 for
|s − t| < δ2 . By (A.2.16) choose N0 so that for n > N0 implies
max {|di|} < δ2 .
1≤i≤n
Therefore,
Z ∞
2
lim max [ϕ(F (y)) − ϕ(F (y + di ))] f (y) dy ≤ ǫ2 ,
−∞ 1≤i≤n
i i
i i
i i
We need to express these results for the random variables S, (A.2.4),√ and Sd ,
√ to py and (T −S)/ n → 0 in
(A.2.32). Because the densities qd are contiguous
probability under py , it follows that (T − S)/√n → 0 in probability under qd .
By a change of variable this means (Td − Sd )/ n → 0 in probability under py .
This discussion leads to the following two results which we state in a theorem.
Next we relate the result Theorem A.2.7 to (2.5.26), the asymptotic lin-
earity of the general scores statistic in the two-sample problem. Recall in
the two-sample problem that ci = 0 for 1 ≤ i ≤ n1 and ci = 1 for
n1 + 1 ≤ i ≤ n1 + n2 = n, (2.2.1). Hence, xi = ci − c = −n2√
/n for 1 ≤ i ≤ n1
and xi = n1 /n for n1 + 1 ≤ i ≤ n. Defining di = −δxi / n, it is easy to
check that conditions (A.2.14)-(A.2.17) hold with σxd = −λ1 λ2 δ. Further
i i
i i
i i
1 √ 1
√ Sϕ (δ/ n) = √ Sϕ (0) − λ1 λ2 γf δ + op (1) .
n n
Finally using the√usual partition argument, Theorem 1.5.6, and the mono-
tonicity of Sϕ (δ/ n) we have:
λ1 λ2 γ f p
cϕ = = λ1 λ2 τϕ−1 , (A.2.41)
σ(0)
see (2.5.27).
where the scores are generated as a+ (i) = ϕ+ (i/(n + 1)) for a nonnega-
+
tive
R +and 2square integrable function ϕ (u) which is standardized such that
(ϕ (u)) du = 1.
The null asymptotic distribution of Tϕ+ was derived in Section 1.8 so here
we are concerned with its behavior under local alternatives. Also the deriva-
tions here are similar to those for simple linear rank statistics, Section A.2.2;
hence, our derivation is brief.
i i
i i
i i
i i
i i
i i
i i
i i
i i
Hence,
1 ∗ 1 ∗
√ Tbϕ + = √ Tϕ+ + bγh + op (1) . (A.2.55)
n n
A similar result holds for the signed-rank statistic.
For the results needed in Chapter 1, however, it is convenient to change
the notation to:
X n
Tϕ+ (b) = a+ (R|Xi − b|)sgn(Xi − b) . (A.2.56)
i=1
i i
i i
i i
where ci is √the ith row of C and note that ∆ = O(1) because n−1 X′ X →
Σ > 0 and nβ = O(1). Then C′ C = Ip and HC = HX , where HC is the
projection matrix onto the column space of C. Note that since X is centered,
C is also. Also kci k2 = h2nii where h2nii is the ith diagonal entry of HX . It
is straightforward to show that c′i ∆ = x′i β. Using the conditions (D.2) and
(D.3), the following conditions are readily established:
d = 0 (A.3.4)
n
X X n
2
di ≤ kci k2 k∆k2 = pk∆k2 , for all n (A.3.5)
i=1 i=1
max d2i ≤ k∆k2 max kci k2 (A.3.6)
1≤i≤n 1≤i≤n
where the scores are generated by a function ϕ which satisfies (S.1), (3.4.10).
We now show that the theory established in Section A.2 for simple linear rank
statistics holds for Snj , for each j.
√ Fix j, then the regression coefficients
P 2 xP
i of Section A.2 are given by xi =
ncij . Note from (A.3.2) that xi /n = c2ij = 1; hence, condition (A.2.2)
is true. Further by (A.3.6),
max1≤i≤n x2i
Pn 2 = max c2ij → 0 ;
i=1 xi
1≤i≤n
i i
i i
i i
where
n
X
Tnj (0) = cij ϕ(F (Yi)) . (A.3.11)
i=1
Theorem A.3.1. Under the above assumptions, for ǫ > 0 and for all ∆
for arbitrary c > 0 and ǫ > 0. This result was first shown by Jurečková (1971)
under more stringent conditions on the design matrix.
Consider the dispersion function discussed in Chapter 2. In terms of the
above notation
n
X
Dn (∆) = a(R(Yi − ci ∆))(Yi − ci ∆) . (A.3.14)
i=1
for arbitrary c > 0 and ǫ > 0. Our main result of this section shows that
(A.3.12), (A.3.13), and (A.3.16) are equivalent. The proof proceeds as in Heiler
and Willers (1988) who established their results based on convex function
theory. Before proceeding with the proof, for the reader’s convenience, we
present some notes on convex functions.
i i
i i
i i
i i
i i
i i
and
lim fn (x0 ) = f (x0 ) , for at least one x0 ∈ C∗ (A.3.22)
n→∞
i i
i i
i i
Let N denote the set of positive integers. Let ∆(1) , ∆(2) , . . . be a listing
of the vectors in p-space with rational components. By (A.3.12) the
rightside of (A.3.26) goes to 0 in probability for ∆(1) . Hence, for every
infinite index set N ∗ ⊂ N there exists another infinite index set N1∗∗ ⊂
N ∗ such that
a.s.
[Sn (∆(1) ) − Sn (0) + γ∆(1) ] → 0 , (A.3.27)
for n ∈ N1∗∗ . Since the right side of (A.3.26) goes to 0 in probability for
∆(2) and N1∗∗ is an infinite index set, there exists another infinite index
set N2∗∗ ⊂ N1∗∗ such that
a.s.
[Sn (∆(i) ) − Sn (0) + γ∆(i) ] → 0 , (A.3.28)
for n ∈ N2∗∗ and i ≤ 2. We continue and, hence, get a sequence of nested
infinite index sets N1∗∗ ⊃ N2∗∗ ⊃ · · · ⊃ Ni∗∗ ⊃ · · · such that
a.s.
[Sn (∆(j) ) − Sn (0) + γ∆(j) ] → 0 , (A.3.29)
for n ∈ Ni∗∗ ⊃ Ni+1
∗∗
⊃ · · · and j ≤ i. Let Ne be a diagonal infinite index
set of the sequence N1∗∗ ⊃ N2∗∗ ⊃ · · · ⊃ Ni∗∗ ⊃ · · · . Then
a.s.
[Sn (∆) − Sn (0) + γ∆] → 0 , (A.3.30)
e and for all rational ∆.
for n ∈ N
Define the convex function Hn (∆) = Dn (∆) − Dn (0) + ∆′ Sn (0). Then
Dn (∆) − Qn (∆) = Hn (∆) − γ∆′ ∆/2 (A.3.31)
▽(Dn (∆) − Qn (∆)) = ▽Hn (∆) − γ∆ . (A.3.32)
Hence by (A.3.30) we have
a.s.
▽Hn (∆) → γ∆ = ▽γ∆′ ∆/2 , (A.3.33)
e and for all rational ∆. Also note
for n ∈ N
Hn (0) = 0 = γ∆′ ∆/2|∆=0 . (A.3.34)
Since Hn is convex and (A.3.33) and (A.3.34) hold, we have by Theo-
rem A.3.6 that {Hn (∆)}n∈Ne converges to γ∆′ ∆/2 a.s., uniformly on
each compact subset of Rp . That is by (A.3.31), Dn (∆) − Qn (∆) → 0
a.s., uniformly on each compact subset of Rp . Since N ∗ is arbitrary,
we can conclude (see Theorem 4, page 103 of Tucker, 1967) that
P
Dn (∆) − Qn (∆) → 0 uniformly on each compact subset of Rp .
i i
i i
i i
i i
i i
i i
A.3.3 b and β
Asymptotic Distance between β e
This section contains a proof of Theorem 3.5.5. It shows that the R estimate in
Chapter 3 is close to the value which minimizes the quadratic approximation
to the dispersion function. The proof is due to Jaeckel (1972). For convenience,
we restate the theorem.
Theorem A.3.9. Under the Model (3.2.3), (E.1), (D.1), (D.2), and (S.1) in
Section 3.4,
√ P
b − β)
n(β e → 0.
√ e
Proof: Choose ǫ > 0 and δ > 0. Since nβ converges in distribution, there
exists a c0 such that h √ i
e
P kβk ≥ c0 / n < δ/2 , (A.3.42)
for n sufficiently large. Let
n √ o
e e .
T = min Q(Y − Xβ) : kβ − βk = ǫ/ n − Q(Y − Xβ) (A.3.43)
for sufficiently large n. By (A.3.42) and (A.3.44) we can assert with probability
greater than 1 − δ that for sufficiently large n, |Q(Y − Xβ)e − D(Y − Xβ)| f <
√
e < c0 / n. This implies with probability greater than 1 − δ that
(T /2) and kβk
for sufficiently large n,
f < Q(Y − Xβ) √
D(Y − Xβ) e + T /2 and kβk
e < c0 / n . (A.3.45)
i i
i i
i i
Let 0 < λn,1 ≤ · · · ≤ λn,p be the eigenvalues of n−1 X′ X and let γ n,1 , . . . , γ n,p
be a corresponding set of orthonormal
Pp eigenvectors. The spectral decompo-
sition of n X X is n X X = i=1 λn,iγ n,iγ ′n,i . From this we can show for
−1 ′ −1 ′
any vector δ that δ ′ n−1 X′ Xδ ≥ λn,1kδk2 and, that further, the minimum is
achieved over all vectors of unit length when δ = γ n,1. It then follows that
min δ ′ n−1 X′ Xδ = λn,1 a2 ,
kδ k=a
√ e
where we define the event C1,n = { nkβ−β 0 k < c}. Since t > 0 by asymptotic
quadraticity, Theorem A.3.8, there exits an n2 such that for n > n2 ,
Pβ (C2,n ) ≥ 1 − (ǫ/2) , (A.3.48)
0
i i
i i
i i
where C2,n = {max√nkβ −β k≤c+a |Q(β) − D(β)| < (t/3)}. For the remainder
0
of the proof assume that n ≥ max{n0 , n1 , n2 } = n∗ . Next suppose β is such
√ e √ e
that nkβ − βk = a. Then on C1,n it follows that nkβ − βk ≤ c + a. Hence
on both C1,n and C2,n we have
D(β) > Q(β) − (t/3)
e + t − (t/3)
≥ Q(β)
e + 2(t/3)
= Q(β)
e + (t/3) .
> D(β)
√ e e > (t/3) > c0 .
Therefore, for all β such that nkβ − βk = a, D(β) − D(β)
√ e
But D is convex; hence on C1,n ∩ C2,n , for all β such that nkβ − βk ≥ a,
D(β) − D(β) e > (t/3) > c0 .
√
Finally choose n3 such that for n ≥ n3 , δ > (c + a)/ n where δ is the
positive distance between β 0 and Rr . Now assume that n ≥ max{n∗ , n3 } and
b = (β
C1,n ∩ C2,n is true. Recall that the reduced model R estimate is β b ′ , 0′ )′
r r,1
where βb lies in Rr ; hence,
r,1
√ √ √ √
nkβb r − βk
e ≥ nkβ b r − β 0 k − nkβe − β 0 k ≥ nδ − c > a .
b ) − D(β)
Thus on C1,n ∩ C2,n , D(β e > c0 . Thus for n sufficiently large we have
r
b ) − D(β)
P [D(β e > (2τ )−1 χ2 ] ≥ 1 − ǫ .
r α,q
Proof: Let a be arbitrary but fixed and let c > |a|. After matching notation,
Theorem A.4.3 leads to the result,
1 1
max √ S1 (Y − an−1/2
− Xβ) − √ S1 (Y) + (2f (0))a = op (1) .
n n
k(X′ X)1/2 β k≤c
(A.3.49)
i i
i i
i i
Obviously the above result holds for β = 0. Hence for any ǫ > 0,
" #
1 1
P max √ S1 (Y − an −1/2
− Xβ) − √ S1 (Y − an −1/2
) ≥ ǫ ≤
n n
k(X′ X)1/2 β k≤c
" #
1
P max √ S1 (Y − an−1/2 − Xβ) − √1 S1 (Y) + (2f (0)a ≥ ǫ
n n 2
k(X′ X)1/2 β k≤c
1 1 ǫ
+P √ S1 (Y − an−1/2 ) − √ S1 (Y) + (2f (0)a ≥ .
n n 2
By (A.3.49), for n sufficiently large, the two terms on the rightside are arbi-
b is bounded
trarily small. The desired result follows from this since (X′ X)1/2 β
in probability.
i i
i i
i i
as n → ∞.
i i
i i
i i
Because the ci ’s are centered it follows that Epd (U1 (0, 0)) = 0. Thus by
the last lemma, we need only show that Var(U1 (α, ∆) − U1 (0, 0)) → 0. By
considering the variance of the sign of a random variable, simplification leads
to the bound:
n
X √
Var((U1 (α, ∆) − U1 (0, 0)) ≤ 4 c2i |F0 (α/ n + ∆ci ) − F0 (0)| .
i=1
√
By our assumptions, maxi |∆ci + α/ n| → 0 as n → ∞. From this and the
continuity of F0 at 0, it follows that Var(U1 (α, ∆) − U1 (0, 0)) → 0.
We need analogous results for the process U0 (α, ∆).
as n → ∞.
where we have used √ the fact that the ci ’s are √ centered. Note that |ξin | is
between 0 and |α/ n + ci ∆| and that max |α/ n + ci ∆| → 0 as n → ∞. By
the continuity of f0 at 0, the desired result follows.
as n → ∞.
Because the medYi is 0, E0 [U0 (0, 0)] = 0. Hence by the last lemma it then
suffices to show that Var(U0 (α, ∆) − U0 (0, 0)) → 0. But,
n
4 X α
.
Var(U0 (α, ∆) − U0 (0, 0)) ≤ F √ + c ∆ − F (0)
n i=1
0 i 0
n
i i
i i
i i
i i
i i
i i
function W , such that the distribution functions {(1 − s)H + sW } lie in the
domain of T , the following limit exists:
Z
T [(1 − s)H + sW ] − T [H]
lim = ψH dW , (A.5.1)
s→0 s
for some function ψH .
i i
i i
i i
Our functional T (H) is defined implicitly by the equation (1.8.5). Using the
symmetry of h(x), (A.5.5), and (A.5.6) we can write the defining equation for
θ = T (H) as
Z ∞
0 = ϕ+ (H(x) − H(2θ − x))h(x) dx
Z−∞
∞
0 = ϕ(1 − H(2θ − x))h(x) dx . (A.5.7)
−∞
For the derivation, we proceed as discussed above; see the discussion around
expression (A.5.3). Consider the contaminated distribution of H(x) given by
where 0 < ǫ < 1 is the proportion of contamination and ∆t (x) is the distri-
bution function for a point mass at t. By (A.5.3) the influence function is the
derivative of the functional at ǫ = 0. To obtain this derivative we implicitly
differentiate the defining equation (A.5.7) at Ht,ǫ (x); i.e., at
Z ∞
0 = (1 − ǫ) ϕ(1 − (1 − ǫ)H(2θ − x) − ǫ∆t (2θ − x))h(x) dx
−∞
Z ∞
= ǫ ϕ(1 − (1 − ǫ)H(2θ − x) − ǫ∆t (2θ − x)) d∆t (x) .
−∞
Next I4 reduces to
Z −t Z H(−t)
′
− ϕ (H(x))h(x) dx = − ϕ′ (u) du = ϕ(H(t)) + ϕ(0) .
−∞ 0
i i
i i
i i
Combining these results and solving for θ̇ leads to the influence function which
we can write in either of the following two ways,
ϕ(H(t))
Ω(t, θbϕ+ ) = R ∞
−∞
ϕ′ (H(x))h2 (x) dx
ϕ+ (2H(t) − 1)
= R∞ . (A.5.9)
4 0
ϕ+′ (2H(x) − 1)h2 (x) dx
Taking the partial derivative of the right side with respect to s, setting s = 0,
and substituting ∆y0 for W leads to the influence function
Z Z
Ω(y0 ; D(0)) = − ϕ(G(y))ydG(y) − ϕ′ (G(y))G(y)ydG(y)
Z ∞
+ ϕ′ (G(y))ydG(y) + ϕ(G(y0))y0 . (A.5.11)
y0
i i
i i
i i
Note that this is not bounded in the Y -space and, hence, the statistic D(0)
is not robust. Thus, as noted in Section 3.11, the coefficient of multiple de-
termination R1 , (3.11.16), is not robust. A similar development establishes
the influence function for the denominator of LS coefficient of multiple deter-
mination R2 , showing too that it is not bounded. Hence R2 is not a robust
statistic.
Another example is the influence function of the least squares estimate of
β which is given by
b ) = σ −1 y0 Σ−1 x0 .
Ω(x0 , y0 ; β (A.5.12)
LS
The influence function of the least squares estimate is, thus, unbounded in
both the Y - and x-spaces.
b
Influence Function of β ϕ
i i
i i
i i
Thus,
Z
∂G∗s (y − x′ T (Hs ))
= − F [y − T (Hs )′ (x − v)]dM(v)
∂s
Z
∂T
+(1 − s) F ′ [y − T (Hs )′ (x − v)](v − x)′ dM(v)
∂s
Z Z
+ dW (v, u) + sB2 ,
u≤y−T (Hs )′ (x−v)
i i
i i
i i
b is given by
Solving this last expression for Ṫ , the influence function of β ϕ
Hence the influence function of βb is bounded in the Y -space but not in the
ϕ
x-space. The estimate is thus bias robust. Note that the asymptotic repre-
b ϕ in Corollary 3.5.23 can be written in terms of this influence
sentation of β
function as
Xn
√
b
nβ = n −1/2 b ) + op (1) .
Ω(xi , Yi; β
ϕ ϕ
i=1
Influence Function of Fϕ
Rewrite the correlation model as
Y = α + x′1 β 1 + x′2 β 2 + e
H0 : β 2 = 0 versus HA : β 2 6= 0 , (A.5.18)
RD/q
Fϕ = ,
τb/2
where D1 (H) and D2 (H) are the reduced and full model functionals given by
Z
D1 (H) = ϕ[G∗ (y − x′1 T1 (H))](y − x′1 T1 (H))dH(x, y)
Z
D2 (H) = ϕ[G∗ (y − x′ T (H))](y − x′ T (H))dH(x, y) , (A.5.19)
i i
i i
i i
and T1 (H) and T2 (H) denote the reduced and full model functionals for β 1
and β, respectively. Let β r = (β ′1 , 0′ )′ denote the true vector of parameters
under H0 . Then the random variables Y − x′ β r and x are independent. Next
write Σ as
Σ11 Σ12
Σ= .
Σ21 Σ22
It is convenient to define the matrices Σr and Σ+
r as
−1
Σ11 0 + Σ11 0
Σr = and Σr = .
0 0 0 0
(a) RD(0) = 0
∂RD(Hs )
(b) |s=0 = 0
∂s
∂ 2 RD(Hs ) 2 ′ ′
−1 +
(c) | s=0 = τ ϕ [F (y − x β r )]x Σ − Σ x.
∂s2
Proof: Part (a) is immediate. For Part (b), it follows from (A.5.19) that
Z
∂D2 (Hs )
= − ϕ[G∗s (y − x′ T (Hs ))](y − x′ T (Hs ))dH
∂s
Z
∂G∗
+(1 − s) ϕ′ [G∗s (y − x′ T (Hs ))](y − x′ T (Hs )) s dH
∂s
Z
∂T
+(1 − s) ϕ[G∗s (y − x′ T (Hs ))](−x′ )dH
∂s
Z
+ ϕ[G∗s (y − x′ T (Hs ))](y − x′ T (Hs ))dW (y) + sB ,
i i
i i
i i
Z
′
· ϕ(F (y − x β r )xd∆x0 ,y0 (x, y) + R . (A.5.21)
i i
i i
i i
Upon substituting the empirical distribution function for ∆x0 ,y0 in expression
(A.5.21), we have at the sample
n
" #
2 1 X ′ 1 ′
−1
nQ (Hs ) = √ xi ϕ R(Yi − xi β r ) Σ − Σ+
n n
" i=1 n #
1 X 1 ′
× √ xi ϕ R(Yi − xi β r ) + op (1).
n i=1 n
This expression is equivalent to the expression (3.6.10) which yields the asymp-
totic distribution of the test statistic in Section 3.6.
b HBR is given by
Theorem A.5.1. The influence function for the estimate β
Z Z
b 1
Ω(x0 , y0 , β HBR ) = C−1
H (x0−x1 )b(x1 , x0 , y1 , y0)sgn{y0−y1 }dF (y1)dM(x1 ),
2
(A.5.22)
where CH is given by expression (3.12.24).
Proof: Let ∆0 (x, y) denote the distribution function of the point mass at the
point (x0 , y0 ) and consider the contaminated distribution Ht = (1 − t)H + t∆0
for 0 < t < 1. Let β(Ht ) denote the functional at Ht . Then β(Ht ) satisfies
Z Z
′ 1
0 = x1 b(x1 , x2 , y1 , y2 ) I(y2 − y1 < (x2 − x1 ) β(Ht )) −
2
×dHt (x1 , y1 )dHt (x2 , y2 ). (A.5.23)
i i
i i
i i
Once again using the symmetry in the x arguments and y arguments of the
function b, we can simplify this expression to
ZZZ
1 ′ 2
0 = − (x2 −x1 )b(x1 , x2 , y1 , y1 )(x2 −x1 ) f (y1 )dy1dM(x1 )dM(x2 ) β̇
2
Z Z
1
+ (x0 − x1 )b(x1 , x0 , y1 , y0 ) I(y1 < y0 ) − dF (y1 )dM(x1 ) .
2
Using the relationship between the indicator function and the sign function
and the definition of CH ,(3.12.24), we can rewrite this last expression as
Z Z
1
0 = −CH β̇ + (x0 − x1 )b(x1 , x0 , y1 , y0 )sgn{y0 − y1 } dF (y1)dM(x1 ) .
2
i i
i i
i i
is continuous in t.
Proof: This result follows from (3.4.1), (3.12.13), and an application of Leib-
nitz’s rule on differentiation of definite integrals.
Let ∆ be arbitrary but fixed. Denote Wij (∆) by Wij , suppressing depen-
dence on ∆.
Lemma A.6.2. Under assumptions (E.1), (3.4.1), and (H.4), (3.12.13), there
exist constants |ξij | < |tij | such that E(bij Wij ) = −tij Bij′ (ξij ).
i i
i i
i i
Proof: Since Wij = 1, −1, or 0 according as tij < yj −yi < 0, 0 < yj −yi < tij ,
or otherwise, we have
Z Z
Eβ (bij Wij ) = bij fY (y) dy − bij fY (y) dy .
0
tij <yj −yi <0 0<yj −yi <tij
When tij > 0, E(bij Wij ) = −Bij (tij ) = Bij (0) − Bij (tij ) = −tij Bij′ (ξij ) by
Lemma A.6.1 and the Mean Value Theorem. The same result holds for tij < 0,
which proves the lemma.
Lemma A.6.3. Under assumptions (H.3), (3.12.12), and (H.4), (3.12.13),
we have
√
b ) = gij (0) + [▽gij (ξ)]′ β
bij = gij (β b = gij (0) + Op (1/ n),
0 0
b k.
uniformly over all i and j, where kξk ≤ kβ 0
Proof: Follows from a multivariate Mean Value Theorem (see e.g. page 355 of
Apostol, 1974), and by (3.12.12) and (3.12.13).
Lemma A.6.4. Under assumptions (3.12.10)-(3.12.13), (3.4.1), (3.4.7), and
(3.4.8),
(i) E(gij (0)gik (0)Wij Wik ) −→ 0, as n → ∞
(ii) E(gij (0)Wij ) −→ 0, as n → ∞
uniformly over i and j.
Proof: Without loss of generality, let tij > 0 and tik > 0, where the indices i,
j and k are all different. Then
E(gij (0)gik (0)Wij Wik ) = E[gij gik I(0 < Yj − Yi < tij ) I(0 < Yk − Yi < tik )]
Z ∞ Z yi +tik Z yi +tij
= gij gik fi fj fk dyj dyk dyi .
−∞ yi yi
2
Assumptions (3.4.7)
√ and (3.4.8) imply (1/n)maxi (xik − xk ) → 0 for all k, or
equivalently (1/ n)maxi |xik − xk | → 0 for all k, which implies that tij → 0.
Since the integrand is bounded,R ∞this
R yproves (i).
+t
Similarly, E(gij (0)Wij ) = −∞ yii ij gij fi fj dyj dyi → 0, which proves (ii).
i i
i i
i i
b . Thus
Proof: To prove (i), recall that b12 = g12 (0) + [▽g12 (ξ)]′ · β 0
Let I1 , I2 , and I3 denote the three terms on the right side. The term I1 is 0,
by independence. Now,
n o
′ b
I2 = 2E [▽g12 (ξ)] · β 0 W12 g34 (0)W34
n o
b W12 E {g34 (0)W34 }
−2E [▽g12 (ξ)]′ · β 0
= I21 − I22 .
The term [▽g12√ b = b12 −g12 (0) is bounded and of magnitude op (1). If we
(ξ)]′ · β 0
can show that nW12 is integrable, then it follows using standard arguments
that I21 = o(1/n). Let F ∗ denote the cdf of y2 − y1 and f ∗ denote its pdf.
Using the mean value theorem,
√ √ √
E[ nW12 (∆)] = n(1/2)E[sgn(Y2 − Y1 − (x2 − x1 )′ )∆/ n) − sgn(Y2 − Y1 )]
√ √
= n(1/2)[2F ∗(−(x2 − x1 )′ )∆/ n) − 2F ∗ (0)]
√ √
= − nf ∗ (ξ ∗ )(x2 − x1 )′ ∆/ n ≤ f ∗ (ξ ∗ )|(x2 − x1 )′ ∆| ,
√
for |ξ ∗ | < |(x2 − x1 )′ ∆/ n|. The right side of the inequality in expression
(A.6.7) is bounded. This proves that I21 = o(1/n). Similarly,
n √ o √
′ b
I22 = 2(1/n)E [▽g12 (ξ)] · β 0 ( nW12 ) E g34 (0)( nW34 ) = o(1/n),
which proves I2 = 0.
The term I3 can be shown to be o(n−1 ) similarly, which proves (i). The
proof of (ii) is analogous to (i). To prove (iii), note that
The first term is o(1) by Lemma A.6.4. The second and third terms are clearly
o(1). This proves (iii). Result (iv) is analogously proved.
i i
i i
i i
We are now ready to state and prove asymptotic linearity. Consider the
negative gradient function
XX
S(β) = − ▽ D(β) = bij sgn(zj − zi )(xj − xi ). (A.6.7)
i<j
Proof: Write R(∆) = S(n−1/2 ∆) − S(0) + 2n−1/2 Cn ∆ . We show that
( )
XX p
− 32 − 21
sup R(∆) = 2 sup n bij (xj −xi ) Wij (∆) + n Cn ∆ → 0.
k∆k≤C k∆k≤C i<j
We show that E(Rk (∆)) → 0 and V ar(Rk (∆)) → 0. By Lemma A.6.2 and
the definition of γij ,
XX
E(Rk ) = 2 n−3/2 (xjk − xik )[E(bij Wij ) + γij tij E(bij )]
i<j
XX
−3/2
= 2n (xjk − xik )tij [Bij′ (0) − Bij′ (ξij )]
i<j
" #1/2 " #1/2
XX XX
≤ 2 (1/n2 ) (xjk − xik )2 (1/n) t2ij
i<j i<j
PP 2 ′ ′ ′
since (1/n) i<j tij = (1/n)∆ X X∆ = O(1) and supi,j |Bij (0) −
Bij′ (ξij )| → 0 by Lemma A.6.1.
i i
i i
i i
+γlmtlm blm ).
The double sum term above goes to 0, since there there n2 bounded terms in
the double sum, multiplied by n−3 . There are two types of covariance terms
in the quadruple sum, covariance terms with all four indices different, e.g.
((i, j), (l, m)) = ((1, 2), (3, 4)), and covariance terms with one index of the first
pair equal to one index of the second pair, e.g. ((i, j), (l, m)) = ((1, 2), (1, 3)).
Since there are of magnitude n4 terms with all four indices different, we need
to show that each covariance term is o(n−1 ). This immediately follows from
Lemma A.6.5. Finally there are of magnitude n3 covariance terms with one
shared index, and we need to show each term is o(1). Again, this immediately
follows from Lemma A.6.5. Hence, we have established the desired result.
Next define the approximating quadratic process,
XX
Q(β) = D(0) − bij sgn(Yj − Yi )(xj − xi )′ β + β ′ Cn β . (A.6.8)
i<j
Let
D ∗ (∆) = n−1 D(n−1/2 ∆) (A.6.9)
and
Q∗ (∆) = n−1 Q(n−1/2 ∆) . (A.6.10)
Note, minimizing D ∗ (∆) and Q∗ (∆) are equivalent to minimizing D(n−1/2 ∆)
and Q(n−1/2 ∆), respectively.
The next result is asymptotic quadraticity.
Theorem A.6.3. Under assumptions (3.12.10)-(3.12.13), (3.4.1), (3.4.7),
and (3.4.8), for a fixed constant C and for any ǫ > 0,
!
P sup |Q∗ (∆) − D ∗ (∆)| ≥ ǫ →0. (A.6.11)
k∆k<C
i i
i i
i i
∗
hP P i
Proof: Since ∂Q
∂∆
− ∂D ∗
∂∆
= 2n −3/2
b
i<j ij (xj − xi )Wij + C(n−1/2
∆) =
R(∆), it follows from Theorem A.6.2 that for ǫ > 0 and C > 0,
!
∂Q∗ ∂D ∗
P sup k − k≥ ǫ/C → 0.
k∆k<C ∂∆ ∂∆
Proof: Let SP denote the projection of S∗ (0) = n−3/2 S(0) onto the space of
linear combinations of independent random variables. Then
n n
" #
X X XX
SP = E[S∗ (0)|yk ] = E n−3/2 (xj − xi )bij sgn(Yj − Yi )|yk
k=1 k=1 i<j
n
" k−1
X X
= n−3/2 (xk − xi )E[bik sgn(yk − Yi )|yk ]
k=1 i=1
n
#
X
+ (xj − xk )E[bkj sgn(Yj − yk )|yk ]
j=k+1
n X
X n
−3/2
= n (xj − xk )E[bkj sgn(Yj − yk )|yk ]
k=1 j=1
Xn
√
= (1/ n) Uk , (A.6.13)
k=1
i i
i i
i i
where the quadruple sum is taken over (j, k) 6= (l, m). The double sum term
goes to 0 since there are n2 bounded terms divided by n3 . There are two types
of covariance terms in the quadruple sum: terms with four different indices
and terms with three different indices (i.e., one shared index). Covariance
terms with four different indices are zero (this can be shown by writing out
the covariance in terms of expectations, and using symmetry to show that
each covariance term is zero). Thus we only need to consider covariance terms
with three different indices and show that the sum goes to 0. Letting k be the
shared index (without loss of generality), and noting that E T (yj , yk ) = 0 for
all j, k, we have
XX X
n−3 Cov[T (Yj , Yk ), T (Yl , Yk )]
k j6=k l6=k,j
XX X
= n−3 E {T (Yj , Yk ) · T (Yl , Yk )}
k j6=k l6=k,j
XX X
= n−3 E {[E(bkj sgn(Yj − yk )|yk ) − bkj sgn(Yj − Yk )]
k j6=k l6=k,j
· [E(bkl sgn(Yl − yk )|yk ) − bkl sgn(Yl − Yk )]}
XX X
= n−3 E {[E(gkj (0) sgn(Yj − yk )|yk ) − gkj (0) sgn(Yj − Yk )]
k j6=k l6=k,j
· [E(gkl (0) sgn(Yl − yk )|yk ) − gkl (0) sgn(Yl − Yk )]} + op (1)
√
where the last equality follows from the relation bkj = gkj (0) + 0p (1/ n).
i i
i i
i i
Expanding the product, each term in the triple sum may be written as
E [E(gkj (0) sgn(Yj − yk )|yk )]2
+E {gkj (0)sgn(Yj − Yk )gkl (0)sgn(Yl − Yk )}
−2 E {gkj (0) sgn(Yj − Yk ) [E(gkl (0) sgn(Yl − yk )|yk )]}
= (1 + 1 − 2) [E(gkj (0) sgn(Yj − yk )|yk )]2 = 0
where the first equality follows by taking conditional expectations with respect
to k inside appropriate terms. A similar method applies to terms where k is
not the shared index. The theorem is proved.
Proof of Theorem A.6.1: Let β̃ denote the value which minimizes Q(β). Then
β̃ is the solution to
0 = S(0) − 2Cn β
√
so that nβ̃ = (1/2)n2 C−1 n [n
−3/2
S(0)] ∼ AN(0, (1/4) C−1 ΣC−1 √), by Theo-
rem A.6.4 and assumption (D.2), (3.4.7). It remains to show that n(β̃ − β̂) =
op (1). This follows from Theorem A.6.3 and convexity of D(β), using standard
arguments as in Jaeckel (1972).
X n
√
b = 1 (n−2 X′ An X)−1 √1
nβ Uk + op (1) , (A.7.1)
2 n i=1
The estimate of α is given by the median of the HBR residuals. This estimator
can be expressed asymptotically as
n
X
−1/2
b = τS n
α sgn(ei ) + op (1) ; (A.7.3)
i=1
i i
i i
i i
see McKean et al. (1990). Using (A.7.1) and (A.7.3) we have the following
first order expression for the residuals eb∗i , (3.12.36),
n n
. 1X 1 1 X ∗
e∗ = e − τS
b sgn(ei )1 − √ X(n−2 X′ A∗ X)−1 √ U , (A.7.4)
n i=1 2 n n i=1 k
Write
.
e∗ ) = E[(b
Var(b e∗ − E[e1 ]1)(b
e∗ − E[e1 ]1)′ ] . (A.7.6)
The approximate variance-covariance matrix of b e∗ can then be obtained by
e∗ in expression (A.7.6)
substituting the right side of expression (A.7.4) for b
and then expanding and taking expectations term-by-term.P This is a tedious
but by making use of E[Uk ] = 0, med ei = 0, ni=1 xi = 0, and
derivation, P
P
∗
n ∗ n ∗
j=1 aij = j=1 aji = 0, we obtain expression (3.12.37) of Section 3.12.7.
where Fc (e) = [Fc (e1 )−1/2, . . . , Fc (en )−1/2]′ is an n×1 vector of independent
random variables.
Now,
i i
i i
i i
Finally, note that from expressions (A.8.1) and (A.8.2) both β b and β b
LS R
are both functions of the errors ei . Hence, asymptotic normality follows in the
usual way by using (A2) to show that the Lindeberg conditionP holds.
Proof of Corollary 3.13.1: Recall that α̂LS = (Y ) = (1/n) ni=1 Yi . Define the
R intercept estimate, αbR , as the median of residuals. Expression (3.5.22) gives
the asymptotic representation of α bR which, for convenience, we restate here:
n
X
bR = (1/n)τs
α sgn(Yi ) + op (n−1/2 ), (A.8.3)
i=1
which gives
P
Var(α̂LS − α̂R ) = (1/n2 ) Pni=1 Var(Yi − τs sgn(Yi))
= (1/n2 ) Pni=1 [Var(Yi ) + τs2 Var(sgn(Yi)) − 2τs cov(Yi, sgn(Yi))]
= (1/n2 ) ni=1 [σ 2 + τs2 − 2τs E(ei sgn(ei ))
= (1/n)[σ 2 + τs2 − 2τs E(e1 sgn(e1 ))].
Next we need to show that the intercept and slope differences have zero covari-
ance. This follows from the fact that 1′ X = 0. Asymptotic normality follows
as in the proof of Theorem 3.13.1.
i i
i i
i i
References
Abebe, A., McKean, J. W., and Kloke, J. D. (2010), Iterated reweighted rank-
based estimates for GEE models, Submitted.
495
i i
i i
i i
496 References
Babu, G. J. and Koti, K. M. (1996), Sign test for ranked-set sampling, Com-
munications in Statistics, Part A-Theory and Methods, 25(7), 1617-1630.
Bai, Z. D., Chen, X. R., Miao, B. Q., and Rao, C. R. (1990), Asymptotic
theory of least distance estimate in multivariate linear models, Statistics,
21, 503-519.
Belsley, D. A., Kuh, K., and Welsch, R. E. (1980), Regression Diagnostics, New
York: John Wiley and Sons.
i i
i i
i i
References 497
Blair, R. C., Sawilowsky, S. S., and Higgins, J. J. (1987), Limitations of the rank
transform statistic in tests for interaction, Communications in Statistics,
Part B-Simulation and Computation, 16, 1133-1145.
Blumen, I. (1958), A new bivariate sign test, Journal of the American Statistical
Association, 53, 448-456.
i i
i i
i i
498 References
Chang, W., McKean, J. W., Naranjo, J. D., and Sheather, S. J. (1999), High
breakdown rank-based regression, Journal of the American Statistical As-
sociation, 94, 205-219.
i i
i i
i i
References 499
Chiang, C.-Y. and Puri, M. L. (1984), Rank procedures for testing subhypothe-
ses in linear regression, Annals of the Institute of Statistical Mathematics,
36, 35-50.
i i
i i
i i
500 References
Crimin, K., Abebe, A. and McKean, J. W. (2008), Robust General Linear Mod-
els and Graphics via a User Interface, Journal of Modern Applied Statistics,
7, 318-330.
Dongarra, J. J., Bunch, J. R., Moler, C. B., and Stewart, G. W. (1979), Linpack
Users’ Guide, Philadelphia: SIAM.
Dowell, M. and and Jarratt, P. (1971), A modified regula falsi method for
computing the root of an equation, BIT 11, 168-171.
i i
i i
i i
References 501
DuBois, C., ed, (1960), Lowie’s Selected Papers in Anthropology, Berkeley: Uni-
versity of California Press.
Dwass, M. (1960), Some k-sample rank order tests, In: I. Olkin, et al., eds,
Contributions to Probability and Statistics. Stanford: Stanford University
Press.
Fix, E. and Hodges, J. L., Jr. (1955), Significance probabilities of the Wilcoxon
test, Annals of Mathematical Statistics, 26, 301-312.
i i
i i
i i
502 References
Fox, A. J. (1972), Outliers in time series, Journal of the Royal Statistical Society
B, 34, 350-363.
Gastwirth, J. L. (1968), The first median test: A two-sided version of the control
median test, Journal of the American Statistical Association, 63, 692-706.
Gastwirth, J. L. (1971), On the sign test for symmetry, Journal of the American
Statistical Association, 66, 821-823.
Ghosh, M. and Sen, P. K. (1971), On a class of rank order tests for regres-
sion with partially formed stochastic predictors, Annals of Mathematical
Statistics, 42, 650-661.
i i
i i
i i
References 503
Hájek, J. and Šidák, Z. (1967), Theory of Rank Tests, New York: Academic
Press.
Hallin M., Jurečková, J., and Koul, H. L. (2007), Serial autoregression and re-
gression rank score statistics, In: V. Nair, ed, Advances in Statistical Mod-
eling and Inference; Essays in Honor of Kjell Doksum’s 65th Brithdya, Sin-
gapore: World scientific.
Hallin M. and Mēlard, G. (1988), Rank-based tests for randomness against first-
order ARMA processes, Journal of the American Statistical Association, 16,
402-432.
Hampel, F. R. (1974), The influence curve and its role in robust estimation,
Journal of the American Statistical Association, 69, 383-393.
Hardy, G. H., Littlewood, J. E., and Polya, G. (1952), Inequalities. 2nd ed.,
Cambridge: Cambridge University Press.
Hawkins, D. M., Bradu, D., and Kass, G. V. (1984), Location of several outliers
in multiple regression data using elemental sets, Technometrics, 26, 197-208.
i i
i i
i i
504 References
i i
i i
i i
References 505
Hodges, J. L., Jr. and Lehmann, E. L. (1956), The efficiency of some non-
parametric competitors of the t-test, Annals of Mathematical Statistics, 27,
324-335.
Hodges, J. L., Jr. and Lehmann, E. L. (1961), Comparison of the normal scores
and Wilcoxon tests, In: Proceedings of the Fourth Berkeley Symposium on
Mathematical Statistics and Probability, 1, 307-317, Berkeley: University of
California Press.
Hössjer, O. (1994), Rank-based estimates in the linear model with high break-
down point, Journal of the American Statistical Association, 89, 149-158.
i i
i i
i i
506 References
Huber, P. J. (1981), Robust Statistics, New York: John Wiley and Sons.
Huitema, B. E., McKean, J. W., and Zhao, J. (1996), The runs test for au-
tocorrelated errors: Unacceptable properties, Journal of Educational and
Behavior Statistics, 21, 390-404.
Iman, R. L. (1974), A power study of the rank transform for the two-way
classification model when interaction may be present, Canadian Journal of
Statistics, 2, 227-239.
Jan, S. L. and Randles, R. H. (1995), A multivariate signed sum test for the
one-sample location problem, Journal of Nonparametric Statistics, 4, 49-63.
Johnson, G. D., Nussbaum, B. D., Patil, G. P., and Ross, N. P. (1996), Design-
ing cost-effective environmental sampling using concomitant information,
Chance, 9, 4-16.
i i
i i
i i
References 507
Kahaner, D., Moler, C., and Nash, S. (1989), Numerical Methods and Software,
Englewood Cliffs, New Jersey: Prentice Hall.
Kapenga, J. A., McKean, J. W., and Vidmar, T. J. (1988), RGLM: Users Man-
ual, Amer. Statist. Assoc. Short Course on Robust Statistical Procedures
for the Analysis of Linear and Nonlinear Models, New Orleans.
Kloke, J., McKean, J. W., and Rashid, M. (2009), Rank-based estimation and
associated inferences for linear models with cluster correlated errors, Journal
of the American Statistical Association, 104, 384-390.
i i
i i
i i
508 References
Larsen, R. J. and Stroup, D. F. (1976), Statistics in the Real World, New York:
Macmillan.
Lawless, J. F. (1982), Statistical Models and Methods for Lifetime Data, New
York: John Wiley and Sons.
Li, H. (1991), Rank Procedures for the Logistic Model, Unpublished Ph.D. The-
sis, Western Michigan University, Kalamazoo, MI.
Liang, K.-Y. and Zeger, S. L. (1986), Longitudinal data analysis using gener-
alized linear models, Biometrika, 73, 13-22.
Liu, R. Y. and Singh, K. (1993), A quality index based on data depth and
multivariate rank tests, Journal of the American Statistical Association,
88, 405-414.
i i
i i
i i
References 509
Maritz, J. S., Wu, M. and Staudte, R. G. Jr. (1977), A location estimator based
on a U-statistic, Annals of Statistics, 5, 779-786.
Martin, R.D. and Yohai, V.J. (1991), Bias robust estimation of autoregression
parameters, In: Directions in Robust Statistics and Diagnostics, Part I, eds,
W. Stahel and S. Weisberg, 233-246, New York: Springer-Verlag.
i i
i i
i i
510 References
Mason, R. L., Gunst, R. F., and Hess, J. L. (1989), Statistical Design and
Analysis of Experiments, New York: John Wiley and Sons.
i i
i i
i i
References 511
McKean, J. W. and Sievers, G. L. (1989), Rank scores suitable for the analysis
of linear models under asymmetric error distributions, Technometrics, 31,
207-218.
McKean, J. W., Vidmar, T.J., and Sievers, G. L. (1989), A robust two stage
multiple comparison procedure with application to a random drug screen,
Biometrics 45, 1281-1297.
i i
i i
i i
512 References
Möttönen, J. (1997a), SAS/IML Macros for spatial sign and rank tests, Math-
ematics Department, University of Oulu, Finland.
Möttönen, J., Hettmansperger, T. P., Oja, H., and Tienari, J. (1998), On the
efficiency of the affine invariant multivariate rank tests, Journal of Multi-
variate Analysis, 66, 118-132.
Möttönen, J. and Oja, H. (1995), Multivariate spatial sign and rank methods.
Journal of Nonparametric Statistics, 5, 201-213.
Möttönen, J., Oja, H., and Tienari, J. (1997), On the efficiency of multivariate
spatial sign and rank tests, Annals of Statistics, 25, 542-552.
i i
i i
i i
References 513
Nelson, W. (1982), Applied Lifetime Data Analysis, New York: John Wiley and
Sons.
Neter, J., Kutner, M. H., Nachtsheim, C. J., and Wasserman, W. (1996), Ap-
plied Linear Statistical Models, 4th Ed., Chicago: Irwin.
Niinimaa, A., Oja, H. and Nyblom, J. (1992), Algorithm AS277: The Oja bi-
variate median, Applied Statistics, 41, 611-617.
Niinimaa, A., Oja, H., and Tableman, M. (1990), The finite-sample breakdown
point of the Oja bivariate median, Statistics and Probability Letters, 10,
325-328.
Numerical Algorithms Group, Inc. (1983), Library Manual Mark 15, Oxford:
Numerical Algorithms Group.
Oja, H. and Nyblom, J. (1989), Bivariate sign tests, Journal of the American
Statistical Association, 84, 249-259.
i i
i i
i i
514 References
Olshen, R. A. (1967), Sign and Wilcoxon test for linearity, Annals of Mathe-
matical Statistics, 38, 1759-1769.
Pinheiro, J. and Bates, D. (2000), Mixed Effects Models in S and S-Plus, New
York: Springer.
Pinheiro, J., Bates, D., DebRoy, S., Sarkar, D., and the R Core team (2008),
nlme: Linear and Nonlinear Mixed Effects Models. R package version 3.1-89.
Plaisance, E. P., Taylor, J. K., Alhasson, S., Abebe, A., Mestek M. L., and
Grandjean P. W. (2007). Cardiovascular fitness and vascular inflammatory
markers following acute aerobic exercise, International Journal of Sport Nu-
trition and Exercise Metabolism, 17, 152–162.
i i
i i
i i
References 515
Rao, C. R. (1973), Linear Statistical Inference and Its Applications, 2nd Edi-
tion, New York: John Wiley and Sons.
Rao, J. N. K., Sutradhar, B. C., and Yue, K. (1993), Generalized least squares
F test in regression analysis with two-stage cluster samples, Journal of the
American Statistical Association, 88, 1388-1391.
Ridker, P. M., Rifai, N., Rose, L., Buring, J. E., and Cook, N. R. (2002).
Comparison of C-reactive protein and low-density lipoprotein cholesterol
levels in the prediction of first cardiovascular events, New England Journal
of Medicine, 347, 1557-1565.
Rousseeuw, P. and Van Driessen, K. (1999), A fast algorithm for the minimum
covariance determinant estimator, Technometrics, 41, 212-223.
i i
i i
i i
516 References
Ryan, B., Joiner, B., and Cryer, J. (2005), Minitab Handbook, 5th Ed., Aus-
tralia: Thomson.
Scheffé, H. (1959), The Analysis of Variance, New York: John Wiley and Sons.
Searle, S. R. (1971), Linear Models, New York: John Wiley and Sons.
i i
i i
i i
References 517
i i
i i
i i
518 References
Terpstra, J., McKean, J. W., and Naranjo, J. D. (2001), Weighted Wilcoxon es-
timates for autoregression, Australian & New Zealand Journal of Statistics,
43, 399-419.
Thompson, G. L. (1993), Correction Note to: A note on the rank transform for
interactions, (V. 78, 697-701), Biometrika, 80, 211.
i i
i i
i i
References 519
Welch, B. L. (1937), The significance of the difference between two means when
the population variances are unequal, Biometrika, 29, 350-362.
Werner, C. and Brunner, E. (2007), Rank methods for the analysis of clustered
data in diagnostic trials, Computational Statistics and Data Analysis, 51,
5041-5054.
i i
i i
i i
520 References
i i
i i