Small Sample Corrections For LTS and MCD: Springer-Verlag 2002

Metrika (2002) 55: 111–123
> Springer-Verlag 2002
Small sample corrections for LTS and MCD

G. Pison, S. Van Aelst*, and G. Willems
Department of Mathematics and Computer Science, Universitaire Instelling Antwerpen (UIA),
Universiteitsplein 1, B-2610 Wilrijk, Belgium
(e-mail: gpison@uia.ua.ac.be, vaelst@uia.ua.ac.be, gewillem@uia.ua.ac.be)
Abstract. The least trimmed squares estimator and the minimum covariance
determinant estimator [6] are frequently used robust estimators of regression
and of location and scatter. Consistency factors can be computed for both
methods to make the estimators consistent at the normal model. However, for
small data sets these factors do not make the estimator unbiased. Based on
simulation studies we therefore construct formulas which allow us to compute
small sample correction factors for all sample sizes and dimensions without
having to carry out any new simulations. We give some examples to illustrate
the e¤ect of the correction factor.
Key words: Robustness – Least Trimmed Squares estimator – Minimum Co-

variance Determinant estimator – Bias
1 Introduction
The classical estimators of regression and multivariate location and scatter

can be heavily influenced when outliers are present in the data set. To over-
come this problem Rousseeuw [6] introduced the least trimmed squares (LTS)
estimator as a robust alternative for least squares regression and the minimum
covariance determinant (MCD) estimator instead of the empirical mean and
covariance estimators.
Consistency factors can be computed to make the LTS scale and MCD
scatter estimators consistent at the normal model. However, these consistency
factors are not su‰cient to make the LTS scale or MCD scatter unbiased for
small sample sizes. Simulations and examples with small sample sizes clearly
show that these estimators underestimate the true scatter such that too many
observations are identified as outliers.
* Research Assistant with the FWO Belgium.

112 G. Pison et al.
To solve this problem we construct small sample correction factors which

allow us to identify outliers correctly. For several sample sizes n and dimen-
sions p we carried out Monte-Carlo simulations with data generated from the
standard Gaussian distribution. Based on the results we then derive a formula
which approximates the actual correction factors very well. These formulas
allow us to compute the correction factor at any sample size n and dimension
p immediately without having to carry out any new simulations.
In Section 2, we focus on the LTS scale estimator. We start with a moti-
vating example and then introduce the Monte-Carlo simulation study. Based
on the simulation results we construct the function which yields finite sample
corrections for all n and p. Similarly, correction factors for the MCD scatter
estimator are constructed in Section 3. The reweighted version of both meth-
ods is shortly treated in Section 4. In Section 5 we apply the LTS and MCD
on a real data set to illustrate the e¤ect of the small sample correction factor.
Section 6 gives the conclusions.
2 Least trimmed squares estimator
Consider the regression model

yi ¼ y t xi þ ei ð1Þ
t p
for i ¼ 1; . . . ; n. Here xi ¼ ðxi1 ; . . . ; xip Þ A R are the regressors, yi A R is the
response and ei A R is the error term. We assume that the errors e1 ; . . . ; en are
independent of the carriers and are i.i.d. according to Nð0; s 2 Þ which is the
usual assumption for outlier identification and inference. For every y A R p we
denote the corresponding residuals by ri ðyÞ ¼ ri :¼ yi y t xi and r1:n 2
a a
2
rn:n denote the squared ordered residuals.
The LTS estimator searches for the optimal subset of size h whose least
squares fit has the smallest sum of squared residuals. Formally, for 0:5 a a a
1, the LTS estimator y^ minimizes the objective function
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
u
u 1 X hðaÞ
ka t ðr 2 Þi:n ð2Þ
hðaÞ i¼1
ð 1=2
n q 2 1 a 1
where ka ¼ u dF with q ¼ F þ is the consistency
hðaÞ q 2 2
factor for the LTS scale [2] and h ¼ hðaÞ determines the subset size.
When a ¼ 0:5, hðaÞ equals ½ðn þ p þ 1Þ=2 which yields the highest break-
down value (50%), and when a ¼ 1, hðaÞ equals n, such that we obtain the
least squares estimator. For other values of a we compute the subset size by
linear interpolation. To compute the LTS we use the FAST-LTS algorithm of
Rousseeuw and Van Driessen [9]. The LTS estimate of the error scale is given
by the minimum of the objective function (2).
2.1 Example
In this example we generated n ¼ 30 points such that the predictor variables
are generated from a multivariate standard Gaussian N5 ð0; I Þ distribution and
Small sample corrections for LTS and MCD 113
Fig. 1. Robust standardized residuals (a) without correction factors, and (b) with correction fac-
tors of a generated data set with n ¼ 30 objects and p ¼ 5 regressors.
the response variable comes from the univariate standard Gaussian distribu-
tion. We used the LTS estimator with a ¼ 0:5 to analyse this data set and
computed the robust standardized residuals ri ðy^Þ=^s based on the LTS esti-
mates y^ and s^. Using the cuto¤ values þ2.5 and 2.5 we expect to find ap-
proximately 1% of outliers in the case of normally distributed errors. Hence,
we expect to find at most one outlier in our example. In Figure 1a the robust
standardized residuals of the observations are plotted. We see that LTS finds
6 outlying objects which is much more than expected. The main problem
is that LTS underestimates the scale of the residuals. Therefore the robust
standardized residuals are too large, and too many observations are flagged as
outliers.
2.2 Monte-Carlo simulation study

To determine correction factors for small data sets, first a Monte-Carlo sim-
ulation study is carried out for several sample sizes n and dimensions p. In the
simulation we also consider the distribution of x to be Gaussian. Note that the
LTS estimator y^ ¼ ðy^1t ; y^2 Þ t with y^1 the slope vector and y^2 the intercept, is
regression, scale and a‰ne equivariant (see [7] page 132). This means that
y^1 ðAxi ; syi þ v t xi þ wÞ ¼ ðA1 Þ t ðsy^1 ðxi ; yi Þ þ vÞ
y^2 ðAxi ; syi þ v t xi þ wÞ ¼ sy^2 ðxi ; yi Þ þ w
for every v A R p , s 0 0; w A R and nonsingular p p matrix A. Also the LTS

scale s^ is a‰ne equivariant meaning that
s^2 ðsri þ wÞ ¼ s 2 s^2 ðri Þ

for every s 0 0, w A R. From these equivariances it follows that
s^2 ðsyi þ v t xi þ w y^1 ðAxi ; syi þ v t xi þ wÞ y^2 ðAxi ; syi þ v t xi þ wÞ
¼ s 2 s^2 ðy y^1 ðxi ; yi Þxi y^2 ðxi ; yi ÞÞ ð3Þ

114 G. Pison et al.
Therefore it su‰ces to consider standard Gaussian distributions for x and y

since (3) shows that this correction factor remains valid for any Gaussian
distribution of x and y.
In the simulation, for sample size n and dimension p we generate re-
gressors X ð jÞ A R n p and a response variable Y ð jÞ A R n 1 . For each dataset
Z ð jÞ ¼ ðX ð jÞ ; Y ð jÞ Þ, j ¼ 1; . . . ; m we then determine the LTS scale s^P ð jÞ
of the
m
residuals corresponding to the LTS fit. Finally, the mean mð^ 1
sÞ :¼ m j¼1 s^ð jÞ
is computed. If the estimator is unbiased we have E½^ s ¼ 1 for this model, so
we expect that also mð^ sÞ equals approximately 1. In general, denote cp;a n :¼ mð^ 1
sÞ
a a
then E½cp; n s^ equals approximately 1, so we can use cp; n as a finite-sample
correction factor to make the LTS scale unbiased.
To determine the correction factor we performed m ¼ 1000 simulations for
di¤erent sample sizes n and dimensions p, and for several values of a. For the
model with intercept ðxp ¼ 1Þ we denote the resulting correction factor cp;a n
and for the model without intercept it is denoted by c~p;a n .
From the simulations, we found empirically that for fixed n and p the
mean mð^ sÞ is approximately linear in function of a. Therefore we reduced the
actual simulations to cases with a ¼ 0:5 and a ¼ 0:875. For values of a in be-
tween we then determine the correction factor by linear interpolation. If a ¼ 1
then least squares regression is carried out. In this case, we don’t need a cor-
rection factor because this estimator is unbiased. So, if 0:875 a a a 1 we in-
terpolate between the value of mð^ sÞ for a ¼ 0:875 and 1 to determine the
correction factor.
In Table 1, the mean mð^ sÞ for LTS with intercept and a ¼ 0:5 is given for
several values of n and p. We clearly see that when the sample size n is small,
mð^ sÞ is very small. Moreover, for n fixed, the mean becomes smaller when the
dimension increases. Note that for fixed p the mean increases monotone to 1,
so for large samples the consistency factor su‰ces to make the estimator un-
biased. Table 2 shows the result for a ¼ 0:875. In comparison with Table 1 we
see that these values of mð^ sÞ are higher such that the correction factor cp;0:875n
0:5
will be smaller than cp; n for the same value of n and p. Similar results were
found for LTS without intercept.
sÞ for a ¼ 0:5 and for several sample sizes n and dimensions p

Table 1. mð^
pnn 20 25 30 35 50 55 80 85 100
1 0.71 0.77 0.77 0.81 0.84 0.86 0.89 0.90 0.91

3 0.49 0.58 0.60 0.65 0.71 0.74 0.79 0.81 0.83
5 0.35 0.45 0.46 0.53 0.60 0.64 0.71 0.72 0.75
8 0.25 0.26 0.34 0.36 0.49 0.51 0.62 0.62 0.67
sÞ for a ¼ 0:875 and for several sample sizes n and dimensions p

Table 2. mð^
pnn 20 25 30 35 50 55 80 85 100
1 0.91 0.94 0.94 0.96 0.97 0.97 0.98 0.98 0.99

3 0.83 0.86 0.88 0.90 0.93 0.94 0.95 0.96 0.97
5 0.73 0.77 0.83 0.83 0.89 0.90 0.93 0.93 0.95
8 0.56 0.69 0.72 0.75 0.84 0.85 0.90 0.90 0.92
2.3 Finite sample corrections
We now construct a function which approximates the actual correction factors

obtained from the simulations and allows us to compute the correction factor
at any sample size n and dimension p immediately without having to carry out
any new simulations. First, for a fixed dimension p we plotted the mean mð^sÞ
versus the number of observations n. We made plots for several dimensions p
(1 a p a 10), for a ¼ 0:5, and a ¼ 0:875 and for LTS with and without inter-
cept. Some plots are shown in Figure 2.
sÞ has a smooth pat-
From these plots we see that for p fixed the mean mð^
tern in function of n. For fixed p we used the model
g
fpa ðnÞ ¼ 1 þ ð4Þ
nb
to fit the mean mð^ sÞ in function of n. Hence, for each p and a we obtain the
corresponding parameters g :¼ gp; a and b :¼ b p; a for LTS with intercept and
g :¼ g~p; a , b :¼ b~p; a for LTS without intercept. In Figure 2 the functions ob-
tained by using the model (4) are superimposed. We see that the function
values fpa ðnÞ approximate the actual values of mð^ sÞ obtained from the simu-
lations very well.
Fig. 2. The approximation function fpa ðnÞ for (a) p ¼ 1, a ¼ 0:5 and LTS without intercept, (b)
p ¼ 4, a ¼ 0:875 and LTS without intercept, (c) p ¼ 3, a ¼ 0:5 and LTS with intercept, (d) p ¼ 7,
a ¼ 0:875 and LTS with intercept.
116 G. Pison et al.
Fig. 3. The approximating function gqa ð pÞ for (a) q ¼ 3, a ¼ 0:5 and LTS with intercept, (b) q ¼ 5,
When the regression dataset has a dimension that was included in our
simulation study, then the functions fpa ðnÞ already yield a correction factor for
all possible values of n. However, when the data set has another dimension,
then we have not yet determined the corresponding correction factor. To be
able to obtain correction factors for these higher dimensions we fitted the
function values fpa ðqp 2 Þ for q ¼ 3 and q ¼ 5 as a function of the number of
dimensions p ( p b 2). In Figure 3 we plotted the values fpa ðqp 2 Þ versus the
dimension p for the LTS with intercept and a ¼ 0:5. Also in Figure 3 we see a
smooth pattern. Note that the function values fpa ðqp 2 Þ converge to 1 as p goes
to infinity since we know from (4) that fpa ðqp 2 Þ goes to 1 if qp 2 goes to infinity.
The model we used to fit the values fpa ðqp 2 Þ in function of p is given by
h
gqa ðpÞ ¼ 1 þ : ð5Þ
pk
By fitting this model for q ¼ 3 and 5 and a ¼ 0:5 and 0.875 we obtain the
corresponding parameters h :¼ hq; a and k :¼ kq; a for LTS with intercept and
h :¼ h~q; a , k :¼ k~q; a for LTS without intercept. From Figure 3 we see that the
resulting functions fit the points very well.
Finally, for any n and p we now have the following procedure to determine
the corresponding correction factor for the LTS scale estimator. For the LTS
with intercept the correction factor in the case p ¼ 1 is given by c1;a n :¼ f a1ðnÞ
1
where f1a ðnÞ ¼ 1 þ g1; a =n b1; a . In the case p > 1, we first solve the following
system of equations
h3; a gp; a
1þ k
¼1þ ð6Þ
p 3; a
ð3p 2 Þ bp; a
h5; a gp; a
1þ k
¼ 1 þ ð7Þ
p 5; a ð5p 2 Þ bp; a
to obtain the estimates g^p; a and b^p; a of the parameter values gp; a and b p; a .
Fig. 4. The approximation f^pa ðnÞ for (a) p ¼ 1, a ¼ 0:5 and LTS without intercept, (b) p ¼ 4,
a ¼ 0:875 and LTS without intercept, (c) p ¼ 3, a ¼ 0:5 and LTS with intercept, (d) p ¼ 7,
Note that the system of equations (6)–(7) can be rewritten into a linear system
of equations by taking logarithms. The corresponding correction factor is
^
then given by cp;a n :¼ 1=f^pa ðnÞ where f^pa ðnÞ ¼ 1 þ g^p; a =nbp; a . Similarly, we also
obtain the correction factors for the LTS without intercept.
Using this procedure we obtain the functions shown in Figure 4. We can
clearly see that these functions are nearly the same as the original functions
fpa ðnÞ shown in Figure 2.
Let us reconsider the example of Section 2.1. The corrected LTS estimator
with a ¼ 0:5 is now used to analyse the dataset. The resulting robust stan-
dardized residuals are plotted in Figure 1b. Using the cuto¤ values F1 ð0:9875Þ
and F1 ð0:9875Þ we find 1 outlier which corresponds with the 2.5% of out-
liers we expect to find. Also, we clearly see that the corrected residuals are
much smaller than the uncorrected. The corrected residuals range between 3
and 2 while the uncorrected residuals range between 5 and 4. We conclude
that the scale is not underestimated when we use the LTS estimator with small
sample corrections and therefore it gives more reliable values for the stan-
dardized residuals and more reliable outlier identification.
Finally, we investigated whether the correction factor is also valid when
working with non-normal explanatory variables. In Table 3 we give the mean
mð^ sÞ for some simulation set ups where we used exponential, student (with 3
df.) and cauchy distributed carriers. The approximated values f^pa ðnÞ of mð^ sÞ
118 G. Pison et al.
sÞ for several other distributions of the carriers

Table 3. mð^
n ¼ 20 n ¼ 40 n ¼ 60 n ¼ 80
exp, p ¼ 4, a ¼ 0:875, without intercept 0.84 0.91 0.94 0.96

(0.82) (0.91) (0.94) (0.96)
t3 , p ¼ 3, a ¼ 0:5, with intercept 0.50 0.67 0.75 0.80

(0.52) (0.68) (0.75) (0.79)
cauchy, p ¼ 7, a ¼ 0:875, with intercept 0.63 0.83 0.88 0.92

(0.63) (0.81) (0.88) (0.91)
obtained with normally distributed carriers are given between brackets. From
Table 3 we see that the di¤erence between the simulated value and the cor-
rection factor is very small. Therefore, we conclude that in general, also for
nonnormal carrier distributions, the correction factor makes the LTS scale
unbiased.
3 Minimum covariance determinant estimator
The MCD estimates the location vector m and the scatter matrix S. Sup-
pose we have a dataset Zn ¼ fzi ; i ¼ 1; . . . ; ng H R p , then the MCD searches
for the subset of h ¼ hðaÞ observations whose covariance matrix has the low-
est determinant. For 0:5 a a a 1, its objective is to minimize the determinant
of
la Sfull ð8Þ
P hðaÞ P hðaÞ
1
where Sfull ¼ hðaÞ i¼1 ðzi m^n Þðzi m^n Þ t with m^n ¼ hðaÞ
1
i¼1 zi . The factor
la ¼ a=Fw 2 ðqa Þ with qa ¼ wp;2 a makes the MCD scatter estimator consistent at
pþ2
the normal model (see [3]). The MCD center is then the mean of the optimal
subset and the MCD scatter is a multiple of its covariance matrix as given by
(8). A fast algorithm have been constructed to compute the MCD ([8]).
3.1 Example
Similarly as for LTS, we generated data from a multivariate standard Gaus-

sian distribution. For n ¼ 20 observations of N4 ð0; I Þ we computed the MCD
estimates with a ¼ 0:75. As cuto¤ value to determine outliers the 97.5%
quantile of the w42 distribution is used. Since no outliers are present, we there-
fore expect that MCD will find at most one outlier in this case. Nevertheless,
the MCD estimator identifies 4 outlying objects as shown in Figure 5a where
we plotted the robust distances of the 20 observations. Hence a similar prob-
lem arises as with LTS. The MCD estimator underestimates the volume of the
scatter matrix, such that the robust distances are too large. Therefore the
MCD identifies too many observations as outliers.
Fig. 5. Robust distances (a) without correction factors, (b) with correction factors, of a generated
data set with n ¼ 20 objects and p ¼ 4 dimensions.
3.2 Monte-Carlo simulation study
A Monte-Carlo simulation study is carried out for several sample sizes n and
dimensions p. We generated datasets X ð jÞ A R n p from the standard Gaussian
distribution. It su‰ces to consider the standard Gaussian distribution since
the MCD is a‰ne equivariant (see [7] page 262). For each dataset X ð jÞ , j ¼
1; . . . ; m we then determine the MCD scatter matrix S^ ð jÞ . If the estimator
is unbiased, we have that E½S^ ¼ Ip so we expect that the p-th root of the
determinant of S^ equals 1. Therefore, the mean of the p-th root of the deter-
P m ^ð jÞ 1=p
minant given by mðjS^jÞ :¼ m1 j¼1 ðjS jÞ , where jAj denotes the determi-
nant of a square matrix A, is computed. Denote dp;a n :¼ 1 ^ , then we expect
mðjS jÞ
that the determinant of dp;a n S^ equals approximately 1. Similarly as for LTS,
we now use dp;a n as a finite-sample correction factor for MCD. We performed
m ¼ 1000 simulations for di¤erent sample sizes n and dimensions p, and for
several values of a to compute the correction factors.
From the simulation study similar results as for LTS were obtained. Em-
pirically we found that the mean mðjS^jÞ is approximately linear in function of
a so we reduced the actual simulations to cases with a ¼ 0:5 and a ¼ 0:875.
The other values of a are determined by linear interpolation. Also here we saw
that the mean is very small when the sample size n is small, and for fixed p the
mean increases monotone to 1 when n goes to infinity.
3.3 Finite sample corrections
We now construct a function which approximates the actual correction factors

obtained from the simulations. The same setup as for LTS is used. Model (4)
and for p > 2 also model (5) with q ¼ 2 and q ¼ 3 are used to derive a func-
tion which yields a correction factor for every n and p. The function values
f^pa ðnÞ obtained from this procedure are illustrated in Figure 6. In this Figure
the mean mðjS^jÞ is plotted versus the sample size n for a fixed p and a and
superimposed are the functions f^pa ðnÞ. We see that the function values f^pa ðnÞ
are very close to the original values obtained from the simulations.
120 G. Pison et al.
Fig. 6. The approximation f^pa ðnÞ for (a) p ¼ 8, a ¼ 0:5, and (b) p ¼ 6, a ¼ 0:875.
Finally, we return to the example in Section 3.1. We now use the corrected
MCD estimator to analyse the dataset. The resulting robust distances are
plotted in Figure 5b. Using the same cuto¤ value we now find 1 outlier which
corresponds to the 2.5% of outliers that is expected. Note that the corrected
distances are much smaller than the uncorrected ones. The corrected distances
are all below 15.5 while the uncorrected distances range between 0 and 20.
When we use the MCD with small sample corrections the volume of the MCD
scatter estimator is not underestimated anymore, so we obtain more reliable
robust distances and outlier identification.
4 Reweighted LTS and MCD
To increase the e‰ciency of the LTS and MCD, the reweighted version of
these estimators is often used in practice [7]. Similarly to the initial LTS and
MCD, the reweighted LTS scale and MCD scatter are not unbiased at small
samples even when the consistency factor is included. Therefore, we also de-
termine small sample corrections for the reweighted LTS and MCD based
on the corrected LTS and MCD as initial estimators. We performed Monte-
Carlo studies similar to those for the initial LTS and MCD to compute the
finite-sample correction factor for several sample sizes n and dimensions p.
Based on these simulation results, we then constructed functions which deter-
mine the finite sample correction factor for all n and p.
5 Examples
Let us now look at some real data examples. First we consider the Coleman
data set which contains information on 20 schools from the Mid-Atlantic and
New England states, drawn from a population studied by [1]. The dataset
contains 5 predictor variables which are the sta¤ salaries per pupil ðx1 Þ, the
percent of white-collar fathers ðx2 Þ, the socioeconomic status composite devi-
ation ðx3 Þ, the mean teacher’s verbal test score ðx4 Þ and the mean mother’s
educational level ðx5 Þ. The response variable y measures the verbal mean test
score. Analyzing this dataset using LTS with intercept and a ¼ 0:5, we obtain
Fig. 7. Robust standardized residuals for the coleman data (n ¼ 20, p ¼ 5) based on LTS with
intercept and a ¼ 0:75 (a) uncorrected, (b) corrected, (c) uncorrected reweighted, and (d) corrected
reweighted.
the standardized residuals shown in Figure 7. Figure 7a is based on LTS

without correction factor while Figure 7b is based on the corrected LTS. The
corresponding results for the reweighted LTS are shown in Figures 7c and 7d.
Based on the uncorrected LTS 7 objects are identified as outliers. On the other
hand, by using the corrected LTS the standardized residuals are rescaled and
only 2 huge outliers and 1 boundary case are left. The standardized residuals
of the uncorrected LTS range between 11 and 15 while the values of the
corrected LTS range between 4 and 5. Also when using the reweighted LTS
we can see that the uncorrected LTS finds 5 outliers and 2 boundary cases
while the corrected version only finds 2 outliers.
In the second example we consider the aircraft dataset [5] which deals with
23 single-engine aircraft built between 1947–1979. We use the MCD with
a ¼ 0:75 to analyse the 4 independent variables which are Aspect Ratio ðx1 Þ,
Lift-to-Drag ratio ðx2 Þ, Weight ðx3 Þ and Thrust ðx4 Þ. Based on MCD without
correction factor we obtain the robust distances shown in Figure 8a. We see
that 4 observations are identified as outliers of which aircraft 15 is a boundary
case. The robust distance of aircraft 14 equals 494. If we use the corrected
MCD then we obtain the robust distances in Figure 8b where the boundary
case has disappeared. Note that the robust distances have been rescaled. For
example the robust distance of aircraft 14 is reduced to 395. Similar results are
obtained for the reweighted MCD as shown by Figures 8c and 8d.
122 G. Pison et al.
Fig. 8. Robust distances for the aircraft data (n ¼ 23, p ¼ 4) based on MCD with a ¼ 0:75 (a)
uncorrected, (b) corrected, (c) uncorrected reweighted, and (d) corrected reweighted.
6 Conclusions
Even when a consistency factor is included, this is not su‰cient to make
the LTS and MCD unbiased at small samples. Consequently, the LTS based
standardized residuals and the MCD based robust distances are too large such
that too many observations are identified as outliers. To solve this problem,
we performed Monte-Carlo simulations to compute correction factors for
several sample sizes n and dimensions p. Based on the simulation results we
constructed functions that allow us to determine the correction factor for all
sample sizes and all dimensions. Similar results have been obtained for the
reweighted LTS and MCD. Some examples have been given to illustrate the
di¤erence between the uncorrected and corrected estimators.
References
[1] Coleman J et al., (1966) Equality of educational opportunity. U.S. Department of Health,
Washington D.C.
[2] Croux C, Rousseeuw PJ (1992) A class of high-breakdown scale estimators based on sub-
ranges. Communications in Statistics 21:1935–1951
[3] Croux C, Haesbroeck G (1999) Influence function and e‰ciency of the minimum covariance
determinant scatter matrix estimator. Journal of Multivariate Analysis 71:161–190
[4] Flury B, Riedwyl H (1988) Multivariate statistics: a practical approach. Cambridge University
Press
[5] Gray JB (1985) Graphics for regression diagnostics. American Statistical Association Pro-
ceedings of the Statistical Computing Section, 102–107
[6] Rousseeuw PJ (1984) Least median of squares regression. Journal of the American Statistical
Association 79:871–880
[7] Rousseeuw PJ, Leroy AM (1987) Robust regression and outlier detection. Wiley-Interscience,
New York
[8] Rousseeuw PJ, Van Driessen K (1999) A fast algorithm for the minimum covariance deter-
minant estimator. Technometrics 41:212–223
[9] Rousseeuw PJ, Van Driessen K (1999) Computing LTS regression for large data sets. Tech-
nical Report, Universitaire Instelling Antwerpen

Small Sample Corrections For LTS and MCD: Springer-Verlag 2002

Uploaded by

Copyright:

Available Formats

Small Sample Corrections For LTS and MCD: Springer-Verlag 2002

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Small Sample Corrections For LTS and MCD: Springer-Verlag 2002

Uploaded by

Copyright:

Available Formats

Metrika (2002) 55: 111–123

> Springer-Verlag 2002

Small sample corrections for LTS and MCD

Key words: Robustness – Least Trimmed Squares estimator – Minimum Co-

The classical estimators of regression and multivariate location and scatter

* Research Assistant with the FWO Belgium.

To solve this problem we construct small sample correction factors which

2 Least trimmed squares estimator

Consider the regression model

2.2 Monte-Carlo simulation study

y^1 ðAxi ; syi þ v t xi þ wÞ ¼ ðA1 Þ t ðsy^1 ðxi ; yi Þ þ vÞ

y^2 ðAxi ; syi þ v t xi þ wÞ ¼ sy^2 ðxi ; yi Þ þ w

for every v A R p , s 0 0; w A R and nonsingular p p matrix A. Also the LTS

s^2 ðsri þ wÞ ¼ s 2 s^2 ðri Þ

s^2 ðsyi þ v t xi þ w  y^1 ðAxi ; syi þ v t xi þ wÞ  y^2 ðAxi ; syi þ v t xi þ wÞ

¼ s 2 s^2 ðy  y^1 ðxi ; yi Þxi  y^2 ðxi ; yi ÞÞ ð3Þ

Therefore it su‰ces to consider standard Gaussian distributions for x and y

sÞ for a ¼ 0:5 and for several sample sizes n and dimensions p

1 0.71 0.77 0.77 0.81 0.84 0.86 0.89 0.90 0.91

sÞ for a ¼ 0:875 and for several sample sizes n and dimensions p

1 0.91 0.94 0.94 0.96 0.97 0.97 0.98 0.98 0.99

2.3 Finite sample corrections

We now construct a function which approximates the actual correction factors

sÞ for several other distributions of the carriers

exp, p ¼ 4, a ¼ 0:875, without intercept 0.84 0.91 0.94 0.96

t3 , p ¼ 3, a ¼ 0:5, with intercept 0.50 0.67 0.75 0.80

cauchy, p ¼ 7, a ¼ 0:875, with intercept 0.63 0.83 0.88 0.92

3 Minimum covariance determinant estimator

Similarly as for LTS, we generated data from a multivariate standard Gaus-

3.2 Monte-Carlo simulation study

3.3 Finite sample corrections

We now construct a function which approximates the actual correction factors

4 Reweighted LTS and MCD

the standardized residuals shown in Figure 7. Figure 7a is based on LTS

You might also like

y^1 ðAxi ; syi þ v t xi þ wÞ ¼ ðA1 Þ t ðsy^1 ðxi ; yi Þ þ vÞ

s^2 ðsyi þ v t xi þ w y^1 ðAxi ; syi þ v t xi þ wÞ y^2 ðAxi ; syi þ v t xi þ wÞ

¼ s 2 s^2 ðy y^1 ðxi ; yi Þxi y^2 ðxi ; yi ÞÞ ð3Þ