7 Single Index Models
7 Single Index Models
E (y j x) = g x0 (1)
P (y = 1 j x) = E (y j x) = g x0 (2)
where g is an unknown distribution function. We use g (rather than, say, F ) to emphasize the
connection with the regression model.
In both contexts, the function g includes any location and level shift, so the vector Xi cannot
include an intercept. The level of is not identi…ed, so some normalization criterion for is needed.
0
It is typically easier to impose this on than on g. One approach is to set = 1. A second
approach is to set one component of to equal one. (This second approach requires that this
variable correctly has a non-zero coe¢ cient.)
The vector Xi must be dimension 2 or larger. If Xi is one-dimensional, then is simply
normalized to one, and the model is the one-dimensional nonparametric regression E (y j x) = g (x)
with no semiparametric component.
Identi…cation of and g also requires that Xi contains at least one continuously distributed
variable, and that this variable has a non-zero coe¢ cient. If not, Xi0 only takes a discrete set of
values, and it would be impossible to identify a continuous function g on this discrete support.
yi = g Xi0 + ei
E (ei j Xi ) = 0
68
This model generalizes the linear regression model (which sets g(z) to be linear), and is a
restriction of the nonparametric regression model.
The gain over full nonparametrics is that there is only one nonparametric dimension, so the
curse of dimensionality is avoided.
Suppose g were known. Then you could estimate by (nonlinear) least-squares. The LS
criterion would be
n
X 2
Sn ( ; g) = yi g Xi0 :
i=1
We could think about replacing g with an estimate g^; but since g(z) is the conditional mean of yi
given Xi0 = z; g depends on ; so a two-step estimator is likely to be ine¢ cient.
In his PhD thesis, Ichimura proposed a semiparametric estimator, published later in the Journal
of Econometrics (1993).
Ichimura suggested replacing g with the leave-one-out NW estimator
P (Xj Xi )0
j6=i k yj
0 h
g^ i Xi = :
P (Xj Xi )0
j6=i k
h
The leave-one-out version is used since we are estimating the regression at the i’th observation i:
Since the NW estimator only converges uniformly over compact sets, Ichimura introduces trim-
ming for the sum-of-squared errors. The criterion is then
n
X 2
Sn ( ) = yi g^ i Xi0 1i (b)
i=1
He is not too speci…c about how to pick the trimming function, and it is likely that it is not
important in applications.
The estimator of is then
^ = argmin Sn ( ) :
The criterion is somewhat similar to cross-validation. Indeed, Hardle, Hall, and Ichimura (An-
nals of Statistics, 1993) suggest picking and the bandwidth h jointly by minimization of Sn ( ):
In his paper, Ichimura claims that the g^ i (Xi0 ) could be replaced by any other uniformly
consistent estimator and the consistency of ^ would be maintained, but his asymptotic normality
result would be lost. In particular, his proof rests on the asymptotic orthogonality of the derviative
of g^ 0
i (Xi ) with ei ; which holds since the former is a leave-one-out estimator, and fails if it is a
conventional NW estimator.
69
The tricky thing is that g^ 0 ) is not estimating g(Xi0
i (Xi 0 ); rather it is estimating
Hardle, Hall, and Ichimura (1993) show that the LS criterion is asymptotically equivalent to re-
placing g^ 0 ) with G (Xi0 ) ; so
i (Xi
n
X 2
S n ( ) ' Sn ( ) = yi G Xi0 :
i=1
This approximation is essentially the same as Andrews’ MINPIN argument, and relies on the
estimator g^ 0
i (Xi
) being a leave-one-out estimator, so that it is orthogonal with the error ei :
This means that ^ is asymptotically equivalent to the minimizer of Sn ( ) ; a NLLS problem.
As we know from the Econ710, the asymptotic distribution of the NLLS estimator is identical to
least-squares on
@
Xi = G Xi0 :
@
This implies
p
n ^ 0 !d N (0; V )
1 1
V = Q Q
Q = E Xi Xi 0
= E Xi Xi 0 e2i
where
d
g (1) (z) = g (z) :
dz
70
Then
since g (Xi0 ) and g (1) (Xi0 ) are measureable with respect to Xi0 . Another Taylor expansion
forg (Xi0 ) yields that this is approximately
0
G Xi0 ' g Xi0 0 + g (1) Xi0 Xi E Xi j Xi0 ( 0)
0
' g Xi0 0 + g (1) Xi0 0 Xi E Xi j Xi0 0 ( 0)
the …nal approximation for in a n 1=2 neighborhood of 0: (The error is of smaller stochastic
order.)
We see that
@
Xi = G Xi0 ' g (1) Xi0 0 Xi E Xi j Xi0 0 :
@
Ichimura rigorously establishes this result.
This asymptotic distribution is slightly di¤erent than that which would be obtained if the func-
tion g were known a priori. In this case, the asymptotic design depends on Xi ; not E (Xi j Xi0 0) :
2
Q = E g (1) Xi0 0 Xi Xi0
yi = 1 Xi0 ei
where ei is an error.
71
If ei is independent of Xi and has distribution function g; then the data satisfy the single-index
regression
E (y j x) = g x0 :
This is analogous to the sum–of-squared errors function Sn ( ; g) for the semiparametric regression
model.
Similarly with Ichimura, Klein and Spady suggest replacing g with the leave-one-out NW esti-
mator
P (Xj Xi )0
j6=i k yj
0 h
g^ i Xi = :
P (Xj Xi )0
j6=i k
h
Making this substitution, and adding trimming function, this leads to the feasible likelihood
criterion
n
X
Ln ( ) = yi ln g^ i Xi0 + (1 yi ) ln 1 g^ i Xi0 1i (b):
i=1
Klein and Spady emphasize that the trimming indicator should not be a function of ; but instead
of a preliminary estimator. They suggest
where ~ is a preliminary estimator of ; and f^ is an estimate of the density of Xi0 ~ : Klein and
Spady observe that trimming does not seem to matter in their simulations.
The Klein-Spady estimator for is the value ^ which maximizes Ln ( ):
In many respects the Ichimura and Klein-Spady estimators are quite similar.
Unlike Ichimura, Klein-Spady impose the assumption that the kernel k must be fourth-order
(e.g. bias reducing). They also impose that the bandwidth h satisfy the rate n 1=6 <h<n 1=8 ;
which is smaller than the optimal n 1=9 rate for a 4th order kernel. It is unclear to me if these are
merely technical su¢ cient conditions, or if there a substantive di¤erence with the semiparametric
regression case.
Klein and Spady also have no discussion about how to select the bandwidth. Following the
ideas of Hardle, Hall and Ichimura, it seems sensible that it could be selected jointly with by
minimization of Ln ( ); but this is just a conjecture.
They establish the asymptotic distribution for their estimator. Similarly as in Ichimura, letting
72
g denote the distribution of ei ; de…ne the function
Then
p
n ^ 0 !d N 0; H 1
@ @ 0 1
H=E G Xi0 G Xi0
@ @ g (Xi0 0 ) (1 g (Xi0 0 ))
They are not speci…c about the derivative component, but if I understand it correctly it is the same
as in Ichimura, so
@
G Xi0 ' g (1) Xi0 0 Xi E Xi j Xi0 0 :
@
The Klein-Spady estimator achieves the semiparametric e¢ ciency bound for the single-index
binary choice model.
Thus in the context of binary choice, it is preferable to use Klein-Spady over Ichimura. Ichimura’s
LS estimator is ine¢ cient (as the regression model is heteroskedastic), and it is much easier and
cleaner to use the Klein-Spady estimator rather than a two-step weighted LS estimator.
E (y j x) = (x)
where w(x) is a weight function. It is particularly convenient to set w(x) = f (x); the marginal
density of X: Thus Powell, Stock and Stoker (Econometrica, 1989) de…ne this as the average
derivative
(1)
=E (X)f (X) :
This is a measure of the average e¤ect of X on y: It is a simple vector, and therefore easier to
report than a full nonparametric estimator.
There is a connection with the single index model, where
(x) = g x0
73
for then
(1)
(x) = g (1) (x0 )
=c
where
c = E g (1) (x0 )f (X) :
Since is identi…ed only up to scale, the constant c doesn’t matter. That is, a (normalized)
estimate of is an estimate of normalized :
PSS observe that by integration by parts
(1)
= E (X)f (X)
Z
(1)
= (x)f (x)2 dx
Z
= 2 (x)f (x)f (1) (x)dx
= 2E yf (1) (X)
(1)
where f^( i) (Xi ) is the leave-one-out density estimator, and f^( i) (Xi ) is its …rst derivative.
This is a convenient estimator. There is no denominator messing with uniform convergence.
There is only a density estimator, no conditional mean needed.
PSS show that ^ is n 1=2 consistent and asy. normal, with a convenient covariance matrix.
The asymptotic bias is a bit complicated.
Let q = dim(X): Set p = ((q + 4)=2 if q is even and p = (q + 3)=2) if q is odd. e.g. p = 2 for
q = 1; p = 3 for q = 2 or q = 3 and p = 4 for q = 4:
PSS require that the kernel for estimation of f be of order at least p: Thus a second-order kernel
for q = 1; a fourth order for q = 2; 3, or 4.
PSS then show that the asymptotic bias is
n1=2 E^ = O n1=2 hp
which is o(1) if the bandwidth is selected so that nh2p ! 0: This is violated (too big) if h is selected
to be optimal for estimation of f^ or f^(1) : This requirement needs the bandwidth to undersmooth to
reduce the bias. This type of result is commonly seen in semiparametric methods. Unfortunately,
it does not lead to a practical rule for bandwidth selection.
74