Regularization Theory and Neural Networks Architectures1
Federico Girosi, Michael Jones and Tomaso Poggio
Center for Biological and Computational Learning
and
Arti cial Intelligence Laboratory
Massachusetts Institute of Technology, Cambridge, MA 02139 USA
Abstract
We had previously shown that regularization principles lead to approximation
schemes which are equivalent to networks with one layer of hidden units, called
Regularization Networks. In particular, standard smoothness functionals lead to
a subclass of regularization networks, the well known Radial Basis Functions approximation schemes. This paper shows that regularization networks encompass
a much broader range of approximation schemes, including many of the popular general additive models and some of the neural networks. In particular, we
introduce new classes of smoothness functionals that lead to di erent classes of
basis functions. Additive splines as well as some tensor product splines can be
obtained from appropriate classes of smoothness functionals. Furthermore, the
same generalization that extends Radial Basis Functions (RBF) to Hyper Basis
Functions (HBF) also leads from additive models to ridge approximation models,
containing as special cases Breiman's hinge functions, some forms of Projection
Pursuit Regression and several types of neural networks. We propose to use the
term Generalized Regularization Networks for this broad class of approximation
schemes that follow from an extension of regularization. In the probabilistic interpretation of regularization, the di erent classes of basis functions correspond to
di erent classes of prior probabilities on the approximating function spaces, and
therefore to di erent types of smoothness assumptions.
In summary, di erent multilayer networks with one hidden layer, which we collectively call Generalized Regularization Networks, correspond to di erent classes
of priors and associated smoothness functionals in a classical regularization principle. Three broad classes are a) Radial Basis Functions that can be generalized to
Hyper Basis Functions, b) some tensor product splines, and c) additive splines that
can be generalized to schemes of the type of ridge approximation, hinge functions
and several perceptron-like neural networks with one-hidden layer.
1 This paper will appear on Neural Computation, vol. 7, pages 219{269, 1995. An earlier version of
this paper appeared as MIT AI Memo 1430, 1993.
1
Introduction
In recent years we and others have argued that the task of learning from examples can be
considered in many cases to be equivalent to multivariate function approximation, that
is, to the problem of approximating a smooth function from sparse data, the examples.
The interpretation of an approximation scheme in terms of networks and vice versa has
also been extensively discussed (Barron and Barron, 1988; Poggio and Girosi, 1989, 1990,
1990b; Girosi, 1992; Broomhead and Lowe, 1988; Moody and Darken, 1988, 1989; White,
1989, 1990; Ripley, 1994; Omohundro, 1987; Kohonen, 1990; Lapedes and Farber, 1988;
Rumelhart, Hinton and Williams, 1986; Hertz, Krogh and Palmer, 1991; Kung, 1993;
Sejnowski and Rosenberg, 1987; Hurlbert and Poggio, 1988; Poggio, 1975).
In a series of papers we have explored a quite general approach to the problem
of function approximation. The approach regularizes the ill-posed problem of function
approximation from sparse data by assuming an appropriate prior on the class of approximating functions. Regularization techniques (Tikhonov, 1963; Tikhonov and Arsenin,
1977; Morozov, 1984; Bertero, 1986; Wahba, 1975, 1979, 1990) typically impose smoothness constraints on the approximating set of functions. It can be argued that some
form of smoothness is necessary to allow meaningful generalization in approximation
type problems (Poggio and Girosi, 1989, 1990). A similar argument can also be used
(see section 9.1) in the case of classi cation where smoothness is a condition on the
classi cation boundaries rather than on the input-output mapping itself. Our use of
regularization, which follows the classical technique introduced by Tikhonov, identi es
the approximating function as the minimizer of a cost functional that includes an error
term and a smoothness functional, usually called a stabilizer. In the Bayesian interpretation of regularization (see Kimeldorf and Wahba, 1971; Wahba, 1990; Bertero, Poggio
and Torre, 1988; Marroquin, Mitter and Poggio, 1987; Poggio, Torre and Koch, 1985)
the stabilizer corresponds to a smoothness prior, and the error term to a model of the
noise in the data (usually Gaussian and additive).
In Poggio and Girosi (1989, 1990, 1990b) and Girosi (1992) we showed that regularization principles lead to approximation schemes which are equivalent to networks with one
\hidden" layer, which we call Regularization Networks (RN). In particular, we described
how a certain class of radial stabilizers { and the associated priors in the equivalent
Bayesian formulation { lead to a subclass of regularization networks, the already-known
Radial Basis Functions (Powell, 1987, 1990; Franke, 1982, 1987; Micchelli, 1986; Kansa,
1990a,b; Madych and Nelson, 1990; Dyn, 1987, 1991; Hardy, 1971, 1990; Buhmann,
1990; Lancaster and Salkauskas, 1986; Broomhead and Lowe, 1988; Moody and Darken,
1988, 1989; Poggio and Girosi, 1990, 1990b; Girosi, 1992). The regularization networks
with radial stabilizers we studied include many classical one-dimensional (Schumaker,
1981; de Boor, 1978) as well as multidimensional splines and approximation techniques,
such as radial and non-radial Gaussian, thin-plate splines (Duchon, 1977; Meinguet,
1
1979; Grimson, 1982; Cox, 1984; Eubank, 1988) and multiquadric functions (Hardy,
1971, 1990). In Poggio and Girosi (1990, 1990a) we extended this class of networks to
Hyper Basis Functions (HBF). In this paper we show that an extension of Regularization Networks, which we propose to call Generalized Regularization Networks (GRN),
encompasses an even broader range of approximation schemes including, in addition to
HBF, tensor product splines, many of the general additive models, and some of the
neural networks. As expected, GRN have approximation properties of the same type as
already shown for some of the neural networks (Girosi and Poggio, 1990; Cybenko, 1989;
Hornik, Stinchcombe and White, 1989; White, 1990; Irie and Miyake, 1988; Funahashi,
1989, Barron, 1991, 1994; Jones, 1992; Mhaskar and Micchelli, 1992, 1993; Mhaskar,
1993, 1993a).
The plan of the paper is as follows. We rst discuss the solution of the variational
problem of regularization. We then introduce three di erent classes of stabilizers { and
the corresponding priors in the equivalent Bayesian interpretation { that lead to di erent
classes of basis functions: the well-known radial stabilizers, tensor-product stabilizers,
and the new additive stabilizers that underlie additive splines of di erent types. It is
then possible to show that the same argument that extends Radial Basis Functions to
Hyper Basis Functions also leads from additive models to some ridge approximation
schemes, de ned as
f
X
(x) =
K
=1
w x)
h (
where h are appropriate one-dimensional functions.
Special cases of ridge approximation are Breiman's hinge functions (1993), Projection Pursuit Regression (PPR) (Friedman and Stuezle, 1981; Huber, 1985; Diaconis and
Freedman, 1984; Donoho and Johnstone, 1989; Moody and Yarvin, 1991) and Multilayer Perceptrons (Lapedes and Farber, 1988; Rumelhart, Hinton and Williams, 1986;
Hertz, Krogh and Palmer, 1991; Kung, 1993; Sejnowski and Rosenberg, 1987). Simple
numerical experiments are then described to illustrate the theoretical arguments.
In summary, the chain of our arguments shows that some ridge approximation
schemes are approximations of Regularization Networks with appropriate additive stabilizers. The form of h depends on the stabilizer, and includes in particular cubic splines
(used in typical implementations of PPR) and one-dimensional Gaussians. Perceptronlike neural networks with one-hidden layer and with a Gaussian activation function are
included. It seems impossible, however, to directly derive from regularization principles
the sigmoidal activation functions typically used in feedforward neural networks. We
discuss, however, in a simple example, the close relationship between basis functions of
the hinge, the sigmoid and the Gaussian type.
The appendices deal with observations related to the main results of the paper and
2
more technical details.
2 The regularization approach to the approximation problem
Suppose that the set g = f(xi; yi) 2 Rd RgNi=1 of data has been obtained by random
sampling a function f , belonging to some space of functions X de ned on Rd, in the
presence of noise, and suppose we are interested in recovering the function f , or an
estimate of it, from the set of data g. This problem is clearly ill-posed, since it has an
in nite number of solutions. In order to choose one particular solution we need to have
some a priori knowledge of the function that has to be reconstructed. The most common
form of a priori knowledge consists in assuming that the function is smooth, in the sense
that two similar inputs correspond to two similar outputs. The main idea underlying
regularization theory is that the solution of an ill-posed problem can be obtained from
a variational principle, which contains both the data and prior smoothness information.
Smoothness is taken into account by de ning a smoothness functional [f ] in such a way
that lower values of the functional correspond to smoother functions. Since we look for a
function that is simultaneously close to the data and also smooth, it is natural to choose
as a solution of the approximation problem the function that minimizes the following
functional:
H [f ] =
XN (f (xi) ? yi) + [f ] :
2
i=1
(1)
where is a positive number that is usually called the regularization parameter. The
rst term is enforcing closeness to the data, and the second smoothness, while the
regularization parameter controls the tradeo between these two term, and can be chosen
according to cross-validation techniques (Allen, 1974; Wahba and Wold, 1975; Golub,
Heath and Wahba, 1979; Craven and Wahba, 1979; Utreras, 1979; Wahba, 1985) or to
some other principle, such as Structural Risk Minimization (Vapnik, 1988).
It can be shown that, for a wide class of functionals , the solutions of the minimization of the functional (1) all have the same form. Although a detailed and rigorous
derivation of the solution of this problem is out of the scope of this paper, a simple
derivation of this general result is presented in appendix (A). In this section we just
present a family of smoothness functionals and the corresponding solutions of the variational problem. We refer the reader to the current literature for the mathematical details
(Wahba, 1990; Madych and Nelson, 1990; Dyn, 1987).
We rst need to give a more precise de nition of what we mean by smoothness and
de ne a class of suitable smoothness functionals. We refer to smoothness as a measure
3
of the \oscillatory" behavior of a function. Therefore, within a class of di erentiable
functions, one function will be said to be smoother than another one if it oscillates less.
If we look at the functions in the frequency domain, we may say that a function is
smoother than another one if it has less energy at high frequency (smaller bandwidth).
The high frequency content of a function can be measured by rst high-pass ltering the
function, and then measuring the power, that is the L2 norm, of the result. In formulas,
this suggests de ning smoothness functionals of the form:
Z
jf~(s)j2
[f ] = d ds ~
(2)
R
G(s)
where~indicates the Fourier transform, G~ is some positive function that tends to zero
as ksk ! 1 (so that G1~ is an high-pass lter) and for which the class of functions such
that this expression is well de ned is not empty. For a well de ned class of functions
G (Madych and Nelson, 1990; Dyn, 1991; Dyn et al., 1989) this functional is a seminorm, with a nite dimensional null space N . The next section will be devoted to giving
examples of the possible choices for the stabilizer . For the moment we just assume
that it can be written as in equation (2), and make the additional assumption that G~
is symmetric, so that its Fourier transform G is real and symmetric. In this case it is
possible to show (see appendix (A) for a sketch of the proof) that the function that
minimizes the functional (1) has the form:
f (x) =
N
Xk d
X
c G(x ? x ) +
i=1
i
i
=1
(x)
(3)
where f gk =1 is a basis in the k dimensional null space N of the functional , that in
most cases is a set of polynomials, and therefore will be referred to as the \polynomial
term" in equation (3). The coecients d and ci depend on the data, and satisfy the
following linear system:
(G + I )c +
c=0
Td
=y
where I is the identity matrix, and we have de ned
(4)
(5)
(y)i = yi ; (c)i = ci ; (d)i = di ;
(G)ij = G(xi ? xj ) ; ( ) i = (xi)
Notice that if the data term in equation (1) is replaced by PNi=1 V (f (xi) ? yi) where V
is any di erentiable function, the solution of the variational principle has still the form
4
3, but the coecients cannot be found anymore by solving a linear system of equations
(Girosi, 1991; Girosi, Poggio and Caprile, 1991).
The existence of a solution to the linear system shown above is guaranteed by the
existence of the solution of the variational problem. The case of = 0 corresponds to
pure interpolation. In this case the existence of an exact solution of the linear system
of equations depends on the properties of the basis function G (Micchelli, 1986).
The approximation scheme of equation (3) has a simple interpretation in terms of a
network with one layer of hidden units, which we call a Regularization Network (RN).
Appendix B describes the extension to the vector output scheme.
In summary, the argument of this section shows that using a regularization network
of the form (3), for a certain class of basis functions G, is equivalent to minimizing the
functional (1). In particular, the choice of G is equivalent to the corresponding choice
of the smoothness functional (2).
2.1
Dual representation of Regularization Networks
Consider an approximating function of the form (3), neglecting the \polynomial term"
for simplicity. A compact notation for this expression is:
f (x) = c g(x)
(6)
where g(x) is the vector of functions such that (g(x))i = G(x ? xi). Since the coecients
c satisfy the linear system (4), solution (6) becomes
f (x) = (G + I )?1 y g(x) :
We can rewrite this expression as
f (x) =
XN yibi(x) = y b(x)
i=1
(7)
in which the vector b(x) of basis functions is de ned
b(x) = (G + I )?1 g(x)
(8)
and now depends on all the data points and on the regularization parameter . The
representation (7) of the solution of the approximation problem is known as the dual
of equation (6), and the basis functions bi(x) are called the equivalent kernels, because
of the similarity between equation (7) and the kernel smoothing technique that we will
de ne in section 2.2 (Silverman, 1984; Hardle, 1990; Hastie and Tibshirani, 1990). While
in equation (6) the \dicult" part is the computation of the vector of coecients ci, the
set of basis functions g(x) being easily built, in equation (7) the \dicult" part is the
5
computation of the basis functions b(x), the coecients of the expansion being explicitly
given by the yi. Notice that b(x) depends on the distribution of the data in the input
space and that the kernels bi(x), unlike the kernels G(x ? xi), are not translated replicas
of the same kernel. Notice also that, as shown in appendix B, a dual representation of the
form (7) exists for all the approximation schemes that consists of linear superpositions
of arbitrary numbers of basis functions, as long as the error criterion that is used to
determine the parameters of the approximation is quadratic.
The dual representation provides an intuitive way of looking at the approximation
scheme (3): the value of the approximating function at an evaluation point x is explicitly
expressed as a weighted sum of the values yi of the function at the examples xi. This
concept is not new in approximation theory, and has been used, for example, in the
theory of quasi-interpolation. The case in which the data points fxig coincide with the
multi-integers Z d, where Z is the set of integers number, has been extensively studied in
the literature, and it is also known as Schoenberg's approximation (Schoenberg, 1946a,
1946b, 1969; Rabut, 1991, 1992; Madych and Nelson, 1990a; Jackson, 1988; de Boor,
1990; Buhmann, 1990, 1991; Dyn et al., 1989). In this case, an approximation f to a
function f is sought of the form:
f (x) =
X f (j) (x ? j) :
j2Z d
(9)
where is some fast-decaying function that is a linear combination of Radial Basis
Functions. The approximation scheme (9) is therefore a linear superposition of Radial
Basis Functions in which the functions (x ? j) play the role of equivalent kernels.
Quasi-interpolation is interesting because it could provide good approximation without
the need of solving complex minimization problems or solving large linear systems. For a
discussion of such non iterative training algorithms see Mhaskar (1993a) and references
therein.
Although dicult to prove rigorously, we can expect the kernels bi(x) to decrease
with the distance of the data points xi from the evaluation point, so that only the
neighboring points a ect the estimate of the function at x, providing therefore a \local"
approximation scheme. Even if the original basis function G is not \local", like the multiquadric G(x) = 1 + kxk2, the basis functions bi(x) are bell shaped, local functions,
whose locality will depend on the choice of the basis function G, on the density of data
points, and on the regularization parameter . This shows that apparently \global" approximation schemes can be regarded as local, memory-based techniques (see equation
7) (Mhaskar, 1993a). It should be noted however, that these techniques do not have
the highest possible degree of locality, since the parameter that controls the locality is
the regularization parameter , that is the same for all the kernels. It is possible to
devise even more local techniques, in which each kernel has a parameter that controls
its locality (Bottou and Vapnik, 1992; Vapnik, personal communication).
q
6
When the data are equally spaced on an in nite grid, we expect the basis functions
b (x) to become translation invariant, and therefore the dual representation (7) becomes
a convolution lter. For a study of the properties of these lters in the case of one
dimensional cubic splines see the work of Silverman (1984), who gives explicit results
for the shape of the equivalent kernel.
Let us consider some simple experiments that show the shape of the equivalent kernels
in speci c situations. We rst considered a data set composed of 36 equally spaced points
on the domain [0; 1] [0; 1], at the nodes of a regular grid with spacing equal to 0.2. We
use the multiquadric basis functions G(x) = 2 + kxk2, where has been set to 0.2.
Figure 1a shows the original multiquadric function, and gure 1b the equivalent kernel
b16, in the case of = 0, where, according to de nition (8)
i
q
bi (x) =
X(G? ) G(x ? x ) :
36
1
j =1
ij
j
All the other kernels, except those close to the border, are very similar, since the data
are equally spaced, and translation invariance holds approximately.
Consider now a one dimensional example with a multiquadric basis function:
p
G(x) = 2 + x2 :
The data set was chosen to be a non uniform sampling of the interval [0; 1], that is the
set
f0:0; 0:1; 0:2; 0:3; 0:4; 0:7; 1:0g :
In gure 1c, 1d and 1e we have drawn, respectively, the equivalent kernels b3, b5 and
b6, under the same de nitions. Notice that all of them are bell-shaped, although the
original basis function is an increasing, cup-shaped function. Notice, moreover, that the
shape of the equivalent kernels changes from b3 to b6, becoming broader in moving from
a high to low sample density region. This phenomenon has been shown by Silverman
(1986) for cubic splines, but we expect it to appear in much more general cases.
The connection between regularization theory and the dual representation (7) becomes clear in the special case of \continuous" data, for which the regularization functional has the form:
Z
H [f ] =
dx (f (x) ? y (x))2 + [f ] :
(10)
where y(x) is the function to be approximated. This functional can be intuitively seen
as the limit of the functional (1) when the number of data points goes to in nity and
their spacing is uniform. It is easily seen that, when the stabilizer [f ] is of the form
(2), the solution of the regularization functional (10) is
7
f (x) = y (x) B (x)
(11)
where B (x) is the Fourier transform of
B~ (s) =
G~ (s)
;
+ G~ (s)
(see Poggio, Voorhes and Yuille, (1988) for some examples of B (x)). The solution (11)
is therefore a ltered version of the original function y(x) and, consistently with the
results of Silverman (1984), has the form (7), where the equivalent kernels are translates
of the function B (x) de ned above. Notice the e ect of the regularization parameter:
for = 0 the equivalent kernel B (x) is a Dirac delta function, and f (x) = y(x) (no
noise) , while for ! 1 we have B (x) = G(x) and f = G y (a low pass lter).
The dual representation is illuminating and especially interesting for the case of a
multi-output network { approximating a vector eld { that is discussed in Appendix B.
2.2 Normalized kernels
An approximation technique very similar to Radial Basis Functions is the so-called
Normalized Radial Basis Functions (Moody and Darken, 1988, 1989). A Normalized
Radial Basis Functions expansion is a function of the form:
Pn c G(x ? t )
f (x) = Pn=1
:
(12)
=1 G(x ? t )
The only di erence between equation (12) and Radial Basis Functions is the normalization factor in the denominator, which is an estimate of the probability distribution of
the data. A discussion about the relation between normalized Gaussian basis function
networks, Gaussian mixtures and Gaussian mixture classi ers can be found in the work
of Tresp, Hollatz and Ahmad (1993). In the rest of this section we show that a particular version of this approximation scheme has again a tight connection to regularization
theory.
Let P (x; y) be the joint probability of inputs and outputs of the network, and let us
assume that we have a sample of N pairs f(xi; yi)gNi=1 randomly drawn according to P .
Our goal is to build an estimator (a network) f that minimizes the expected risk:
I [f ] =
Z
dxdyP (x; y )(y ? f (x))2 :
(13)
This cannot be done, since the probability P is unknown, and usually the empirical risk:
Iemp[f ] =
1
N
X
(y ? f (x ))2
N i=1
8
i
i
(14)
is minimized instead. An alternative consists in obtaining an approximation of the
probability P (x; y) rst, and then in minimizing the expected risk. If this option is
chosen, one could use the regularization approach to probability estimation (Vapnik
and Stefanyuk, 1978; Aidu and Vapnik, 1989; Vapnik, 1982), that leads to the well
known technique of Parzen windows. A Parzen window estimator P for the probability
distribution of a set of data fzigNi=1 has the form:
P
N z ? zi
X
1
(z) = N h h
i=1
(15)
where is an appropriate kernel, for example a Gaussian, whose L1 norm is 1, and
where h is a positive parameter, that, for simplicity, we set to 1 from now on. If the
joint probability P (x; y) in the expected risk (13) is approximated with a Parzen window
estimator P , we obtain an approximated expression for the expected risk, I [f ], that
can be explicitly minimized. In order to show how this can be done, we notice that
we need to approximate the probability distribution P (x; y), and therefore the random
variable z of equation (15) is z = (x; y). Hence, we choose a kernel of the following
form2:
(z) = K (kxk)K (y)
where K is a standard one-dimensional, symmetric kernel, like the Gaussian. The Parzen
window estimator to P (x; y) is therefore:
P (x; y ) =
N
1X
K (kx ? xi k) K (y ? yi )
N
i=1
(16)
An approximation to the expected risk is therefore obtained as:
I
N Z
X
1
[f ] = N
i=1
k ? xik) K (y ? yi)(y ? f (x))2 :
dxdyK ( x
In order to nd an analytical expression for the minimum of I [f ] we impose the stationarity constraint:
I~[f ]
=0;
f (s)
that leads to the following equation:
2 Any kernel of the form (z) =
(x ) in which the function is even in each of the variables x
and would lead to the same conclusions that we obtain for this choice.
K
;y
K
y
9
N Z
X
i=1
k ? xik) K (y ? yi)(y ? f (x))(x ? s) = 0 :
dxdyK ( x
Performing the integral over x, and using the fact that kK kL1 = 1 we obtain:
f (x) =
PN
R
k
?
k
? yi)y :
P
k ? k
i=1 K ( x xi ) dyK (y
N K( x x )
i
i=1
Performing a change of variable in the integral of the previous expression and using the
fact that the kernel K is symmetric, we nally conclude that the function that minimizes
the approximated expected risk is:
PN yiK (kx ? xik)
f (x) = Pi=1
N K (kx ? x )k :
i
i=1
(17)
The right hand side of the equation converges to f when the number of examples goes to
in nity, provided that the scale factor h tends to zero at an appropriate rate. This form
of approximation is known as kernel regression, or Nadaraya-Watson estimator, and it
has been the subject of extensive study in the statistics community (Nadaraya, 1964;
Watson, 1964; Rosenblatt, 1971; Priestley and Chao, 1972; Gasser and Muller, 1985;
Devroye and Wagner, 1980). A similar derivation of equation (17) has been given by
Specht (1991), but we should remark that this equation is usually derived in a di erent
way, within the framework of locally weighted regression, assuming a locally constant
model (Hardle, 1990) with a local weight function K .
Notice that this equation has the form of equation (12), in which the centers coincide
with the examples, and the coecients ci are simply the values yi of the function at the
data points xi. On the other hand, the equation is an estimate of f which is linear in
the observations yi and has therefore also the general form of equation (7).
The Parzen window estimator, and therefore expression (17), can be derived in the
framework of regularization theory (Vapnik and Stefanyuk, 1978; Aidu and Vapnik,
1989; Vapnik, 1982) under a smoothness assumption on the probability distribution that
has to be estimated. This means that, in order to derive equation (17), a smoothness
assumption has to be made on the joint probability distribution P (x; y), rather than on
the regression function as in (2).
3
Classes of stabilizers
In the previous section we considered the class of stabilizers of the form:
Z
jf~(s)j2
ds
[f ] =
~ (s)
Rd
G
10
(18)
and we have seen that the solution of the minimization problem always has the same
form. In this section we discuss three di erent types of stabilizers belonging to the
class (18), corresponding to di erent properties of the basis functions G. Each of them
corresponds to di erent a priori assumptions on the smoothness of the function that
must be approximated.
3.1
Radial stabilizers
Most of the commonly used stabilizers have radial symmetry, that is, they satisfy the
following equation:
[f (x)] = [f (Rx)]
for any rotation matrix R. This choice re ects the a priori assumption that all the
variables have the same relevance, and that there are no privileged directions. Rotation
invariant stabilizers correspond to radial basis function G( x ). Much attention has
been dedicated to this case, and the corresponding approximation technique is known as
Radial Basis Functions (Powell, 1987, 1990; Franke, 1982, 1987; Micchelli, 1986; Kansa,
1990a,b; Madych and Nelson, 1990; Dyn, 1987, 1991; Hardy, 1971, 1990; Buhmann,
1990; Lancaster and Salkauskas, 1986; Broomhead and Lowe, 1988; Moody and Darken,
1988, 1989; Poggio and Girosi, 1990, 1990b; Girosi, 1992). The class of admissible Radial
Basis Functions is the class of conditionally positive de nite functions (Micchelli, 1986)
of any order, since it has been shown (Madych and Nelson, 1990; Dyn, 1991) that in
this case the functional of equation (18) is a semi-norm, and the associated variational
problem is well de ned. All the Radial Basis Functions can therefore be derived in this
framework. We explicitly give two important examples.
k
k
Duchon multidimensional splines
Duchon (1977) considered measures of smoothness of the form
[f ] =
Z
Rd
ds
ksk
2m
f~(s) 2 :
j
j
In this case G~ (s) = ksk12m and the corresponding basis function is therefore
(
2m?d
if 2m > d and d is even
(19)
otherwise.
In this case the null space of [f ] is the vector space of polynomials of degree at most
m in d variables, whose dimension is
G(x) =
kxk
2m?d
kxk
ln
kxk
11
!
k = d + md ? 1 :
These basis functions are radial and conditionally positive de nite, so that they represent
just particular instances of the well known Radial Basis Functions technique (Micchelli,
1986; Wahba, 1990). In two dimensions, for m = 2, equation (19) yields the so called
\thin plate" basis function G(x) = kxk2 ln kxk (Harder and Desmarais, 1972; Grimson,
1982).
The Gaussian
A stabilizer of the form
[f ] =
Z
Rd
ds e
ksk2
jf~(s)j2 ;
where is a xed positive parameter, has G~ (s) = e? and as basis function the
Gaussian function (Poggio and Girosi, 1989; Yuille and Grzywacz, 1988). The Gaussian
function is positive de nite, and it is well known from the theory of reproducing kernels (Aronszajn, 1950) that positive de nite functions (Stewart, 1976) can be used to
de ne norms of the type (18). Since [f ] is a norm, its null space contains only the zero
element, and the additional null space terms of equation (3) are not needed, unlike in
Duchon splines. A disadvantage of the Gaussian is the appearance of the scaling parameter , while Duchon splines, being homogeneous functions, do not depend on any scaling
parameter. However, it is possible to devise good heuristics that furnish sub-optimal,
but still good, values of , or good starting points for cross-validation procedures.
ksk2
Other Basis Functions
Here we give a list of other functions that can be used as basis functions in the Radial
Basis Functions technique, and that are therefore associated with the minimization of
some functional. In the following table we indicate as \p.d." the positive de nite functions, which do not need any polynomial term in the solution, and as \c.p.d. k" the
conditionally positive de nite functions of order k, which need a polynomial of degree
k in the solution. It is a well known fact that positive de nite functions tend to zero at
in nity whereas conditionally positive functions tend to in nity.
12
G(r) = e?
r2
p
3.2
Gaussian, p.d.
G(r) = r2 + c2
multiquadric, c.p.d. 1
G(r) = pc21+r2
inverse multiquadric, p.d.
G(r) = r2n+1
thin plate splines, c.p.d. n
G(r) = r2n ln r
thin plate splines, c.p.d. n
Tensor product stabilizers
An alternative to choosing a radial function G~ (s) in the stabilizer (18) is a tensor product
type of basis function, that is a function of the form
G~ (s) = dj=1 g~(sj )
(20)
where sj is the j -th coordinate of the vector s, and g~ is an appropriate one-dimensional
function. When g is positive de nite the functional [f ] is clearly a norm and its null
space is empty. In the case of a conditionally positive de nite function the structure of
the null space can be more complicated and we do not consider it here. Stabilizers with
G~ (s) as in equation (20) have the form
Z
f~(s) 2
[f ] = d ds d
j=1g~(sj )
R
which leads to a tensor product basis function
j
j
G(x) = dj=1 g (xj )
where xj is the j -th coordinate of the vector x and g(x) is the Fourier transform of g~(s).
An interesting example is the one corresponding to the choice:
1 ;
g~(s) =
1 + s2
which leads to the basis function:
G(x) = dj=1 e?jxj j = e?
Pdj
=1
jxj j = e?kxkL1 :
This basis function is interesting from the point of view of VLSI implementations, because it requires the computation of the L1 norm of the input vector x, which is usually
easier to compute than the Euclidean norm L2. However, this basis function is not very
13
smooth, and its performance in practical cases should rst be tested experimentally.
Notice that if the approximation is needed for computing derivatives smoothness of an
appropriate degree is clearly a necessary requirement (see Poggio, Vorhees and Yuille,
1988).
We notice that the choice
= e?s2
leads again to the Gaussian basis function G(x) = e?kxk2 .
g~(s)
3.3 Additive stabilizers
We have seen in the previous section how some tensor product approximation schemes
can be derived in the framework of regularization theory. We now will see that it is also
possible to derive the class of additive approximation schemes in the same framework,
where by additive approximation we mean an approximation of the form
f (x)
=
X
d
=1
f (x )
(21)
where x is the -th component of the input vector x and the f are one-dimensional
functions that will be de ned as the additive components of f (from now on Greek letter
indices will be used in association with components of the input vectors). Additive models are well known in statistics (Hastie and Tibshirani, 1986, 1987, 1990; Stone, 1985;
Wahba, 1990; Buja, Hastie and Tibshirani 1989) and can be considered as a generalization of linear models. They are appealing because, being essentially a superposition
of one-dimensional functions, they have a low complexity, and they share with linear
models the feature that the e ects of the di erent variables can be examined separately.
The simplest way to obtain such an approximation scheme is to choose, if possible,
a stabilizer that corresponds to an additive basis function:
X
(x) =
G
n
=1
g (x )
(22)
where are certain xed parameters. Such a choice would lead to an approximation
scheme of the form (21) in which the additive components f have the form:
f (x) =
X
N
i=1
ci G(x
? xi)
(23)
Notice that the additive components are not independent at this stage, since there is
only one set of coecients ci. We postpone the discussion of this point to section (4.2).
14
We would like then to write stabilizers corresponding to the basis function (22)
in the form (18), where G~ (s) is the Fourier transform of G(x). We notice that the
Fourier transform of an additive function like the one in equation (22) exists only in the
generalized sense (Gelfand and Shilov, 1964), involving the distribution. For example,
in two dimensions we obtain
~ (s)
G
= g~(s )(s ) + g~(s )(s )
(24)
and the interpretation of the reciprocal of this expression is delicate. However, almost
additive basis functions can be obtained if we approximate the delta functions in equation
(24) with Gaussians of very small variance. Consider, for example in two dimensions,
the stabilizer:
jf~(s)j
[f ] =
(25)
ds
2
g~(s )e?
+ g~(s )e? 2
This corresponds to a basis function of the form:
x
x
y
Z
y
y
x
2
sy
Rd
x
( )
x
y
y
s
( x )
= g(x)e? 2 2 + g(y)e? 2 2 :
(26)
In the limit of going to zero the denominator in expression (25) approaches equation
(24), and the basis function (26) approaches a basis function that is the sum of onedimensional basis functions. In this paper we do not discuss this limit process in a
rigorous way. Instead we outline another way to obtain additive approximations in the
framework of regularization theory.
Let us assume that we know a priori that the function f that we want to approximate
is additive, that is:
G(x; y )
y
x
f (x)
=
X
d
=1
x
y
f (x )
We then apply the regularization approach and impose a smoothness constraint, not on
the function f as a whole, but on each single additive component, through a regularization functional of the form (Wahba, 1990; Hastie and Tibshirani, 1990):
~
(y ? f (x )) + 1 ds jfg~((ss))j
H [f ] =
where are given positive parameters which allow us to impose di erent degrees of
smoothness on the di erent additive components. The minimizer of this functional is
found with the same technique described in appendix (A), and, skipping null space
terms, it has the usual form
X
i=1
X
d
N
i
i
2
X Z
d
=1
=1
15
R
2
x) = X (x ? x )
N
f(
where
i=1
ci G
x?x)= X
d
G(
i
=1
(27)
i
g (x
? xi) ;
as in equation (22).
We notice that the additive component of equation (27) can be written as
(
f x
where we have de ned
X
)=
N
i=1
ci
ci g (x
= ci
? xi)
:
The additive components are therefore not independent because the parameters are
xed. If the were free parameters, the coecients ci would be independent, as well
as the additive components.
Notice that the two ways we have outlined for deriving additive approximation from
regularization theory are equivalent. They both start from a priori assumptions of
additivity and smoothness of the class of functions to be approximated. In the rst
technique the two assumptions are woven together in the choice of the stabilizer (equation
25); in the second they are made explicit and exploited sequentially.
4 Extensions: from Regularization Networks to Generalized Regularization Networks
In this section we will rst review some extensions of regularization networks, and then
will apply them to Radial Basis Functions and to additive splines.
A fundamental problem in almost all practical applications in learning and pattern
recognition is the choice of the relevant input variables. It may happen that some of the
variables are more relevant than others, that some variables are just totally irrelevant, or
that the relevant variables are linear combinations of the original ones. It can therefore
be useful to work not with the original set of variables , but with a linear transformation of them, , where is a possibily rectangular matrix. In the framework of
regularization theory, this can be taken into account by making the assumption that
Wx
x
W
16
x
Wx
the approximating function f has the form f ( ) = F ( ) for some smooth function
F . The smoothness assumption is now made directly on F , through a smoothness functional [F ] of the form (18). The regularization functional is expressed in terms of F
as
z Wx
H [F ] =
X(y ? F (z )) + [F ]
N
i=1
i
i
2
where i = i. The function that minimizes this functional is clearly, accordingly to
the results of section (2), of the form:
X
F (z) = c G(z ? z ) :
(plus eventually a polynomial in z). Therefore the solution for f is:
X
(28)
f (x) = F (Wx) = c G(Wx ? Wx )
This argument is rigorous for given and known W, as in the case of classical Radial
Basis Functions. Usually the matrix W is unknown, and it must be estimated from
the examples. Estimating both the coecients c and the matrix W by least squares is
N
i=1
i
i
N
i=1
i
i
i
usually not a good idea, since we would end up trying to estimate a number of parameters
that is larger than the number of data points (though one may use regularized least
squares). Therefore, it has been proposed (Moody and Darken, 1988, 1989; Broomhead
and Lowe, 1988; Poggio and Girosi, 1989, 1990) to replace the approximation scheme of
equation (28) with a similar one, in which the basic shape of the approximation scheme
is retained, but the number of basis functions is decreased. The resulting approximating
function that we call the Generalized Regularization Network (GRN) is:
x) = X c G(Wx ? Wt ) :
f(
t
n
(29)
=1
where n < N and the centers are chosen according to some heuristic, or are considered
as free parameters (Moody and Darken, 1988, 1989; Poggio and Girosi, 1989, 1990).
The coecients c , the elements of the matrix , and eventually the centers , are
estimated according to a least squares criterion. The elements of the matrix could
also be estimated through cross-validation (Allen, 1974; Wahba and Wold, 1975; Golub,
Heath and Wahba, 1979; Craven and Wahba, 1979; Utreras, 1979; Wahba, 1985), which
may be a formally more appropriate technique.
In the special case in which the matrix and the centers are kept xed, the resulting
technique is one originally proposed by Broomhead and Lowe (1988), and the coecients
satisfy the following linear equation:
W
W
17
t
W
c
y
GT G = GT ;
where we have de ned the following vectors and matrices:
y
c
Wx Wt
( )i = yi ; ( ) = c ; (G)i = G( i ?
):
This technique, which has become quite common in the neural network community,
has the advantage of retaining the form of the regularization solution, while being less
complex to compute. A complete theoretical analysis has not yet been given, but some
results, in the case in which the matrix is set to identity, are already available
(Sivakumar and Ward, 1991; Poggio and Girosi, 1989).
The next sections discuss approximation schemes of the form (29) in the cases of
radial and additive basis functions.
W
4.1
Extensions of Radial Basis Functions
In the case in which the basis function is radial, the approximation scheme of equation
(29) becomes:
x
f( ) =
Xn c G(kx ? t k
w)
=1
where we have de ned the weighted norm:
kxkw xWT Wx :
(30)
The basis functions of equation (29) are not radial anymore, or, more precisely, they are
radial in the metric de ned by equation (30). This means that the level curves of the
basis functions are not circles, but ellipses, whose axis do not need to be aligned with the
coordinate axis. Notice that in this case what is important is not the matrix itself,
but rather the symmetric matrix T . Therefore, by the Cholesky decomposition,
it is sucient to consider to be upper triangular. The optimal center locations
satisfy the following set of nonlinear equations (Poggio and Girosi, 1990, 1990a):
W
WW
W
t
Pi P xi
= P Pi
i i
= 1; : : : ; n
t
(31)
where Pi are coecients that depend on all the parameters of the network and are not
necessarily positive. The optimal centers are then a weighted sum of the example points.
Thus in some cases it may be more ecient to \move" the coecients Pi rather than
the components of (for instance when the dimensionality of the inputs is high relative
to the number of data points).
t
18
The approximation scheme de ned by equation (29) has been discussed in detail in
Poggio and Girosi (1990) and Girosi(1992), so we will not discuss it further. In the next
section we will consider its analogue in the case of additive basis functions.
4.2
Extensions of additive splines
In the previous sections we have seen an extension of the classical regularization technique. In this section we derive the form that this extension takes when applied to
additive splines. The resulting scheme is very similar to Projection Pursuit Regression
(Friedman and Stuezle, 1981; Huber, 1985; Diaconis and Freedman, 1984; Donoho and
Johnstone, 1989; Moody and Yarvin, 1991).
We start from the \classical" additive spline, derived from regularization in section (3.3):
f (x)
=
X X
N
i=1
ci
d
=1
G(x
? xi)
(32)
In this scheme the smoothing parameters should be known, or can be estimated by
cross-validation. An alternative to cross-validation is to consider the parameters as
free parameters, and estimate them with a least square technique together with the
coecients ci. If the parameters are free, the approximation scheme of equation (32)
becomes the following:
f
XX
(x) =
N
d
i=1 =1
ci G(x
? xi)
where the coecients are now independent. Of course, now we must estimate N d
coecients instead of just N , and we are likely to encounter an over tting problem.
We then adopt the same idea presented in section (4), and consider an approximation
scheme of the form
ci
f (x)
=
XX
n
d
=1 =1
c G(x
? t ) ;
(33)
in which the number of centers is smaller than the number of examples, reducing the
number of coecients that must be estimated. We notice that equation (33) can be
written as
f (x)
=
X
d
=1
where each additive component has the form:
19
f (x )
f (x )
X
=
n
=1
c G(x
? t ) :
Therefore another advantage of this technique is that the additive components are
now independent, each of them being a one-dimensional Radial Basis Functions.
We can now use the same argument from section (4) to introduce a linear transformation of the inputs ! , where is a d d matrix. Calling the -th row of
W , and performing the substitution !
in equation (33), we obtain
x Wx
f
W
x Wx
X X (w x ?
(x) =
w
0
n
d
0
c G
=1 =1
t ) :
(34)
We now de ne the following one-dimensional function:
h (y ) =
X
n
=1
c G(y
? t )
and rewrite the approximation scheme of equation (34) as
X (w x)
(x) =
d
0
f
=1
h
:
(35)
Notice the similarity between equation (35) and the Projection Pursuit Regression
technique: in both schemes the unknown function is approximated by a linear superposition of one-dimensional variables, which are projections of the original variables on
certain vectors that have been estimated. In Projection Pursuit Regression the choice
of the functions h (y) is left to the user. In our case the h are one-dimensional Radial
Basis Functions, for example cubic splines, or Gaussians. The choice depends, strictly
speaking, on the speci c prior, that is, on the speci c smoothness assumptions made
by the user. Interestingly, in many applications of Projection Pursuit Regression the
functions h have been indeed chosen to be cubic splines but other choices are Flexible Fourier Series, rational approximations and orthogonal polynomials (see Moody and
Yarvin, 1991).
Let us brie y review the steps that bring us from the classical additive approximation
scheme of equation (23) to a Projection Pursuit Regression-like type of approximation:
1. the regularization parameters of the classical approximation scheme (23) are
considered as free parameters;
2. the number of centers is chosen to be smaller than the number of data points;
3. the true relevant variables are assumed to be some unknown linear combination of
the original variables;
20
We notice that in the extreme case in which each additive component has just one
center (n = 1), the approximation scheme of equation (34) becomes:
X
(x) =
d0
f
=1
(w x ? t ) :
c G
(36)
When the basis function G is a Gaussian we call { somewhat improperly { a network
of this type a Gaussian Multilayer Perceptron (MLP) Network, because if G were a
threshold function sigmoidal function this would be a Multilayer Perceptron with one
layer of hidden units. The sigmoidal functions, typically used instead of the threshold,
cannot be derived directly from regularization theory because it is not symmetric, but
we will see in section (6) the relationship between a sigmoidal function and the absolute
value function, which is a basis function that can be derived from regularization.
There are a number of computational issues related to how to nd the parameters
of an approximation scheme like the one of equation (34), but we do not discuss them
here. We present instead, in section (7), some experimental results, and will describe
the algorithm used to obtain them.
5
The Bayesian interpretation of Generalized Regularization Networks
It is well known that a variational principle such as equation (1) can be derived not
only in the context of functional analysis (Tikhonov and Arsenin, 1977), but also in a
probabilistic framework (Kimeldorf and Wahba, 1971; Wahba, 1980, 1990; Poggio, Torre
and Koch, 1985; Marroquin, Mitter and Poggio, 1987; Bertero, Poggio and Torre, 1988).
In this section we illustrate this connection informally, without addressing the related
mathematical issues.
Suppose that the set g = f(xi; yi) 2 Rn RgNi=1 of data has been obtained by random
sampling a function f , de ned on Rn , in the presence of noise, that is
(xi) = yi + i; i = 1; : : : ; N
(37)
where i are random independent variables with a given distribution. We are interested
in recovering the function f , or an estimate of it, from the set of data g. We take a
probabilistic approach, and regard the function f as the realization of a random eld
with a known prior probability distribution. Let us de ne:
f
{ P [f jg] as the conditional probability of the function f given the examples g.
21
{ P [gjf ] as the conditional probability of g given f . If the function underlying the
data is f , this is the probability that by random sampling the function f at the sites
fxigNi the set of measurement fyigNi is obtained. This is therefore a model of the noise.
=1
=1
{ P [f ]: is the a priori probability of the random eld f . This embodies our a priori
knowledge of the function, and can be used to impose constraints on the model, assigning signi cant probability only to those functions that satisfy those constraints.
Assuming that the probability distributions P [gjf ] and P [f ] are known, the posterior
distribution P [f jg] can now be computed by applying the Bayes rule:
P [f jg] / P [gjf ] P [f ]:
(38)
We now make the assumption that the noise variables in equation (37) are normally
distributed, with variance . Therefore the probability P [gjf ] can be written as:
P [gjf ] / e?
1
2 2
PN
i=1 (yi ?f (xi ))
2
where is the variance of the noise.
The model for the prior probability distribution P [f ] is chosen in analogy with the
discrete case (when the function f is de ned on a nite subset of a n-dimensional lattice)
for which the problem can be formalized (see for instance Marroquin, Mitter and Poggio,
1987). The prior probability P [f ] is written as
P [f ] / e? f
(39)
where [f ] is a smoothness functional of the type described in section (3) and a positive
real number. This form of probability distribution gives high probability only to those
functions for which the term [f ] is small, and embodies the a priori knowledge that
one has about the system.
Following the Bayes rule (38) the a posteriori probability of f is written as
[ ]
PNi
P [f jg] / e? 212 [
=1 (yi
?f (xi ))2 +2 2 [f ]]:
(40)
One simple estimate of the function f from the probability distribution (40) is the so
called MAP (Maximum A Posteriori) estimate, that considers the function that maximizes the a posteriori probability P [f jg], and therefore minimizes the exponent in equation (40). The MAP estimate of f is therefore the minimizer of the following functional:
H [f ] =
N
X
(y ? f (x ))
i=1
i
i
22
2
+ [f ] :
where = 22 . This functional is the same as that of equation (1), and from here
it is clear that the parameter , that is usually called the \regularization parameter"
determines the trade-o between the level of the noise and the strength of the a priori
assumptions about the solution, therefore controlling the compromise between the degree
of smoothness of the solution and its closeness to the data. Notice that functionals of
the type (39) are common in statistical physics (Parisi, 1988), where [f ] plays the role
of an energy functional. It is interesting to notice that, in that case, the correlation
function of the physical system described by [f ] is the basis function G(x).
As we have pointed out (Poggio and Girosi, 1989; Rivest, pers. comm.), prior probabilities can also be seen as a measure of complexity, assigning high complexity to the
functions with small probability. It has been proposed by Rissanen (1978) to measure
the complexity of a hypothesis in terms of the bit length needed to encode it. It turns
out that the MAP estimate mentioned above is closely related to the Minimum Description Length Principle: the hypothesis f which for given g can be described in the most
compact way is chosen as the \best" hypothesis. Similar ideas have been explored by
others (for instance Solomono ,1978). They connect data compression and coding with
Bayesian inference, regularization, function approximation and learning.
6 Additive splines, hinge functions, sigmoidal neural nets
In the previous sections we have shown how to extend RN to schemes that we have
called GRN, which include ridge approximation schemes of the PPR type, that is
X
f (x) = h (w x) ;
d
0
=1
where
h (y ) =
X c G(y ? t ):
n
=1
The form of the basis function G depends on the stabilizer, and a list of \admissible"
G has been given in section (3). These include the absolute value G(x) = jxj { corresponding to piecewise linear splines, and the function G(x) = jxj3 { corresponding to
cubic splines (used in typical implementations of PPR), as well as Gaussian functions.
Though it may seem natural to think that sigmoidal multilayer perceptrons may be included in this framework, it is actually impossible to derive directly from regularization
principles the sigmoidal activation functions typically used in Multilayer Perceptrons.
23
In the following section we show, however, that there is a close relationship between
basis functions of the hinge, the sigmoid and the Gaussian type.
6.1 From additive splines to ramp and hinge functions
We will consider here the one-dimensional case, since multidimensional additive approximations consist of one-dimensional terms. We consider the approximation with the
lowest possible degree of smoothness: piecewise linear. The associated basis function
G(x) = jxj is shown in gure 2a, and the associated stabilizer is given by
[f ] =
Z1
?1
ds s2jf~(s)j2
This assumption thus leads to approximating a one-dimensional function as the linear
combination with appropriate coecients of translates of jxj. It is easy to see that
a linear combination of two translates of jxj with appropriate coecients (positive and
negative and equal in absolute value) yields the piecewise linear threshold function L(x)
also shown in gure 2b. Linear combinations of translates of such functions can be used
to approximate one-dimensional functions. A similar derivative-like, linear combination
of two translates of L(x) functions with appropriate coecients yields the Gaussian-like
function gL (x) also shown in gure 2c. Linear combinations of translates of this function
can also be used for approximation of a function. Thus any given approximation in terms
of gL (x) can be rewritten in terms of L(x) and the latter can be in turn expressed in
terms of the basis function jxj.
Notice that the basis functions jxj underlie the \hinge" technique proposed by Breiman
(1993), whereas the basis functions L(x) are sigmoidal-like and the gL(x) are Gaussianlike. The arguments above show the close relations between all of them, despite the fact
that only jxj is strictly a \legal" basis function from the point of view of regularization
(gL(x) is not, though the very similar but smoother Gaussian is). Notice also that jxj
can be expressed in terms of \ramp" functions, that is jxj = x+ + x? . Thus a onehidden layer perceptron using the activation function L(x) can be rewritten in terms
of a generalized regularization network with basis function jxj. The equivalent kernel
is e ectively local only if there exist a sucient number of centers for each dimension
(w x). This is the case for Projection Pursuit Regression but not for usual one-hidden
layer perceptrons.
These relationships imply that it may be interesting to compare how well each of
these basis functionsPis able to approximate some simple function. To do this we used
the model f (x) = n c G(w x ? t ) to approximate the function h(x) = sin(2x)
on [0; 1], where G(x) is one of the basis functions of gure 2. Fifty training points and
10,000 test points were chosen uniformly on [0; 1]. The parameters were learned using the
iterative back tting algorithm (Friedman and Stuezle, 1981; Hastie and Tibshirani, 1990;
24
Breiman, 1993) that will be described in section 7. We looked at the function learned
after tting 1, 2, 4, 8 and 16 basis functions. Some of the resulting approximations are
plotted in gure 3.
The results show that the performance of all three basis functions is fairly close as
the number of basis functions increases. All models did a good job of approximating
sin(2x). The absolute value function did slightly worse and the \Gaussian" function
did slightly better. It is interesting that the approximation using two absolute value
functions is almost identical to the approximation using one \sigmoidal" function which
again shows that two absolute value basis functions can sum to equal one \sigmoidal"
piecewise linear function.
7 Numerical illustrations
7.1 Comparing additive and non-additive models
In order to illustrate some of the ideas presented in this paper and to provide some practical intuition about the various models, we present numerical experiments comparing
the performance of additive and non-additive networks on two-dimensional problems.
In a model consisting of a sum of two-dimensional Gaussians, the model can be changed
from a non-additive Radial Basis Function network to an additive network by \elongating" the Gaussians along the two coordinate axis x and y. This allows us to measure
the performance of a network as it changes from a non-additive scheme to an additive
one.
Five di erent models were tested. The rst three di er only in the variances of the
Gaussian along the two coordinate axis. The ratio of the x variance to the y variance
determines the elongation of the Gaussian. These models all have the same form and
can be written as:
f (x) =
where
XN ci[G (x ? xi) + G (x ? xi)]
1
i=1
x2
y2
G1 = e?( 1 + 2 )
2
x2
y2
G2 = e?( 2 + 1 ) :
;
The models di er only in the values of and . For the rst model, = :5 and
= :5 (RBF), for the second model = 10 and = :5 (elliptical Gaussian), and for
the third model, = 1 and = :5 (additive). These models correspond to placing
two Gaussians at each data point xi, with one Gaussian elongated in the x direction
and one elongated in the y direction. In the rst case (RBF) there is no elongation, in
1
2
1
1
2
1
2
2
25
Model 1
Model 2
Model 3
Model 4
Model 5
f (x; y ) =
f (x; y ) =
(x?x )2
P20
?
1
?
x?xi
1
i=1 ci [e
P20
i=1 ci [e
i
(
)2
+
f (x; y ) =
=1
c (w
y?yi
2
(
)2
?
(x?x )2
(
i
y?yi )2
1
(
2
+ e (x?x )2 (y?y )2 ] 1 = 2
i +
i
?
+e
y
?
y
i
+ e? ]
+
P
? (x?xi
f (x; y ) = 20
i
=1 ci [e
P
f (x; y ) = n =1 c e?(w x?t
Pn
)2
y?yi )2
2
(
2
)2
)2
x?t )
+
1
= 0:5
] = 10; = 0:5
= 0:5
{
{
1
2
Table 1: The ve models we tested in our numerical experiments.
the second case (elliptical Gaussian) there is moderate elongation, and in the last case
(additive) there is in nite elongation.
The fourth model is a Generalized Regularization Network model, of the form (36),
that uses a Gaussian basis function:
f (x) =
n
X
c e?(w x?t
=1
)2
:
In this model, to which we referred earlier as a Gaussian MLP network (equation 36),
the weight vectors, centers, and coecients are all learned.
In order to see how sensitive were the performances to the choice of basis function, we
also repeated the experiments for model (4) with a sigmoid (that is not a basis function
that can be derived from regularization theory) replacing the Gaussian basis function.
In our experiments we used the standard sigmoid function:
1 :
(x) =
1 + e?x
Models (1) to (5) are summarized in table 1: notice that only model (5) is a Multilayer
Perceptron in the standard sense.
In the rst three models, the centers were xed in the learning algorithm and equal
to the training examples. The only parameters that were learned were the coecients
ci , that were computed by solving the linear system of equations (4). The fourth and
the fth model were trained by tting one basis function at a time according to the
following recursive algorithm with back tting (Friedman and Stuezle, 1981; Hastie and
Tibshirani, 1990; Breiman, 1993)
26
Add a new basis function;
Optimize the parameters w , t and c using the \random step" algorithm (Caprile
and Girosi, 1990) described below;
Back tting: for each basis function added so far:
{ hold the parameters of all other functions xed;
{ re-optimize the parameters of function ;
Repeat the back tting stage until there is no signi cant decrease in L2 error.
The \random step" (Caprile and Girosi, 1990) is a stochastic optimization algorithm that
is very simple to implement and that usually nds good local minima. The algorithm
works as follows: pick random changes to each parameter such that each random change
lies within some interval [a; b]. Add the random changes to each parameter and then
calculate the new error between the output of the network and the target values. If the
error decreases, then keep the changes and double the length of the interval for picking
random changes. If the error increases, then throw out the changes and halve the size of
the interval. If the length of the interval becomes less than some threshold, then reset
the length of the interval to some larger value.
The ve models were each tested on two di erent functions: a two-dimensional additive function:
hadd(x; y ) = sin(2x) + 4(y ? 0:5)2
and the two-dimensional Gabor function:
gGabor(x; y ) = e?kxk cos(:75 (x + y )):
The training data for the functions hadd and gGabor consisted of 20 points picked from a
uniform distribution on [0; 1] [0; 1] and [?1; 1] [?1; 1] respectively. Another 10,000
2
points were randomly chosen to serve as test data. The results are summarized in table
2 (see Girosi, Jones and Poggio, 1993 for a more extensive description of the results).
As expected, the results show that the additive model (3) was able to approximate
the additive function, hadd(x; y) better than both the RBF model (1) and the elliptical
Gaussian model (2), and that there seems to be a smooth degradation of performance
as the model changes from the additive to the Radial Basis Function. Just the opposite
results are seen in approximating the non-additive Gabor function, gGabor(x; y), shown
in gure 4a. The RBF model (1) did very well, while the additive model (3) did a
very poor job, as shown in gure 4b. However, gure 4c shows that the GRN scheme
27
Model 1
) train: 0.000036
add(
test: 0.011717
(
)
train:
0.000000
Gabor
test: 0.003818
h
g
x; y
x; y
Model 2
0.000067
0.001598
0.000000
0.344881
Model 3
0.000001
0.000007
0.000000
67.95237
Model 4
0.000170
0.001422
0.000001
0.033964
Model 5
0.000743
0.026699
0.000044
0.191055
Table 2: A summary of the results of our numerical experiments. Each table entry
contains the 2 errors for both the training set and the test set.
L
(model 4) gives a fairly good approximation, because the learning algorithm nds better
directions for projecting the data than the and axis as in the pure additive model.
Notice that the rst three models we considered had a number of parameters equal
to the number of data points, and were supposed to exactly interpolate the data, so
that one may wonder why the training errors are not exactly zero. The reason is the
ill-conditioning of the associated linear system, which is a typical problem of Radial
Basis Functions (Dyn, Levin and Rippa, 1986).
x
y
8 Hardware and biological implementation of network architectures
We have seen that di erent network architectures can be derived from regularization
by making somewhat di erent assumptions on the classes of functions used for approximation. Given the basic common roots, one is tempted to argue { and numerical
experiments support the claim { that there will be small di erences in average performance of the various architectures (see also Lippmann 1989, 1991). It becomes therefore
interesting to ask which architectures are easier to implement in hardware.
All the schemes that use the same number of centers as examples { such as RBF
and additive splines { are expensive in terms of memory requirements (if there are many
examples) but have a simple learning stage. More interesting are the schemes that
use fewer centers than examples (and use the linear transformation ). There are
at least two perspectives for our discussion: we can consider implementation of radial
vs. additive schemes and we can consider di erent activation functions. Let us rst
discuss radial vs. non-radial functions such as a Gaussian RBF vs. a Gaussian MLP
network. For a VLSI implementations, the main di erence is in computing a scalar
product rather than a 2 distance, which is usually more expensive both for digital and
analog VLSI. The 2 distance, however, might be replaced with the 1 distance, that
is a sum of absolute values, which can be computed eciently. Notice that a Radial
W
L
L
L
28
Basis Functions scheme that uses the L1 norm has been derived in section (3.2) from a
tensor-product stabilizer.
Let us consider now di erent activation functions. Activation functions such as
Gaussian, sigmoid or absolute values are equally easy to compute, especially if look-up
table approaches are used. In analog hardware it is somewhat simpler to generate a
sigmoid than a Gaussian, although Gaussian-like shapes can be synthesized with less
than 10 transistors (J. Harris, personal communication).
In practical implementations other issues, such as trade-o s between memory and
computation and on-chip learning, are likely to be much more relevant than the speci c
chosen architecture. In other words, a general conclusion about ease of implementation
is not possible: none of the architectures we have considered holds a clear edge.
From the point of view of biological implementations the situation is somewhat different. The hidden unit in MLP networks with sigmoidal-like activation functions is a
plausible, albeit much over-simpli ed, model of real neurons. The sigmoidal transformation of a scalar product seems much easier to implement in terms of known biophysical
mechanisms than the Gaussian of a multidimensional euclidean distance. On the other
hand, it is intriguing to observe that HBF centers and tuned cortical neurons behave
alike (Poggio and Hurlbert, 1993). In particular, a Gaussian HBF unit is maximally
excited when each component of the input exactly matches each component of the center. Thus the unit is optimally tuned to the stimulus value speci ed by its center. Units
with multidimensional centers are tuned to complex features, made of the conjunction of
simpler features. This description is very like the customary description of cortical cells
optimally tuned to some more or less complex stimulus. So-called place coding is the
simplest and most universal example of tuning: cells with roughly bell shaped receptive
elds have peak sensitivities for given locations in the input space, and by overlapping,
cover all of that space. Thus tuned cortical neurons seem to behave more like Gaussian
HBF units than like the sigmoidal units of MLP networks: the tuned response function
of cortical neurons mostly resembles exp(?kx ? tk2) more than it does (x w). When
the stimulus to a cortical neuron is changed from its optimal value in any direction,
the neuron's response typically decreases. The activity of a Gaussian HBF unit would
also decline with any change in the stimulus away from its optimal value t. For the
sigmoid unit, though, certain changes away from the optimal stimulus will not decrease
its activity, for example when the input x is multiplied by a constant > 1.
How might, then, multidimensional Gaussian receptive elds be synthesized from
known receptive elds and biophysical mechanisms?
The simplest answer is that cells tuned to complex features may be constructed from
a hierarchy of simpler cells tuned to incrementally larger conjunctions of elementary
features. This idea { popular among physiologists { can immediately be formalized in
terms of Gaussian radial basis functions, since a multidimensional Gaussian function
can be decomposed into the product of lower dimensional Gaussians (Ballard, 1986;
29
Mel, 1988,1990,1992; Poggio and Girosi, 1990). There are several biophysically plausible
ways to implement Gaussian RBF-like units (see Poggio and Girosi, 1989 and Poggio,
1990), but none is particularly simple. Ironically one of the plausible implementations
of a RBF unit may exploit circuits based on sigmoidal nonlinearities (see Poggio and
Hurlbert, 1993). In general, the circuits required for the various schemes described
in this paper are reasonable from a biological point of view (Poggio and Girosi, 1989;
Poggio, 1990). For example, the normalized basis function scheme of section (2.2) could
be implemented as outlined in gure 5 where a \pool" cell summates the activities of all
hidden units and shunts the output unit with a shunting inhibition approximating the
required division operation.
9 Summary and remarks
A large number of approximation techniques can be written as multilayer networks with
one hidden layer. In past papers (Poggio and Girosi, 1989; Poggio and Girosi, 1990,
1990b; Girosi, 1992) we showed how to derive Radial Basis Functions, Hyper Basis
Functions and several types of multidimensional splines from regularization principles.
We had not used regularization to yield approximation schemes of the additive type
(Wahba, 1990; Hastie and Tibshirani, 1990), such as additive splines, ridge approximation of the Projection Pursuit Regression type and hinge functions. In this paper, we
show that appropriate stabilizers can be de ned to justify such additive schemes, and
that the same extensions that leads from RBF to HBF leads from additive splines to
ridge function approximation schemes of the Projection Pursuit Regression type. Our
Generalized Regularization Networks include, depending on the stabilizer (that is on the
prior knowledge on the functions we want to approximate), HBF networks, ridge approximation, tensor products splines and Perceptron-like networks with one hidden layer
and appropriate activation functions (such as the Gaussian). Figure 6 shows a diagram
of the relationships. Notice that HBF networks and ridge approximation networks are
directly related in the special case of normalized inputs (Maruyama, Girosi and Poggio,
1992).
We now feel that a common theoretical framework justi es a large spectrum of approximation schemes in terms of di erent smoothness constraints imposed within the
same regularization functional to solve the ill-posed problem of function approximation from sparse data. The claim is that many di erent networks and corresponding
approximation schemes can be derived from the variational principle
N
X
H [f ] = (f (xi ) ? yi ) + [f ] :
2
i=1
(41)
They di er because of di erent choices of stabilizers , which correspond to di erent
30
assumptions of smoothness. In this context, we believe that the Bayesian interpretation
is one of the main advantages of regularization: it makes clear that di erent network
architectures correspond to di erent prior assumptions of smoothness of the functions
to be approximated.
The common framework we have derived suggests that di erences between the various network architectures are relatively minor, corresponding to di erent smoothness
assumptions. One would expect that each architecture will work best for the class of
function de ned by the associated prior (that is stabilizer), an expectation which is
consistent with numerical results in this paper (see also Donoho and Johnstone, 1989).
9.1 Classi cation and smoothness
From the point of view of regularization, the task of classi cation { instead of regression
{ may seem to represent a problem since the role of smoothness is less obvious. Consider
for simplicity binary classi cation, in which the output y is either 0 or 1 and let P (x; y) =
P (x)P (y jx) be the joint probability of the input-output pairs (x; y ). The average cost
associated to an estimator f (x) is the expected risk (see section 2.2)
I [f ] =
Z
dxdy P (x; y )(y
? f (x))2 :
The problem of learning is now equivalent to minimizing the expected risk based on N
samples of the joint probability distribution P (x; y), and it is usually solved by minimizing the empirical risk (14). Here we discuss two possible approaches to the problem
of nding the best estimator:
If we look for an estimator in the class of real valued functions, it is well known
that the minimizer f0 of Q[f ] is the so called regression function, that is
f0 (x) =
Z
dy yP (y
jx) = P (1jx) :
(42)
Therefore, a real valued network f trained on the empirical risk (14) will approximate, under certain conditions of consistency (Vapnik, 1982; Vapnik and Chervonenkis, 1991), the conditional probability distribution of class 1, P (1jx). In this
case our nal estimator f is real valued, and in order to obtain a binary estimator
we have to apply a threshold function to it, so that our nal solution turns out to
be:
f (x)
= (f (x))
where is the Heaviside function.
31
We could look for an estimator with range f0; 1g, for example of the form f (x) =
[g (x)]. In this case the expected risk becomes the average number of misclassied vectors. The function that minimizes the expected risk is not the regression
function anymore, but a binary approximation to it.
We argue that in both cases it makes sense to assume that f (and g) is a smooth
real-valued function and therefore to use regularization networks to approximate it. The
argument is that a natural prior constraint for classi cation is smoothness of classi cation boundaries, since otherwise it would be impossible to e ectively generalize the
correct classi cation from a set of examples. Furthermore, a condition that usually
provides smooth classi cation boundaries is smoothness of the underlying regressor: a
smooth function usually has \smooth" level crossings. Thus both approaches described
above suggest to impose smoothness of f or g, that is to approximate f or g with a
regularization network.
9.2 Complexity of the approximation problem
So far we have discussed several approximation techniques only from the point of view
of the representation and architecture, and we did not discuss how well they perform in
approximating functions of di erent functions spaces. Since these techniques are derived
under di erent a priori smoothness assumptions, we clearly expect them to perform optimally when those a priori assumptions are satis ed. This makes it dicult to compare
their performances, since we expect each technique to work best on a di erent class of
functions. However, if we measure performances by how quickly the approximation error
goes to zero when the number of parameters of the approximation scheme goes to in nity, very general results from the theory of linear and nonlinear widths (Timan, 1963;
Pinkus, 1986; Lorentz, 1962; 1986; DeVore, Howard and Micchelli, 1989; DeVore, 1991;
DeVore and Yu, 1991) suggest that all techniques share the same limitations. For example, when approximating an s times continuously di erentiable function in d variables
with some function parametrized by n parameters, one can prove that even the \best"
nonlinear parametrizations cannot achieve an accuracy that is better than the Jackson
type bound, that is O(n? d ). Here the adjective \best" is used in the sense de ned by DeVore, Howard and Micchelli (1989) in their work on nonlinear n-widths, which restricts
the sets of nonlinear parametrization to those for which the optimal parameters depend
continuously on the function that has to be approximated. Notice that, although this
is a desirable property, not all the approximation techniques may have it, and therefore
these results may not always be applicable. However, the basic intuition is that a class
of functions has an intrinsic complexity that increases exponentially in the the ratio ds ,
where s is a smoothness index, that is a measure of the amount of constraints imposed on
the functions of the class. Therefore, if the smoothness index is kept constant, we expect
32
that the number of parameters needed in order to achieve a certain accuracy increases
exponentially with the number of dimensions, irrespectively of the approximation technique, showing the phenomenon known as \the curse of dimensionality" (Bellman, 1961).
Clearly, if we consider classes of functions with a smoothness index that increases when
the number of variables increase, then a rate of convergence independent of the dimensionality can be obtained, because the increase in complexity due to the larger number
of variables is compensated by the decrease due to the stronger smoothness constraint.
In order to make this concept clear, we summarized in table (3) a number of di erent
approximation techniques, and the constraints that can be imposed on them in order
to make the approximation error to be O( p1n ), that is \indepedent of the dimension",
and therefore immune to the curse of dimensionality. Notice that, since these techniques
are derived under di erent a priori assumptions, the explicit form of the constraints are
di erent. For example in entries 5 and 6 of the table (Girosi and Anzellotti, 1992, 1993;
Girosi, 1993) the result holds in H 2m;1(Rd ), that is the Sobolev space of functions whose
derivatives up to order 2m are integrable (Ziemer, 1989). Notice that the number of
derivatives that are integrable has to increase with the dimension d in order to keep
the rate of convergence constant. A similar phenomenon appears in entries 2 and 3
(Barron, 1991, 1993; Breiman, 1993), but in a less obvious way. In fact, it can be shown
(Girosi and Anzellotti, 1992, 1993) that, for example, the space of functions considered
by Barron (entry 2) and Breiman (entry 3) are the set of functions that can be written
respectively as f (x) = kxk1?d and f (x) = kxk2?d , where is any function whose
Fourier Transform is integrable, and stands for the convolution operator. Notice that,
in this way, it becomes more apparent that these space of functions become more and
more constrained as the dimensions increases, due to the more and more rapid fall-o
of the terms kxk1?d and kxk2?d. The same phenomenon is also very clear in the results
of H.N. Mhaskar (1993), who proved that the rate of convergence of approximation of
functions with s continuous derivatives by multilayered feedforward neural networks is
O(n? ds ): if the number of continuous derivatives s increases linearly with the dimension
d, the curse of dimensionality disappears, leading to a rate of convergence independent
of the dimension.
It is important to emphasize that in practice the parameters of the approximation
scheme have to be estimated using a nite amount of data (Vapnik and Chervonenkis,
1971, 1981, 1991; Vapnik, 1982; Pollard, 1984; Geman, Bienenstock and Doursat, 1992;
Haussler, 1989; Baum and Haussler, 1989, Baum, 1988; Moody, 1991a, 1991b). In fact,
what one does in practice is to minimize the empirical risk (see equation 14), while
what one would really like to do is to minimize the expected risk (see equation 13).
This introduces an additional source of error, sometimes called \estimation error", that
usually depends on the dimension d in a much milder way than the approximation error,
and can be estimated using the theory of uniform convergence of relative frequences to
probabilities (Vapnik and Chervonenkis, 1971, 1981, 1991; Vapnik, 1982; Pollard, 1984).
33
Speci c results on the generalization error, that combine both approximation and
estimation error, have been obtained by A. Barron (1991, 1994) for sigmoidal neural
networks, and by Niyogi and Girosi (1994) for Gaussian Radial Basis Functions. Although these bounds are di erent, they all have the same qualitative behaviour: for a
xed number of data points the generalization error rst decreases when the number of
parameters increases, then reaches a minimum and start increasing again, revealing the
well known phenomenon of over tting. For a general description of how the approximation and estimation error combine together to bound the generalization error see Niyogi
and Girosi (1994).
9.3 Additive structure and the sensory world
In this last section we address the surprising relative success of additive schemes of the
ridge approximation type in real world applications. As we have seen, ridge approximation schemes depend on priors that combine additivity of one-dimensional functions with
the usual assumption of smoothness. Do such priors capture some fundamental property
of the physical world? Consider for example the problem of object recognition, or the
problem of motor control. We can recognize almost any object from any of many small
subsets of its features, visual and non-visual. We can perform many motor actions in
several di erent ways. In most situations, our sensory and motor worlds are redundant.
In terms of GRN this means that instead of high-dimensional centers, any of several
lower-dimensional centers, that is components, are often sucient to perform a given
task. This means that the \and" of a high-dimensional conjunction can be replaced by
the \or" of its components (low-dimensional conjunctions) { a face may be recognized
by its eyebrows alone, or a mug by its color. To recognize an object, we may use not
only templates comprising all its features, but also subtemplates, comprising subsets of
features and in some situations the latter, by themselves, may be fully sucient. Additive, small centers { in the limit with dimensionality one { with the appropriate are
of course associated with stabilizers of the additive type.
Splitting the recognizable world into its additive parts may well be preferable to
reconstructing it in its full multidimensionality, because a system composed of several
independent, additive parts is inherently more robust than a whole simultaneously dependent on each of its parts. The small loss in uniqueness of recognition is easily o set
by the gain against noise and occlusion. There is also a possible meta-argument that
we mention here only for the sake of curiosity. It may be argued that humans would
not be able to understand the world if it were not additive because of the too-large
number of necessary examples (because of high dimensionality of any sensory input such
as an image). Thus one may be tempted to conjecture that our sensory world is biased
towards an \additive structure".
W
34
Function space
R d ds jf~(s)j < +1
R
Norm
Approximation scheme
L2 ( )
f (x) = Pni=1 ci sin(x wi + i)
L2 ( )
f (x) = Pni=1 ci(x wi + i)
L2 ( )
f (x) = Pni=1 cijx wi + ij+ + x a + b
(Jones, 1992)
R d ds kskjf~(s)j < +1
R
(Barron, 1991)
R d ds ksk2jf~(s)j < +1
R
(Breiman, 1993)
e?kxk2 ; 2 L1(Rd )
(Girosi and Anzellotti, 1992)
H 2m;1(Rd ); 2m > d
L1(R2 ) f (x) = Pn=1 c e?kx?t
k2
L1(R2 ) f (x) = Pn=1 c Gm(kx ? t k2)
(Girosi and Anzellotti, 1992)
H 2m;1(Rd ); 2m > d
L2(R2) f (x) = Pn=1 c e?
kx?t k2
2
(Girosi, 1993)
Table 3: Approximation schemes and corresponding functions spaces with the same rate
of convergence O(n 12 ). The function in the standard sigmoidal function, the function
jxj+ in the third entry is the ramp function, and the functionm Gm in the fth entry is a
Bessel potential, that is the Fourier Transform of (1+ ksk2)? 2 (Stein, 1970). H 2m;1(Rd )
is the Sobolev space of functions whose derivatives up to order 2m are integrable (Ziemer,
1989).
35
Acknowledgements We are grateful to P. Niyogi, H. Mhaskar, J. Friedman, J. Moody, V.
Tresp and one of the (anonymous) referees for useful discussions and suggestions. This paper
describes research done within the Center for Biological and Computational Learning in the
Department of Brain and Cognitive Sciences and at the Arti cial Intelligence Laboratory at
MIT. This research is sponsored by grants from the Oce of Naval Research under contracts
N00014-91-J-0385 and N00014-92-J-1879; by a grant from the National Science Foundation
under contract ASC-9217041 (which includes funds from DARPA provided under the HPCC
program). Support for the A.I. Laboratory's arti cial intelligence research is provided by ONR
contract N00014-91-J-4038. Tomaso Poggio is supported by the Uncas and Helen Whitaker
Chair at the Whitaker College, Massachusetts Institute of Technology.
36
APPENDICES
A Derivation of the general form of solution of the
regularization problem
We have seen in section (2) that the regularized solution of the approximation problem
is the function that minimizes a cost functional of the following form:
H [f ] =
N
X
(yi ? f (xi))2 + [f ]
i=1
where the smoothness functional [f ] is given by
[f ] =
Z
ds
Rd
(43)
jf~(s)j2 :
G~ (s)
The rst term measures the distance between the data and the desired solution f ,
and the second term measures the cost associated with the deviation from smoothness.
For a wide class of functionals the solutions of the minimization problem (43) all
have the same form. A detailed and rigorous derivation of the solution of the variational
principle associated with equation (43) is outside the scope of this paper. We present here
a simple derivation and refer the reader to the current literature for the mathematical
details (Wahba, 1990; Madych and Nelson, 1990; Dyn, 1987).
We rst notice that, depending on the choice of G, the functional [f ] can have a nonempty null space, and therefore there is a certain class of functions that are \invisible"
to it. To cope with this problem we rst de ne an equivalence relation among all the
functions that di er for an element of the null space of [f ]. Then we express the rst
term of H [f ] in terms of the Fourier transform of f 3:
f (x) =
obtaining the functional
Z
Rd
ds f~(s)eix s
2
Z
Z
N
X
jf~(s)j2 :
i
xi s
~
~
yi ? d ds f (s)e
+
d ds ~
H [f ] =
R
R
G(s)
i=1
Then we notice that since f is real, its Fourier transform satis es the constraint:
f~ (s) = f~(?s)
3 For simplicity of notation we take all the constants that appear in the de nition of the Fourier
transform to be equal to 1.
37
so that the functional can be rewritten as:
2
Z
Z
N
X
f~(?s)f~(s)
i
xi s
~
~
yi ?
ds f (s)e
H [f ] =
+
ds
~ (s) :
Rd
Rd
G
i=1
In order to nd the minimum of this functional we take its functional derivatives with
respect to f~ and set it to zero:
H [f~]
= 0 8t 2 R d :
(44)
f~(t)
We now proceed to compute the functional derivatives of the rst and second term of
H [f~]. For the rst term we have:
N
X
Z
2
yi ?
ds f~(s)eixi s
Rd
f~(t) i=1
Z
N
~
X
= 2 (yi ? f (xi)) d ds f~(s) eixi s
R
f (t)
i=1
N
X
=2
i=1
(yi ? f (xi))
=2
Z
Rd
N
X
i=1
ds (s
? t)eixi s
(yi ? f (xi)) eixi t
For the smoothness functional we have:
Z
f~(t) Rd
=2
Z
Rd
ds
ds
f~( s)f~(s)
?
= 2
~ (s)
G
Z
R
d
ds
? (s ? t) = 2 f~(?t) :
~ (s)
~ (t)
G
G
f~( s)
f~( s) f~(s)
?
~ (s) f~(t)
G
Using these results we can now write equation (44) as:
N
~
X
(yi ? f (xi))eixi t + f~(?t) = 0 :
G(t)
i=1
Changing t in ?t and multiplying by G~ (t) on both sides of this equation we get:
f~(t)
X
= G~ (?t) (yi ? f (xi)) eixi t :
N
i=1
38
We now de ne the coecients
(yi ? f (xi)) i = 1; : : : ; N ;
ci =
assume that G~ is symmetric (so that its Fourier transform is real), and take the Fourier
transform of the last equation, obtaining:
f (x)
=
X
N
i=1
ci (xi
? x) G(x) =
X
N
i=1
ci G(x
? xi ) :
We now recall that we had de ned as equivalent all the functions di ering by a term that
lies in the null space of [f ], and therefore the most general solution of the minimization
problem is
f (x) =
X
N
i=1
ci G(x
? xi) + p(x)
where p(x) is a term that lies in the null space of [f ], that is a set of polynomials for
most common choices of stabilizer [f ].
B Approximation of vector elds with regularization networks
Consider the problem of approximating a q-dimensional vector eld y(x) from a set of
sparse data, the examples, which are pairs (xi; yi) for i = 1; : : : ; N . Choose a Generalized Regularization Network as the approximation scheme, that is, a network with
one \hidden" layer and linear output units. Consider the case of N examples, n N
centers, input dimensionality d and output dimensionality q (see gure 7). Then the
approximation is
y(x) =
Xc
n
=1
G(x
?t )
(45)
where G is the chosen basis function and the coecients c are now q-dimensional
vectors4: c = (c1 ; : : : ; c ; : : : ; cq ).
Here we assume, for simplicity, that G is positive de nite in order to avoid the need of
additional polynomial terms in the previous equation. Equation (45) can be rewritten
in matrix notation as
4
The components of an output vector will be always denoted by superscript, greek indices.
39
y(x) = C g(x)
(46)
where the matrix C is de ned by (C ); = c and g is the vector with elements (g(x)) =
G(x ? t ). Assuming, for simplicity, that there is no noise in the data (that is equivalent
to choosing = 0 in the regularization functional (1)), the equations for the coecients
c can be found imposing the interpolation conditions:
Introducing the following notation:
yi = C g(xi)
(Y )i; = y(xi) ; (C ); = c ; (G) ;i = G(xi ? t )
the matrix of coecients C is given by:
= Y G+ :
where G+ is the pseudoinverse of G (Penrose, 1955; Albert, 1972). Substituting this
expression in equation (46), the following expression is obtained:
C
y(x) = Y G+ g(x) :
After some algebraic manipulations, this expression can be rewritten as
y(x) =
X b (x)y
N
i=1
i
i
where the functions bi(x),that are the elements of the vector b(x), depend on the chosen
G, according to
b(x) = G+ g(x):
Therefore, it follows (though it is not so well known) that the vector eld y(x) is approximated by the network as the linear combination of the example elds yi.
Thus for any choice of the regularization network and any choice of the (positive
de nite) basis function the estimated output vector is always a linear combination of the
output example vectors with coecients b that depend on the input value. The result is
valid for all networks with one hidden layer and linear outputs, provided that the mean
square error criterion is used for training.
40
1
1
0
0.8
0.6
0.5
1
0.5
0.4
0
0.2
0
1
1
0.75
0.8
0.5
0.6
0.25
0
0.4
0
0.2
0.2
0.4
0.6
0.8
(a)
(b)
1
1
1.2
1
0.8
1
0.8
0.8
0.6
0.6
0.6
0.4
0.4
0.4
0.2
0.2
0.2
(c)
(d)
(e)
Figure 1: (a) The multiquadric function (b) An equivalent kernel for the multiquadric
basis function in the cases of two-dimensional equally spaced data. (c, d, e) The equivalent kernels b3, b5 and b6, for nonuniform one-dimensional multiquadric interpolation
(see text for explanation).
0.2
0.4
0.6
0.8
1
0.2
-0.2
41
0.4
0.6
0.8
1
0.2
0.4
0.6
0.8
1
a
b
c
0.5
0.4
0.4
0.2
1
0.8
0.6
0.3
-0.4 -0.2
-0.2
0.2
0.2 0.4
0.4
0.2
0.1
-0.4
-0.4-0.2
-0.4-0.2
0.2 0.4
0.2 0.4
Figure 2: a) Absolute value basis function, x , b) Sigmoidal-like basis function L(x) c)
Gaussian-like basis function gL(x)
j j
a
b
c
1
1
1
0.5
0.5
0.5
0.2
0.4
0.6
0.8
0.2
1
0.4
0.6
0.8
0.2
1
-0.5
-0.5
-0.5
-1
-1
-1
0.4
0.6
0.8
1
Figure 3: Approximation of sin(2x) using 8 basis functions of the a) absolute value
type, b) Sigmoidal-like type, and c) Gaussian-like type.
42
a
1
0.5
0
-0.5
-1
1
0.5
0
-0.5
-0.5
0
0.5
1 -1
b
c
10
5
0
-5
-10
0
1
0.5
0
-0.5
-1
1
0.8
0.6
0.4
0.8
0
-0.5
0
0.2
0.6
0.5
-0.5
0.4
0.2
1
0.5
1 -1
10
Figure 4: a) The function to be approximated ( ). b) Additive Gaussian model
approximation of ( ) (model 3). c) GRN Approximation of ( ) (model 4).
g x; y
g x; y
g x; y
43
X1
Xn
I
C1
I
C2
I
.. ..
I
Cn
Figure 5: An implementation of the normalized radial basis function scheme. A \pool"
cell (dotted circle) summates the activities of the hidden units and then divides the output of the network. The division may be approximated in a physiological implementation
by shunting inhibition.
44
Regularization networks (RN)
Generalized Regularization
Networks (GRN)
Regularization
Radial
Stabilizer
Additive
Stabilizer
Additive
Splines
RBF
Movable metric
Movable centers
Movable metric
Movable centers
Product
Stabilizer
Tensor
Product
Splines
Movable metric
Movable centers
Ridge
Approximation
HyperBF
if ||x|| = 1
Figure 6: Several classes of approximation schemes and corresponding network architectures can be derived from regularization with the appropriate choice of smoothness priors
and associated stabilizers and basis functions, showing the common Bayesian roots.
45
X1
H1
...
...
+
+
y1
y2 . . .
...
Xd
...
+
...
Hn
+
. . . yq−1
+
yq
Figure 7: The most general network with one hidden layer and vector output. Notice that
this approximation of a q-dimensional vector eld has in general fewer parameters than
the alternative representation consisting of q networks with one-dimensional outputs. If
the only free parameters are the weights from the hidden layer to the output (as for
simple RBF with n = N , where N is the number of examples) the two representations
are equivalent.
46
References
[1] F.A. Aidu and V.N. Vapnik. Estimation of probability density on the basis of the
method of stochastic regularization. Avtomatika i Telemekhanika, (4):84{97, April
1989.
[2] A. Albert. Regression and the Moore-Penrose Pseudoinverse. Academic Press,
New York, 1972.
[3] D. Allen. The relationship between variable selection and data augmentation and
a method for prediction. Technometrics, 16:125{127, 1974.
[4] N. Aronszajn. Theory of reproducing kernels. Trans. Amer. Math. Soc., 686:337{
404, 1950.
[5] D. H. Ballard. Cortical connections and parallel processing: structure and function.
Behavioral and Brain Sciences, 9:67{120, 1986.
[6] A. R. Barron and Barron R. L. Statistical learning networks: a unifying view. In
Symposium on the Interface: Statistics and Computing Science, Reston, Virginia,
April 1988.
[7] A.R. Barron. Approximation and estimation bounds for arti cial neural networks.
Technical Report 59, Department of Statistics, University of Illinois at UrbanaChampaign, Champaign, IL, March 1991.
[8] A.R. Barron. Universal approximation bounds for superpositions of a sigmoidal
function. IEEE Transaction on Information Theory, 39(3):930{945, May 1993.
[9] A.R. Barron. Approximation and estimation bounds for arti cial neural networks.
Machine Learning, 14:115{133, 1994.
[10] E. B. Baum. On the capabilities of multilayer perceptrons. J. Complexity, 4:193{
215, 1988.
[11] E.B. Baum and D. Haussler. What size net gives valid generalization? Neural
Computation, 1:151{160, 1989.
[12] R.E. Bellman. Adaptive Control Processes. Princeton University Press, Princeton,
NJ, 1961.
[13] M. Bertero. Regularization methods for linear inverse problems. In C. G. Talenti,
editor, Inverse Problems. Springer-Verlag, Berlin, 1986.
47
[14] M. Bertero, T. Poggio, and V. Torre. Ill-posed problems in early vision. Proceedings
of the IEEE, 76:869{889, 1988.
[15] L. Bottou and V. Vapnik. Local learning algorithms. Neural Computation,
4(6):888{900, November 1992.
[16] L. Breiman. Hinging hyperplanes for regression, classi cation, and function approximation. IEEE Transaction on Information Theory, 39(3):999{1013, May
1993.
[17] D.S. Broomhead and D. Lowe. Multivariable functional interpolation and adaptive
networks. Complex Systems, 2:321{355, 1988.
[18] M.D. Buhmann. Multivariate cardinal interpolation with radial basis functions.
Constructive Approximation, 6:225{255, 1990.
[19] M.D. Buhmann. On quasi-interpolation with Radial Basis Functions. Numerical
Analysis Reports DAMPT 1991/NA3, Department of Applied Mathematics and
Theoretical Physics, Cambridge, England, March 1991.
[20] A. Buja, T. Hastie, and R. Tibshirani. Linear smoothers and additive models. The
Annals of Statistics, 17:453{555, 1989.
[21] B. Caprile and F. Girosi. A nondeterministic minimization algorithm. A.I. Memo
1254, Arti cial Intelligence Laboratory, Massachusetts Institute of Technology,
Cambridge, MA, September 1990.
[22] D.D. Cox. Multivariate smoothing spline functions. SIAM J. Numer. Anal.,
21:789{813, 1984.
[23] P. Craven and G. Wahba. Smoothing noisy data with spline functions: estimating
the correct degree of smoothing by the method of generalized cross validation.
Numer. Math, 31:377{403, 1979.
[24] G. Cybenko. Approximation by superposition of a sigmoidal function. Math.
Control Systems Signals, 2(4):303{314, 1989.
[25] C. de Boor. A Practical Guide to Splines. Springer-Verlag, New York, 1978.
[26] C. de Boor. Quasi-interpolants and approximation power of multivariate splines.
In M. Gasca and C.A. Micchelli, editors, Computation of Curves and Surfaces,
pages 313{345. Kluwer Academic Publishers, Dordrecht, Netherlands, 1990.
[27] R. DeVore, R. Howard, and C. Micchelli. Optimal nonlinear approximation.
Manuskripta Mathematika, 1989.
48
[28] R.A. DeVore. Degree of nonlinear approximation. In C.K. Chui, L.L. Schumaker,
and D.J. Ward, editors, Approximation Theory, VI, pages 175{201. Academic
Press, New York, 1991.
[29] R.A. DeVore and X.M. Yu. Nonlinear n-widths in Besov spaces. In C.K. Chui, L.L.
Schumaker, and D.J. Ward, editors, Approximation Theory, VI, pages 203{206.
Academic Press, New York, 1991.
[30] L.P. Devroye and T.J. Wagner. Distribution-free consistency results in nonparametric discrimination and regression function estimation. The Annals of Statistics,
8:231{239, 1980.
[31] P. Diaconis and D. Freedman. Asymptotics of graphical projection pursuit. The
Annals of Statistics, 12(3):793{815, 1984.
[32] D.L. Donoho and I.M. Johnstone. Projection-based approximation and a duality
with kernel methods. The Annals of Statistics, 17(1):58{106, 1989.
[33] J. Duchon. Spline minimizing rotation-invariant semi-norms in Sobolev spaces.
In W. Schempp and K. Zeller, editors, Constructive theory of functions os several
variables, Lecture Notes in Mathematics, 571. Springer-Verlag, Berlin, 1977.
[34] N. Dyn. Interpolation of scattered data by radial functions. In C.K. Chui, L.L.
Schumaker, and F.I. Utreras, editors, Topics in multivariate approximation. Academic Press, New York, 1987.
[35] N. Dyn. Interpolation and approximation by radial and related functions. In C.K.
Chui, L.L. Schumaker, and D.J. Ward, editors, Approximation Theory, VI, pages
211{234. Academic Press, New York, 1991.
[36] N. Dyn, I.R.H. Jackson, D. Levin, and A. Ron. On multivariate approximation by
integer translates of a basis function. Computer Sciences Technical Report 886,
University of Wisconsin{Madison, November 1989.
[37] N. Dyn, D. Levin, and S. Rippa. Numerical procedures for surface tting of
scattered data by radial functions. SIAM J. Sci. Stat. Comput., 7(2):639{659,
April 1986.
[38] R.L. Eubank. Spline Smoothing and Nonparametric Regression, volume 90 of
Statistics, textbooks and monographs. Marcel Dekker, Basel, 1988.
[39] R. Franke. Scattered data interpolation: tests of some method. Math. Comp.,
38(5):181{200, 1982.
49
[40] R. Franke. Recent advances in the approximation of surfaces from scattered data.
In C.K. Chui, L.L. Schumaker, and F.I. Utreras, editors, Topics in multivariate
approximation. Academic Press, New York, 1987.
[41] J.H. Friedman and W. Stuetzle. Projection pursuit regression. Journal of the
American Statistical Association, 76(376):817{823, 1981.
[42] K. Funahashi. On the approximate realization of continuous mappings by neural
networks. Neural Networks, 2:183{192, 1989.
[43] Th. Gasser and H.G. Muller. Estimating regression functions and their derivatives
by the kernel method. Scand. Journ. Statist., 11:171{185, 1985.
[44] I.M. Gelfand and G.E. Shilov. Generalized functions. Vol. 1: Properties and Operations. Academic Press, New York, 1964.
[45] S. Geman, E. Bienenstock, and R. Doursat. Neural networks and the bias/variance
dilemma. Neural Computation, 4:1{58, 1992.
[46] F. Girosi. Models of noise and robust estimates. A.I. Memo 1287, Arti cial
Intelligence Laboratory, Massachusetts Institute of Technology, 1991.
[47] F. Girosi. On some extensions of radial basis functions and their applications in
arti cial intelligence. Computers Math. Applic., 24(12):61{80, 1992.
[48] F. Girosi. Regularization theory, radial basis functions and networks. In
V. Cherkassky, J.H. Friedman, and H. Wechsler, editors, From Statistics to Neural Networks. Theory and Pattern Recognition Applications. Springer-Verlag, Subseries F, Computer and Systems Sciences, 1993.
[49] F. Girosi and G. Anzellotti. Rates of convergence of approximation by translates. A.I. Memo 1288, Arti cial Intelligence Laboratory, Massachusetts Institute
of Technology, 1992.
[50] F. Girosi and G. Anzellotti. Rates of convergence for radial basis functions and
neural networks. In R.J. Mammone, editor, Arti cial Neural Networks for Speech
and Vision, pages 97{113, London, 1993. Chapman & Hall.
[51] F. Girosi, M. Jones, and T. Poggio. Priors, stabilizers and basis functions: From
regularization to radial, tensor and additive splines. A.I. Memo No. 1430, Arti cial
Intelligence Laboratory, Massachusetts Institute of Technology, 1993.
[52] F. Girosi and T. Poggio. Networks and the best approximation property. Biological
Cybernetics, 63:169{176, 1990.
50
[53] F. Girosi, T. Poggio, and B. Caprile. Extensions of a theory of networks for
approximation and learning: outliers and negative examples. In R. Lippmann,
J. Moody, and D. Touretzky, editors, Advances in Neural information processings
systems 3, San Mateo, CA, 1991. Morgan Kaufmann Publishers.
[54] G. Golub, M. Heath, and G. Wahba. Generalized cross validation as a method for
choosing a good ridge parameter. Technometrics, 21:215{224, 1979.
[55] W. E. L. Grimson. A computational theory of visual surface interpolation. Proceedings of the Royal Society of London B, 298:395{427, 1982.
[56] R.L. Harder and R.M. Desmarais. Interpolation using surface splines. J. Aircraft,
9:189{191, 1972.
[57] W. Hardle. Applied nonparametric regression, volume 19 of Econometric Society
Monographs. Cambridge University Press, 1990.
[58] R.L. Hardy. Multiquadric equations of topography and other irregular surfaces.
J. Geophys. Res, 76:1905{1915, 1971.
[59] R.L. Hardy. Theory and applications of the multiquadric-biharmonic method.
Computers Math. Applic., 19(8/9):163{208, 1990.
[60] T. Hastie and R. Tibshirani. Generalized additive models. Statistical Science,
1:297{318, 1986.
[61] T. Hastie and R. Tibshirani. Generalized additive models: some applications. J.
Amer. Statistical Assoc., 82:371{386, 1987.
[62] T. Hastie and R. Tibshirani. Generalized Additive Models, volume 43 of Monographs on Statistics and Applied Probability. Chapman and Hall, London, 1990.
[63] D. Haussler. Decision theoretic generalizations of the PAC model for neural net
and other learning applications. Technical Report UCSC-CRL-91-02, University
of California, Santa Cruz, 1989.
[64] J.A. Hertz, A. Krogh, and R. Palmer. Introduction to the theory of neural computation. Addison-Wesley, Redwood City, CA, 1991.
[65] K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are
universal approximators. Neural Networks, 2:359{366, 1989.
[66] P.J. Huber. Projection pursuit. The Annals of Statistics, 13(2):435{475, 1985.
51
[67] A. Hurlbert and T. Poggio. Synthetizing a color algorithm from examples. Science,
239:482{485, 1988.
[68] B. Irie and S. Miyake. Capabilities of three-layered Perceptrons. IEEE International Conference on Neural Networks, 1:641{648, 1988.
[69] I.R.H. Jackson. Radial Basis Functions methods for multivariate approximation.
Ph.d. thesis, University of Cambridge, U.K., 1988.
[70] L.K. Jones. A simple lemma on greedy approximation in Hilbert space and convergence rates for Projection Pursuit Regression and neural network training. The
Annals of Statistics, 20(1):608{613, March 1992.
[71] E.J. Kansa. Multiquadrics { a scattered data approximation scheme with applications to computational uid dynamics { I. Computers Math. Applic., 19(8/9):127{
145, 1990a.
[72] E.J. Kansa. Multiquadrics { a scattered data approximation scheme with applications to computational uid dynamics { ii. Computers Math. Applic., 19(8/9):147{
161, 1990b.
[73] G.S. Kimeldorf and G. Wahba. A correspondence between Bayesan estimation on
stochastic processes and smoothing by splines. Ann. Math. Statist., 2:495{502,
1971.
[74] T. Kohonen. The self-organizing map. Proceedings of the IEEE, 78(9):1464{1480,
1990.
[75] S.Y. Kung. Digital Neural Networks. Prentice Hall, Englewood Cli s, New Jersey,
1993.
[76] P. Lancaster and K. Salkauskas. Curve and Surface Fitting. Academic Press,
London, 1986.
[77] A. Lapedes and R. Farber. How neural nets work. In Dana Z. Anderson, editor,
Neural Information Processing Systems, pages 442{456. Am. Inst. Physics, NY,
1988. Proceedings of the Denver, 1987 Conference.
[78] R. P. Lippmann. Review of neural networks for speech recognition. Neural Computation, 1:1{38, 1989.
[79] R. P. Lippmann and Y. Lee. A critical overview of neural network pattern classi ers. Presented at Neural Networks for Computing Conference, Snowbird, UT,
1991.
52
[80] G. G. Lorentz. Metric entropy, widths, and superposition of functions. Amer.
Math. Monthly, 69:469{485, 1962.
[81] G. G. Lorentz. Approximation of Functions. Chelsea Publishing Co., New York,
1986.
[82] W.R. Madych and S.A. Nelson. Multivariate interpolation and conditionally positive de nite functions. II. Mathematics of Computation, 54(189):211{230, January
1990.
[83] W.R. Madych and S.A. Nelson. Polyharmonic cardinal splines: a minimization
property. Journal of Approximation Theory, 63:303{320, 1990a.
[84] J. L. Marroquin, S. Mitter, and T. Poggio. Probabilistic solution of ill-posed
problems in computational vision. J. Amer. Stat. Assoc., 82:76{89, 1987.
[85] M. Maruyama, F. Girosi, and T. Poggio. A connection between HBF and MLP.
A.I. Memo No. 1291, Arti cial Intelligence Laboratory, Massachusetts Institute of
Technology, 1992.
[86] J. Meinguet. Multivariate interpolation at arbitrary points made simple. J. Appl.
Math. Phys., 30:292{304, 1979.
[87] B. W. Mel. MURPHY: a robot that learns by doing. In D. Z. Anderson, editor,
Neural Information Processing Systems. American Institute of Physics, University
of Colorado, Denver, 1988.
[88] B.W. Mel. The Sigma-Pi column: A model of associative learning in cerebral
neocortex. Technical Report 6, California Institute of Technology, 1990.
[89] B.W. Mel. NMDA-based pattern-discrimination in a modeled cortical neuron.
Neural Computation, 4:502{517, 1992.
[90] H.N. Mhaskar. Approximation properties of a multilayered feedforward arti cial
neural network. Advances in Computational Mathematics, 1:61{80, 1993.
[91] H.N. Mhaskar. Neural networks for localized approximation of real functions. In
C.A. Kamm et al., editor, Neural networks for signal processing III, Proceedings
of the 1993 IEEE-SP Workshop, pages 190{196, New York, 1993a. IEEE Signal
Processing Society.
[92] H.N. Mhaskar and C.A. Micchelli. Approximation by superposition of a sigmoidal
function. Advances in Applied Mathematics, 13:350{373, 1992.
53
[93] H.N. Mhaskar and C.A. Micchelli. How to choose an activation function. In S. J.
Hanson, J. D. Cowan, and C. L. Giles, editors, Advances in Neural Information
Processing Systems 5. San Mateo, CA: Morgan Kaufmann Publishers, 1993.
[94] C. A. Micchelli. Interpolation of scattered data: distance matrices and conditionally positive de nite functions. Constructive Approximation, 2:11{22, 1986.
[95] J. Moody. Note on generalization, regularization, and architecture selection in
nonlinear learning systems. In Proceedings of the First IEEE-SP Workshop on
Neural Networks for Signal Processing, pages 1{10, Los Alamitos, CA, 1991a.
IEEE Computer Society Press.
[96] J. Moody. The e ective number of parameters: an analysis of generalization and
regularization in nonlinear learning systems. In J. Moody, S. Hanson, and R. Lippmann, editors, Advances in Neural information processings systems 4, pages 847{
854, Palo Alto, CA, 1991b. Morgan Kaufmann Publishers.
[97] J. Moody and C. Darken. Learning with localized receptive elds. In G. Hinton,
T. Sejnowski, and D. Touretzsky, editors, Proceedings of the 1988 Connectionist
Models Summer School, pages 133{143, Palo Alto, 1988.
[98] J. Moody and C. Darken. Fast learning in networks of locally-tuned processing
units. Neural Computation, 1(2):281{294, 1989.
[99] J. Moody and N. Yarvin. Networks with learned unit response functions. In
J. Moody, S. Hanson, and R. Lippmann, editors, Advances in Neural information
processings systems 4, pages 1048{1055, Palo Alto, CA, 1991. Morgan Kaufmann
Publishers.
[100] V.A. Morozov. Methods for solving incorrectly posed problems. Springer-Verlag,
Berlin, 1984.
[101] E.A. Nadaraya. On estimating regression. Theor. Prob. Appl., 9:141{142, 1964.
[102] P. Niyogi and F. Girosi. On the relationship between generalization error, hypothesis complexity, and sample complexity for radial basis functions. A.I. Memo 1467,
Arti cial Intelligence Laboratory, Massachusetts Institute of Technology, 1994.
[103] S. Omohundro. Ecient algorithms with neural network behaviour. Complex
Systems, 1:273, 1987.
[104] G. Parisi. Statistical Field Theory. Addison-Wesley, Reading, Massachusets, 1988.
54
[105] E. Parzen. On estimation of a probability density function and mode. Ann. Math.
Statis., 33:1065{1076, 1962.
[106] R. Penrose. A generalized inverse for matrices. Proc. Cambridge Philos. Soc.,
51:406{413, 1955.
[107] A. Pinkus. N-widths in Approximation Theory. Springer-Verlag, New York, 1986.
[108] T. Poggio. On optimal nonlinear associative recall. Biological Cybernetics, 19:201{
209, 1975.
[109] T. Poggio. A theory of how the brain might work. In Cold Spring Harbor Symposia
on Quantitative Biology, pages 899{910. Cold Spring Harbor Laboratory Press,
1990.
[110] T. Poggio and F. Girosi. A theory of networks for approximation and learning.
A.I. Memo No. 1140, Arti cial Intelligence Laboratory, Massachusetts Institute of
Technology, 1989.
[111] T. Poggio and F. Girosi. Networks for approximation and learning. Proceedings
of the IEEE, 78(9), September 1990.
[112] T. Poggio and F. Girosi. Extension of a theory of networks for approximation and
learning: dimensionality reduction and clustering. In Proceedings Image Understanding Workshop, pages 597{603, Pittsburgh, Pennsylvania, September 11{13
1990a. Morgan Kaufmann.
[113] T. Poggio and F. Girosi. Regularization algorithms for learning that are equivalent
to multilayer networks. Science, 247:978{982, 1990b.
[114] T. Poggio and A. Hurlbert. Observation on cortical mechanisms for object recognition and learning. In C. Koch and J. Davis, editors, Large-scale Neuronal Theories
of the Brain. In press.
[115] T. Poggio, V. Torre, and C. Koch. Computational vision and regularization theory.
Nature, 317:314{319, 1985.
[116] T. Poggio, H. Voorhees, and A. Yuille. A regularized solution to edge detection.
Journal of Complexity, 4:106{123, 1988.
[117] D. Pollard. Convergence of stochastic processes. Springer-Verlag, Berlin, 1984.
[118] M. J. D. Powell. Radial basis functions for multivariable interpolation: a review.
In J. C. Mason and M. G. Cox, editors, Algorithms for Approximation. Clarendon
Press, Oxford, 1987.
55
[119] M.J.D. Powell. The theory of radial basis functions approximation in 1990. In W.A.
Light, editor, Advances in Numerical Analysis Volume II: Wavelets, Subdivision
Algorithms and Radial Basis Functions, pages 105{210. Oxford University Press,
1992.
[120] M.B. Priestley and M.T. Chao. Non-parametric function tting. Journal of the
Royal Statistical Society B, 34:385{392, 1972.
[121] C. Rabut. How to build quasi-interpolants. applications to polyharmonic B-splines.
In P.-J. Laurent, A. Le Mehaute, and L.L. Schumaker, editors, Curves and Surfaces, pages 391{402. Academic Press, New York, 1991.
[122] C. Rabut. An introduction to Schoenberg's approximation. Computers Math.
Applic., 24(12):149{175, 1992.
[123] B.D. Ripley. Neural networks and related methods for classi cation. Proc. Royal
Soc. London, in press, 1994.
[124] J. Rissanen. Modeling by shortest data description. Automatica, 14:465{471, 1978.
[125] M. Rosenblatt. Curve estimates. Ann. Math. Statist., 64:1815{1842, 1971.
[126] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by
back-propagating errors. Nature, 323(9):533{536, October 1986.
[127] I.J. Schoenberg. Contributions to the problem of approximation of equidistant
data by analytic functions, part a: On the problem of smoothing of graduation, a
rst class of analytic approximation formulae. Quart. Appl. Math., 4:45{99, 1946a.
[128] I.J. Schoenberg. Cardinal interpolation and spline functions. Journal of Approximation theory, 2:167{206, 1969.
[129] L.L. Schumaker. Spline functions: basic theory. John Wiley and Sons, New York,
1981.
[130] T. J. Sejnowski and C. R. Rosenberg. Parallel networks that learn to pronounce
english text. Complex Systems, 1:145{168, 1987.
[131] B.W. Silverman. Spline smoothing: the equivalent variable kernel method. The
Annals of Statistics, 12:898{916, 1984.
[132] N. Sivakumar and J.D. Ward. On the best least square t by radial functions to
multidimensional scattered data. Technical Report 251, Center for Approximation
Theory, Texas A & M University, June 1991.
56
[133] R.J. Solomono . Complexity-based induction systems: comparison and convergence theorems. IEEE Transactions on Information Theory, 24, 1978.
[134] D.F. Specht. A general regression neural network. IEEE Transactions on Neural
Networks, 2(6):568{576, 1991.
[135] E.M. Stein. Singular integrals and di erentiability properties of functions. Princeton, N.J., Princeton University Press, 1970.
[136] J. Stewart. Positive de nite functions and generalizations, an historical survey.
Rocky Mountain J. Math., 6:409{434, 1976.
[137] C. J. Stone. Additive regression and other nonparametric models. The Annals of
Statistics, 13:689{705, 1985.
[138] A. N. Tikhonov. Solution of incorrectly formulated problems and the regularization
method. Soviet Math. Dokl., 4:1035{1038, 1963.
[139] A. N. Tikhonov and V. Y. Arsenin. Solutions of Ill-posed Problems. W. H. Winston,
Washington, D.C., 1977.
[140] A.F. Timan. Theory of approximation of functions of a real variable. Macmillan,
New York, 1963.
[141] V. Tresp, J. Hollatz, and S. Ahmad. Network structuring and training using rulebased knowledge. In S. J. Hanson, J. D. Cowan, and C. L. Giles, editors, Advances
in Neural Information Processing Systems 5. San Mateo, CA: Morgan Kaufmann
Publishers, 1993.
[142] F. Utreras. Cross-validation techniques for smoothing spline functions in one or
two dimensions. In T. Gasser and M. Rosenblatt, editors, Smoothing techniques
for Curve Estimation, pages 196{231. Springer-Verlag, Heidelberg, 1979.
[143] V. N. Vapnik. Estimation of Dependences Based on Empirical Data. SpringerVerlag, Berlin, 1982.
[144] V. N. Vapnik and A. Y. Chervonenkis. On the uniform convergence of relative frequences of events to their probabilities. Th. Prob. and its Applications, 17(2):264{
280, 1971.
[145] V.N. Vapnik and A. Ya. Chervonenkis. The necessary and sucient conditions for
the uniform convergence of averages to their expected values. Teoriya Veroyatnostei i Ee Primeneniya, 26(3):543{564, 1981.
57
[146] V.N. Vapnik and A. Ya. Chervonenkis. The necessary and sucient conditions for
consistency in the empirical risk minimization method. Pattern Recognition and
Image Analysis, 1(3):283{305, 1991.
[147] V.N. Vapnik and A.R. Stefanyuk. Nonparametric methods for restoring probability
densities. Avtomatika i Telemekhanika, (8):38{52, 1978.
[148] G. Wahba. Smoothing noisy data by spline functions. Numer. Math, 24:383{393,
1975.
[149] G. Wahba. Smoothing and ill-posed problems. In M. Golberg, editor, Solutions
methods for integral equations and applications, pages 183{194. Plenum Press, New
York, 1979.
[150] G. Wahba. Spline bases, regularization, and generalized cross-validation for solving approximation problems with large quantities of noisy data. In J. Ward and
E. Cheney, editors, Proceedings of the International Conference on Approximation
theory in honour of George Lorenz, Austin, TX, January 8{10 1980. Academic
Press.
[151] G. Wahba. A comparison of GCV and GML for choosing the smoothing parameter
in the generalized splines smoothing problem. The Annals of Statistics, 13:1378{
1402, 1985.
[152] G. Wahba. Splines Models for Observational Data. Series in Applied Mathematics,
Vol. 59, SIAM, Philadelphia, 1990.
[153] G. Wahba and S. Wold. A completely automatic french curve. Commun. Statist.,
4:1{17, 1975.
[154] G.S. Watson. Smooth regression analysis. Sankhya A, 26:359{372, 1964.
[155] H. White. Learning in arti cial neural networks: a statistical perspective. Neural
Computation, 1:425{464, 1989.
[156] H. White. Connectionist nonparametric regression: Multilayer perceptrons can
learn arbitrary mappings. Neural Networks, 3(535-549), 1990.
[157] A. Yuille and N. Grzywacz. The motion coherence theory. In Proceedings of the
International Conference on Computer Vision, pages 344{354, Washington, D.C.,
December 1988. IEEE Computer Society Press.
[158] W.P. Ziemer. Weakly di erentiable functions : Sobolev spaces and functions of
bounded variation. Springer-Verlag, New York, 1989.
58