Lect Notes 3
Lect Notes 3
Lect Notes 3
Approximation Methods
In this chapter, we deal with a very important problem that we will encounter
in a wide variety of economic problems: approximation of functions. Such
a problem commonly occurs when it is too costly either in terms of time or
complexity to compute the true function or when this function is unknown
and we just need to have a rough idea of its main properties. Usually the only
thing that is required then is to be able to compute this function at one or a
few points and formulate a guess for all other values. This leaves us with some
choice concerning either the local or global character of the approximation
and the level of accuracy we want to achieve. As we will see in different
applications, choosing the method is often a matter of efficiency and ease of
computing.
Following Judd [1998], we will consider 3 types of approximation methods
1. Local approximation, which essentially exploits information on the value
of the function in one point and its derivatives at the same point. The
idea is then to obtain a (hopefully) good approximation of the function
in a neighborhood of the benchmark point.
2. Lp approximations, which actually find a nice function that is close to
the function we want to evaluate in the sense of a Lp norm. Ideally, we
would need information on the whole function to find a good approximation, which is usually infeasible or which would make the problem
1
of approximation totally irrelevant! Therefore, we usually rely on interpolation, which then appears as the other side of the same problem, but
only requires to know the function at some points.
3. Regressions, which may be viewed as an intermediate situation between
the two preceding cases, as it usually relies exactly as in econometrics,
on m moments to find n parameters of the approximating function.
3.1
Local approximations
3.1.1
Taylor series expansion is certainly the most wellknown and natural approximation to any student.
The basic framework: This approximation relies on the standard Taylors
theorem:
Theorem 1 Suppose F : R R is a C k+1 function, then for x? Rn , we
have
F (x) = F (x? ) +
+
1
2
n
X
1
k!
n
X
F ?
(x )(xi x?i )
xi
i=1
n
X
i=1 i1 =1
n
X
F
(x? )(xi1 x?i1 )(xi2 x?i2 ) + . . .
xi1 xi2
n
X
F
(x? )(xi1 x?i1 ) . . . (xik x?ik )
xi1 . . . xi1
i1 =1
ik =1
+O kx x? kk+1
+
...
case, then we are sure that the error will be at most of order O kx x? kk+1 .1 .
In fact, we may look at Taylor series expansion from a slightly different
1 F
?
k! xi (x ).
n
X
k=0
k (x x? )k
X
xk
i=0
k!
k x k <
r = sup |x| :
k=0
r therefore provides the maximal radius of x C for which the complex series
converges. That is for any x C such that |x| < r, the series converges while
Let us recall at this point that a function f : Rn Rk is O x` if limx0
kf (x)k
kxk`
<
n
X
k=0
k (x x? )k for kx x? k < r
For example, let us consider the tangent function tan(x). tan(x) is defined by
the ratio of two analytic functions
tan(x) =
sin(x)
cos(x)
(1)n
x2n
(2n)!
(1)n
x2n+1
(2n + 1)!
i=0
sin(x) =
X
i=0
X
1 kF ?
(x )(x x? )k
k! xk
k=0
X
xk
k=1
What the theorem tells us is that this approximation may be used for values
of x such that kx x? k is below kx? xo k = k0 1k = 1, that is for x such
that 1 < x < 1. In other words, the radius of convergence of the Taylor
report in table 3.1, the true value of log(1 x), its approximate value using
100 terms in the summation, and the absolute deviation of this approximation
from the true value.2 As can be seen from the table, as soon as kx? xo k
approaches the radius, the accuracy of the approximation by the Taylor series
expansion performs more and more poorly.
Table 3.1: Taylor series expansion for log(1 x)
x
-0.9999
-0.9900
-0.9000
-0.5000
0.0000
0.5000
0.9000
0.9900
0.9999
log(1 x)
0.69309718
0.68813464
0.64185389
0.40546511
0.00000000
-0.69314718
-2.30258509
-4.60517019
-9.21034037
100 (x)
0.68817193
0.68632282
0.64185376
0.40546511
0.00000000
-0.69314718
-2.30258291
-4.38945277
-5.17740221
0.00492525
0.00181182
1.25155e-007
1.11022e-016
0
2.22045e-016
2.18735e-006
0.215717
4.03294
2
The word true lies between quotes as it has been computed by the computer and is
therefore an approximation!
1
1 1
=
kt+1 + 1
ct
ct+1
(3.1)
(3.2)
let us denote by b
kt the deviation of the capital stock kt with
respect to its steady state level in period t, such that b
kt = kt k ? . Likewise,
Linearization:
we define b
ct = ct c? . The first step of linearization is to reexpress the system
F (kt+1 , ct+1 , kt , ct ) =
kt+1 kt + ct (1 )kt
1
ct
1
ct+1
kt+1 + 1
+F3 (k ? , c? , k ? , c? )b
kt + F4 (k ? , c? , k ? , c? )b
ct
Since we just want to make the case of Taylor series expansions, we do not need to be
any more precise on the origin of these two equations. Therefore, we will take them as given,
but we will come back to their determination in the sequel.
1
1
+1 b
ct+1 c1? ( 1)k ? 2 b
c?2 b
kt+1 = 0
ct + c?2 k
which simplifies to
(
b
kt+1 k ? 1 b
k +b
ct (1 )b
kt = 0
? 1t
?
b
ct + k
+1 b
ct+1 ( 1) kc ? k ? 1 b
kt+1 = 0
We then have to solve the implied linear dynamic system, bu this is another
story that we will deal with in a couple of chapters.
Loglinearization:
proximation to the equilibrium. Such an approximation is usually taken because it delivers a natural interpretation of the coefficients in front of the
variables: these can be interpreted as elasticities. Indeed, lets consider the
following onedimensional function f (x) and lets assume that we want to take
a loglinear approximation of f around x? . This would amount to have, as
deviation, a logdeviation rather than a simple deviation, such that we can
define
x
b = log(x) log(x? )
Then, a restatement of the problem is in order, as we are to take an approximation with respect to log(x):
f (x) = f (exp(log(x)))
which leads to the following first order Taylor expansion
f (x) ' f (x? ) + f 0 (exp(log(x? ))) exp(log(x? ))b
x = f (x? ) + f 0 (x? )x? x
b
If we apply this technic to the growth model, we end up with the following
system
b
ct (1 )b
kt = 0
b
ct + b
ct+1 ( 1)(1 (1 ))b
kt+1 = 0
b
kt+1
1(1) b
kt
1(1)
3.2
Regressions as approximation
This type of approximation is particularly common in economics as it just corresponds to ordinary least square (OLS) . As you know the problem essentially
amounts to approximate a function F by another function G of exogenous variables. In other words, we have a set of observable endogenous variables y i ,
i = 1 . . . N , which we are willing to explain in terms of the set of exogenous
variables Xi = {x1i , . . . , xki }, i = 1 . . . N . This problem amounts to find a set
N
X
i=1
(3.3)
The idea is therefore that is chosen such that on average G(X ; ) is close
enough to y, such that G delivers a good approximation for the true function
F in the region of x that we consider. This is actually nothing else than
econometrics!
There are however several choices that can be made and that give us much
more freedom than in econometrics. We now consider investigate these choices.
points they can use to reveal information on the F function, as data are given
by history. In numerical analysis this constraint is relaxed, we may be exactly
identified for example, meaning that the number of data points, N , is exactly
equal to the number of parameters, p, we want to reveal. Never would a good
econometrician do this kind of thing! Obviously, we can impose a situation
where N > p in order to exploit more information. The difference between
these two choices should be clear to you, in the first case we are sure that
the approximation will be exact in the selected points, whereas this will not
necessarily be the case in the second experiment. To see that, just think of the
following: assume we have a sample made of 2 data points for a function that
we want to approximate using a linear function. we actually need 2 parameters
8
to define a linear function: the intercept, and the slope . In such a case,
(3.3) rewrites
min (y1 x1 )2 + (y2 x2 )2
{,}
1 1
x1 x2
y1 x1
y2 x2
A.v = 0
This system then just amounts to find the null space of the matrix A, which,
in the case x1 6= x2 (which we can always impose as we will see next), imposes
vi = 0, i = 1, 2. This therefore leads to
y1 = + x1
y2 = + x2
such that the approximation is exact. When the system is overidentified this
is not the case anymore.
Selection of data points:
x?
X
`=0
(1 )` it`
Therefore, taking as a basis for exogenous variables powers of the capital stock
is basically a bad idea. To see that, let us assume that = 0.025 and it is a
white noise process with volatility 0.1, then let us simulate a 1000 data points
process for the capital stock and let us compute the correlation matrix for k tj ,
10
j = 1, . . . , 4, we get:
kt
kt2
kt3
kt4
kt
1.0000
0.9835
0.9572
0.9315
kt2
0.9835
1.0000
0.9933
0.9801
kt3
0.9572
0.9933
1.0000
0.9963
kt4
0.9315
0.9801
0.9963
1.0000
implying a very high correlation between the powers of the capital stock, therefore raising the possibility of occurrence of multicolinearity. A typical answer
to this problem is to rely on orthogonal polynomials rather than monomials.
We will discuss this issue extensively in the next section. A second possibility is
to rely on parcimonious approaches that do not require too much information
in terms of function specification. An alternative is to use neural networks.
Before going to Neural Network approximations, we first have to deal with
a potential problem you may face with all the examples I gave is that when
we solve a model: the true decision rule is unknown, such that we do not
know the function we are dealing with. However, the main properties of the
decision rule are known, in particular, we know that it has to satisfy some
conditions imposed by economic theory. As an example, let us attempt to find
an approximate solution for the consumption decision rule in the deterministic
optimal growth model. Economic theory teaches us that consumption should
satisfy the Euler equation
1
c
= c
t
t+1 kt+1 + 1
(3.4)
kt+1 = kt ct + (1 )kt
Let us assume that consumption may be approximated by
(3.5)
over the interval [k; k]. Our problem is then to find the triple {0 , 1 , 2 } such
that
N
X
t=1
1
2
(kt , ) (kt+1 , ) kt+1
+1
11
is minimum. This actually amounts to solve a nonlinear least squares problem. However, a lot of structure is put on this problem as kt+1 has to satisfy
the law of motion for capital:
1
R(kt , ) (kt , ) (kt+1 , ) kt+1
+1
4. If the quantity
N
X
R(ki , )2
t=1
The true decision rule was computed using value iteration, which we shall study later.
12
from the graph, the solution we obtained with our approximation is rather
accurate and we actually have
1 X ctrue
cLS
4
i
i
true
= 8.743369e and
N
ci
i=1
true
ci cLS
i
= 0.005140
max
true
c
i={1,...,N }
i
such that the error we are making is particularly small. Indeed, using the
approximate decision rule, an agent would make a maximal error of 0.51% in
its economic calculus i.e. 51 cents for each 100$ spent on consumption.
Figure 3.2: Leastsquare approximation of consumption
1.8
1.6
Consumption
1.4
1.2
1
0.8
0.6
0.4
0
0.5
1.5
2.5
3
Capital stock
3.5
4.5
n
X
i g(x(i) )
i=1
where x is the vector of inputs, and h and g are scalar functions. A common
choice for g in the case of the dingle layer neural network is g(x) = x. Therefore, if we set h to be identity too, we are back to the standard OLS model
with monomials. A second a perhaps much more interesting type of neural
Figure 3.3: Neural Networks
x1
x1
x2
x2
..
..
..
..
xn
~
q
>
- y
q
1
R
q
..
..
..
..
xn
(a)
wU
:
s
- - y
3
(b)
network is depicted in panel (b) of figure 3.3. This type of neural network
is called the hiddenlayer feedforward network. In this specification, we are
closer to the idea of network as the transformed input is fed to another node
that process it to get information. The associated function form is given by
G(x, , ) = f
m
X
j h
j=1
n
X
i=1
!
j g(x(i) )
i
1
0
14
for x > 0
for x < 0
1
1 + exp(x)
Z x
x2
1
exp 2
h(x) =
2
2
Obtaining an approximation then simply amounts to determine the set
of coefficients {j , ij ; i = 1 . . . n, j = 1, . . . , m}. This can be simply achieved
{,}
N
X
`=1
One particular nice feature of neural networks is that they deliver accurate
approximation using only few parameters. This characteristic is related to the
high flexibility of the approximating functions they use. Further, as established
by the following theorem by Hornik, Stinchcombe and White [1989], neural
network are extremely powerful in that they offer a universal approximation
method.
Theorem 3 Let f : Rn R be a continuous function to be approximated.
R
Let h be a continuous function, h : R R, such that either (i) h(x)dx
aj , j R, and w j Rn , wj 6= 0, m = 1, 2, . . .
be the set of all possible single hiddenlayer feedforward neural networks, using
h as the hidden layer activation function. Then, for all > 0, probability
measure , and compact sets K Rn , there is a g (h) such that
Z
sup |f (x) g(x)| 6 and
|f (x) g(x)|d 6
xK
15
! !
1 3
3
,2
F (x) = min max , x
2
2
over the interval [3; 3] and consider a single hiddenlayer feedforward network
of the form
Fe(x, , , ) =
1
2
+
1 + exp((1 x + 1 ) 1 + exp((2 x + 2 )
1
6.8424
1
-10.0893
2
-1.5091
2
-7.6414
2
-2.9238
N
X
i=1
b 2
(F (xi ) Fe (xi , )
! 21
= 0.0469
e
b
= max F (xi ) F (xi , ) = 0.2200
i
True
Neural Network
1.5
1
0.5
0
0.5
1
1.5
2
3
0
x
All these methods actually relied on regressions. They are simple, but
may be either totally unreliable and illconditioned in a number of problem, or
17
3.2.1
Orthogonal polynomials
This class of polynomial possesses by definition the orthogonality property which will prove to be extremely efficient and useful in a number of
problem. For instance, this will solve the multicolinearity problem we encountered in OLS. This property will also greatly simplify the computation of the
approximation in a number of problems we will deal with in the sequel. First
of all we need to introduce some preliminary concepts.
Definition 4 (Weighting function) A weighting function (x) on the interval [a; b] is a function that is positive almost everywhere on [a; b] and has a
finite integral on [a; b].
An example of such a weighting function is (x) = (1x)1/2 over the interval
[-1;1]. Indeed, limx1 (x) = 2/2, and 0 (x) > 0 such that (x) is positive
everywhere over the whole interval. Further
1
Z 1
2 1/2
(1 x )
dx = arcsin(x) =
1
Definition 5 (Inner product) Let us consider two functions f1 (x) and f2 (x)
both defined at least on [a; b], the inner product with respect to the weighting
function (x) is given by
hf1 , f2 i =
f1 (x)f2 (x)(x)dx
a
x
2
dx = 1 x = 0
hf1 , f2 i =
1x
1
1
18
Hence, in this case, we have that the inner product of 1 and x with respect
to (w) on the interval [-1;1] is identically null. This will actually define the
orthogonality property.
Definition 6 (Orthogonal Polynomials) The family of polynomials {Pn (x)}
is mutually orthogonal with respect to (x) iff
hPi , Pj i = 0 for i 6= j
Definition 7 (Orthonormal Polynomials) The family of polynomials {Pn (x)}
is mutually orthonormal with respect to (x) iff it is orthogonal and
hPi , Pi i = 1 for all i
Table 3.3 reports the most common families of orthogonal polynomials (see
Judd [1998] for a more detailed exposition) and table 3.4 their recursive formulation
Table 3.3: Orthogonal polynomials (definitions)
Family
Legendre
Chebychev
Laguerre
Hermite
3.3
(x)
1
(1 x2 )1/2
exp(x)
exp(x2 )
[a; b]
[1; 1]
[1; 1]
[0, )
(, )
Definition
n
dn
2 n
Pn (x) = (1)
2n n! dxn (1 x )
1
Tn (x) = cos(n cos (x))
dn
n
Ln (x) = exp(x)
n! dxn (x exp(x))
dn
2
Hn (x) = (1)n exp(x2 ) dx
n exp(x )
We will now discuss one of the most common approach to approximation that
goes back to the easy OLS approach we have seen previously.
19
0
P0 (x) = 1
1
P1 (x) = x
Recursion
Pn+1 (x) =
Chebychev
T0 (x) = 1
T1 (x) = x
Laguerre
L0 (x) = 1
L1 (x) = 1 x
Ln+1 (x) =
Hermite
H0 (x) = 1
H1 (x) = 2x
2n+1
n+1 xPn (x)
2n+1x
n+1 Ln (x)
n
n+1 Pn1 (x)
n
n+1 Ln1 (x)
deg(g)6n a
Pn
i=0 ci i (x),
i=0
i=0
20
which yields
ci =
Therefore, we have
F (x) '
hF, i i
hi , i i
n
X
hF, i i
i (x)
hi , i i
i=0
There are several examples of the use of this type of approximation. Fourier
approximation is an example of those which is suitable for periodic functions.
Nevertheless, I will focus in a coming section on one particular approach, which
we will use quite often in the next chapters: Chebychev approximation.
Beyond least square approximation, there exists other approaches that
departs from the least square by the norm they use:
Uniform approximation, which attempt to solve
lim max |F (x) pn (x)| = 0
n x[a;b]
deg(g)6n
There are theorems in approximation theory that states that the quality
of this approximation increases polynomially as the degree of polynomials increases, and the polynomial rate of convergence is faster for
21
3.4
Interpolation methods
Up to now, we have seen that there exist methods to compute the value of a
function at some particular points, but in most of the cases we are not only
interested by the function at some points but also between these points. This
is the problem of interpolation.
3.4.1
Linear interpolation
xi yi1 xi1 yi
yi yi1
x+
xi xi1
xi xi1
Although, such approximations have proved to be very useful in a lot of applications, it can be particularly inefficient for different reasons:
1. It does not deliver an approximating function but rather a collection of
approximations for each interval;
2. it requires to identify the interval where the approximation is to be
computed, which may be particularly costly when the interval is not
uniform;
3. it can obviously perform very badly for nonlinear functions.
Therefore, there obviously exist alternative interpolation methods that perform better, but which are not necessarily more efficient.
22
3.4.2
Lagrange interpolation
that yi = P (xi ). Therefore, the method is exact for each point. Lagrange
Y x xj
xi x j
j6=i
n
X
yi Pi (x)
i=1
The obvious problem that we can immediately see from this formula is that
if the number of point is high enough, this type of interpolation is totally
untractable. Indeed, just to compute a single Pi (x) this already requires 2(n
1) substractions, n multiplications. Then this has to be constructed for the
n data points to compute all needed Pi (x). Then to compute P (x) we need
n additions and n multiplications. Therefore, this requires 3n2 operations!
Therefore, one actually may actually attempt to compute directly:
P (x) =
n1
X
i x i
i=0
or
y1 = 0 + 1 x1 + 2 x21 + . . . + n1 xn1
y2 = 0 + 1 x2 + 2 x2 + . . . + n1 xn1
2
2
..
yn = 0 + 1 xn + 2 x2n + . . . + n1 xn1
n
A = y
23
xi , i = 1, . . . , n
A=
1 x1 x21
1 x2 x22
.. . .
.. ..
.
.
. .
1 xn x2n
xn1
1
xn1
2
..
.
xn1
n
P ? (x) is at most of degree n 1 and is zeros at all n nodes. But the only
Here again, the method may be quite demanding from a computational point of
view, as we have to invert a matrix of size (nn). There then exist much more
efficient methods, and in particular the socalled Chebychev approximation
that works very well for smooth functions.
3.5
Chebychev approximation
less, this interval may be generalized to [a; b], by transforming the data using
24
the formula
ya
1 for y [a; b]
ba
Beyond the standard orthogonality property, Chebychev polynomials exhibit
x=2
n
0 for i 6= j
X
n for i = j = 0
Ti (rk )Tj (rk ) =
n
i=1
for i = j 6= 0
2
0.8
0.6
0.4
0.2
0
0.2
0.4
0.6
0.8
1
1
0.8
0.6
0.4
0.2
0
x
0.2
0.4
0.6
0.8
ci =
dx for i = 0 . . . n
1
1 x2
and
gn (x)
c0 X
+
ci Ti (x)
2
i=1
25
log(n)
nk
This theorem is of great importance as it actually states that the approximation gn (x) will be as close as we might want to F (x) as the degree of
approximation n increases to . In effect, since the approximation error is
increases. Further, the next theorem will establish a useful property on the
coefficients of the approximation.
Theorem 6 Assume F is a C k function over the interval [1; 1], and admits
a Chebychev representation
F (x) =
c0 X
+
ci Ti (x)
2
i=1
c
for j > 1
ik
G(x)
n
X
i=0
xa
1
i Ti 2
ba
2k 1
for k = 1 . . . , m
rk = cos
2m
2. Adjust the nodes, rk , to fit in the [a; b] interval
xk = (rk + 1)
ba
+ a for k = 1 . . . , m
2
i =
m
X
yk Ti (rk )
k=1
m
X
Ti (rk )2
k=1
n
X
i=0
xa
i Ti 2
1
ba
27
cov(y, Ti (x))
var(Ti (x))
X=
T0 (x1 )
T0 (x2 )
..
.
T1 (x1 )
T1 (x2 )
..
.
..
.
Tn (x1 )
Tn (x2 )
..
.
T0 (xm ) T1 (xm )
Tn (xm )
and Y =
y1
y2
..
.
ym
We now report two examples implementing the algorithm. The first one
deals with a smooth function of the type F (x) = x . The second one evaluate the accuracy of the approximation in the case of a nonsmooth function:
F (x) = min(max(1.5, (x 1/2)3 ), 2).
In the case of the smooth function, we set = 0.1 and approximate the
function over the interval [0.01; 2]. We select 100 nodes and evaluate the
accuracy of degree 2 and 6 approximation. Figure 3.6 reports the true function
and the corresponding approximation, table 3.5 reports the coefficients. As
can be seen from the table, adding terms in the approximation does not alter
the coefficients of lower degree. This just reflects the orthogonality properties
of the Chebychev polynomials, that we saw in the formula determining each
i . This is of great importance, as it states that once we have obtained a high
order approximation, obtaining lower orders is particularly simple. This is
the economization principle. Further, as can be seen from the figure, a good
approximation to the function is obtained at rather low degrees. Indeed, the
difference between the function and its approximation at order 6 is already
good.
28
c0
c1
c2
c3
c4
c5
c6
n=2
0.9547
0.1567
-0.0598
n=6
0.9547
0.1567
-0.0598
0.0324
-0.0202
0.0136
-0.0096
1.1
1
0.9
True
n=2
n=6
0.8
0.05
0.1
0.7
0.6
0
Residuals
0.05
0.5
1
x
1.5
0.15
0
29
n=2
n=6
0.5
1
x
1.5
xa
2
1
i
i
i=0
ba
Pn
(x 1/2)3 = 1.5
30
c0
c1
c2
c3
c4
c5
c6
c7
c8
c9
c10
c11
c12
c13
c14
c15
n=7
-0.0140
2.0549
0.4176
-0.3120
-0.1607
-0.0425
-0.0802
0.0571
n=15
-0.0140
2.0549
0.4176
-0.3120
-0.1607
-0.0425
-0.0802
0.0571
0.1828
0.0275
-0.1444
-0.0686
0.0548
0.0355
-0.0012
0.0208
1
2
4
0
x
0.5
4
31
0
x
and x satisfies
(x 1/2)3 = 2
In such a case, the approximation would be perfect with n = 3. This suggests
that piecewise approximation may be of interest in a number of cases.
3.6
Piecewise interpolation
i = 0, . . . , m 1
0, x < xi
0
1, xi 6 x 6 xi+1
Bi (x) =
0, x > xi+1
32
0,
xxi
x < xi
xi 6 x 6 xi+1
xi+1 xi ,
Bi1 (x) =
xi+2 x
xi+2 xi+1 , xi+1 6 x 6 xi+2
0,
x > xi+2
x xi
xi+n+1 x
n1
n1
n
Bi (x) =
Bi (x) +
Bi+1
(x)
xi+n xi
xi+n+1 xi+1
Cubic splines are the most widely used splines to interpolate functions. Therefore, we will describe the method in greater details in such a case.
Let
(3.6)
and
an1 + bn1 (xn xn1 ) + cn1 (xn xn1 )2 + dn1 (xn xn1 )3 = yn (3.7)
The second set of restrictions imposes continuity of the function on the upper
bound of each interval
Si (xi ) = Si1 (xi ) for i = 1, . . . , n 1
which implies, noting hi = xi xi1
ai = ai1 + bi1 hi + ci1 h2i + di1 h3i for i = 1, . . . , n 1
(3.8)
This furnishes 2n identification restrictions, such that 2n additional restrictions are still needed. Since we are dealing with a cubic spline interpolation,
this requires the approximation to be C 2 , implying that first and second order
derivatives should be continuous. This yields the following n 1 conditions
for the first order derivatives
0
Si0 (xi ) = Si1
(xi ) for i = 1, . . . , n 1
or
bi = bi1 + 2ci1 hi + 3d3i1 h2i for i = 1, . . . , n 1
(3.9)
or
2ci = 2ci1 + 6d3i1 hi for i = 1, . . . , n 1
(3.10)
This is the socalled Hermite spline. However, in a number of situation such information on the derivative of F is either not known or does
not exist (think of F not being differentiable at some points), such that
further source of information is needed. One can then rely on an approximation of the slope by the secant line. This is what is proposed
by thesecant Hermite spline, which amounts to approximate F 0 (x0 ) and
F 0 (xn ) by the secant line over the corresponding interval:
S00 (x0 ) =
S0 (x1 ) S0 (x0 )
Sn1 (xn ) Sn1 (xn1 )
0
and Sn1
(xn ) =
x1 x 0
xn xn1
1
(ci ci1 ) for i = 1, . . . , n 1
3hi
by
3
hi+1
(yi+1 yi )
3
(yi yi1 ) = hi ci1 + 2(hi + hi+1 )ci + hi+1 ci+1
hi
2(h0 + h1 )
h1
h1
2(h1 + h2 )
h2
h
2(h
+ h3 )
h3
2
2
A=
.
..
hn3
2(hn3 + hn2 )
hn2
hn2
2(hn2 + hn1 )
36
c1
c = ... and B =
cn1
3
h1 (y2
3
hn1 (yn
y1 )
..
.
yn1 )
3
h0 (y1
y0 )
3
hn2 (yn1
yn2 )
The matrix A is then said to be tridiagonal (and therefore sparse) and is also
symmetric and elementwise positive. It is hence positive definite and therefore
invertible. We then got all the ci , i = 1, . . . , n 1 and can compute the bs
and ds as
bi1 =
yn yn1 2cn1
yi yi1 1
(ci +2ci1 )hi for i = 1, . . . , n1 and bn1 =
hi
3
hn
3hn
and
cn1
1
(ci ci1 ) for i = 1, . . . , n 1 and dn1 =
3hi
3hn
finally we have had ai = yi , i = 0, . . . , n 1 from the very beginning.
di1 =
has to be undertaken. The only difficult part in this job is to identify the interval the value of the argument of the function we want to evaluate belongs to
i.e. we have to find i {0, . . . , n 1} such that x [xi , xi+1 ]. Nevertheless,
as long as the nodes are generated using an invertible formula, there will be no
cost to determine the interval. Most of the time, a uniform grid is used, such
that the interval [a; b] is divided using the linear scheme xi = a + i, where
= (b a)/(n 1), and i = 0, . . . , n 1. In such a case, it is particularly
the following simple 4 nodes grid {3, 0.5 3 1.5, 0.5 + 3 2, 3}, as taking this
grid would yield a perfect approximation (remember that the central part of
the function is cubic!)
As an example of spline approximation, figure 3.8 reports the spline approximation to the nonsmooth function F (x) = min(max(1.5, (x1/2) 3 ), 2)
considering a uniform grid over the [-3;3] interval with 3, 7 and 15 nodes. In
37
3
2
1
0
1
2
4
0
x
0.5
4
0
x
mation in that the error is driven to zero. Nevertheless, it also appears that
convergence is not monotonic in the case of the L error. This is actually related to the fact that F , in this case is not even C 1 on the overall interval. In
fact, as soon as we consider a smooth function this convergence is monotonic,
as can be seen from the lower panel that report it for the function F (x) = x 0.1
over the interval [0.01;2]. This actually illustrates the following result.
Theorem 7 Let F be a C 4 function over the interval [x0 ; xn ] and S its cubic
spline approximation on {x0 , x1 , . . . , xn } and let > maxi {xi xi1 }, then
kF Sk 6
and
0
kF S k
5
kF (4) k 4
384
p
9 + (3) (4)
6
kF k 3
216
This theorem actually gives upper bounds to spline approximation, and indicates that these bounds decrease at a fast pace (power of 4) as the number of
38
L2 error
0.03
0.8
L error
0.6
0.02
0.4
0.01
0.2
0
0
20
40
# of nodes
0
0
60
20
40
# of nodes
60
x 10
L2 error
0.2
L error
0.15
0.1
1
0
0
0.05
20
40
# of nodes
0
0
60
39
20
40
# of nodes
60
=
=
=
=
=
=
A
= spalloc((nbx-2),(nbx-2),3*nbx-8);
% creates sparse matrix A
B
= zeros((nbx-2),1);
% creates vector B
A(1,[1 2])=[2*(dx+dx) dx];
for i=2:nbx-3;
A(i,[i-1 i i+1])=[dx 2*(dx+dx) dx];
B(i)=3*(y(i+2)-y(i+1))/dx-3*(y(i+1)-y(i))/dx;
end
A(nbx-2,[nbx-3 nbx-2])=[dx 2*(dx+dx)];
c
a
b
d
S
=
=
=
=
=
[0;A\B];
y(1:nbx-1);
(y(2:nbx)-y(1:nbx-1))/dx-dx*([c(2:nbx-1);0]+2*c(1:nbx-1))/3;
([c(2:nbx-1);0]-c(1:nbx-1))/(3*dx);
[a;b;c(1:nbx-1);d];
% Matrix of spline coefficients
One potential problem that may arise with the type of method we have
developed until now is that we have not imposed any particular restriction on
the shape of the approximation relative to the true function. This may be of
great importance in some cases. Let us assume for instance that we need to
approximate the function F (xt ) that characterizes the dynamics of variable x
40
3.7
In this section, we will see an approximation method that preserves the shape
of the function we want to approximate. This method was proposed by Schumaker [1983] and essentially amounts to exploit some information on both
the level and the slope of the function to be approximated to build a smooth
approximation. We will deal with two situations. The first one Hermite
interpolation assumes that we have information on both the level and the
slope of the function to approximate. The second one that uses Lagrange
data assumes that no information on the slope of the function is available.
Both method was originally developed using quadratic splines.
3.7.1
Hermite interpolation
This method assumes that we have information on both the level and the
slope of the function to be approximated. Assume we want to approximate
the function F on the interval [x1 , x2 ] and we know yi = F (xi ) and zi = F 0 (xi ),
i = 1, 2. Schumaker proposes to build a quadratic function S(x) on [x1 ; x2 ]
41
that satisfies
S(xi ) = yi and S 0 (xi ) = zi for i = 1, 2
Schumaker establishes first that
Lemma 1 If
z1 + z 2
y2 y 1
=
2
x2 x 1
S(x) = y1 + z1 (x x1 ) +
z2 z 1
(x x1 )2
2(x2 x1 )
(z2 z1 )
(x x1 )
(x2 x1 )
However, the conditions stated by this lemma are extremely stringent and
do not usually apply, such that we have to adapt the procedure. This may be
done by adding a node between x1 and x2 and construct another spline that
satisfies the lemma.
Lemma 2 For every x? (x1 , x2 ) there exist a unique quadratic spline that
solves
01 + 11 (x x1 ) + 21 (x x1 )2
S(x) =
02 + 12 (x x? ) + 22 (x x? )2
where
where z =
01 = y1
02 = y1 +
for x [x1 ; x? ]
for x [x? ; x2 ]
11 = z1 21 =
z+z1
?
22 =
2 (x x1 ) 12 = z
42
zz1
2(x? x1 )
z2 z
2(x2 x? )
If the later lemma fully characterized the quadratic spline, it gives no information on x? which therefore remains to be selected. x? will be set such that the
spline matches the desired shape properties. First note that if z1 and z2 are
both positive (negative), then S(x) is monotone if and only if z1 z > 0 (6 0)
which is actually equivalent to
2(y2 y1 ) R (x? x1 )z1 + (x2 x? )z2 if z1 , z2 R 0
This essentially deals with the monotonicity problem, and we now have to
tackle the question of curvature. To do so, we compute the slope of the secant
line between x1 and x2
=
y2 y 1
x2 x 1
Then, if (z2 )(z1 ) > 0, this indicates the presence of an inflexion point
in the interval [x1 ; x2 ] such that the interpolant cannot be neither convex nor
concave. Conversely, if |z2 | < |z1 | and x? satisfies
x1 < x ? 6 x x 1 +
2(x2 x1 )(z2 )
(z2 z1 )
2(x2 x1 )(z1 )
6 x? < x2
(z2 z1 )
3. if (z1 )(z2 ) > 0 set x? = (x1 + x2 )/2 and stop else goto 4.
4. if |z1 | < |z2 | set x? = (x1 + x)/2 and stop else goto 5.
5. if |z1 | > |z2 | set x? = (x2 + x)/2 and stop.
We have then in hand a value for x? for [x1 ; x2 ]. We then apply it to each
subinterval to get x?i [xi ; xi+1 ] and then solve the general interpolation
problem as explained in lemma 2.
Note here that everything assumes that with have Hermite data in hand
i.e. {xi , yi , zi : i = 0, . . . , n}. However, the knowledge of the slope is usually
not the rule and we therefore have to adapt the algorithm to such situations.
3.7.2
Assume now that we do not have any data for the slope of the function, that
is we are only endowed with Lagrange data {xi , yi : i = 0, . . . , n}. In such a
case, we just have to add the needed information an estimate of the slope of
and
1
Li = (xi+1 xi )2 + (yi+1 yi )2 2
i =
yi+1 yi
xi+1 xi
Li1 i1 + Li i
if i1 i > 0
zi =
i = 2, . . . , n 1
Li1 + Li
0
if i1 i 6 0
and
z1 =
3n1 sn1
31 z2
and zn =
2
2
Then, we just apply exactly the same procedure as described in the previous
section.
44
3.8
Multidimensional approximations
Computing a multidimensional approximation to a function may be quite cumbersome and even impossible in some cases. To understand the problem, let
us restate an example provided by Judd [1998]. Consider we have data points
{P1 , P2 , P3 , P4 } = {(1, 0), (1, 0), (0, 1), (0, 1)} in R2 and the corresponding
data zi = F (Pi ), i = 1, . . . , 4. Assume now that we want to construct the approximation of function F using a linear combination of {1, x, y, xy} defined
as
G(x, y) = a + bx + cy + dxy
such that G(xi , yi ) = zi . Finding a, b, c, d amounts to solve the linear system
1
1
0 0
a
z1
1 1
0 0
b = z2
1
0
1 0
z3
c
z4
1
0 1 0
d
3.8.1
The idea here is to use the tensor product of univariate functions to form a
basis of multivariate functions. In order to better understand this point, let
us consider that we want to approximate a function F : R2 R using simple
{1, x, y, xy, x2 , y 2 , x2 y, xy 2 , x2 y 2 }
i.e. all possible 2terms products of elements belonging to X and Y .
We are now in position to define the nfold tensor product basis for functions of n variables {x1 , . . . , xi , . . . , xn }.
Definition 10 Given a basis for n functions of the single variable xi : Pi =
i
{pki (xi )}k=0
then the tensor product basis is given by
1
n
Y
Y
...
B=
pk11 (x1 ) . . . pknn (xn )
k1 =0
kn =0
An important problem with this type of tensor product basis is their size.
For example, considering a mdimensional space with polynomials of order
n, we already get (n + 1)m terms! This exponential growth in the number
of terms makes it particularly costly to use this type of basis, as soon as the
number of terms or the number of nodes is high. Nevertheless, it will often
be satisfactory or sufficient for low enough polynomials (in practice n=2!)
Therefore, one often rely on less computationally costly basis.
3.8.2
Complete polynomials
xk11
. . . xknn : k1 , . . . , kn > 0,
n
X
ki 6
i=1
To see this more clearly, let us consider the example developed in the previous
section (X = {1, x, x2 } and Y = {1, y, y 2 }) and let us assume that = 2. In
B c = 1, x, y, x2 , y 2 , xy = B\{xy 2 , x2 y, x2 y 2 }
Note that we have actually already encountered this type of basis, as this is
typically what is done by Taylors theorem for many dimensions
F (x) ' F (x? ) +
..
.
+
n
X
F ?
(x )(xi x?i )
xi
i=1
n
n
X
F
1 X
...
(x? )(xi1 x?i1 ) . . . (xik x?ik )
k!
xi1 . . . xi1
i1 =1
ik =1
1
+
Fxx (x? , y ? )(x x? )2 + 2Fxy (x? , y ? )(x x? )(y y ? )
2
!
+Fyy (x? , y ? )(y y ? )2
which rewrites
F (x, y) = 0 + 1 x + 2 y + 3 x2 + 4 y 2 + 5 xy
such that the implicit polynomial basis is the complete polynomials basis of
order 2 with 2 variables.
47
The key difference between tensor product bases and complete polynomials bases lies essentially in the rate at which the size of the basis increases. As
aforementioned, tensor product bases grow exponentially while complete polynomials bases only grow polynomially. This reduces the computational cost of
approximation. But what do we loose using complete polynomials rather than
tensor product bases? From a theoretical point of view, Taylors theorem gives
us the answer: Nothing! Indeed, Taylors theorem indicates that the element
in B c delivers a approximation in the neighborhood of x? that exhibits an
asymptotic degree of convergence equal to k. The nfold tensor product, B,
can deliver only a k th degree of convergence as it does not contains all terms
of degree k + 1. In other words, complete polynomials and tensor product
bases deliver the same degree of asymptotic convergence and therefore complete polynomials based approximation yields an as good level of accuracy as
tensor product based approximations.
Once we have chosen a basis, we can proceed to approximation. For example, we may use Chebychev approximation in higher dimensional problems.
Judd [1998] reports the algorithm for this problem. As we will see, it takes advantage of a very nice feature of orthogonal polynomials: they inherit their orthogonality property even if we extend them to higher dimensions. Let us then
assume we want to compute the chebychev approximation of a 2dimensional
function F (x, y) over the interval [ax ; bx ] [ay ; by ] and let us assume to
keep things simple for a while that we use a tensor product basis. Then
the algorithm is as follows
1. Choose a polynomial order for x (nx ) and y (ny )
= cos
2k 1
, k = 1, . . . , mx
2mx
= cos
2k 1
, k = 1, . . . , my
2my
and
zky
48
bx ax
x
xk = ax + (1 + zk )
, k = 1, . . . , mx
2
and
yk = ay + (1 +
zky )
by ay
2
, k = 1, . . . , my
k` Tix (zkx ) Tjy z`y
!
! my
ij = mk=1 `=1
x
X
X
2
Tjy z`y
Tix (zkx )2
`=1
k=1
T x (z x )0 T y (z y )
kT x (z x )k2 kT y (z y )k2
ny
nx X
X
i=0 j=0
ij Tix
y ay
x ax
y
1 Tj 2
1
2
bx ax
by ay
y ay
x ax
y
x
1 T 2
1
G(x, y) = T
2
bx ax
by ay
As an illustration of the algorithm we compute the approximation of the CES
function
1
F (x, y) = [x + y ]
49
on the [0.01; 2][0.01; 2] interval for = 0.75. We used 5th order polynomials
for both x and y and 20 nodes for both x and y, such that there are 400
possible interpolation nodes. Applying the algorithm we just described, we
get the matrix of coefficients reported in table 3.7. As can be seen from the
table, most of the coefficients are close to zero as soon as they involve the
crossproduct of higher order terms, such that using a complete polynomial
basis would yield the same efficiency at a lower computational cost. Figure
3.10 reports the graph of the residuals for the approximation.
Table 3.7: Matrix of Chebychev coefficients (tensor product basis)
kx \ k y
0
1
2
3
4
5
0
2.4251
1.2744
-0.0582
0.0217
-0.0104
0.0057
1
1.2744
0.2030
-0.0366
0.0124
-0.0055
0.0029
2
-0.0582
-0.0366
0.0094
-0.0037
0.0018
-0.0009
3
0.0217
0.0124
-0.0037
0.0016
-0.0008
0.0005
4
-0.0104
-0.0055
0.0018
-0.0008
0.0004
-0.0003
5
0.0057
0.0029
-0.0009
0.0005
-0.0003
0.0002
50
%
% Step 3
%
Y
= zeros(mx,my);
for ix=1:mx;
for iy=1:my;
Y(ix,iy) = (x(ix)^rho+y(iy)^rho)^(1/rho);
end
end
%
% Step 4
%
Xx = [ones(mx,1) rx];
for i=3:nx+1;
Xx= [Xx 2*rx.*Xx(:,i-1)-Xx(:,i-2)];
end Xy = [ones(my,1) ry];
for i=3:ny+1;
Xy= [Xy 2*ry.*Xy(:,i-1)-Xy(:,i-2)];
end
T2x = diag(Xx*Xx);
T2y = diag(Xy*Xy);
a
= (Xx*Y*Xy)./(T2x*T2y);
0.01
0.005
0
0.005
0.01
0.015
0
0.5
1
1.5
y
1.5
51
0.5
If we now want to perform the same approximation using a complete polynomials basis, we just have to modify the algorithm to take into account the
fact that when iterating on i and j we want to impose i + j 6 . Let us
compute is for = 5. This implies that the basis will consists of
1, T1x (.), T1y (.), T2x (.), T2y (.), T3x (.), T3y (.), T4x (.), T4y (.), T5x (.), T5y (.),
T1x (.)T1y (.), T1x (.)T2y (.), T1x (.)T3y (.), T1x (.)T4y (.),
T2x (.)T1y (.), T2x (.)T2y (.), T2x (.)T3y (.),
T3x (.)T1y (.), T3x (.)T2y (.),
T4x (.)T1y (.)
0
2.4251
1.2744
-0.0582
0.0217
-0.0104
0.0057
1
1.2744
0.2030
-0.0366
0.0124
-0.0055
2
-0.0582
-0.0366
0.0094
-0.0037
3
0.0217
0.0124
-0.0037
4
-0.0104
-0.0055
5
0.0057
A first thing to note is that the coefficients that remain are the same as
the one we got in the tensor product basis. This should not be any surprise
as what we just find is just the expression of the Chebychev economization we
already encountered in the unidimensional case and which is just the direct
consequence of the orthogonality condition of chebychev polynomials. Figure
3.11 report the residuals from the approximation using the complete basis. As
can be seen from the figure, this constrained approximation yields quantitatively similar results compared to the tensor product basis, therefore achieving
almost the same accuracy while being less costly from a computational point
of view. In the matlab code section, we just report the lines in step 4 that
are affected by the adoption of the complete polynomials basis.
52
0.02
0.01
0
0.01
0.02
0
0.5
0.5
1.5
2
1.5
2
3.9
Finite element are extremely popular among engineers (especially in aeronautics). This approach considers elements that are zero over most of the
domain of approximation. Although they are extremely powerful in the case
of 2dimensional problems, they are more difficult to implement in higher
dimensions. We therefore focus on the bidimensional case.
53
3.9.1
Bilinear approximations
A bilinear interpolation proposes to interpolate data linearly in both coordinate directions. Assume that we have the values of a function F (x, y) at the
four points
P1 = (1, 1) P2 = (1, 1)
P3 = (1, 1)
P4 = (1, 1)
F (x, y) ' F (1, 1)b1 (x, y)+F (1, 1)b2 (x, y)+F (1, 1)b3 (x, y)+F (1, 1)b4 (x, y)
If we have data on [ax ; bx ] [ay ; by ], we use the linear transformation we have
already encountered a great number of times
y ay
x ax
2
1, 2
1
bx ax
by ay
proceed as follows
insured if we were to construct biquadratic or bicubic interpolations for example. Note that this type of interpolation scheme is typically what is done
when a computer draw a 3d graph of a function. In figure 3.12, we plot
the residuals of the bilinear approximation of the CES function we approximated in the previous section, with 5 uniform intervals (6 nodes such that x =
{0.010, 0.408, 0.806, 1.204, 1.602, 2.000} and y = {0.010, 0.408, 0.806, 1.204, 1.602, 2.000}).
Like in the spline approximation procedure, the most difficult step once we
have obtained an approximation is to determine the square the point for which
we want an approximation belongs to. We therefore face exactly the same type
of problems.
Figure 3.12: Residuals of the bilinear approximation
1
0.5
0
0.5
1
0
0.5
1.5
1
1.5
0.5
2
3.9.2
plane. To do so, and assuming that the lagrange data have already been
55
to which is associated the same index (i = j). b1 b2 and b3 are the cardinal
functions on P1 , P2 , P3 . Let us now add the point P4 = (1, 1), then we have
the following cardinal functions for P2 , P3 , P4 :
b4 (x, y) = 1 x, b5 (x, y) = 1 y, b6 (x, y) = x + y 1
Therefore, on the square P1 , P2 , P3 , P4 the interpolant for F is given by
if x + y 6 1
F (0, 0)(1 x y) + F (0, 1)y + F (1, 0)x
G(x, y) =
It should be clear to you that if these methods are pretty easy to implement
56
Bibliography
Hornik, K., M. Stinchcombe, and H. White, MultiLayer Feedforward Networks are Universal Approximators, Neural Networks, 1989, 2, 359366.
Judd, K.L., Numerical methods in economics, Cambridge, Massachussets:
MIT Press, 1998.
Schumaker, L.L., On Shapepreseving Quadratic Spline Interpolation, SIAM
Journal of Numerical Analysis, 1983, 20, 854864.
57
Index
Analytic function, 4
Chebychev approximation, 24
Vandermonde matrix, 23
Complete polynomials, 47
complete polynomials, 46
Economization principle, 28
Hiddenlayer feedforward, 14
Least square orthogonal polynomial
approximation, 19
Linearization, 6
Local approximations, 2
Loglinearization, 7
Multidimensional approximation, 45
Neural networks, 11
Ordinary least square, 8
Orthogonal polynomials, 18
Radius of convergence, 3
Singlelayer, 14
Splines, 32
Taylor series expansion, 2
58
Contents
3 Approximation Methods
3.1
Local approximations
3.1.1
3.2
1
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
Regressions as approximation . . . . . . . . . . . . . . . . . . .
3.2.1
Orthogonal polynomials . . . . . . . . . . . . . . . . . .
18
3.3
19
3.4
Interpolation methods . . . . . . . . . . . . . . . . . . . . . . .
22
3.4.1
Linear interpolation . . . . . . . . . . . . . . . . . . . .
22
3.4.2
Lagrange interpolation . . . . . . . . . . . . . . . . . . .
23
3.5
Chebychev approximation . . . . . . . . . . . . . . . . . . . . .
24
3.6
Piecewise interpolation . . . . . . . . . . . . . . . . . . . . . . .
32
3.7
41
3.7.1
Hermite interpolation . . . . . . . . . . . . . . . . . . .
41
3.7.2
44
Multidimensional approximations . . . . . . . . . . . . . . . . .
45
3.8.1
46
3.8.2
Complete polynomials . . . . . . . . . . . . . . . . . . .
47
53
3.9.1
Bilinear approximations . . . . . . . . . . . . . . . . . .
53
3.9.2
55
3.8
3.9
59
60
List of Figures
3.1
Selection of points . . . . . . . . . . . . . . . . . . . . . . . . .
10
3.2
13
3.3
Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . .
14
3.4
17
3.5
Chebychev polynomials . . . . . . . . . . . . . . . . . . . . . .
25
x0.1
3.6
. . . . . . . . . . . . . . . . . . .
29
3.7
31
3.8
38
Approximation errors
. . . . . . . . . . . . . . . . . . . . . . .
39
52
54
56
3.9
61
62
List of Tables
3.1
3.2
17
3.3
19
3.4
20
3.5
29
3.6
31
3.7
50
3.8
52
63