0% found this document useful (0 votes)
203 views143 pages

Ae-423 CFD PDF

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
203 views143 pages

Ae-423 CFD PDF

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 143

Computational Methods for Aerospace

Engineering
Aero 423 Course Notes: Winter 2019

Krzysztof Fidkowski
University of Michigan, Ann Arbor, MI, 48109, USA

February 27, 2019


2
c 2019, K. Fidkowski
Contents

1 Mathematics Review 7
1.1 Linear Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.1.1 Vectors and Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.1.2 Matrix Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.1.3 Eigenvalues and Eigenvectors . . . . . . . . . . . . . . . . . . . . . . 9
1.1.4 Sample Systems in Aerospace Engineering . . . . . . . . . . . . . . . 11
1.2 Differential Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.2.1 Classification of PDEs . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.2.2 Initial and Boundary Conditions . . . . . . . . . . . . . . . . . . . . . 15
1.2.3 Sample Equations in Aerospace Engineering . . . . . . . . . . . . . . 15
1.3 Norms and Function Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2 ODE Integration 19
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.1.1 The Forward Euler Method . . . . . . . . . . . . . . . . . . . . . . . 20
2.1.2 The Predictor-Corrector Method . . . . . . . . . . . . . . . . . . . . 21
2.2 Multi-Step Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3 Accuracy, Consistency, Stability . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3.1 Truncation Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3.2 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3.3 Global Error and Convergence . . . . . . . . . . . . . . . . . . . . . . 29
2.4 Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.4.1 Zero Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.4.2 Eigenvalue Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.4.3 Implications for Convergence . . . . . . . . . . . . . . . . . . . . . . . 37
2.5 Multi-Stage Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.5.1 Predictor-Corrector Methods . . . . . . . . . . . . . . . . . . . . . . . 38
2.5.2 Runge-Kutta Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.6 Stiffness and Implicit Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.6.1 Stiffness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.6.2 Implicit Methods for Linear Systems . . . . . . . . . . . . . . . . . . 44
2.6.3 Implicit Methods for Nonlinear Systems . . . . . . . . . . . . . . . . 49

3
3 Finite Difference Methods 55
3.1 Finite Difference Approximations . . . . . . . . . . . . . . . . . . . . . . . . 55
3.2 Truncation Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.3 General Finite-Difference Formulas . . . . . . . . . . . . . . . . . . . . . . . 59
3.3.1 Undetermined coefficients . . . . . . . . . . . . . . . . . . . . . . . . 59
3.3.2 Lagrange interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.4 Problem Setup and Boundary Conditions . . . . . . . . . . . . . . . . . . . . 62
3.5 Multiple Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.6 Stability Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.6.1 Von-Neumann Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.6.2 Matrix Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

4 Probabilistic Methods 77
4.1 Probability Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.1.1 Outcomes and Events . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.1.2 The Meaning of Probability . . . . . . . . . . . . . . . . . . . . . . . 77
4.1.3 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.1.4 The Probability Density Function (PDF) . . . . . . . . . . . . . . . . 78
4.1.5 Expected Value and Mean . . . . . . . . . . . . . . . . . . . . . . . . 79
4.1.6 Variance and Standard Deviation . . . . . . . . . . . . . . . . . . . . 79
4.1.7 The Cumulative Distribution Function (CDF) . . . . . . . . . . . . . 79
4.1.8 Percentiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.2 Monte Carlo Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.2.2 Sampling a Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.2.3 Multiple Inputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.3 Error Estimates for Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . 87
4.3.1 Estimator of the Mean . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.3.2 Standard Error of an Estimator . . . . . . . . . . . . . . . . . . . . . 89

5 Finite Volume Methods 93


5.1 Conservation Laws . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.2 Cells and Meshes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.3 A First-Order Discretization . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.3.1 One Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.3.2 Two Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.4 The Finite Volume Method for Systems . . . . . . . . . . . . . . . . . . . . . 106
5.5 High-Order Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

6 Optimization 109
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.2 Gradient-Based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.2.1 Steepest Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

4
c 2019, K. Fidkowski
6.2.2 Conjugate Gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.2.3 Line Search Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
6.2.4 Computing Gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
6.3 Gradient-Free Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
6.3.1 Nelder-Mead/Simplex Method . . . . . . . . . . . . . . . . . . . . . . 116
6.3.2 Genetic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

7 Finite Element Methods 117


7.1 Function Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
7.1.1 Finite-Dimensional Function Spaces . . . . . . . . . . . . . . . . . . . 117
7.1.2 Local Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
7.2 The Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
7.2.1 Collocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
7.2.2 Weighted Residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
7.3 Numerical Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
7.4 The Finite Element Method in One Dimension . . . . . . . . . . . . . . . . . 125
7.4.1 Mesh and Basis Functions . . . . . . . . . . . . . . . . . . . . . . . . 126
7.4.2 The Reference Element . . . . . . . . . . . . . . . . . . . . . . . . . . 126
7.4.3 The Weak Form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
7.4.4 Boundary Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
7.4.5 Matrix Assembly and Solution . . . . . . . . . . . . . . . . . . . . . . 129
7.5 The Finite Element Method in Two Dimensions . . . . . . . . . . . . . . . . 134
7.5.1 Mesh and Basis Functions . . . . . . . . . . . . . . . . . . . . . . . . 134
7.5.2 The Reference Element . . . . . . . . . . . . . . . . . . . . . . . . . . 135
7.5.3 The Weak Form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
7.5.4 Boundary Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
7.5.5 Matrix Assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
7.6 High-Order Finite Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
7.7 Discontinuous Finite Elements . . . . . . . . . . . . . . . . . . . . . . . . . . 141
7.7.1 Advection in One Dimension . . . . . . . . . . . . . . . . . . . . . . . 142

5
c 2019, K. Fidkowski
6
c 2019, K. Fidkowski
Chapter 1

Mathematics Review

1.1 Linear Algebra


1.1.1 Vectors and Matrices
By u ∈ RN we mean a vector, usually a column, of N real numbers. We often write
 
u1
 u2   T
u =  ..  = u1 u2 . . . uN ,
 
 . 
uN
where the superscript T denotes the transpose operation, which means flipping rows into
columns and vice-versa. A matrix A ∈ RM ×N is a two-dimensional array of real numbers,
with M rows and N columns,
   
A11 A12 . . . A1N A11 A21 . . . AM 1
 A21 A22 . . . A2N   A12 A22 . . . AM 2 
M ×N T N ×M
A =  .. ∈ , A = ..  ∈ R
   
.. . . .
.  R  .. .. . .
 . . . .   . . . . 
AM 1 AM 2 . . . AM N A1N A2N . . . AM N
A dot or inner product between two column vectors of the same size, u, v ∈ RN , is
N
X
T T
u·v =u v =v u= un vn = a scalar. (1.1)
n=1

Two vectors are orthogonal if their dot product is zero. We can multiply a column vector
u ∈ RN from the left by a matrix A ∈ RM ×N . This operation produces a vector v ∈ RM ,
the entries of which are dot products between u and the rows of A:
     PN 
A11 A12 . . . A1N u1 n=1 A1n un
 A21 A22 . . . A2N   u2   PN A u 
    n=1 2n n 
v = Au =  .. ..   ..  =   ∈ RM .

.. .. ..
 . . . .  .   . 
PN
AM 1 AM 2 . . . AM N uN n=1 AM n un

7
In a similar way we can multiply matrices, say A ∈ RN ×M and B ∈ RM ×L ,
M
X
C = AB ∈ RN ×L , Cnl = Anm Bml .
m=1

Multiplying a column vector and a row vector gives an outer product, the result of which is
a matrix,
   
v1 v1 u1 v1 u2 . . . v1 uN
 v2      v2 u1 v2 u2 . . . v2 uN 

 ..  u1 u2 . . . uN =  .. ..  .
 
.. ...
 .   . . . 
vM vM u1 vM u2 . . . vM uN

The determinant of a square matrix, det(A), is a scalar computed from a systematic com-
bination of products of the matrix entries. For example, for a 2 × 2 matrix,
 
a b
det = ad − bc.
c d
The determinant has several properties,
1. det(I) = 1, where I is the identity matrix
2. det(AB) = det(A) det(B)
3. det(AT ) = det(A)
4. det(A−1 ) = 1/ det(A)
The inverse of a square matrix, A−1 , if it exists, is a matrix that multiplies A to give the
identity matrix,

AA−1 = A−1 A = I

We will sometimes use “Matlab” notation, namely the colon(:), to denote parts of matrices.
For example, A(i, :) denotes the ith row of matrix A.

1.1.2 Matrix Systems


A typical N × N linear system is

Au = b. (1.2)

Here, b is a known right-hand-side, and u is our vector of N unknowns. The solution is

u = A−1 b,

and it exists when A is invertible, also called non-singular. A matrix is non-singular if and
only if the determinant is nonzero. This means that the columns of A must be linearly-
independent: i.e. we should not be able to write any column as a linear combination of

8
c 2019, K. Fidkowski
the other columns. The same holds for the rows – a nonzero determinant is equivalent to
linearly-independent rows.
The rank of a matrix, rank(A) is the number of linearly-independent columns, which is
the same as the number of linearly-independent rows. An N × N matrix must have full rank
of N to be invertible.
The null space of an N × N matrix A is the set of all vectors u for which Au = 0. The
dimension of the null space is (N − rank(A)), so a full-rank matrix has no null space.
One way to solve Eqn. 1.2 is through a direct method such as Gaussian elimination. This
is a systematic technique for solving for one variable from one equation (in terms of the
other variables), and substituting this expression into all of the other equations. The cost of
Gaussian elimination of a dense N ×N matrix is O(N 3 ). Pivoting is a technique for choosing
the order in which the variables/equations are visited in order to minimize errors associated
with division by small numbers.
When solving systems on a computer, an important consideration is round-off error due
to finite machine precision. Most programs operate with “double precision” which means
that 64 bits (0/1s) are used to encode a real number. This means that many (uncountably
infinite) numbers are not representable on a computer. So every time we operate on numbers
(e.g. add, multiply, etc.) we potentially introduce a little bit of error. For double-precision
operations on O(1) numbers, these errors are each O(10−16 ). However, over many operations,
these errors can accumulate and be noticeable in the final result. In addition, some (not so
good) numerical algorithms can be unstable, meaning that errors are amplified exponentially
over the operations.
Another way to solve Eqn. 1.2 is through an iterative method. Such a method starts with
a guess for the solution and improves on that guess at each iteration. Iterations stop when
the residual,

r ≡ Au − b

becomes small in some chosen measure. Iterative methods are often the only option for large
sparse (mostly zero entries) matrices.

1.1.3 Eigenvalues and Eigenvectors


An eigenvector v ∈ RN of an N × N matrix A is a special nonzero vector for which

Av = λv, (1.3)

where the scalar λ is the associated eigenvalue. Rearranging this equation shows that v must
live in the null space of (A − λI), so that λ must satisfy

det(A − λI) = 0.

λ could be zero, in which case the corresponding eigenvector lies in the null space of A – this
also means that the matrix is singular, or not invertible. v scaled by a nonzero constant is

9
c 2019, K. Fidkowski
still an eigenvector, the same one for our purposes. Multiple eigenvectors can have the same
eigenvalue, in which case any linear combination of those eigenvectors is also an eigenvector.
Positive definite matrices are ones for which all eigenvalues are real and positive. If the
matrix is also symmetric, it is termed SPD (symmetric positive definite). These matrices
have a complete set of orthogonal eigenvectors.
To be precise, v defined in Eqn. 1.3 is a right eigenvector. Left eigenvectors satisfy
vT A = λvT , so they are right eigenvectors of AT . A matrix that has a complete set of
eigenvectors can be decomposed as
A = RΛL = RΛR−1 = L−1 ΛL, (1.4)
where the columns of R are the right eigenvectors, the rows of L are the left eigenvectors,
and Λ = diag(λ1 , . . . , λN ) is a diagonal matrix of eigenvalues on the main diagonal.
Example 1.1 (A 3 × 3 matrix). Consider the following matrix,
 
1 2 3
A =  3 2 1
−2 0 2
We can show that A is not invertible by calculating its determinant,
det(A) = 1(2 ∗ 2 − 0) − 2(3 ∗ 2 − (−2) ∗ 1) + 3(0 − (−2) ∗ 2) = 0.
Since the determinant is zero, the matrix is not invertible. To identify the null space of A,
we look for vectors v = [a, b, c]T for which
  
1 2 3 a
Av = 0 ⇒  3 2 1  b  = 0
−2 0 2 c
So the system for a, b, c is
a + 2b + 3c = 0
3a + 2b + c = 0
−2a + 2c = 0
The last of these says that a = c, and substituting for a in either the first or second equation
shows that b = −2c. So the null space consists of multiples of the vector
   
a 1
v = b = −2
  
c 1
Finally, we can calculate the eigenvalues of A, λ, which must satisfy
det(A − Iλ) = 0
 
1−λ 2 3
det  3 2−λ 1  = 0
−2 0 2−λ

10
c 2019, K. Fidkowski
Expanding the determinant, we obtain a polynomial for λ,
(1 − λ)((2 − λ)2 ) − 2(3(2 − λ) + 2) + 3(0 − (−2)(2 − λ)) = 0
(1 − λ)(4 − 4λ + λ2 ) + 6λ − 16 + 12 − 6λ = 0
4 − 4λ + λ2 − 4λ + 4λ2 − λ3 − 4 = 0
−8λ + 5λ2 − λ3 = 0
λ(λ2 − 5λ + 8) = 0

The roots of this equation are


√ √
5 − 7i 5 + 7i
λ1 = 0, λ2 = , λ3 =
2 2
———

1.1.4 Sample Systems in Aerospace Engineering


Let’s look at a few matrix systems that arise in aerospace engineering.

1.1.4.1 Prandtl’s Lifting-Line Equation


In Prandtl’s Lifting Line (PLL) theory, the lift distribution over the span of a wing is
decomposed into sinusoidal modes, and the (non-dimensional) coefficients on these modes,
An , are the unknowns. The PLL equation at point θ (surrogate for the coordinate y =
−b cos(θ)/2) on the wing reads
N N
4b X X sin nθ
An sin nθ + nAn = α∞ + αg (θ) − αL=0 (θ), (1.5)
a0 (θ)c(θ) n=1 n=1
sin θ
where b is the span, c(θ) is the chord, a0 (θ) is the sectional lift curve slope, αg (θ) is the
geometric twist, and αL=0 (θ) is the sectional zero-lift angle of attack. N is the number
of unknown modes, i.e. coefficients An . To get N equations, we enforce Eqn. 1.5 at N
collocation points, θi , usually equally-spaced in θ. In matrix form, this looks like
    
collocation point 1 → M11 M12 . . . M1N A1 F1
collocation point 2 →   M21 M22 . . . M2N   A2   F2 
   
.. . .. .. .. . = .  (1.6)
 .. .   ..   .. 
    
. . .
collocation point N → MN 1 MN 2 . . . MN N AN FN
where
4b sin nθi
Min = sin nθi + n (1.7)
a0 (θi )c(θi ) sin θi
Fi = α∞ + αg (θi ) − αL=0 (θi ) (1.8)
Given information about the wing, we can solve this matrix system, Eqn. 1.6, for the unknown
An coefficients.

11
c 2019, K. Fidkowski
1.1.4.2 Masses on a Spring
Spring/mass systems sometimes arise not only in dynamics but also in simplified models of
more complicated systems. Let’s look at one example of two masses joined by one spring,
k = stiffness
m1 m2

x1 x2

The tension force in the spring has a magnitude given by Hooke’s law,
|F | = k(x2 − x1 − d),
where k is the spring stiffness and d is the unstretched spring length. The accelerations of
each mass due to this force are then
|F |
ẍ1 = = k(x2 − x1 − d)/m1
m1
|F |
ẍ1 = − = −k(x2 − x1 − d)/m2
m2
Defining a state vector as
 
x1
ẋ1 
u=
x2  ,

ẋ2
we can write down the equations of motion via the following linear system
        
ẋ1 ẋ1 0 1 0 0 x1 0
ẍ1   k(x2 − x1 − d)/m1  −k/m1 0 k/m1 0  ẋ1  + −kd/m1 
   
 =  = 
ẋ2   ẋ2   0 0 0 1   x2   0 
ẍ2 −k(x2 − x1 − d)/m2 k/m2 0 −k/m2 0 ẋ2 kd/m2
| {z } | {z } | {z } | {z }
u̇ A u q
⇒ u̇ = Au + q (1.9)

1.1.4.3 Truss Displacement


Consider a simple triangular truss, pictured below.
C

θ
B
A
Q

12
c 2019, K. Fidkowski
We are interested in the displacement of point B under a downward load of Q. Hooke’s law
lets us relate length changes to forces inside both of the truss members,
FAB = kAB δAB ,
FBC = kBC δBC ,
where kAB , kBC are stiffnesses (“AE/L”) of the truss members, δAB , δBC are the member
length changes (positive means extension), and FAB , FBC are forces in the members (positive
means tension). If (u, v) denote the unknown x and y displacements at point B, we can write
δAB = u
δBC = u cos θ − v sin θ
Summing the forces to zero at node B then gives
       
Fx −FAB − FBC cos θ −kAB δAB − kBC δBC cos θ 0
= = =
Fy FBC sin θ − Q kBC δBC sin θ − Q 0
   
−kAB u − kBC (u cos θ − v sin θ) cos θ 0
⇒ =
kBC (u cos θ − v sin θ) sin θ Q
    
−kAB − kBC cos2 θ kBC sin θ cos θ u 0
⇒ =
kBC cos θ sin θ −kBC v sin2 θ v Q
Solving this 2 × 2 system gives the desired displacements at node B.
Example 1.2 (Setting up a system). The vertical (z-direction) force on a falling object is
given by
Fz = −mg + D,
where m is the mass of the object g is the (constant) acceleration due to gravity, and the
drag force is
1 ρ(z)wL
D = ρ(z)w2 CD (Re), where Re = .
2 µ
Note that the density ρ is a function of the altitude, z. w = ż is the vertical velocity, and
L, µ are constants.
Using the state vector
 
z
u= ,
w
we can write down a nonlinear equation of the form u̇ = f (u). Since force is mass times
acceleration, we have
mẇ = Fz = −mg + D
ẇ = −g + D/m

13
c 2019, K. Fidkowski
So our system is
   
ż w
u̇ = = = f (u)
ẇ −g + D/m
We can now linearize this equation about the state ū = [ū, w̄]T to obtain a linear system
of the form
u̇0 = Au0 + q,
where u = ū + u0 . Substituting u = ū + u0 into our equations, we have, since ū is constant,
∂f 0
u̇ = u̇0 = f (ū + u0 ) ≈ f (ū) + u
∂u ū

∂f
So we see that q is going to be f (ū), and A is going to be ∂u .

 

q = f (ū) =
−g + D(z̄, w̄)/m
 ∂f1 ∂f1

∂f ∂z ∂w
A= = ∂f2 ∂f2
∂u ū ∂z ∂w
 
0 1
= 1 ∂D 1 ∂D
m ∂z m ∂w

where
∂D 1 ∂ρ 2 1 ∂CD ∂Re
= w CD (Re) + ρw2
∂z 2 ∂z 2 ∂Re ∂z
∂D 1 1 2 ∂CD ∂Re
= ρ2wCD (Re) + ρw
∂w 2 2 ∂Re ∂w
∂Re wL ∂ρ
=
∂z µ ∂z
∂Re ρL
=
∂w µ
Note all values and derivatives are calculated at (z̄, w̄).
———

1.2 Differential Equations


We deal with various differential equations in aerospace engineering. Ordinary differential
equations (ODEs) involve derivatives with respect to one variable, such as time. An example
of an ODE system is given in Eqn. 1.9. On the other hand, partial differential equations
(PDEs) involve derivatives with respect to multiple variables, e.g. multiple spatial coordi-
nates and/or time. Let’s look a little more closely at what distinguishes PDEs and what
kind of PDEs we find in aerospace engineering.

14
c 2019, K. Fidkowski
1.2.1 Classification of PDEs
A PDE is called hyperbolic if it admits wave-like solutions, i.e. solutions of the form ei(k·x−ωt) ,
where k is the wave number vector and ω is the frequency. Problems involving transport,
e.g. convection of a material, or wave propagation (e.g. aeroacoustics) are often governed
by hyperbolic PDEs.
A PDE is called elliptic if it does not admit any wave-like solutions. Elliptic equa-
tions describe phenomena where source disturbances affect the entire domain. We see these
equations whenever we study diffusion phenomena. If a time derivative term is added to a
spatially elliptic equation, the resulting equation is called parabolic.

1.2.2 Initial and Boundary Conditions


PDEs require appropriate initial and boundary conditions in order to be well-posed. Initial
conditions refer to the state at time t = 0: this is required when solving unsteady problems,
which could be either hyperbolic or parabolic. Boundary conditions refer to specifying the
state, or parts of the state, at the boundaries of the domain: such information is generally
required when solving either steady or unsteady problems. In some cases, the boundary
information may be given in terms of derivatives of states, so-called Neumann boundary
conditions, as opposed to the state information only, so-called Dirichlet boundary conditions.
Robin boundary conditions involve both the state and its derivatives. On some boundaries,
we may not need to specify any boundary conditions, e.g. at the outflow of a hyperbolic
propagation problem; in such cases all of the information is taken from the interior.

1.2.3 Sample Equations in Aerospace Engineering


1.2.3.1 Multi-Dimensional Heat Conduction
Conduction of heat in a material of thermal conductivity κ is governed by Fourier’s law. If
q(x) is the volumetric source of heat within the material (zero if no heat sources are present),
the equation governing the material temperature, T (x), is

−∇ · (κ∇T ) = q(x).

This is an elliptic equation, and its solution gives the steady-state temperature distribution
in the material. If we are interested in transients, we can add a time derivative term,
∂T
− ∇ · (κ∇T ) = q(x, t)
∂t
This equation for T (x, t) is then parabolic.

1.2.3.2 Linearized Compressible Potential Flow


Inviscid subsonic or supersonic flow with small disturbances in two dimensions can be de-
scribed by a potential flow where the disturbance potential φ̂(x, y) is governed by the fol-

15
c 2019, K. Fidkowski
lowing PDE:

2
 ∂ 2 φ̂ ∂ 2 φ̂
1 − M∞ + = 0, (1.10)
∂x2 ∂y 2

where M∞ is the Mach number. This equation is valid for either M∞ < 1 or M∞ > 1.
Note that in the case of M∞ < 1, the equation is elliptic, whereas for M∞ > 1 the equation
becomes hyperbolic. This agrees with the intuition that in subsonic flow disturbances can
propagate and affect the entire domain, while in supersonic flow the disturbances propagate
along certain waves, or characteristic directions, and only affect a portion of the domain.

1.2.3.3 Compressible Euler Equations of Gas Dynamics


The compressible Euler equations of gas dynamics in two spatial dimensions are given by

∂ ∂ ∂
(ρ) + (ρu) + (ρv) = 0 (mass)
∂t ∂x ∂y
∂ ∂ ∂
(ρu) + (ρu2 + p) + (ρuv) = 0 (x-momentum)
∂t ∂x ∂y
∂ ∂ ∂
(ρv) + (ρvu) + (ρv 2 + p) = 0 (y-momentum)
∂t ∂x ∂y
∂ ∂ ∂
(ρE) + (ρuH) + (ρvH) = 0 (energy)
∂t ∂x ∂y

The variables appearing in this system of equations are defined as follows:

ρ = density Calorically-perfect gas:


u, v = x, y components of velocity, ~v e = cv T
E = total energy per unit mass h = cp T
e = internal energy per unit mass p = ρRT
H = total enthalpy per unit mass 
1

2 2
h = internal enthalpy per unit mass = (γ − 1) ρE − ρ(u + v )
2
p = pressure | {z }
R = gas constant for air, cp − cv ρe
γ = ratio of specific heats, cp /cv = 1.4 H = E + p/ρ

cp = specific heat at constant pressure M = |~v |/a = u2 + v 2 /c
cv = specific heat at constant volume c =
p p
γRT = γp/ρ
c = speed of sound
M = Mach number

The Euler equations can be written in compact form,

∂u ~
+ ∇ · F(u) =0 (1.11)
∂t
16
c 2019, K. Fidkowski
where the state and flux vectors are
     
ρ ρu ρv
 ρu   ρu2 + p
~ =  î +  ρvu
  
state: u =   ρv  ,
 flux: F  ρuv   ρv 2 + p  ĵ

ρE ρuH ρvH

1.2.3.4 Euler-Bernoulli Beam Deflection


When a distributed load q(x) is applied to a one-dimensional beam, the beam deflection
w(x) is governed by a bi-harmonic equation,

d2 d2 w
 
EI 2 = q(x),
dx2 dx

where EI is the product of the elastic modulus and the moment of inertia. This is an
elliptic equation for w(x) and to solve it we need four boundary conditions since we have
four derivatives.

1.3 Norms and Function Spaces


We will use norms, i.e. scalar numbers, to quantify sizes of vectors and functions. Norms are
sometimes induced by inner products, such as the one in Eqn. 1.1 for vectors – the resulting
spaces are called Hilbert spaces.
p If (u, v) denotes the inner product between u and v, then
the induced norm is kuk = (u, u). But norms can be defined without inner products, and
any (complete) linear space with a norm is called a Banach space.
Some useful norms for discrete vectors u ∈ RN or functions v(x) ∈ R, x ∈ Ω are

1. L1 norm: N Z
1 X 1
kukL1 = |un |, kvkL1 = |v(x)|dΩ
N n=1 |Ω| Ω

2. L2 norm: v
u N
s Z
u1 X 1
kukL2 = t u2 , kvkL2 = v(x)2 dΩ
N n=1 n |Ω| Ω

3. Lp norm, p > 1: N
!1/p  Z 1/p
1 X p 1 p
kukLp = u , kvkLp = v(x) dΩ
N n=1 n |Ω| Ω

4. L∞ norm: kukL∞ = max |un |, kvkL∞ = ess sup |v(x)|


x∈Ω

17
c 2019, K. Fidkowski
The integrals appearing above are Lebesque integrals, which basically means that they do
not care about changes to v on a set of zero measure (e.g. at a point or on a line in a 2D
domain). That’s also why we use “ess sup” instead of “max” in the definition of the L∞
norm for functions: “ess sup” is a maximum (supremum) that ignores regions of Ω which
are of zero measure.
Associated with the function norms defined above are Lebesque spaces, Lp . We say that
a function v(x) lies in the space Lp if it has a bounded (non-infinite) Lp norm. Another
useful space and norm for functions is H 1 . The H 1 norm is defined as
Z Z 
2 1 2
kvkH 1 = v(x) dΩ + ∇v · ∇v dΩ .
|Ω| Ω Ω

So functions in the H 1 space must have square integrable values and first derivatives (gradi-
ents). This is a stricter requirement than that for L2 , which just requires square-integrable
values.
Finally, we define spaces C m (Ω) as including only those functions that have continuous
derivatives up to order m. For example, C 0 (Ω) is the space of continuous functions on Ω (no
jumps), C 1 (Ω) is the space of functions with continuous first derivatives, etc.

18
c 2019, K. Fidkowski
Chapter 2

ODE Integration

2.1 Introduction
Consider the following example from orbital mechanics:
Example 2.1 (Satellite in orbit). A satellite orbiting Earth feels a radially-inward accelera-
tion of GMearth /r2 when it is a distance r from the earth’s center. Putting the Earth at the
origin of a two-dimensional coordinate system, the x and y accelerations are
(ẋ, ẏ)

GMearth x (x, y)
ẍ = −
r2 r
GMearth y r
ÿ = −
r2 r
Earth
Defining the position/velocity state by
 T
u = x y ẋ ẏ ,
the equations of motion are
   
ẋ ẋ
 =  GMẏ x  = f (u).
ẏ   
u̇ = 
ẍ − 2 earth  (2.1)
r r
GMearth y
ÿ − r2 r
| {z }
f (u)
p
Note that f (u) is a nonlinear (because r = x2 + y 2 ), vector-valued function of the state u.
———
When using numerical methods to solve an ODE, we typically divide a time horizon of
interest, t ∈ [0, T ], into N intervals of equal length ∆t ≡ T /N 1 . Denote by un , 0 ≤ n ≤ N ,
1
In general the intervals can be of different lengths, for example with small time steps at important times.

19
the state at time node n, i.e. at t = tn ≡ n∆t. u0 is a known initial condition, and our
objective in finite-difference methods for ODEs is to determine un+1 using un and perhaps
other prior states.
We must be aware that numerical methods are not going to give us the exact answer
to the states at the time nodes. Instead, each un will be an approximation to the exact
solution at tn , which we will denote by uexact (tn ). For good numerical methods, the error in
our approximation will hopefully decrease as ∆t → 0.
Now, let’s look at a couple methods for numerically solving a nonlinear ODE, specifically
the one in Eqn. 2.1.

2.1.1 The Forward Euler Method


One of the simplest finite-difference methods for solving

u̇ = f (u), (2.2)

is to numerically approximate the time derivative of the state, u̇, as the difference of the
states at two adjacent time nodes, divided by ∆t. So the time derivative of the state at time
node tn is
un+1 − un
u̇ ≈ .

tn ∆t
Substituting this into Eqn. 2.2 and solving for un+1 gives the forward Euler update equation,

un+1 = un + ∆t f (un ) (2.3)

Figure 2.1 shows a graphical interpretation of this equation.


u
un−1 exact solution
through un
un un+1
slope = f (un )
t
tn−1 tn tn+1
∆t

Figure 2.1: Graphical interpretation of the forward Euler method for a scalar problem. The state
at tn+1 is given by extrapolating the state at un using the slope at tn , which is just f (un ).

Example 2.2 (Forward Euler for the Satellite Problem). Let’s apply the forward Euler
method, Eqn. 2.3, to the system in Example 2.1. Suppose that the initial state is
T
u(t0 = 0) = (x0 = 4 × 107 m) (y 0 = 0m) (ẋ0 = 0m/s) (ẏ 0 = 2500m/s)


20
c 2019, K. Fidkowski
Using this as an initial condition, we write a forward Euler code, given in Listing 2.1, for
solving the system in Eqn. 2.1.

Listing 2.1: Code for computing a forward Euler solution to the satellite problem.
1 function [tn,un] = satFE(u0,T,N);
2 % function [t ,u] = satFE(u0,T,N);
3 % PURPOSE:
4 % This function solves the satellite orbit problem using forward Euler
5 % INPUTS:
6 % u0 : initial state [x, y, xdot, ydot] [m]
7 % T : end time (start is 0) [ s ]
8 % N : number of time steps
9 % OUTPUTS:
10 % tn : (N+1)x1 vector of time values
11 % un : (N+1)x4 vector of states
12
13 % initialize arrays
14 tn = linspace(0,T,N+1);
15 un = zeros(N+1,4);
16
17 un (1,:) = u0; % initial state
18 dt = T/N; % time step
19
20 % Forward Euler time integration
21 for n=1:N,
22 tn(n+1) = tn(n) + dt;
23 un(n+1,:) = un(n,:) + dt∗fsat(un(n ,:));
24 end
25
26 %−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
27 function f = fsat(u);
28 GM = 6.67e−11 ∗ 5.976e24; % M = mass of Earth [kg]
29 r = sqrt(u(1)ˆ2 + u(2)ˆ2);
30 f (1) = u(3); % xdot
31 f (2) = u(4); % ydot
32 f (3) = −GM∗u(1)/rˆ3;
33 f (4) = −GM∗u(2)/rˆ3;

The results in the form of orbit radius versus time for two different time steps are shown
in Figure 2.2. We note that errors appear to build up during the course of the simulation,
and that decreasing the time step improves the accuracy of integration relative to the exact
solution.
———

2.1.2 The Predictor-Corrector Method


Looking at Figure 2.1, we see that the forward Euler assumes that the derivative u̇ over
t ∈ [tn , tn+1 ] equals the derivative at t = tn . This will generally not be true, and by t = tn+1
the time derivative might be quite different. A simple improvement is to use the average of
f = u̇ at t = tn and t = tn+1 . To get f at t = tn+1 we need an approximate state there, and
for this we can use the forward Euler state. So the improved update consists of two steps:
1. Use forward Euler to predict the state at tn+1 ,
ũn+1 = un + ∆t f (un ).

21
c 2019, K. Fidkowski
4
x 10
5.5
Forward Euler, ∆ t = 100s
5 Forward Euler, ∆ t = 10s

distance from earth center [km]


Exact
4.5

3.5

2.5

1.5
0 5 10 15 20
time (hours)

Figure 2.2: Forward Euler solution to an orbiting satellite problem.

2. Use f (ũn+1 ) to correct the update through an improved slope,

f (un ) + f (ũn+1 )
un+1 = un + ∆t
2
Figure 2.3 shows a graphical interpretation of this equation. This two-stage update requires

u exact solution
n−1
through un
u
average slope un+1
un ũn+1

slope = f (un ) slope = f (ũn+1 )


t
tn−1 tn tn+1
∆t

Figure 2.3: Graphical interpretation of the predictor-corrector method for a scalar problem. The
state at tn+1 is given by extrapolating the state at un using the average of the slopes at tn and at
tn+1 .

two evaluations of f (·), which makes this method roughly twice as expensive as forward
Euler per time step. However, we expect improved accuracy, which means possibly fewer
time steps, with the predictor-corrector method.

22
c 2019, K. Fidkowski
Example 2.3 (Predictor-Corrector for the Satellite Problem). Let’s apply the predictor-
corrector method to our system in Example 2.1. Using the same initial condition as in
Example 2.2 we run a predictor-corrector code for the system in Eqn. 2.1. The code is given
in Listing 2.2.

Listing 2.2: Code for computing a predictor-corrector solution to the satellite problem.
1 function [tn,un] = satPC(u0,T,N);
2 % function [t ,u] = satPC(u0,T,N);
3 % PURPOSE:
4 % This function solves the satellite orbit problem using predictor−corrector
5 % INPUTS:
6 % u0 : initial state [x, y, xdot, ydot]
7 % T : end time (start is 0)
8 % N : number of time steps
9 % OUTPUTS:
10 % tn : (N+1)x1 vector of time values
11 % un : (N+1)x4 vector of states
12
13 % initialize arrays
14 tn = linspace(0,T,N+1);
15 un = zeros(N+1,4);
16
17 un (1,:) = u0; % initial state
18 dt = T/N; % time step
19
20 % Predictor−corrector time integration
21 for n=1:N,
22 tn(n+1) = tn(n) + dt;
23 uFE = un(n,:) + dt∗fsat(un(n,:));
24 un(n+1,:) = un(n,:) + dt∗0.5∗(fsat(un(n,:))+fsat(uFE));
25 end
26
27
28 %−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
29 function f = fsat(u);
30 GM = 6.67e−11 ∗ 5.976e24; % M = mass of Earth
31 r = sqrt(u(1)ˆ2 + u(2)ˆ2);
32 f (1) = u(3); % xdot
33 f (2) = u(4); % ydot
34 f (3) = −GM∗u(1)/rˆ3;
35 f (4) = −GM∗u(2)/rˆ3;

The results in the form of orbit radius versus time for two different time steps are shown
in Figure 2.4. Again, the errors appear to build up during the course of the simulation,
and decreasing the time step improves the accuracy of integration. While the curves show
qualitatively similar errors to those in Figure 2.2, note that we choose the time steps to be
much larger for the predictor-corrector method. Even though the cost per time step is about
twice as much when using predictor-corrector, in this case the cost is certainly worth it.
———

2.2 Multi-Step Methods


Many time-integration methods can be classified as multi-step. These are methods that relate
an unknown state at time node n+1 to a certain number of previous states and slopes. They

23
c 2019, K. Fidkowski
4
x 10
5.5
Predictor−corrector, ∆ t = 1600s
5 Predictor−corrector, ∆ t = 800s

distance from earth center [km]


Exact
4.5

3.5

2.5

1.5
0 5 10 15 20
time (hours)

Figure 2.4: Predictor-corrector solution to an orbiting satellite problem.

are typically defined for time discretizations with uniform time steps, ∆t = constant. The
general formula for a K-step method for solving u̇ = f (u) is
1
X 1
X
n+k
αk u = ∆t βk f (un+k , tn+k ) (2.4)
k=1−K k=1−K

where αk , βk are coefficients that define the method. Note that we put tn+k as an argument
into f to include cases where f depends directly on time t in addition to the state u. The
relationship between k and n is illustrated below.

n+1−K n−1 n n+1 t


k =1−K k = −1 k=0 k=1

In Eqn. 2.4, we typically set α1 = 1, and our goal is to solve for un+1 . A complication arises
on the right-hand side if β1 6= 0. In this case our equation for un+1 involves f (un+1 , tn+1 )
. . . in other words, un+1 depends implicitly on itself through f . Such methods are called
implicit. On the other hand, if β1 = 0 we can directly, or explicitly, solve for un+1 in terms
of previous states and slopes. Such methods are called explicit.
The forward Euler method in Eqn. 2.3 is an explicit multi-step method with K = 1, and
α0 = −1, α1 = 1, β0 = 1, all other αk , βk zero.
The predictor-corrector update is not a multi-step method because it uses f from a temporary
state update instead of f from only the time nodes. Some other popular multi-step methods
are as follows:

24
c 2019, K. Fidkowski
• First-order Backward Difference (BDF1), also known as Backward Euler,
un+1 − un = ∆tf (un+1 , tn+1 ).
• Second-order Backward Difference (BDF2),
3 n+1 1
u − 2un + un−1 = ∆tf (un+1 , tn+1 ).
2 2
• Third-order Backward Difference (BDF3),
11 n+1 3 1
u − 3un + un−1 − un−2 = ∆tf (un+1 , tn+1 ).
6 2 3
• Second-order Adams-Bashforth (AB2),
 
n+1 n 3 n n 1 n−1 n−1
u − u = ∆t f (u , t ) − f (u , t ) .
2 2
• Third-order Adams-Bashforth (AB3),
 
n+1 n 23 n n 16 n−1 n−1 5 n−2 n−2
u − u = ∆t f (u , t ) − f (u , t ) + f (u , t ) .
12 12 12
• Second-order Adams-Moulton or Trapezoidal (AM2, Trap),
 
n+1 n 1 n+1 n+1 1 n n
u − u = ∆t f (u , t ) + f (u , t ) .
2 2
• Third-order Adams-Moulton (AM3),
 
n+1 n 5 n+1 n+1 8 n n 1 n−1 n−1
u − u = ∆t f (u , t ) + f (u , t ) − f (u , t ) .
12 12 12
A few comments are in order. First, of the methods listed, only the Adams-Bashforth
methods are explicit; all others are implicit. Second, backwards differentiation methods
all use just one f , at n + 1, while varying the number of states on the left-hand-side. On
the other hand, the Adams-Bashforth and Adams-Moulton methods all have un+1 − un on
the left hand side and instead vary the number of f terms on the right-hand side. Finally,
for methods that require more than one previous state (e.g. BDF2), another method (e.g.
BDF1) must be used for the first time step(s).
So how do we decide which method to use? The two main considerations are
1. Cost for a given accuracy: we’d like to use the computationally-cheapest but most
accurate method available. The choice of method will generally depend on the desired
level of accuracy.
2. Stability: we don’t want the solution to blow up, which can happen if a method is
unstable. Stability depends in part on the size of the time step.
We address these considerations in the following sections.

25
c 2019, K. Fidkowski
2.3 Accuracy, Consistency, Stability
Let’s look at accuracy in the context of multi-step methods. We can think of Eqn. 2.4 as
an equation for un+1 given data at previous time nodes. Let’s say this data were perfect,
meaning taken from the exact solution. How accurate would our answer for un+1 be in this
case? A systematic technique for answering this question is to calculate the truncation error
associated with the update.

2.3.1 Truncation Error


The truncation error, τ n+1 , is the quantity that remains in Eqn. 2.4 after the exact
solution, uexact (t) is plugged in for both un+k and f = u̇. Defining “what remains” by
left-hand-side minus the right-hand-side, we obtain,
1
X 1
X
n+1
τ ≡ αk un+k
exact − ∆t n+k
βk f (uexact , tn+k ). (2.5)
k=1−K k=1−K

This error directly pollutes the next state, un+1 , and it is sometimes called the local error
or the error made in one time step. The error arises because we are using finite difference
approximations to the time derivatives.
We say that a multi-step method is of order p if

τ n+1 = O ∆tp+1 .

(2.6)

For example, for a second-order method (p = 2), this equations says that the truncation
error must decrease by a factor of 8 when ∆t is halved. We’ll see very soon that the use of
p versus p + 1 is related to the distinction between local and global order of accuracy.
How do we calculate τ n+1 ? The most straightforward way is through a Taylor-series
expansion of the state and derivative. Let’s do this about the state un . We have

un+k = un + k∆t(ut )n + 21 (k∆t)2 (utt )n + O(∆t3 )


(2.7)
f (un+k ) = un+k
t = unt + k∆t(utt )n + 21 (k∆t)2 (uttt )n + O(∆t3 )

Substituting these expressions into the multi-step update equation and collecting terms with
the same power of ∆t gives us our truncation error. The higher the order of accuracy of a
method, the more powers of ∆t will cancel in this substitution.

Example 2.4 (Truncation Error of Forward Euler). The forward Euler update in Eqn. 2.3
involves un+1 and un . Expanding un+1 using the Taylor series from Eqn. 2.7 in the forward
Euler update gives the following equation,
1
un + ∆tunt + ∆t2 untt + O(∆t3 ) = un + ∆t unt
2{z |{z}
fn
| }
un+1

26
c 2019, K. Fidkowski
The truncation error is the left-hand side minus the right-hand side,
1 2 n
τ n+1 = ∆t utt + O(∆t3 ).
2
Since the leading remaining term is O(∆t2 ), we say that forward Euler is first-order accurate,
p = 1.

———

Example 2.5 (Minimizing Truncation Error). Suppose we would like to determine the
most accurate K = 2 explicit multi-step method. Since we’re looking for an explicit method,
β1 = 0 and the general multi-step formula in Eqn. 2.4 for K = 2 becomes

α−1 un−1 + α0 un + α1 un+1 = ∆t β−1 f n−1 + β0 f n .




Inserting the Taylor-series expansions about t = tn from Eqn. 2.7 and matching coefficients
we obtain
O(1) : α−1 + α0 + α1 = 0
O(∆t) : −α−1 + α1 = β−1 + β0
α−1
2
O(∆t ) : 2
+ α21 = −β−1
O(∆t3 ) : − α−1
6
+ α61 = β−1
2
α−1
O(∆t4 ) : 24
+ α241 = − β−1
6

We expect our coefficients to be defined only up to an arbitrary multiplicative constant, so


we set α1 = 1 to remove this arbitrariness. That leaves four unknowns, which means that we
can match up to the O(∆t3 ) terms above. Note that we should not expect to match O(∆t4 )
because this equation is not even compatible with the O(∆t2 ) equation. Solving the first
four of the above equations yields

α−1 = −5, α0 = 4, α1 = 1, β−1 = 2, β0 = 4.

We would call this a third-order accurate method because the truncation error, given by
the first term that we could not match, is O(∆t4 ). Unfortunately, we’ll soon see that this
method is useless because it is unstable!

———

2.3.2 Consistency
How large of a truncation error are we willing to accept? For a time integration method
to be useful, it must be at least first-order accurate, which means that we need at least
τ n+1 = O(∆t2 ). Zeroth order accuracy won’t do because it means that the truncation error
is O(∆t), and we can get this by even neglecting all of the physics. For example, imagine a
silly time-marching method in which the state just stayed constant, un+1 = un , regardless

27
c 2019, K. Fidkowski
of what f was. Assuming a continuous actual solution, at each time step we would be
committing an error of O(∆t) since our state would not be changing while the actual state
would be changing by ∼ f ∆t. But of course this is a useless method, and we’d like to make
sure our methods do better.
We call a method consistent if it is at least first-order accurate, i.e. if τ n+1 = O(∆t2 ).
Referring to the example in the previous paragraph, we can think of consistency as a require-
ment that we respect the physics of the problem. Let’s look at what consistency means for
a multi-step method. Substituting the Taylor-series expansions from Eqn. 2.7 into Eqn. 2.5,
we find
" 1 # " 1 #
X X
n+1 n
τ = αk uexact + (kαk − βk ) unexact,t ∆t + O(∆t2 )
k=1−K k=1−K

For consistency, we need the O(1) and O(∆t) terms to cancel,

1
X
αk = 0, (2.8)
k=1−K
1
X 1
X
kαk = βk . (2.9)
k=1−K k=1−K

Example 2.6 (Consistent Multistep Schemes). Using Eqn. 2.8 and Eqn. 2.9, we can quickly
verify that all of the schemes presented at the beginning of this section are consistent. The
general form of a consistent two-step method is

ξun−1 − (1 + 2ξ)un + (1 + ξ)un+1 = ∆t φf n−1 + (1 − θ − φ)f n + θf n+1 .



(2.10)

We verify this by first checking that Eqn. 2.8 holds,


1
X
αk = ξ − (1 + 2ξ) + (1 + ξ) = 0.
k=1−K

Eqn. 2.9 also holds since,


1
X 1
X
kαk = (−1)ξ + 0 (−(1 + 2ξ)) + 1(1 + ξ) = 1 = φ + (1 − θ − φ) + θ = βk .
k=1−K k=1−K

Also, by looking at the truncation error, one can verify that the scheme in Eqn. 2.10 will be
second-order accurate if ξ = θ −φ−1/2. It will be third-order accurate if also ξ = −2φ−1/6.

———

28
c 2019, K. Fidkowski
2.3.3 Global Error and Convergence
When we integrate ODEs, we’re interested in the error not only in an individual time step,
but also over the entire time history. We define the global error of a discrete solution at time
nodes un relative to the exact solution unexact = uexact (n∆t) as

global error = global ≡ max kun − unexact kL∞ .


n=1...N

In this expression, we use the L∞ norm at each time node in order to pick out the maximum
error over the components of the state. In short, the global error is the maximum difference
between the computed state and the exact state over the course of the simulation.
We say that a finite difference method is convergent for a particular ODE, u̇ = f , if the
global error goes to zero as we reduce the time step,

global → 0 as ∆t → 0.

Note that the concept of convergence applies to the combination of method and ODE. A
particular integration method may converge for one ODE but not for another.
For convergent methods, we can relate the global error, global to the truncation error,
τ n . Recall that the truncation error τ n is the error in the time step n−1 → n. From an
order-of-magnitude point of view, we expect these local truncation errors to add up over the
course of the simulation, so that the error at the final time would be the biggest, i.e.
N
!
X
global = O kτ n k . (2.11)
n=1

Now, recall from Eqn. 2.6 that we named a method “order p” if each truncation error was
O (∆tp+1 ). If we add up N ∼ 1/∆t such truncation errors in Eqn. 2.11, we expect
N
!  
global
X
n p+1 1 p+1
= O (∆tp ) .

 =O kτ k = O N ∆t =O ∆t (2.12)
n=1
∆t

So we expect the global error to decrease at a rate of ∆t to a power that is one less than
that for the local or truncation error. We also see that the name “order p” is appropriate
since this is the rate at which the global error converges.
Finally, we mention that the local errors do not always sum as indicated in Eqn. 2.12.
Conceivably, although not often, local errors could cancel so that the global error super-
converges at a rate higher than p. A more typical case is that of the global error growing
faster than expected, in fact exponentially, due to a lack of stability in the method. We
consider this case in the next section.

Example 2.7 (Global Convergence for the Satellite Problem). In this example we mea-
sure the global convergence rates of two integration methods, forward Euler and predictor-
corrector, for the satellite problem of Example 2.1. To do this, we run the two codes for

29
c 2019, K. Fidkowski
various time steps ∆t and we plot the natural logarithm, “log”, of the global error versus
log of ∆t. Why the logarithms? Since we expect

global = C∆tp ,

then, taking the log of both sides, we have

log global = log C + p log ∆t.

So a plot of log of the global error versus log of ∆t should reduce to a straight line with
slope p. Figure 2.5 shows such plots for the two methods we’ve been considering. Note that

8
10

7
10
slope = 0.99
6
10
global error

5
10

4
10
slope = 2.03
3
10
Forward Euler
2
Predictor−corrector
10 0 1 2 3
10 10 10 10
time step, ∆ t

Figure 2.5: Global errors of two time-integration methods for the orbiting satellite problem. The
slope of the lines on the log-log scale indicate the order of accuracy of the methods.

to make the plots easier to read, we just use a log-log scale instead of plotting the log of the
quantities. These plots are not exactly straight lines, although they are pretty close. We
calculate slopes using the two runs at the smallest ∆t – we expect runs at small ∆t to be
indicative of what happens as ∆t asymptotically approaches zero. From the figure, we see
that the predictor-corrector method is much more accurate relative to forward Euler, and
that it converges at a faster rate: approximately 2 compared to approximately 1 for forward
Euler. This result is consistent with the expected convergence rates for the methods.

———

30
c 2019, K. Fidkowski
2.4 Stability
Numerical ODE integration methods approximate the solution uexact (t) by a sequence of
discrete states, un . If these states grow unbounded (when the exact solution does not), we
say that the integration method applied to that particular ODE is unstable. In this section
we make this concept a little more precise, with emphasis on multi-step methods.

2.4.1 Zero Stability


Suppose that we are interested in a fixed time range of t ∈ [0, T ]. We approximate the
solution of an ODE with N + 1 states un and a time step of ∆t = T /N . We say that a
multi-step scheme is zero-stable if the solution remains bounded as ∆t → 0, i.e. as N → ∞.
Note that this concept does not address stability for a given ∆t, which we discuss in the
next sub-section. Instead, for now we only care about the limit ∆t → 0.
As ∆t → 0 for a multi-step method, the contribution of the “physics”, i.e. the right-
hand-side of Eqn. 2.4 drops out, since all of these terms are multiplied by ∆t. What’s left is
a recurrence relation, which we write assuming the state u = u is a scalar for simplicity,
1
X
αk un+k = 0. (2.13)
k=1−K

For zero-stability, all solutions of this recurrence relation must remain bounded as n → ∞.
Recurrence relations admit solutions of the form un = u0 g n , where g n here means a
scalar g raised to the nth power, and where g is a (complex) root of the following polynomial
equation,
1
X
αk g k = 0.
k=1−K

For zero stability, we need all roots of this equation to satisfy |g| < 1. |g| = 1 is allowed as
long as the root is not repeated (otherwise un = nu0 g n would be a solution).

Example 2.8 (Zero Stability of Forward Euler). For the forward Euler scheme, the recur-
rence relation from Eqn. 2.13 reads

un+1 − un = 0
u0 g n+1 − u0 g n = 0
u0 g n (g − 1) = 0

We see that the solutions to this equation are g = 0 and a single root g = 1. Hence, the
forward-Euler scheme is zero-stable.

———

31
c 2019, K. Fidkowski
Example 2.9 (Lack of Zero Stability). Recall the most-accurate K = 2 explicit multi-step
scheme that we derived in Example 2.5,

−5un−1 + 4un + 1un+1 = ∆t 2f n−1 + 4f n .




Ignoring the right-hand side, the recurrence relation from Eqn. 2.13 reads

−5un−1 + 4un + un+1 = 0


u0 g n−1 (−5 + 4g + g 2 ) = 0

The nonzero solutions to this equation are g = 1 and g = −5. The problematic root is
g = −5 since it does not satisfy |g| < 1. Why is this problematic? The reason is that having
g = −5 as a root means that un = u0 (−5)n is a solution to the homogeneous (zero right-
hand side) part of the equation. So an error that grows exponentially with n can “slip by”
undetected and un-damped by the numerical method. In practice, the source of this error
can be as innocuous as machine precision (rounding) errors – exponential growth makes such
an error a problem in only a few iterations.
As a demonstration, we apply this unstable scheme to the satellite problem of Exam-
ple 2.1. Figure 2.6 shows the result for the orbit radius versus time. The solution blows up
exponentially away from the exact solution shortly after the start of integration. If we look
closely, we see that the errors are multiplied by a factor of approximately -5 from iteration
to iteration. For reference, the time step here was about ∆t = 2.6s. These errors are seeded
by machine precision, and for smaller time steps, we would expect the solution to blow up
“sooner”, albeit after a similar number of time steps.

———

2.4.2 Eigenvalue Stability


While zero stability tells us what happens in the limit ∆t → 0, we often care about the
practical case of a finite time step, ∆t > 0. For example, we might want large time steps if
we’re solving a steady-state problem or if we want to solve a problem without too many time
steps for reasons of efficiency. In such cases we need to know whether a time integration
method is stable for a particular ∆t.
We will analyze stability of linear systems of the form

u̇ = Au, (2.14)

Note, adding a constant source term on the right hand side would not affect the stability
analysis.
Let’s assume that the above system is non-defective, meaning that A has a complete set
of eigenvectors. We can then diagonalize A by writing A = RΛR−1 , and the system in

32
c 2019, K. Fidkowski
4
x 10
5.5

distance from earth center [km]


4.5

3.5

2.5

2 Exact
Unstable scheme
1.5
0 10 20 30 40 50
time (seconds)

Figure 2.6: Results from applying the time-integration scheme derived in Example 2.5 to the
satellite problem. This scheme is not zero stable, and hence the legend entry of “unstable” – the
red dashed curve grows unbounded shortly after the start of the time integration.

Eqn. 2.14 becomes

u̇ = Au
u̇ = RΛR−1 u
−1 −1
| {z u̇} = Λ |R{z u}
R
ẇ w
ẇ = Λw

In the last line we have a set of decoupled ODEs. That is, if the original system is N × N ,
we now have N independent scalar equations, each of the form

ẇ = λw, (2.15)

where λ is an eigenvalue of A. So the maximum time step allowed for integrating Eqn. 2.14
is going to be limited by the most restrictive eigenvalue of A. More importantly, the diago-
nalization shows us that instead of tackling systems all at once, we can instead consider just
scalar equations of the form in Eqn. 2.15. This is what we will do, and to avoid introducing
new variables, we will use u instead of w. So we consider the following scalar ODE,

u̇ = λu, (2.16)

where λ is a possibly-complex scalar.

33
c 2019, K. Fidkowski
To analyze the stability of a time-integration scheme, we calculate the amplification
factor, g, of the state after one iteration. That is, we write

un+1 = gun , (2.17)

where g is a constant that could be complex. The motivation for writing the above equa-
tion is that we’re working with a homogeneous (just λu on the right-hand side) and time-
independent (λ is constant) system. So there is nothing to distinguish one iteration from
another, and if after one iteration the state is multiplied by g, it will also be multiplied by g
after the next iteration. In particular, we can write un = g n u0 , where g n means g raised to
the nth power.
What’s the use of a homogeneous system? It is the form of the error equation of any
constant-coefficient non-homogeneous system. The amplification factor then gives us the
rate at which the error grows, if |g| > 1, or decays, if |g| < 1. For stability we want the
latter case, |g| < 1.
The amplification factor g will depend on the “physics” of the problem, i.e. λ, and on the
numerics via the choice of integration scheme and time step, ∆t. Let’s see how this works
through an example.
Example 2.10 (Eigenvalue Stability of Forward Euler). Recall the forward Euler update
from Eqn. 2.3, which for f (u) = λu is

un+1 = un + ∆t λun .

To calculate the amplification factor, we substitute un = g n u0 into this equation,

g n+1 u0 = g n u0 + ∆t λg n u0
g n u0 (g) = g n u0 (1 + λ∆t)
⇒ g = 1 + λ∆t

For stability, we’d like the amplification factor to have magnitude less than 1. From the
above equation, we see that for real λ, this means −2 < λ∆t < 0. But in general, λ can be
complex, and so we need to allow for complex g. We determine the general stability boundary
by setting g = eiθ , since this is the set of all complex numbers such that |g| = 1. Hence we
have,

eiθ = 1 + λ∆t
⇒ λ∆t = eiθ − 1

which is a circle of radius 1 centered at −1 in the complex number plane. This analysis gives
us the stability boundary, and all that’s left is figuring out which side of the boundary is
stable. To do this, we pick a point, i.e. a value of λ∆t, and calculate |g|: for example, for
forward Euler, choosing the point λ∆t = −1 gives |g| = 0, which is stable. This means that
the region inside the unit circle centered at λ∆t = −1 is stable, and we obtain the stability
diagram shown in Figure 2.7.

34
c 2019, K. Fidkowski
Im(λ∆t)

unstable
1

stable

-2 Re(λ∆t)
|g| < 1 |g| > 1

-1
|g| = 1

Figure 2.7: Eigenvalue stability region for the forward Euler method. The region inside the circle
is stable as it yields amplification factors with |g| < 1.

———

We make a few remarks about the concept of a stability region derived in Example 2.10.

• The stability boundary is a contour in the complex number plane of λ∆t.

• If λ∆t falls inside the stability region, for all eigenvalues λ in the case of a system,
then the time integration method is stable.

• We often cannot control λ directly, as it usually is given by the physics or the spatial
discretization in the case of PDEs.

• ∆t will usually be the first step for controlling stability, and we will often speak of
time-step restrictions, which refer to maximum values of ∆t allowed for stability.

• Zero stability manifests itself in the diagram as stability in the left-half-plane as we


zoom into the origin by making ∆t → 0.

Example 2.11 (Eigenvalue Stability of Backward Euler). The backward Euler method, also
known as first-order backward differencing (BDF1), is a one-step method similar to forward
Euler with the exception that the forcing term is evaluated at n+1 instead of at n. Applying
this method to our prototypical eigenvalue system in Eqn. 2.16 we have

un+1 = un + ∆t λun+1 .

35
c 2019, K. Fidkowski
The stability analysis now proceeds in a way similar to the forward Euler case. Introducing
the amplification factor via Eqn. 2.17,

g n+1 u0 = g n u0 + ∆t λg n+1 u0
g n u0 (g) = g n u0 (1 + gλ∆t)
g = 1 + gλ∆t
1
⇒ g =
1 − λ∆t

To determine the stability boundary, we set |g| = 1, i.e. g = eiθ ,

1
eiθ =
1 − λ∆t
iθ iθ
e − λ∆te = 1
⇒ λ∆t = 1 − e−iθ ,

which is a circle of radius 1 centered at 1 in the complex number plane. Choosing the circle
origin, λ∆t = 1, shows that the inside of the circle is unstable while the outside is stable.
Therefore, we obtain the stability region shown in Figure 2.8. Note the much larger extent of

Im(λ∆t)

1
|g| < 1
stable

unstable

-2 |g| > 1 Re(λ∆t)

-1
|g| = 1

Figure 2.8: Eigenvalue stability region for the backward Euler method. The region inside the
circle is unstable as it yields amplification factors with |g| > 1.

the backward Euler stability region compared to forward Euler. In particular, the entire left
half plane in Figure 2.8 is stable. When this happens, we say that the method is A-stable.

———

36
c 2019, K. Fidkowski
2.4.3 Implications for Convergence
In order for a multistep time-integration method to be convergent, as defined in Section 2.3.3,
it must be both consistent and stable. This result is summarized in the Dahlquist Equivalence
Theorem, which states that a method is convergent if and only if it is zero stable and
consistent. For consistency, we need at least first-order accuracy, i.e. for the truncation
error to be at least O(∆t2 ). For stability we are concerned with zero stability because the
idea of convergence applies to the case ∆t → 0.
We gain some intuition for the Dahlquist Equivalence Theorem by looking at non-
convergent situations. If a method is not zero stable, as in Example 2.9, then the error
will explode after a few iterations regardless of the time step size and hence clearly we will
not have convergence. If a method is not consistent, the solution might not blow up but in-
stead it might “converge” to an incorrect value. Thus, we need both stability and consistency
to see convergence.

2.5 Multi-Stage Methods


We saw that multi-step methods evaluate the state at time node n+1 using the state at time
node n and possibly at other preceding time nodes. The higher the order of a multi-step
method, the more preceding states such a method uses. Since, all of the previous states
are already known, calculating the next state is relatively inexpensive. However, there are
some drawbacks. A practical one is the question of what to do at the initial time steps,
where the required number of preceding states is not available. The answer here is usually
to use a low-order method until enough states are available. Another drawback is that
accurate multi-step methods generally have poor stability for nonzero eigenvalues close to
the imaginary axis (e.g. oscillatory modes of a system). Multi-stage methods address both
of these drawbacks, albeit at additional cost per time step.
As illustrated in Figure 2.9, multi-stage methods use temporary intermediate states to
compute the state at time node n + 1 using only the previous state at n.

t 1
t
n−3 n−2 n−1 n n+1 n n+ 3 n + 23 n+1
multi-step multi-stage
∆t ∆t

Figure 2.9: Schematic comparison of multi-step versus multi-stage time-integration methods. In


multi-stage integration, the intermediate fractional states are temporary: they are only constructed
for the purpose of calculating the n + 1 state from the n state, and they are discarded in the next
time step.

37
c 2019, K. Fidkowski
2.5.1 Predictor-Corrector Methods
The simplest multi-stage method is predictor-corrector integration, already introduced in
Section 2.1.2. There are various forms of this method, one of them being the already-
presented mix of forward Euler and trapezoidal integration:
ũn+1 = un + ∆tf (un , tn )
1
un+1 = un + ∆t f (un , tn ) + f (ũn+1 , tn+1 ) .

2
Note that here, ũn+1 is the temporary intermediate state. This method is second-order
accurate because it uses the average of the slope at the current state and at a (rough)
prediction of the n + 1 state.
Example 2.12 (Eigenvalue Stability of the Predictor-Corrector Method). For multi-stage
methods we can use the same eigenvalue stability analysis procedure that we used for multi-
step methods. That is, we consider a homogeneous scalar eigenvalue problem, f (u) = λu,
and an exponentially-growing/decaying state, un = g n u0 . Combining the two stages of
predictor-corrector into one equation, we have
1
un+1 = un + ∆t f (un , tn ) + f (ũn+1 , tn+1 )

2
1
= u + ∆t λun + λũn+1
n

2
1
= u + ∆t [λun + λ (un + ∆tλun )]
n

 2 
n 1 2
= u 1 + λ∆t + (λ∆t)
2
 
n+1 0 n 0 1 2
⇒ g u = g u 1 + λ∆t + (λ∆t)
2
1
g = 1 + λ∆t + (λ∆t)2
2
Having solved for the amplification factor, we can now determine the region of stability. The
stability boundary is given by those λ∆t values that give |g| = 1, i.e. g = eiθ . Substituting
eiθ for g above, we obtain the following quadratic equation,
1
(λ∆t)2 + λ∆t + 1 − eiθ = 0
2 p
⇒ λ∆t = −1 ± 2eiθ − 1
Figure 2.10 shows a plot of this solution for all values of θ ∈ [0, 2π]. Note the vertically-oval
nature of the stability boundary. We particularly care about the behavior near the origin,
where eigenvalues that are close to the imaginary axis now have a chance of being inside
the stability boundary (e.g. for a sufficiently-small time step). Compare, for example, to
the forward Euler stability boundary in Figure 2.7, which, as a circle, turns away from the
origin much sooner.

38
c 2019, K. Fidkowski
2

1.5

0.5

Imag(λ ∆ t) 0

−0.5

−1

−1.5

−2
−2.5 −2 −1.5 −1 −0.5 0 0.5
Real(λ ∆ t)

Figure 2.10: Stability boundary for the combined Euler/Trapezoidal predictor-corrector method.
The region enclosed by the boundary is stable.

———

2.5.2 Runge-Kutta Methods


Probably the most widely-used multi-stage methods are of the Runge-Kutta family. In this
section we discuss and analyze two such methods: a second-order and a fourth-order version.

2.5.2.1 Two-stage Runge-Kutta


A popular two-stage method is modified Euler,

1 1
un+ 2 = un + ∆tf (un , tn )
2 (2.18)
1 1
n+1
u = u + ∆tf (un+ 2 , tn+ 2 )
n

In the first stage, forward Euler is used to obtain an approximation to the state at the
midpoint of the time step. The slope, f , at this midpoint state is then used to advance the
state from node n to n + 1. Figure 2.11 presents these steps graphically for a scalar problem.

Example 2.13 (Order of Accuracy of Modified Euler). We already mentioned that the
modified Euler method is second-order accurate. Let’s show this for a scalar problem by
calculating the local error in one time step: n → n + 1. In the first stage, we recognize that

39
c 2019, K. Fidkowski
1 1
u slope = f (un+ 2 , tn+ 2 )
1 un+1
un+ 2

un
slope = f (un , tn )
1 t
tn tn+ 2 tn+1

Figure 2.11: Graphical representation of the two stages in modified Euler time integration.

f (un , tn ) = unt , the time derivative of the state at n,


1 ∆t n
un+ 2 = un + u .
2 t
Now, in the second stage,
 
n+ 21 n+ 12 d n ∆t n ∆t n
f (u ,t )= u + ut = unt + u .
dt 2 2 tt
So the second stage becomes
∆t2 n
 
n+1 n n ∆t n
u = u + ∆t ut + utt = un + ∆tunt + u .
2 2 tt
But this last expression is just a truncated Taylor-series expansion of u about tn . The leading
missing term is O(∆t3 ), which is then the truncation error, or local order of accuracy. Since
the global order of accuracy is of one order lower, we have that the global order of accuracy
is O(∆t2 ), and the method is second-order accurate.
———
Example 2.14 (Stability of Modified Euler). Following Example 2.12, we combine the two
stages of modified Euler for a scalar problem, f (u) = λu, into one equation,
 
n+1 n n 1 n
u = u + ∆tλ u + ∆tλu
2
1
= un + λ∆tun + (λ∆t)2 un
 2 
n+1 0 n 0 1 2
⇒ g u = g u 1 + λ∆t + (λ∆t)
2
1
g = 1 + λ∆t + (λ∆t)2
2
This is the same amplification factor as for the predictor-corrector method in Example 2.12,
and hence the stability region is as shown in Figure 2.10. In summary, modified Euler is a
second-order accurate explicit time integration method.
———

40
c 2019, K. Fidkowski
2.5.2.2 Four-stage Runge-Kutta

One of the most popular four-stage integration methods is the following sequence:

n n
f0 = f (u
 ,t ) 
n 1 n ∆t
f1 = f u + ∆tf0 , t +
 2 2 
n 1 n ∆t (2.19)
f2 = f u + ∆tf1 , t +
2 2
f3 = f (un + ∆tf2 , tn + ∆t)
∆t
un+1 = un + (f0 + 2f1 + 2f2 + f3 )
6

Note that we present this method slightly differently from the previous ones, in terms of f -
values instead of state values. This is because that’s how one would implement this method
for efficiency, in order to minimize calculations of f which are typically the most expensive
parts of the time integration. Although they are hidden, we do have three temporary states
in the above sequence, namely un + 21 ∆tf0 , un + 12 ∆tf1 , and un + ∆tf2 .

Example 2.15 (Stability of Fourth-Order Runge-Kutta). For the linear scalar problem
f (u) = λu, we can combine all four stages in Eqn. 2.19 into one equation,

 
n+1 n 1 2 1 3 1 4
u = u 1 + λ∆t + (λ∆t) + (λ∆t) + (λ∆t)
2 6 24

Substituting un = g n u0 gives us the amplification factor

1 1 1
g = 1 + λ∆t + (λ∆t)2 + (λ∆t)3 + (λ∆t)4
2 6 24

We can attempt to set g = eiθ and to solve for λ∆t, but this is difficult as we would have
to solve a fourth-order polynomial equation. A simpler method of obtaining the stability
boundary is to plot contours of |g| in the real/imaginary complex number plane of λ∆t and
to identify the |g| = 1 contour. Listing 2.3 presents a code that makes such a contour plot,
and Figure 2.12 shows the result.

41
c 2019, K. Fidkowski
Listing 2.3: Code for computing the stability boundary of fourth-order Runge-Kutta integration.
1 function stabRK4;
2 % PURPOSE:
3 % Plots the stability region for four−stage Runge−Kutta
4 % INPUTS: none
5 % OUTPUTS: none, makes a plot
6
7 % grid of points in complex lambda−delta t space
8 xv = linspace(−3,1,400);
9 yv = linspace(−3.5,3.5,400);
10 [x,y] = meshgrid(xv,yv);
11 z = x + i∗y; % z = lambda ∗ delta t
12 g = 1 + z + 0.5∗z.ˆ2 + (1/6)∗z.ˆ3 + (1/24)∗z.ˆ4; % amplification factor
13 gmag = abs(g); % magnitude
14
15 % make contour plot
16 figure(1); clf ;
17 contour(x,y,gmag,[1,1.05],’b−’, ’linewidth’ , 2); hold on;
18 xlabel(’Real (\lambda\Delta t)’, ’fontsize ’ , 16);
19 ylabel(’Imag (\lambda\Delta t)’, ’fontsize’ , 16);
20 plot([0, 0], [−3.5, 3.5], ’k−’, ’linewidth’ , 2); % y axis
21 plot([−3, 1], [0, 0], ’k−’, ’linewidth’ , 2); % x axis
22 grid on;
23 axis equal;
24 axis([−3, 1, −3.5,3.5]);
25
26 % print figure to pdf
27 set(gca, ’ fontsize ’ , 16);
28 s = ’ ../ figs /stabRK4.eps’;
29 print(’−depsc2’, s );
30 unix(sprintf(’epstopdf %s; rm %s’, s, s));

Testing a point such as λ∆t = −1 shows that the interior of the contour is stable. Note that
the stability boundary encloses portions of the imaginary axis – this is good for stability of
lightly-damped oscillatory modes. Also, the stability boundary extends further along the
negative real axis compared to forward Euler and second-order Runge-Kutta, which means
that the time steps could be larger for a given eigenvalue. Finally, the name “four-stage” is
given to this method because a minimum of four function evaluations, f (u, t), are required
to take one time step. Equivalently, there are three intermediate states.

———

2.6 Stiffness and Implicit Methods


2.6.1 Stiffness
We call a system of equations stiff if it exhibits a wide range of time scales. Physically,
this means that the system contains both fast and slow processes. For example, imagine
slowly stretching and compressing a stiff spring: your applied forcing function is slow, but
the spring’s response to any disturbances is very quick. Stiffness also occurs for systems
without forcing if some responses are much faster compared to others. Mathematically, for
linear or linearized systems, this means that the magnitudes of the system eigenvalues are

42
c 2019, K. Fidkowski
3

1
Imag (λ∆ t)

−1

−2

−3

−3 −2 −1 0 1
Real (λ∆ t)

Figure 2.12: Stability boundary for the fourth-order Runge-Kutta method in Eqn. 2.19. The
region inside the contour is stable.

43
c 2019, K. Fidkowski
highly-disparate. A rule of thumb is that if the ratio of the largest to smallest eigenvalue
magnitudes, |λmax |/|λmin | is greater than about 1000, the system is likely stiff.
As an example, consider the following system:
 
−1 1
u̇ = Au, where A = (2.20)
0 −1000

The eigenvalues are λ1 = −1 and λ2 = −1000. Suppose we wanted to integrate this system
forward in time, starting from t = 0, u = [1, 1]T to t = 1. Using forward Euler, we calculate
the time step restriction from Figure 2.7 as ∆tmax = .002, since anything larger would put
λ2 ∆t in the unstable region, λ∆t < −2. Let’s see how the solution looks for various ∆t.
Figure 2.13 shows the forward Euler solution versus time for three time steps, ∆t = .001,
∆t = .002, and ∆t = .0021. We make several observations about these results. First, the
eigenvalue stability analysis that we did earlier proves correct, since for time steps ∆t > .002
the integration is unstable: u1 and u2 exhibit exponentially-growing oscillations. Second, for
this problem, the decay of u1 is the slow, “long-time”, phenomenon, while the decay of u2
is a fast transient. Third, at the stability limit, we already see oscillations in u2 , and while
they do not grow, they also do not go away. Fourth, beyond the stability limit, oscillations
in u2 grow ... and they pollute u1 through the coupling of the states in the system. Finally,
we note that if we only care about the long-time behavior exhibited by u1 , which has a time
scale of O(1), then the maximum time step of ∆t = .002 set by the stability restriction is
likely excessively small. The remedy for this situation is an implicit time integration method.

2.6.2 Implicit Methods for Linear Systems


In Section 2.2 we made a distinction between explicit and implicit multi-step integration
methods. In the following we will restrict our attention to multi-step methods, although
multi-stage and other time integration methods can also be classified as explicit or implicit.
In an implicit multi-step method, the update formula for un+1 , i.e. Eqn. 2.4, depends on
the unknown un+1 through the forcing function f . This means that we cannot evaluate un+1
as easily as in an explicit method because un+1 also appears on the right-hand side of the
equation. However, on the face of it this is not a huge problem: we just have to re-arrange
the update equation to put the unknown un+1 on the left-hand side. Let’s see how this works
for a prototypical implicit time integration method: backward Euler.

Example 2.16 (The Backward Euler Method for a Stiff Linear System). Consider again
the 2 × 2 system in Eqn. 2.20. The backward Euler update equation for this system reads

un+1 − un = ∆tAun+1 . (2.21)

The only difference between this equation and a forward Euler update is the last superscript
n + 1, on the right-hand side, which would be n for forward Euler. However, this is a big
difference for the solution approach and for stability.

44
c 2019, K. Fidkowski
1.5 1.5

1 1
u1

2
0.5 0.5

u
0 0

−0.5 −0.5
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
time time

(a) u1 , ∆t = .001 (b) u2 , ∆t = .001

1.5 1.5

1 1
u1

0.5 0.5
u

0 0

−0.5 −0.5
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
time time

(c) u1 , ∆t = .002 (d) u2 , ∆t = .002

1.5 1.5

1 1
u1

0.5 0.5
u

0 0

−0.5 −0.5
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
time time

(e) u1 , ∆t = .0021 (f) u2 , ∆t = .0021

Figure 2.13: Forward Euler time integration results for the stiff system given in Eqn. 2.20.

45
c 2019, K. Fidkowski
To solve Eqn. 2.21 we put terms with the unknown un+1 on the left-hand side and
everything else on the right-hand side,

un+1 − ∆tAun+1 = un
(I − ∆tA) un+1 = un
un+1 = (I − ∆tA)−1 un

For linear systems, the matrix inversion can be precomputed before the simulation, and this
saves redundant calculations at each time step.
Turning back to the problem at hand, Eqn. 2.20, we integrate the 2 × 2 system from
t = 0, u = [1, 1]T to t = 1 – the same setup as we had for forward Euler. For backward Euler,
there is no time step restriction, as indicated in Figure 2.8, and we run with a wider range of
time steps. Figure 2.14 shows the backward Euler solution versus time for three time steps,
∆t = .001, ∆t = .01, and ∆t = .1. The lack of time step restriction allows us to obtain
stable results with time steps that are much larger than we could use for an explicit method.
For ∆t = .1, we only have 10 time steps in the simulation. In this case, the transient in u2
is severely under-resolved, as expected since this transient is very fast. On the other hand,
the u1 solutions are very similar for all three time steps.

———

The previous example showed that solving for un+1 requires inverting an N × N system.
This will be the case for all implicit methods, not just backward Euler. Depending on the
properties of A, the matrix inversion could be expensive – up to O(N 3 ) for general dense
matrices. This cost has to be factored into the decision when choosing between explicit and
implicit methods.
Another consideration besides stability when choosing between time integration methods
is accuracy. In Figure 2.15 we compare the accuracy versus time step of several popular
explicit and implicit methods for solving the system in Eqn. 2.20. We measure accuracy of
the solution we care about, u1 (t), and we define the error as the maximum difference over
the simulated time history between the computed solution and the analytical solution, which
is
1000e−t − e−1000t
u1 (t) = .
999
The results in Figure 2.15 show that explicit methods are desirable when high accuracy
is required, in this case error levels below approximately 10−3 . For errors below this, the
time step restriction as dictated by the accuracy requirement is below the stability limit and
hence the cheaper explicit methods make sense. However, if the accuracy requirement is
less stringent, say an error level of 10−2 , the explicit methods do not really get any cheaper
because they are stuck with the small time step restriction due to stability. On the other
hand, the implicit methods can deliver the desired less-stringent accuracy with many fewer
time steps, almost two orders of magnitude in our particular example.

46
c 2019, K. Fidkowski
1.5 1.5

1 1
u1

2
0.5 0.5

u
0 0

−0.5 −0.5
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
time time

(a) u1 , ∆t = .001 (b) u2 , ∆t = .001

1.5 1.5

1 1
u1

0.5 0.5
u

0 0

−0.5 −0.5
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
time time

(c) u1 , ∆t = .01 (d) u2 , ∆t = .01

1.5 1.5

1 1
u1

0.5 0.5
u

0 0

−0.5 −0.5
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
time time

(e) u1 , ∆t = .1 (f) u2 , ∆t = .1

Figure 2.14: Backward Euler time integration results for the stiff system given in Eqn. 2.20.

47
c 2019, K. Fidkowski
−1
10
Forward Euler, rate = 1.07
Second−order Runge−Kutta, rate = 2.11
−2
10 Backward Euler, rate = 1
Trapezoidal, rate = 2

−3
max error in u1

10

−4
10

−5
10

−6
10

−7
10 −5 −4 −3 −2 −1
10 10 10 10 10
time step, ∆ t
Figure 2.15: Accuracy comparison of time integration methods for the stiff system in Eqn. 2.20.
The two explicit methods (dashed red) are unstable for the larger time steps, and hence their
accuracy results are not shown over the entire ∆t range. The convergence rates given in the legend
are measured between the two smallest time-step simulations.

48
c 2019, K. Fidkowski
Finally, we remark that Figure 2.15 also shows the convergence rates for each of the
methods. These are measurements of what is expected to be the global convergence rate,
e.g. the p in Eqn. 2.12. We measure this rate by expressing the error as global = C∆tp and
using the ratio of the errors at the last two points (smallest ∆t) to calculate p. We see that
the rates agree almost spot-on with the theoretical expectations.

2.6.3 Implicit Methods for Nonlinear Systems


Nonlinear ODEs are ones in which f (u, t) depends non-linearly on the state u – this will
be the case in most practical applications. Implicit methods applied to nonlinear systems
require special attention because the unknown state un+1 cannot be easily isolated out of
f (un+1 , tn+1 ). Instead, we have to solve a nonlinear system of algebraic equations to compute
the updated state.
Consider for example a nonlinear, time-invariant system,

u̇ = f (u).

We note that the changes required for non time-invariant systems are straightforward and
just require keeping track of the time at which f needs to be evaluated. Applying our
prototypical implicit method, backward Euler, gives the update formula,

un+1 = un + ∆tf (un+1 )


un+1 − ∆tf (un+1 ) = un
un+1 − ∆tf (un+1 ) − un = 0
| {z }
n+1
R(u )

This is a nonlinear system of equations which we can solve using the Newton-Raphson method.
The first step in Newton-Raphson is to rearrange the equations to define a residual, R(un+1 ),
as we did in the last line above. We now pose the following problem:

Determine w such that R(w) = 0. (2.22)

Here w is just a temporary variable that we will update in an iterative manner until we drive
the residual to zero, at which point w will be our desired state un+1 .
So how do we solve the problem in Eqn. 2.22? With the Newton-Raphson method we
start with a guess for w – a good one is usually the most recent known state, un . Then, we
iteratively update w by linearizing the residual and driving it to zero. That is, given a value
of w for which R(w) 6= 0, we seek an update ∆w such that

R(w + ∆w) ≈ 0.

Linearizing the left-hand side and dropping high-order terms, we have


∂R
R(w) + ∆w = 0, (2.23)
∂w w

49
c 2019, K. Fidkowski

where ∂R is called the residual Jacobian matrix. It is the derivative of the residual vector

∂w
w
with respect to the state. In our linear example, Example 2.16, this was just I − ∆tA. For
general nonlinear problems, the residual Jacobian will depend on the state, and so that’s
why we indicate the state (w) about which the linearization is performed.
Solving Eqn. 2.23 for ∆w requires inverting a linear system,
 −1
∂R
∆w = − R(w).
∂w w

Once ∆w is available, we update w → w + ∆w and repeat. We usually have to repeat


because the residual at the updated state will still likely be nonzero (note that we dropped
high-order terms to obtain the above linear equation). The number of times we repeat the
Newton-Raphson iteration depends on the required accuracy, which is typically given as
a criterion on a norm of the residual, kRk. The method converges very quickly, in fact
quadratically when close to the solution, so driving the residual norm to very low values is
usually not difficult.
With the multiple matrix solves and residual/Jacobian evaluations, the cost of implicit
methods relative to explicit ones certainly goes up for nonlinear problems. Yet, in practice,
we still have very stiff nonlinear problems that make implicit methods desirable.

Example 2.17 (Newton-Raphson for Root Finding). Consider the following nonlinear equa-
tion,

R(w) = w3 + 2w − 3 = 0,

which has a single real root at w = 1. In this example we will implement the Newton-Raphson
method to solve this equation given a starting guess, w0 .
First, at a given w, we linearize the above equation to determine the Newton-Raphson
update, ∆w. The linearized equation is

∂R
∆w = −R(w)
∂w
(3w2 + 2)∆w = −w3 − 2w + 3

Solving for ∆w gives us the Newton-Raphson update,

−w3 − 2w + 3
∆w =
3w2 + 2

After implementing this formula numerically, we try two values for w0 : 0, 10, and we
tabulate the convergence of w to the exact solution:

50
c 2019, K. Fidkowski
w0 = 10
Iteration 1: error = 5.6325E+00
0
w =0 Iteration 2: error = 3.3780E+00
Iteration 1: error = 5.0000E-01 Iteration 3: error = 1.8710E+00
Iteration 2: error = 1.1429E-01 Iteration 4: error = 8.8302E-01
Iteration 3: error = 7.3659E-03 Iteration 5: error = 2.9406E-01
Iteration 4: error = 3.2426E-05 Iteration 6: error = 4.4175E-02
Iteration 5: error = 6.3086E-10 Iteration 7: error = 1.1434E-03
Iteration 8: error = 7.8394E-07
Iteration 9: error = 3.6882E-13
An initial guess far from the root requires more Newton iterations to converge. This is
because the linearization becomes less and less valid the further we move away from the root.
However, note that one the approximate value is close to the root, the convergence is very
rapid: the error decreases quadratically in the last few iterations in each case.
———
Example 2.18 (The Trapezoidal Method for a Nonlinear System). Consider the following
nonlinear modification to the linear system in Eqn. 2.20,
u̇1 = −u1 + u22
u̇2 = −1000u2
Note that u2 now appears quadratically in the first equation. We write this as
u̇ = f (u),
where f = [f1 , f2 ], and f1 = −u1 + u22 , f2 = −1000u2 .
Trapezoidal integration, presented in Section 2.2, consists of the following update formula,
 
n+1 n 1 n+1 1 n
u − u = ∆t f (u ) + f (u ) .
2 2
Putting all terms on the left-hand side gives us the residual,
 
n+1 n+1 n 1 n+1 1 n
R(u ) ≡ u − u − ∆t f (u ) + f (u ) = 0.
2 2
Rewriting the residual in terms of w = [w1 , w2 ]T , we have
 
n 1 1 n
R(w) ≡ w − u − ∆t f (w) + f (u ) = 0.
2 2
The residual Jacobian matrix is then
   
∂R 1 ∂f 1 0 1 −1 2w2
= I − ∆t = − ∆t
∂w 2 ∂w 0 1 2 0 −1000
A Matlab code that implements the Newton-Raphson method for this residual is given in
Listing 2.4.

51
c 2019, K. Fidkowski
Listing 2.4: Code for trapezoidal integration of a nonlinear problem
1 function [t,u] = NonLinTrap(N);
2 % PURPOSE:
3 % Performs trapezoidal integration of a nonlinear problem
4 % INPUTS:
5 % N : number of time steps
6 % OUTPUTS:
7 % t : vector of N+1 time node values
8 % u : 2 by N+1 solution vector
9
10 T = 1; % length of simulation
11 dt = T/N; % time step
12 t = linspace(0,T,N+1); % vector of time nodes
13 u = zeros(2,N+1); % state vector
14 u (:,1) = [1;1]; % initial condition
15 Rnorm = 1e−8; % Newton−Raphson residual tolerance
16
17 for n = 1:N, % loop over time steps
18
19 fn = calcf(u (:, n )); % f−value at uˆn
20 w = u(:,n); % Newton−Raphson initialization
21
22 for k=1:10, % allow max 10 Newton−Raphson iterations
23 R = w − u(:,n) − dt∗(0.5∗calcf(w) + 0.5∗fn);
24 if (norm(R) < Rnorm) break; end
25 R w = eye(2) − 0.5∗dt∗calcf u(w);
26 dw = −R w \ R;
27 w = w + dw;
28 end
29
30 % make sure the residual tolerance was met
31 if (norm(R) > Rnorm), error(’residual norm not met!’); end;
32
33 u (:, n+1) = w; % store solution as next state
34
35 end
36
37 %−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
38 function f = calcf(u)
39 f = [−u(1) + u(2)ˆ2; −1000∗u(2)];
40
41 %−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
42 function f u = calcf u(u)
43 f u = [−1, 2∗u(2); 0, −1000];

Running this code with various time steps (i.e. values of N ) gives the error convergence plot
shown in Figure 2.16. As the error we use the maximum difference between the computed
u1 and the exact u1 (t), which is

2000e−t − e−2000t
u1 (t) = .
1999
Note that we achieve second-order accuracy as expected for the trapezoidal method, and
that we attain stability even for large ∆t.

———

52
c 2019, K. Fidkowski
0
10
Trapezoidal, rate = 2

−1
10
max error in u1

−2
10

−3
10

−4
10

−5
10 −4 −3 −2 −1
10 10 10 10
time step, ∆ t
Figure 2.16: Convergence with time step of trapezoidal integration for a nonlinear 2 × 2 system.

53
c 2019, K. Fidkowski
54
c 2019, K. Fidkowski
Chapter 3

Finite Difference Methods

3.1 Finite Difference Approximations


Finite difference approximations are widely used in discretizing ordinary and partial differ-
ential equations. One of the simplest finite difference formulas is that of a first derivative,

du u(x + ∆x) − u(x)
≈ , (3.1)
dx x ∆x

where in the limit ∆x → 0 we recover the standard definition of a derivative. We can use
this formula to approximate the slope of u(x) at a point of a grid, as shown in Figure 3.1.
Note that for now we assume a uniform grid, which is one in which the points are equally
spaced. On this grid, we represent u(x) through ui , which are values of u at the N + 1 nodes,
xi = i∆x, 0 ≤ i ≤ N .
∆x
x
0 1 2 i−1 i i+1 N −1 N
i− 12 i+ 12

Figure 3.1: Uniform grid used for one-dimensional finite difference formulas. The spacing between
all consecutive nodes is ∆x.

If we’re sitting at node i, we can look forward to i + 1 and calculate the first derivative
using Eqn. 3.1,

du ui+1 − ui
forward difference: ≈ . (3.2)
dx i ∆x

But we can also look backward to node i − 1,



du ui − ui−1
backward difference: ≈ . (3.3)
dx i ∆x

55
Or, for symmetry, we can average the forward and backward formulas to obtain

du ui+1 − ui−1
central difference: ≈ . (3.4)
dx
i 2∆x
We’ll see in the next section that the central difference formula is more accurate than the
forward or backward difference formulas.
We can apply the central-difference formula using consecutive nodes to get an approxi-
mation of the first derivative half-way between the nodes,

du ui+1 − ui du ui − ui−1
≈ , and ≈ .
dx 1
i+ 2 ∆x dx 1
i− 2 ∆x
Note the use of i ± 12 , as defined in Figure 3.1. Taking the difference of these two first
derivatives gives us an approximation to the second derivative at node i,

du

du
2
d u

1
dx i+ 2
− dx i− 21
ui+1 −ui
− ui −ui−1
ui+1 − 2ui + ui−1
∆x ∆x
2
≈ ≈ = . (3.5)
dx i ∆x ∆x ∆x2
These are just a few popular finite difference formulas. Before we discuss how to derive
general formulas, we first look at how to measure their accuracy.

3.2 Truncation Error Analysis


For a finite ∆x, the forward difference formula in Eqn. 3.1 is just an approximation of
the true first derivative at x. We identify the truncation error as the difference between
this approximation when the exact u is used in the finite difference formula and the true
derivative,

forward u(x + ∆x) − u(x) du
forward difference truncation error = τ ≡ − . (3.6)
| ∆x
{z } dx
x
| {z }
approximate derivative true derivative

This definition of the truncation error differs from that used for ODEs in Chapter 2 because
here we are looking at the error in the derivative of interest, whereas before we were looking
at the error in the updated state.
We can calculate the truncation error, or at least the rate of convergence with ∆x, by
using Taylor-series expansions. In the above equation, we expand u(x + ∆x) as
du 1 2 d2 u 1 3 d3 u
u(x + ∆x) = u(x) + ∆x + ∆x 2
+ ∆x 3
+ O(∆x4 ).
dx 2 dx 6 dx
All derivatives above are taken at x. Using this expansion in Eqn. 3.6, the truncation error
of our forward difference is
du 1 2 d2 u 1 3 d3 u
 
forward 1 4 du
τ = u(x) + ∆x + ∆x 2
+ ∆x 3
+ O(∆x ) − u(x) −
∆x dx 2 dx 6 dx dx
2 3
1 du 1 du
= ∆x 2 + ∆x2 3 + O(∆x3 ).
2 dx 6 dx
56
c 2019, K. Fidkowski
This shows that the forward difference is first-order accurate, meaning that the truncation
error has a leading term of order ∆x1 . Note, here the leading term in τ gives the order of
accuracy because we are looking at the error in a derivative, as opposed to the error in an
updated states that we considered in ODEs. A similar analysis for the backwards difference
scheme shows that it is also first-order accurate,

backward u(x) − u(x − ∆x) du
τ = −
∆x dx x
du 1 2 d2 u 1 3 d3 u
 
1 4 du
= u(x) − u(x) + ∆x − ∆x + ∆x + O(∆x ) −
∆x dx 2 dx2 6 dx3 dx
2 3
1 du 1 du
= − ∆x 2 + ∆x2 3 + O(∆x3 ).
2 dx 6 dx
The central difference, [u(x + ∆x) − u(x − ∆x)]/∆x, is the average of the forward and
backward differences, so its truncation error is
1 forward
τ central = + τ backward

τ
2
1 2 d3 u
= ∆x + O(∆x3 ).
6 dx3
This shows that the central difference approximates the derivative to second order, as opposed
to the first-order accuracy of forward and backward differences.
Example 3.1 (Second Derivative Difference Truncation Error). How accurate is the second-
derivative difference in Eqn. 3.5?
d2 u

ui+1 − 2ui + ui−1

dx2 i ∆x2
Using a Taylor-series expansion about node i, we have
du 1 2 d2 u 1 3 d3 u
ui+1 = u(xi + ∆x) = ui + ∆x + ∆x + ∆x + O(∆x4 )
dx 2 dx2 6 dx3
du 1 2 d2 u 1 3 d3 u
ui−1 = u(xi − ∆x) = ui − ∆x + ∆x 2
− ∆x 3
+ O(∆x4 )
dx 2 dx 6 dx
All derivatives are taken at xi . The truncation error is the error between the finite difference
approximation via Eqn. 3.5 and the actual second derivative,
ui+1 − 2ui + ui−1 d2 u
τ = 2
− 2
∆x dx
2
d2 u

1 2 d u 4
= 0 + 0 + ∆x + 0 + O(∆x ) −
∆x2 dx2 dx2
= O(∆x2 )
So this finite difference gives a “second-order” accurate approximation of the second deriva-
tive.

57
c 2019, K. Fidkowski
———

Example 3.2 (An approximation to the fourth derivative). Let’s derive a difference ap-
4
proximation to ddxu4 at node j by applying the second-order central difference approximation
twice. The answer will involve nodes j − 2, j − 1, j, j + 1, j + 2.
At node j in our finite difference grid,

d2 u 2 2
d4 u dx2
− 2 ddxu2 (xj ) + ddxu2 (xj−1 )
(xj+1 )
(xj ) ≈
dx4 ∆x2
(uj+2 − 2uj+1 + uj ) − 2(uj+1 − 2uj + uj−1 ) + (uj − 2uj−1 + uj−2 )

∆x4
uj+2 − 4uj+1 + 6uj − 4uj−1 + uj−2
=
∆x4

To determine the order of accuracy, we use Taylor-series expansions,

1 1 1
uj−2 = uj − 2∆xux + (2∆x)2 uxx − (2∆x)3 uxxx + (2∆x)4 uxxxx
2 6 24
1 5 6
− (2∆x) uxxxxx + O(∆x )
120
1 1 1
uj−1 = uj − 1∆xux + (1∆x)2 uxx − (1∆x)3 uxxx + (1∆x)4 uxxxx
2 6 24
1 5 6
− (1∆x) uxxxxx + O(∆x )
120
uj = uj
1 1 1
uj+1 = uj + 1∆xux + (1∆x)2 uxx + (1∆x)3 uxxx + (1∆x)4 uxxxx
2 6 24
1 5 6
+ (1∆x) uxxxxx + O(∆x )
120
1 1 1
uj+2 = uj + 2∆xux + (2∆x)2 uxx + (2∆x)3 uxxx + (2∆x)4 uxxxx
2 6 24
1 5 6
+ (2∆x) uxxxxx + O(∆x )
120

Summing these together with the given coefficients yields,

uj+2 − 4uj+1 + 6uj − 4uj−1 + uj−2 = 0uj + 0ux + 0uxx + 0uxxx +


(∆x)4 uxxxx + 0uxxxxx + O(∆x6 ).

Since we have to divide by ∆x4 , the leading error term that remains will be O(∆x2 ) and so
the finite difference approximation is second-order accurate.

———

58
c 2019, K. Fidkowski
3.3 General Finite-Difference Formulas
In the previous sections we presented a few common finite difference formulas that had
simple interpretations related to the definition of the first derivative. Now we take a look at
systematically deriving general finite difference formulas.
The problem statement reads as follows: given l + r + 1 consecutive points on a uniform
grid, as illustrated in Figure 3.2, determine coefficients aj , −l ≤ j ≤ r, such that the following
summation (i.e. finite difference)
r
X
aj uj , (3.7)
j=−l

approximates a desired derivative quantity, usually at x0 . Note that the finite difference
stencil consists of the point in question, j = 0, as well as l points to the left and r points
to the right. For example, for a forward difference, l = 0, r = 1; for a backward difference,
l = 1, r = 0, and for a central difference, l = r = 1.

∆x
x
x−l x−l+1 . . . x−1 x0 x1 ... xr−1 xr

Figure 3.2: Points used in the derivation of a general finite-difference formula. The spacing
between all consecutive nodes is ∆x.

3.3.1 Undetermined coefficients


In this method we insert into Eqn. 3.7 Taylor-series expansions for uj about x = x0 ,

du 1 d2 u
uj = u0 + (xj − x0 ) + (xj − x0 )2 2 + . . . .
dx 2 dx
We then group powers of ∆x and choose aj so that we obtain an approximation of the desired
derivative with the maximum possible accuracy.

Example 3.3 (One-sided Difference for du dx


). How accurate of an approximation can we
obtain for the first derivative, dudx
, at node 0 using information from only one side? In
particular, let’s use two points on the left and no points on the right: l = 2, r = 0. So the
players are u−2 , u−1 , and u0 . We have (all derivatives are at x0 )

du 1 d2 u 1 d3 u
u−2 = u0 + (−2∆x) + (−2∆x)2 2 + (−2∆x)3 3 + . . .
dx 2 dx 6 dx
2 3
du 1 d u 1 d u
u−1 = u0 + (−∆x) + (−∆x)2 2 + (−∆x)3 3 + . . .
dx 2 dx 6 dx
59
c 2019, K. Fidkowski
Inserting into Eqn. 3.7 and grouping powers of ∆x leaves
r
X
aj uj = a−2 u−2 + a−1 u−1 + a0 u0
j=−l
2 3
 
du 1 2d u 1 3d u
= a−2 u0 + (−2∆x) + (−2∆x) + (−2∆x) + ...
dx 2 dx2 6 dx3
2 3
 
du 1 2d u 1 3d u
+a−1 u0 + (−∆x) + (−∆x) + (−∆x) + ...
dx 2 dx2 6 dx3
+a0 u0
= (a−2 + a−1 + a0 )u0
du
+ (−2a−2 − a−1 ) ∆x
  dx 2
1 du
+ 2a−2 + a−1 ∆x2 2
2 dx
d3 u
 
4 1
+ − a−2 − a−1 ∆x3 3
3 6 dx
4
+O(∆x )
We want this sum to approximate the first derivative, du dx
, as accurately as possible. But we
only have three knobs, a−2 , a−1 , a0 , so we can’t get too ambitious about accuracy. Focusing
on the first three terms above, we have the following equations:
a−2 + a−1 + a0 = 0
1
−2a−2 − a−1 =
∆x
1
2a−2 + a−1 = 0
2
Summing the last two of these, we obtain
1 1 2
− a−1 = ⇒ a−1 = −
2 ∆x ∆x
Substituting this result into the last equation,
1 1
a−2 = − a−1 = .
4 2∆x
Finally, substituting both results into the first equation gives
1 2 3
a0 = −a−2 − a−1 = − + =
2∆x ∆x 2∆x
So our one-sided approximation for the first derivative is
du 3 2 1 3u0 − 4u−1 + u−2
≈ u0 − u−1 + u−2 =
dx 2∆x ∆x 2∆x 2∆x
60
c 2019, K. Fidkowski
A quick check in the above equation with the Taylor-series expansions shows that the trun-
cation error in this finite difference is O(∆x2 ) – i.e. the difference is second-order accurate.

———

3.3.2 Lagrange interpolation


In this method, we derive a finite difference approximation by first fitting a polynomial of
order l + r to the u-values at the l + r + 1 points. We do this by Lagrange interpolation. We
then calculate the desired derivative by differentiating the interpolated polynomial.
Specifically, the order l + r polynomial that interpolates values uj at the l + r + 1 points
xj is
r
X Y x − xi
ũ(x) = Lj (x)uj , where Lj (x) = .
x j − xi
j=−l i6=j

Note that the Lagrange polynomial Lj (x) is 1 at node xj and 0 at all of the other nodes.
Once we have ũ(x) we can differentiate it at x = x0 to obtain the desired derivative.

Example 3.4 (Derivation of the Central Difference for du dx


). Using l = r = 1, let’s see if we
can use Lagrange interpolation to derive the central difference formula that we presented in
Eqn. 3.4. The three Lagrange interpolating functions are

(x − x0 )(x − x1 )
L−1 (x) = 1 at node j = −1
(−∆x)(−2∆x)
(x − x−1 )(x − x1 )
L0 (x) = 1 at node j = 0
(∆x)(−∆x)
(x − x−1 )(x − x0 )
L1 (x) = 1 at node j = 1
(2∆x)(∆x)

The polynomial

1
X
ũ(x) = Lj (x)uj
j=−1

will then interpolate the three states u−1 , u0 , u1 . To obtain the first derivative at x = x0 we
differentiate this polynomial and evaluate at x = x0 ,
1
dũ(x) X dLj (x)
= uj
dx x0 j=−1 dx x0

61
c 2019, K. Fidkowski
The derivatives of the Lagrange polynomials are

dL−1 2x − x1 − x0 −1
= =
dx x0 2∆x2
x0 2∆x

dL0 2x − x1 − x−1
= =0
dx x0 −∆x2
x0
dL1 2x − x0 − x−1 1
= =
dx x0 2∆x2
x0 2∆x

So our finite difference approximation is


1
du dũ(x) X dLj (x) −1 1
≈ = uj = u−1 + 0u0 + u1
dx x0
dx x0 j=−1 dx x0
2∆x 2∆x
u1 − u−1
= ,
2∆x
and we see that we have indeed recovered our central difference result. We don’t directly
get the order of accuracy using Lagrange interpolation, but we can reason as follows: we fit
a quadratic function of x to our three points so the slope is going to be a linear function of
x, which has an error (missing term) that behaves like O(∆x2 ). In general, however, getting
the order of accuracy is more straightforward when using Taylor-series expansions.

———

3.4 Problem Setup and Boundary Conditions


Finite difference formulas can be used to discretize ordinary or partial differential equations
into a set of algebraic equations relating the unknown solution values at the nodes. In this
section we show how to do this for a one-dimensional scalar problem, paying particular
attention to the boundary conditions.
To obtain a solvable system, we should write down as many equations as we have un-
knowns. The unknowns are the values of u at the nodes of our grid. The equations will
generally be the discretized differential equation written down at the grid nodes, so that the
number of equations is the same as the number of unknowns.
Points near or on boundaries require some attention. First of all, points close to bound-
aries might not have enough left or right neighbors in order to apply the desired finite
difference formulas. In such cases one-sided difference formulas of matching accuracy can
usually be used. Second, for points directly on the boundaries, one has to decide whether
to keep these points as unknowns and, if so, what the associated equations should be. We
distinguish between two types of boundary conditions (presented in 1D):
Dirichlet boundary condition: u specified
du
Neumann boundary condition: dx
specified

62
c 2019, K. Fidkowski
We will focus on Dirichlet boundary conditions, where u is given on the boundary nodes.
We could keep the values of u on these nodes as unknowns, and throw the equations “u
= known value” into our system. This is simple, but it increases the size of the system.
Instead, we will usually treat the given boundary condition values as known quantities, so
that the unknowns are only the interior nodes. Then, if any finite difference needs a value
from our Dirichlet boundary nodes, we use the given boundary condition value.
Example 3.5 (Steady Heat Conduction in a Bar with a Source). The following differential
equation, introduced for multiple dimensions in Section 1.2.3.1, governs the steady-state
temperature distribution inside a one-dimensional bar with a heat source term,
d2 T
−κ = q(x), (3.8)
dx2
where T (x) is the temperature distribution, κ is the thermal conductivity, and q(x) is the
heat source. Our domain will be 0 ≤ x ≤ 1, and we will use a uniform grid of N intervals so
that the spacing is ∆x = 1/N .
2
Let’s use a second-order difference, i.e. Eqn. 3.5, to discretize the second derivative ddxT2
in our equation. So the discretized form of Eqn. 3.8 becomes, at node xi ,
Ti+1 − 2Ti + Ti−1
−κ = q(xi ).
∆x2
Rearranging the terms,
1
−Ti−1 + 2Ti − Ti+1 = ∆x2 q(xi ).
κ
This equation holds at all interior nodes. Now we need boundary conditions. Suppose that
the temperature is fixed at both ends of the bar, so that T0 and TN are both known constants.
This means that we do not have any unknowns or equations at i = 0 and i = N . Also, the
equations at i = 1 and i = N − 1 need to be rearranged slightly to put the known values on
the right-hand side.
1
at i = 1: 2T1 − T2 = ∆x2 q(x1 ) + T0
κ
1
at i = N − 1: − TN −2 + 2TN −1 = ∆x2 q(xN −1 ) + TN
κ
So our system of equations, in matrix form, reads
    
2 −1 0 0 0 ... 0 T1 q(x1 )∆x2 /κ + T0
 −1
 2 −1 0 0 ... 0   T2
   
  q(x2 )∆x2 /κ 

 0 −1
 2 −1 0 ... 0    T3
  
  q(x3 )∆x2 /κ 

 .. . . . . . . . . . . ..   .. ..
=
  
 . . . . . . . 
 . .
 
  q(xN −3 )∆x2 /κ
   
 0 ... 0 −1 2 −1 0    TN −3
 
  q(xN −2 )∆x2 /κ
   
 0 ... 0 0 −1 2 −1  TN −2
 
0 ... 0 0 0 −1 2 TN −1 q(xN −1 )∆x2 /κ + TN

63
c 2019, K. Fidkowski
Listing 3.1 shows a Matlab code that sets up and solves this system and Figure 3.3 shows
results for q(x) = sin(πx), κ = 1, T0 = 1, TN = 2, and N = 16.

Listing 3.1: Computes the temperature in a heated 1D bar with Dirichlet boundary conditions.
1 function [x,T] = heat1d(T0,TN, kappa, N);
2 % function [x,T] = heat1d(T0,TN, kappa, N);
3 % PURPOSE:
4 % This function determines the temperature in a heated 1D bar
5 % INPUTS:
6 % T0,TN : temperatures at bar endpoints
7 % kappa : heat conductivity inside bar
8 % N : number of intervals
9 % OUTPUTS:
10 % x : N+1 by 1 vector of node locations, including boundary endpoints
11 % T : N+1 by 1 vector of temperature values at each x
12
13 dx = 1/N; % interval length
14 x = linspace(dx,1−dx, N−1); % node positions
15
16 % set up system: A∗T = b
17 e = ones(N−1,1);
18 A = spdiags([−e, 2∗e, −e], −1:1, N−1, N−1);
19 b = zeros(N−1,1);
20 for i=1:N−1,
21 b(i ) = sin(pi∗x(i))∗dxˆ2/kappa; % source term
22 end
23 b( 1) = b( 1) + T0; % additions to rhs due to Dirichlet BC
24 b(N−1) = b(N−1) + TN;
25
26 T = A\b; % solve system
27
28 x = [ 0; x ’; 1]; % augment with BCs
29 T = [T0; T ; TN];

How good is our solution? The answer depends on the number of points we use. This is a
simple problem and we know the exact solution. So let’s measure accuracy by an L2 error
norm,
v
uN −1
uX
=t ∆x(Ti − T exact (xi ))2 ,
i=1

where T exact (x) is the exact solution, given by

1
T exact (x) = sin(πx) + T0 + (TN − T0 )x.
κπ 2
Figure 3.4 shows the convergence of the L2 error norm with mesh refinement. As indicated
by the slope, second-order accuracy is achieved, i.e.  ∼ O(∆x2 ). This supports our analysis
in Example 3.1.

———

64
c 2019, K. Fidkowski
2

1.9

1.8

1.7

temperature, T 1.6

1.5

1.4

1.3

1.2

1.1

1
0 0.2 0.4 0.6 0.8 1
x

Figure 3.3: Sample finite-difference result for the temperature in a heated 1D bar, with q(x) =
sin(πx), κ = 1, T0 = 1, TN = 2, and N = 16.

−2
10
convergence rate = 2

−3
10
L error norm

−4
10

−5
10
2

−6
10

−7
10 −3 −2 −1 0
10 10 10 10
dx

Figure 3.4: Convergence of the discrete L2 error norm of the computed temperature distribution
relative to the actual temperature distribution for Example 3.5.

65
c 2019, K. Fidkowski
3.5 Multiple Dimensions
In Section 3.3 we presented a general form for finite difference formulas in one dimension.
This form extends to multiple dimensions; in two dimensions, a finite difference formula will
typically involve a rectangular grid of points, as illustrated in Figure 3.5. Note that the

... ... yu
... ... yu−1
.. .. .. .. .. .. ..
. . . . . . .
... ... y1
∆y
... ... y0
... ... y−1
.. .. .. .. .. .. ..
. . . . . . .
y ... ... y−d+1
... ... y−d
x−l x−l+1 x−1 x0 x1 xr−1 xr
∆x
x

Figure 3.5: Points used in the derivation of a general finite-difference formula in two dimensions.
The spacing between all consecutive nodes is ∆x in the x direction and ∆y in the y direction.

center point is i = 0, j = 0 – there are l columns of points to the left, r columns to the
right, d rows below, and u rows above. The total number of points is (l + r + 1)(u + d + 1).
We then use a weighted sum over these points to approximate a desired derivative quantity,
such as the following mixed derivative,

r X u
∂ m+n u X
≈ aij uij . (3.9)
∂xm ∂y n x0 ,y0 i=−l j=−d

We can still use either the method of undetermined coefficients or Lagrange interpolation
in order to calculate the coefficients aij . When using undetermined coefficients, the Taylor-
series expansion in multiple dimensions reads

∂u ∂u 1 ∂ 2u
uij = u00 + (xi − x0 ) + (yj − y0 ) + (xi − x0 )2 2
∂x ∂y 2 ∂x
2 2
1 ∂ u ∂ u
+ (yj − y0 )2 2 + (xi − x0 )(yj − y0 ) + ...
2 ∂y ∂x∂y

Note that all derivatives are computed at x0 , y0 . When using Lagrange interpolants, the
Lagrange polynomial at each node is given by the product of one-dimensional Lagrange

66
c 2019, K. Fidkowski
interpolants in x and y,
r
X u
X
ũ(x, y) = Lxj (x)Lyk (y)
j=−l k=−d
Y x − xk
Lxi (x) =
k6=i
xi − xk
Y y − yk
Lyj (x) =
k6=j
yj − yk
m+n
∂ u  x
Li (x)Lyj (x)

ajk = m n
∂x ∂y
Two-dimensional problems require more points and hence are more expensive to set up and
solve compared to one-dimensional problems. However, the basic procedure for setting up
the linear system remains unchanged: each node contributes one unknown and one equation.
Finite difference methods are great for multi-dimensional problems where the domain
is simple – i.e. rectangular. For non-rectangular domains, more general finite difference
formulas that do not rely on equal spacing of the points can be used with so-called curvilinear
grids. Alternatively, the equations on a general domain can sometimes be mapped to a simple
domain, such as a unit square. This changes the equations but allows the use of simple finite
differences on uniform grids.
Example 3.6 (Two-dimensional Poisson’s Equation). Poisson’s equation in two dimensions
is
∂ 2u ∂ 2u
+ = f (x, y). (3.10)
∂x2 ∂y 2
| {z }
∇2 u

Consider a rectangular domain of dimensions lx × ly with Dirichlet boundary conditions,


u = 0, as shown in Figure 3.6. Using a grid of N × M intervals, with boundary values
known, the interval spacings are
lx ly
∆x = , ∆y = .
N M
We can use our second-order discretization, Eqn. 3.5, of each second derivative in Eqn. 3.10,
∂ 2u ui−1,j − 2ui,j + ui+1,j
2
≈ ,
∂x ∆x2
∂ 2u ui,j−1 − 2ui,j + ui,j+1
≈ ,
∂y 2 ∆y 2
Substituting these formulas into Eqn. 3.10 gives the following system,
ui−1,j − 2ui,j + ui+1,j ui,j−1 − 2ui,j + ui,j+1
+ = fi,j
∆x2 ∆y 2
u0,j = uN,j = 0
ui,0 = ui,M = 0

67
c 2019, K. Fidkowski
y
Γ = domain boundary

ly
Ω = domain interior

lx x

Figure 3.6: Rectangular domain for a two-dimensional Poisson problem.

Note, fi,j is f (x, y) evaluated at node (i, j) in our lattice. We can write this system in matrix
form as

AU = F,

where U is a column vector of all (N −1)(M −1) unknowns, ordered in theory in an arbitrary
fashion. Typically, some logical ordering is used, such as stacking unknowns from our x, y
lattice row by row,

U = [u1,1 , u2,1 , . . . , uN −1,1 , u1,2 , u2,2 , . . . , uN −1,2 , . . . , u1,M −1 , u2,M −1 , . . . , uN −1,M −1 ]

Note, the number of unknowns is Nu = (N − 1)(M − 1) = the number of grid nodes on


the interior of the domain. Since u is prescribed on the boundary through the boundary
condition, the nodes on the boundary are not treated as unknowns and do not contribute
any equations. F is a column vector of the right-hand side f (x, y) source term evaluated at
the Nu = (N − 1)(M − 1) nodes. Finally, A is an Nu × Nu matrix where each row contains
coefficients into the U vector for a single equation.
For example, for N = M = 3 and ∆x = ∆y = h, the equations (see definitions of
quantities in Figure 3.7) are

1
at i = 1, j = 1: (−4u11 + u21 + u12 ) = f11
h2
1
at i = 2, j = 1: (u11 − 4u21 + u22 ) = f21
h2
1
at i = 1, j = 2: (u11 − 4u12 + u22 ) = f12
h2
1
at i = 2, j = 2: (u21 + u12 − 4u22 ) = f22
h2
The corresponding matrix system is

68
c 2019, K. Fidkowski
h

h
u = 0 on boundary
u12 u22

y
u11 u21

Figure 3.7: Nodes for a N = M = 3 grid of points for the two-dimensional Poisson problem.

    
−4 1 1 0 u11 f11
1  1 −4
 0 1 
 u21  
= f21 

h 2  1 0 −4 1  u12   f12 
0 1 1 −4 u22 f22
| {z }| {z } | {z }
A U F
For greater N and M , the matrix will be correspondingly larger. Fortunately, however, the
matrix will be sparse (mostly zeros) because in each equation, each unknown only “talks to”
its four immediate neighbors (above, below, left, right).

———

3.6 Stability Analysis


We can derive various finite difference approximations for derivatives that appear in ordinary
or partial differential equations. So which finite difference should we use for a particular
problem? Accuracy is one consideration – at the very least, we need consistency which means
that our approximations should be at least first-order accurate. Beyond consistency, high-
order accuracy is usually advantageous, although it should be traded off against increased
computational time. But accuracy is not the only consideration: we should also pay attention
to stability. That is, a very accurate finite difference may be unstable and hence useless.
The PDE analog of the ODE Dahlquist equivalence theorem presented in Section 2.4.3, is
the Lax equivalence theorem. This theorem also states that

stability + consistency → convergence

Some definitions are as follows:

• stability: the solution remains bounded as ∆x and ∆t approach zero on a fixed space-
time domain.

69
c 2019, K. Fidkowski
• consistency: the numerical scheme is derived logically from the PDE; i.e. the error in a
Taylor-series sense between the numerical difference equation and the PDE approaches
zero as ∆x and ∆t approach zero.
• convergence: “if we do enough work, we’ll get the right answer” – i.e. the numerical
solution approaches the exact solution of the PDE as ∆x and ∆t approach zero.
We note that the Lax equivalence theorem applies only to linear problems on a fixed space-
time domain. For nonzero ∆x or ∆t, we might see instability even for a convergent scheme.
In this section we look at two ways to analyze stability given a particular discretization.

3.6.1 Von-Neumann Analysis


Von-Neumann analysis applies to linear initial-value problems on uniform grids with peri-
odic boundary conditions. These assumptions seem limiting, yet the applicability of von-
Neumann analysis is quite broad: a scheme that fails this test has fundamental flaws that
need to be addressed.
The key observation in von-Neumann analysis is that difference equations possess exact
solutions of sinusoidal variation,

unj = Re g n eijφ ,
 

where
• g is a modal amplification factor (a complex number),

• i = −1,
• j is a spatial node index,
• φ is a phase angle which is the amount by which the phase of our solution “mode”
changes per mesh interval.
On a finite mesh with N intervals, φ takes on discrete values, namely
 
π 2π
φ= , ,...,π .
N N
Low values of φ correspond to low-frequency modes, while high values of φ yield high-
frequency modes, as illustrated in Figure 3.8. Therefore, if the solution at one time instant
is a pure sine wave with phase change of φ per ∆x, then the solution at the next time
level will be that sine wave multiplied by g. Since g may be complex, both the phase and
amplitude of the sine wave can be affected. For stability, we care about the amplitude, so
our requirement is that |g| ≤ 1 for all φ ∈ [0, π] – note that g will be a function of the
frequency of the mode, i.e. φ. This is another way of saying that we do not want any of the
modes representable on the grid to “blow up.”
The steps for performing von-Neumann stability analysis read as follows,

70
c 2019, K. Fidkowski
x x
π
“low frequency”: φ = N “high frequency”: φ = π

Figure 3.8: In von-Neumann analysis, φ ∈ [0, π] describes the frequency of a sinusoidal error
mode. Specifically, φ is the phase change per mesh interval.

1. Choose a finite-difference scheme.

2. Substitute unj = g n eijφ into the scheme.

3. Simplify and solve for g(φ).

4. If possible, determine requirements on scheme parameters such that |g(φ)| < 1 for all
φ in [0, π].

Example 3.7 (FTCS Discretization of the Advection Equation). In this example we look
at the forward-in-time centered-in-space (FTCS) discretization of a linear convection (ad-
vection) PDE,

∂u ∂u
+a = 0. (3.11)
∂t ∂x
“Forward-in-time” means that we discretize the time derivative using a forward difference.
“Centered-in-space” means we discretize ∂u
∂x
using a central difference, so that our discretized
equation is

un+1
j − unj unj+1 − unj−1
+a = 0
∆t 2∆x
a∆t n
un+1 = unj − uj+1 − unj−1

⇒ j
2∆x
Note that the subscript j denotes the spatial index in our finite difference grid, and the
superscript n denotes the temporal index, i.e. the time node. To apply von-Neumann
stability analysis, we substitute unj = g n eijφ into this equation and calculate the amplification
factor g,

a∆t n i(j+1)φ
g n+1 eijφ = g n eijφ − − g n ei(j−1)φ

g e
2∆x
a∆t iφ
e − e−iφ

g = 1−
2∆x
a∆t
= 1− i sin φ
∆x
71
c 2019, K. Fidkowski
The magnitude of the amplification factor is
 2
2 a∆t
|g| = 1 + sin2 φ ≥ 1
∆x
So the FTCS scheme for advection is unconditionally unstable because the magnitude of g
is always greater or equal to 1 (equality only in the special case when φ = π).

———

Example 3.8 (FOU Discretization of the Advection Equation). In this example we look at
the first-order upwind (FOU) discretization of the same PDE as in the previous example,
∂u ∂u
+a = 0.
∂t ∂x
The only difference is the spatial discretization, which is now “upwind”, which means
∂u unj − unj−1
≈ for a > 0
∂x ∆x
∂u unj+1 − unj
≈ for a < 0
∂x ∆x

Let’s just look at the case a > 0, since the other result is similar. Our discretization is

un+1
j − unj unj − unj−1
+a = 0
∆t ∆x
a∆t n
un+1 = unj − uj − unj−1

⇒ j
∆x
To apply von-Neumann stability analysis, we substitute unj = g n eijφ into this equation and
calculate the amplification factor g,
a∆t n ijφ
g n+1 eijφ = g n eijφ − g e − g n ei(j−1)φ

∆x
a∆t
1 − e−iφ

g = 1−
∆x
Let’s define σ = a∆t/∆x; note that σ is nondimensional and positive. The amplification
factor is then

g = 1 − σ + σ cos φ − iσ sin φ

The square magnitude of g is

|g|2 = (1 − σ + σ cos φ)2 + (σ sin φ)2


= (1 − σ)2 + 2(1 − σ)σ cos φ + σ 2

72
c 2019, K. Fidkowski
Now we ask whether |g| is less than or greater than 1. The answer depends on σ and φ. We
don’t have control over φ: this is the frequency of the mode we’re analyzing, and in practice
error modes of all frequencies appear, e.g. due to finite precision calculations. So we have
to consider all possible φ ∈ [0, π]. Since −1 ≤ cos(φ) ≤ 1, we have

|g|2 = (1 − σ)2 + 2(1 − σ)σ cos φ + σ 2


max (1 − σ)2 + 2(1 − σ)σ(−1) + σ 2 , (1 − σ)2 + 2(1 − σ)σ(1) + σ 2
 

max (1 − σ − σ)2 , (1 − σ + σ)2
 
=
max (1 − 2σ)2 , 1
 
=

So we see that |g| ≤ 1 as long as (1 − 2σ)2 ≤ 1. Since σ is positive, this means that
we need σ ≤ 1. The conclusion is that our first-order upwind discretization is stable if
σ = |a|∆t/∆x < 1 – note that we have put absolute values around a in order to include the
similar result for a < 0. σ is called the CFL (Courant-Friedrichs-Lewy) number, and this
restriction is a well-known result for convection problems. Note that the requirement on σ
can be used to set the time step according to a and ∆x.

———

3.6.2 Matrix Stability


Von-Neumann stability analysis gives us a theoretical result under idealized conditions such
as a uniform grid and periodic boundary conditions. But sometimes we would like to know
whether a particular method will be stable on an arbitrary grid with other boundary condi-
tions. For such practical situations, a matrix stability analysis can give us this information.
Matrix stability analysis is based on the idea of a semi-discrete form, which is the PDE
written with only the spatial term discretized – the time derivative is left alone. The general
form is given by

dU
= AU + Q, (3.12)
dt
where

• U is the vector of discrete unknowns, such as node values in a finite difference dis-
cretization.

• A is the matrix that arises from the spatial discretization; it represents how the un-
knowns talk to each other through the spatial derivatives.

• Q is a source term independent of U.

Note that this analysis applies to linear equations. Nonlinear equations would have to be
linearized prior to analyzing stability.

73
c 2019, K. Fidkowski
But we have already seen Eqn. 3.12 ... this was back in our eigenvalue stability analysis
for ODE systems. For example, Eqn. 2.14 was just our current equation without the source
term, which happens to be irrelevant for stability analysis. So at this point, we rely on
our results from ODEs, where we found that the stability of a time integration method will
generally depend on λi ∆t, where λi are the eigenvalues of A.
The steps for matrix stability analysis are then as follows:
• Apply the spatial discretization in order to obtain a semi-discrete system of the form
in Eqn. 3.12.

• Calculate the eigenvalues of the system matrix A; call them λi .

• Use the results of eigenvalue stability analysis for ODEs to determine whether a par-
ticular time integration method will be stable.
Example 3.9 (Matrix Stability of FOU for Advection). Let’s look at the matrix stability
of the first-order upwind method presented in Example 3.8. We’ll look at one-dimensional
problems with periodic and Dirichlet boundary conditions. In the case of Dirichlet boundary
conditions, we’ll only have to set the state on the inflow boundary. For a > 0 this will be the
left boundary. The outflow boundary state is never used since the derivative is approximated
by an upwind difference. This makes sense because physically the outflow state should not
be relevant for a pure advection problem. Assuming a > 0, our semi-discrete PDE obtained
from Eqn. 3.11 reads
∂uj uj − uj−1
= −a
∂t ∆x
The boundary condition comes into our equation for the first unknown, j = 1, since then
uj−1 = u0 , which is the prescribed Dirichlet boundary condition. So in matrix form our
equation with Dirichlet boundary conditions reads
      
u1 1 0 0 0 0 ... 0 u1 au0 /∆x
 u2  −1 1 0 0 0 ... 0    u2
    0 
     
 u3  0 −1 1 0 0 . . . 0   u3
0
   
d  ..
 a  .. . . . . . . . . . .

..   ..
 
.

=− + .
     
 .  . . . . . . . . .
dt  ∆x 
  
    
 uN −2   0 . . . 0 −1 1 0 0   uN −2   0 
      
 uN −1   0 ... 0 0 −1 1 0   uN −1   0 
uN 0 ... 0 0 0 −1 1 uN 0
| {z }
A
The two boxed terms above change when periodic boundary conditions are used. Specifically,
in A, the 0 becomes -1, and in the source term, au0 /∆x becomes zero.
We use Matlab to compute the eigenvalues of the system matrix A using both Dirichlet
and periodic boundary conditions for various N and a = 1, ∆x = 1/N (i.e. a domain of
length L = 1). Figure 3.9 shows the results. In both cases, we note that the results are

74
c 2019, K. Fidkowski
consistent with the von-Neumann stability analysis. For example, when a forward Euler
time discretization is used, the eigenvalue stability region is as shown in Figure 2.7. To get
λ∆t into this stability region, we would need ∆t < 1/N . But this result is specific to our
choice of a and L, and a dimensional analysis shows that the general result is ∆t < ∆x/a,
i.e. σ = a∆t/∆x < 1, as derived in Example 3.8.

1 40
N=8 N=8
0.8 N = 16 N = 16
30
N = 32 N = 32
0.6
20
0.4
10
0.2

Imag(λ)
Imag(λ)

0 0

−0.2
−10
−0.4
−20
−0.6
−30
−0.8

−1 −40
−35 −30 −25 −20 −15 −10 −5 −70 −60 −50 −40 −30 −20 −10 0 10
Real(λ) Real(λ)

(a) Dirichlet BCs (b) Periodic BCs

Figure 3.9: Eigenvalues of the system matrix for a first-order upwind (FOU) discretization of
the one-dimensional advection equation, using both Dirichlet and periodic boundary conditions.

———

75
c 2019, K. Fidkowski
76
c 2019, K. Fidkowski
Chapter 4

Probabilistic Methods

4.1 Probability Review


In this section we provide a review of basic probability and statistics.

4.1.1 Outcomes and Events


When performing an experiment, an outcome, ζ, is a result of the experiment that is recorded.
An event, A, is some set of outcomes to which we can assign a probability. An elementary
event is one that consists of a single outcome.

Example 4.1 (Manufacturing of a Solar Cell Array). Consider the manufacturing process of
a solar cell array. The “experiment” is the manufacturing of a single cell. A simple outcome
could be the binary result of a quality control test: an outcome value of ξ = 1 indicates a
pass of the quality-control test, while an outcome value of ξ = 0 indicates a fail of the test.
The event that a produced cell passes the test is an example of an elementary event.
Suppose now that we have another experiment, in which we manufacture 100 solar cells
and count Ndefective = the number of defective cells – i.e. the ones that fail the quality-
control test. An event in this case could be the situation where Ndefective < 5. This is not an
elementary event, since the event consists of five possible outcomes, Ndefective ∈ {0, 1, 2, 3, 4}.

———

4.1.2 The Meaning of Probability


We denote an event by A, and the probability of that even happening by P (A). This
probability must satisfy the following conditions:

• P (A) ≥ 0.

• P (A) = 1 ⇔ the event A is certain.

• Given two mutually exclusive events, A and B, P (A ∪ B) = P (A) + P (B).

77
4.1.3 Random Variables
A random variable, which we will denote by an upper-case letter, e.g. X, is a value that is
a function of the outcome of the experiment. We sometimes indicate the dependence on the
outcome explicitly by writing X(ζ). This could be just the value of the outcome itself, or it
could be a more complicated function. A random variable could be discrete, meaning that it
takes on a finite or countably infinite number of values, or continuous, meaning that it can
take on a value that varies continuously in one or more intervals of real numbers.

Example 4.2 (Cost of Manufacturing of a Solar Cell Array). Consider again the manufac-
turing process of a solar cell array, specifically the case where we produce 100 cells and count
the number of defective cells: ζ = Ndefective . Suppose now that we are interested in the cost
associated with defective cells. If the cost of each defective cell is C = 1, then we can define
a random variable that measures this cost for 100 cells as

X = Cζ = ζ.

A more complicated example would be the case where a too-high failure rate triggers an
inspection of the manufacturing process. For example, suppose that if Ndefective > 10, we
incur an inspection cost of Cinspect . Then the random variable that measures the “cost of
failure” for 100 cells is

Cζ for ζ ≤ 10
X=
Cinspect + Cζ for ζ > 10

———

4.1.4 The Probability Density Function (PDF)


For a continuous random variable, X, we use probability density function (PDF), p(x), to
describe the probability of the random variable value, x, being in some range. For example,
given a random variable X, the probability that a ≤ X ≤ b is given by the integral of the
PDF over the range,
Z b
P (a ≤ x ≤ b) = p(x)dx,
a

One of the simplest probability density functions is a constant, p(x) = const., on the interval
and zero outside it. In particular,
 1
b−a
for a ≤ x ≤ b,
p(x) = (4.1)
0, otherwise.

Note that the integral of p(x) over all possible ranges of x is 1, indicating that when consid-
ering all possible events, the probabilities must “add up” to unity.

78
c 2019, K. Fidkowski
4.1.5 Expected Value and Mean
Given the PDF, p(x), of a continuous random variable X, the expected value is defined as
Z ∞
E(X) ≡ xp(x) dx. (4.2)
−∞

For a discrete random variable, the above integral is replaced with a sum over the discrete
values that x could take on, and the PDF just becomes a value for each x. The expected
value of X, which we will denote by µx , is also known as the mean value.

4.1.6 Variance and Standard Deviation


The variance of X is defined as,
Z +∞
σx2 ≡ (x − µx )2 p(x) dx.
−∞

The value σx is known as the standard deviation of X. The variance is a measure of the
variability in X about its mean value. A frequently-used relationship between the mean and
variance is,
σx2 = E(X 2 ) − µ2x .

4.1.7 The Cumulative Distribution Function (CDF)


The cumulative distribution function (CDF) of a continuous random variable X is defined
as the probability that X ≤ x. Specifically,
CDF(x) ≡ P (X ≤ x).
From the basic assumptions of probability, we can show that CDF(−∞) = 0 and CDF (+∞) =
1. The CDF and PDF of X are related as follows,
Z a
CDF(a) = p(x) dx,
−∞
Z b
CDF(b) − CDF(a) = p(x) dx.
a
d
p(x) = (CDF(x)) .
dx

4.1.8 Percentiles
The u percentile of X is the smallest number xu such that
u = P (X ≤ xu ) = CDF(xu ).
Note, since u is a probability, its range is 0 ≤ u ≤ 1.

79
c 2019, K. Fidkowski
4.2 Monte Carlo Sampling
The Monte Carlo method consists of using repeated simulations of an experiment in order
to obtain statistics of quantities of interest. This simple method has many uses, and here
we focus on one, which is the propagation of uncertainties through a system.

4.2.1 Introduction
Consider the general setup of a numerical simulation shown in Figure 4.1.

µ system y
inputs output

Figure 4.1: Schematic illustration of an input-to-output mapping for a system, such as a numerical
simulation. We consider multiple inputs, in the vector µ and a single scalar output, y.

Here, µ ∈ Rp is a vector if p inputs, and y is a scalar output. Multiple outputs can be


easily considered, and we choose one output only for simplicity of exposition. In the Monte
Carlo method, we will assume that we know the probability distributions of the inputs, i.e.
the entries of µ, and our goal is to determine the probability distribution of the output
y. Throughout this section we assume that the inputs and outputs are continuous random
variables.
The essence of the Monte Carlo method is to obtain an approximate probability distri-
bution of y through repeated simulations with different, independent, choices of inputs µ.
These input samples, µi , are chosen randomly, according to the probability distributions
governing the inputs. Note that here i indexes the different input samples: for every i, we
have a separate input vector µi . So if we run N simulations, i ∈ [1 . . . N ].
The result of running N simulations is N outputs, yi . We can think of this set of outputs
as an approximation to the probability density function of y, p(y). To see this, let’s take
the range of outputs, [ymin , ymax ], and divide it into M equally-spaced intervals, called bins.
For each bin, we then count the number of output samples that lie inside the bin. Dividing
these numbers (one for each bin) by N and plotting versus the bins in a bar-graph form
gives a histogram of the output distribution. Since the samples were drawn independently at
random, the histogram is an approximation to the probability density function p(y). Sample
histograms are shown in Figures 4.7 and 4.8. Note that the number of bins, M , affects the
“smoothness” of the histogram: we want to choose an M that is sufficiently high to resolve
p(x) well, but low enough relative to N so that on average there are still a large number of
samples per bin.
In summary, we are using the Monte Carlo method to obtain an approximation to the
probability density function, p(y), of an output of interest y. This approximation is in the
form of samples yi , i ∈ [1 . . . N ], which we can visualize in a histogram and from which we
can compute statistics (as we will soon see). The method is approximate because we will use
a finite number of samples, N . That is, to obtain exact statistics, we would need N → ∞.
We will see in Section 4.3 how to estimate the errors incurred by using a finite N .

80
c 2019, K. Fidkowski
Example 4.3 (Variability in Aircraft Range). Consider the Breguet range equation for a
propeller aircraft,

η CL W0
R= ln (4.3)
c CD W1
where R is the range and the parameters are

• η = propeller efficiency

• c = specific fuel consumption

• CL /CD = lift-to-drag ratio, assumed constant throughout cruise

• W0 = initial weight of aircraft and fuel

• W1 = W0 − Wf final weight of aircraft (no fuel); Wf is the fuel weight

Furthermore, we model the drag coefficient as

CL2
CD = CD0 + ,
πeAR
where CD0 is the zero-lift drag coefficient, e is the span efficiency factor, and AR is the
aspect ratio. During initial stages of aircraft design, not all of these parameters may be
exactly known. But if we know the probability distributions of the input parameters (we
don’t always, but sometimes we can make reasonable guesses), we can say something about
the expected variability in the output, which in this case is the range.
Table 4.1 provides a list of all parameters with baseline values, and it notes ones that we
will treat as uncertain in the following sections.

Table 4.1: Listing of parameters for the Breguet range equation example.

Parameter Baseline value Type


CL 0.5 Fixed
AR 8 Fixed
Wf 170kg Fixed
η 0.8 Uncertain
c 7.5 × 10−7 m−1 Uncertain
CD0 .025 Uncertain
e 0.9 Uncertain
W0 1300kg Uncertain

———

81
c 2019, K. Fidkowski
4.2.2 Sampling a Distribution

In the Monte Carlo method we need to pick samples for input random variables that are
governed by certain probability distributions. Usually, we have at our disposal a (pseudo-)
random number generator, often one which returns uniform random numbers in a predefined
range, such as [0, 1]. Let’s see how we can use such a function to sample general distributions.
A very simple distribution that we can sample directly is that of a uniformly-distributed
random variable. The PDF of such a variable is given in Eqn. 4.1 and Figure 4.2 illustrates
it schematically.

p(x)

1/(b − a)

a b x

Figure 4.2: Probability density function for a uniform random variable X that takes on a constant
value for x between a and b.

Given a uniform random variable U between 0 and 1 (from our random number genera-
tor), we can obtain a sample of our random variable X by the following equation,

X = a + U (b − a).

Example 4.4 (Sampling a uniform distribution). Continuing with Example 4.3, we consider
the case where all uncertain parameters are at their baseline values, but the span efficiency
factor e is distributed uniformly on the range e ∈ [0.85, 0.95]. We are interested in the effect
of this variation on the range, R.
Listing 4.1 presents a code that implements the Monte Carlo method by sampling e and
computing the range for each sample.

82
c 2019, K. Fidkowski
Listing 4.1: Propagates uniform variation in span efficiency to the range.
1 function [ve, vR] = MCuniform(N);
2 % function [ve, vR] = MCuniform;
3 % PURPOSE:
4 % Propagates a uniformly−varying input parameter to the output
5 % INPUTS:
6 % N :Number of samples in the Monte Carlo method
7 % OUTPUTS
8 % ve : vector of input samples
9 % vR : vector of output samples
10 % vR : Ne+1 by 1 vector of temperatures at each node
11
12
13 % constant parameters
14 CL = 0.5;
15 AR = 8;
16 Wf = 170;
17 eta = 0.8;
18 c = 7.5e−7; % [mˆ{−1}]
19 CD0 = 0.025;
20 ebase = 0.9;
21 W0 = 1300;
22
23 % input samples
24 ve = 0.85 + 0.1∗rand(N,1);
25
26 % drag coefficient
27 CD = CD0 + CL∗CL./(pi∗ve∗AR);
28
29 % range
30 vR = eta/c∗CL./CD∗log(W0/(W0−Wf));

Figure 4.3 illustrates both the input and the output histograms for N = 104 samples. Note
the near-uniform distribution of the input samples, and a slight variation in the range his-
togram.
1200 1200

1000 1000

800 800
frequency

frequency

600 600

400 400

200 200

0 0
0.85 0.9 0.95 2030 2040 2050 2060 2070 2080 2090 2100 2110
span efficiency factor, e range, R [km]

Figure 4.3: Input and output samples for N = 104 samples of the Monte Carlo method in
Example 4.4.

83
c 2019, K. Fidkowski
———

Non-uniform random variables are somewhat harder to sample. For example, consider
a random variable X governed by a linear probability density function, as illustrated in
Figure 4.4. When sampling X we expect to obtain x values that are biased towards 1, since

p(x)

0 1 x

Figure 4.4: Probability density function for a linear random variable X defined over 0 ≤ x ≤ 1.

those values are more probable. So we cannot just use a uniform random number generator
between 0 and 1 to sample X. For more complicated probability distributions, this task
becomes even more difficult. Let’s now look at a general approach for sampling arbitrary
probability distributions.
We consider the inversion approach for sampling a general probability distribution via
a simple (uniform) random number generator. The goal is to map realizations of a uniform
random variable U , which takes on values 0 ≤ u ≤ 1, into samples of a random variable
X governed by the general probability density function p(x). To do this, we use as our
mapping the inverse of the cumulative distribution function, CDF(x). Recall that CDF(x)
maps values of x to the real numbers between 0 and 1. The inverse of this mapping then takes
values between 0 and 1 and converts them to values of x. The resulting samples conform to
the PDF of X, since the probability of drawing a sample between x and x + dx is p(x)dx,
as illustrated in Figure 4.5.
In summary, the inversion method for sampling X consists of two steps,

1. Pick a random number, 0 ≤ u ≤ 1, from a random number generator.

2. Convert u to x via the inverse CDF function, x = CDF−1 (u).

Example 4.5 (Sampling a linear distribution). Let’s apply the inverse CDF method to
sample a random variable X governed by the linear probability density function given in
Figure 4.4. The PDF, p(x), is nonzero only on the interval 0 ≤ x ≤ 1, and there p(x) = 2x.
Since the cumulative distribution function is the integral of p(x), we have

Z x  0 for x < 0
0 0
CDF(x) = p(x )dx = x2 for 0 ≤ x ≤ 1
0 
1 for x > 1

84
c 2019, K. Fidkowski
CDF (x) and p(x)

1
CDF(x)
p(x)dx

p(x)

dx x

Figure 4.5: Illustration of the inverse CDF approach for sampling a random variable X governed
by a general probability distribution function p(x). Note that p(x) = d(CDF )/dx.

The inverse CDF function, x = CDF−1 (u), maps each u, 0 ≤ u ≤ 1, to x. Inverting the
above equation and using 0 ≤ x ≤ 1 (since that’s the only interval over which x is defined
with nonzero probability), we have

x = CDF−1 (u) = u.
Using this inverse CDF, we can now sample X with a simple random number generator for
u. Listing 4.2 presents a code that does this, and Figure 4.6 shows histograms of both u
(should be uniform) and x (should be linear) for N = 104 samples.

Listing 4.2: Code for sampling a linear PDF via the inverse CDF method.
1 function [u,x] = MClinear(N);
2 % function [ve, vR] = MClinear
3 % PURPOSE:
4 % Samples a linear PDF by the inverse CDF method
5 % INPUTS:
6 % N : number of samples
7 % OUTPUTS
8 % u : samples of a uniform random variable
9 % x : samples of the linear random variable of interest
10
11 % uniform samples
12 u = rand(N,1);
13
14 % inverse CDF
15 x = sqrt(u);

———

4.2.3 Multiple Inputs


When using the Monte Carlo method to sample a system with multiple inputs, stored in the
vector µ ∈ Rp , we must create many independent sample vectors, µi , i ∈ [1 . . . N ]. For each

85
c 2019, K. Fidkowski
1200 2000

1800
1000
1600

1400
800
frequency

frequency
1200

600 1000

800
400
600

400
200
200

0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
u x

Figure 4.6: Samples of u and x for a test run of the inverse CDF method in Example 4.5.

vector, we will typically need to generate several random variables – e.g. p if all inputs are
independent. If the inputs are not independent, we can use joint probability distribution
functions to sample them, but we will not consider this here.

Example 4.6 (Variability in Aircraft Range with Multiple Uncertainty Sources). Consider
again the Breguet range equation presented in Example 4.3. We now apply the Monte Carlo
method to predict the variation in range under three sources of uncertainty: The other input

Table 4.2: Parameters and distributions for the multiple-uncertainty study in Example 4.6.

Parameter Probability distribution


CD0 Linear, with p(.021) = 0 up to p(.027) = 1
e Uniform between 0.85 and 0.95
W0 Normal with mean 1300kg and standard deviation of 50kg.

parameters are taken at baseline values. The new distribution here is the “normal” one, for
which the PDF is given by

1 2 2
normal distribution: p(x) = √ e−(x−µ) /(2σ ) .
σ 2π

In this example we take advantage of Matlab’s built-in function randn for sampling such a
random variable.
Listing 4.3 presents a code that implements the Monte-Carlo method for these three
sources of uncertainty, and Figure 4.7 shows histograms of the input samples and the final
output, the range.

86
c 2019, K. Fidkowski
Listing 4.3: Code for sampling multiple uncertainty sources in the Breguet range equation.
1 function [CD0, e, W0, R] = BreguetMC(N)
2 % function [CD0, e, W0, R] = BreguetMC(N)
3 % PURPOSE:
4 % Propagates multiple input uncertainties to the range output
5 % INPUTS:
6 % N : Number of samples in the Monte Carlo method
7 % OUTPUTS
8 % CD0 : vector of CD0 input samples
9 % e : vector of e input samples
10 % W0 : vector of W0 input samples
11 % R : vector of output samples
12
13 % fixed parameters
14 CL = 0.5;
15 AR = 8;
16 Wf = 170;
17 eta = 0.8;
18 c = 7.5e−7; % [mˆ{−1}]
19
20 % Now we generate N independent realizations for each uncertain input
21
22 % parasitic drag coefficient
23 u = rand(N,1); % u is uniform between 0 and 1
24 x = sqrt(u); % x is linear between 0 and 1 (peak at 1)
25 CD0 = .021 + x∗(.027−.021); % linear between desired values
26
27 % span efficiency factor
28 e = 0.85 + 0.1∗rand(N,1);
29
30 % Weight
31 W0 = 1300 + 50∗randn(N,1);
32
33 % drag coefficient
34 CD = CD0 + CL∗CL./(pi∗e∗AR);
35
36 % range
37 R = eta/c∗CL./CD.∗log(W0./(W0−Wf));

———

4.3 Error Estimates for Monte Carlo


As we already mentioned, Monte Carlo sampling is a means of approximating statistics using
a finite sample size, N . As we increase N , we expect our answers to improve, but practical
considerations, such as an expensive numerical calculation for each simulations, generally
limit the number of samples we can take. In this section we look at how we can estimate
the accuracy of statistical quantities computed using the Monte Carlo method for finite N .

4.3.1 Estimator of the Mean


Consider the output y of a system of interest. Given uncertain inputs, y is governed by a con-
tinuous random variable, Y , with probability density function p(y). Recall from Section 4.1.5

87
c 2019, K. Fidkowski
2000 1200

1800
1000
1600

1400
800

frequency
frequency

1200

1000 600

800
400
600

400
200
200

0 0
0.021 0.022 0.023 0.024 0.025 0.026 0.027 0.028 0.85 0.9 0.95
parasitic drag coefficient, CD0 span efficiency, e
3000 3500

3000
2500

2500
2000
frequency

frequency

2000
1500
1500

1000
1000

500
500

0 0
1100 1150 1200 1250 1300 1350 1400 1450 1500 1600 1800 2000 2200 2400 2600 2800
total weight, W0 [kg] range, R [km]

Figure 4.7: Histograms of the input samples and the range output for the multiple uncertainty
source study in Example 4.6.

88
c 2019, K. Fidkowski
that the mean, or expected value, of random variable Y is given by
Z ∞
µy = E(Y ) = yp(y)dy.
−∞

Now, when we run a Monte Carlo simulation, we have N output samples, yi . Since we assume
independent samples, a reasonable estimator of the output mean is a simple arithmetic
average of the samples,
N
1 X
ȳ ≡ yi . (4.4)
N i=1

This estimator is unbiased because it gives a zero expected error. To see this, we take the
expected value of the difference between the estimator and the actual mean value µy ,

E(ȳ − µy ) = E(ȳ) − E(µy )


N
!
1 X
= E y i − µy
N i=1
N
1 X
= E (yi ) − µy .
N i=1

Since in the Monte Carlo method each yi comes from an independent draw from the PDF of
y, we have E(yi ) = µy . So the above equation tells us that
N
1 X 1
E(ȳ − µy ) = µy − µy = N µy − µy = 0.
N i=1 N

4.3.2 Standard Error of an Estimator


In the Monte Carlo method, a computed value of ȳ depends not only on the number of
samples N , but also on the random number generator. Recall that we are randomly selecting
inputs, so in general when we run Monte Carlo multiple times, we will obtain different sets
of outputs {yi } – let’s call these sets realizations. For example, it’s possible that in one
realization all yi are the same because the random number generator always returned the
same values ... this is possible, although in general not probable for a good random number
generator. It turns out that the output dependence on N and on the random number
generator are related, and we can reduce the dependence by increasing N .
Example 4.7 (Realization Variability for the Breguet Range Equation). Consider again the
Breguet range equation uncertainty study presented in Example 4.6. In that example we
presented a set of results showing the variation in the range (output y) given three uncertain
inputs, for a single realization (which was long, N = 104 samples). Let’s now look at
generating shorter realizations, but multiple times instead of just once. Table 4.3 presents

89
c 2019, K. Fidkowski
Table 4.3: Range mean estimator calculations for different Monte Carlo realizations in Exam-
ple 4.6. Note, N = 100 for all realizations.

Realization Range mean estimator, ȳ [km]


1 2083
2 2088
3 2071
4 2075
5 2089
··· ···

results for the range mean estimator, ȳ, for several realizations with N = 100 samples each.

If we think about creating many realizations, in our case 1000 of them, we can ask how
the estimator values will be distributed. Figure 4.8 presents histograms of these distributions
for N = 100 and N = 500. As shown in the figure, increasing N decreases the variation

N=100 N=500
140 140

120 120

100 100
frequency

frequency

80 80

60 60

40 40

20 20

0 0
2020 2040 2060 2080 2100 2120 2020 2040 2060 2080 2100 2120
range estimator [km] range estimator [km]

Figure 4.8: Histogram of the mean range estimator across 1000 realizations of Monte Carlo with
N = 100 (left) and N = 500 (right), as described in Example 4.7.

of the mean estimator, since the spread in the histogram for N = 500 is smaller than that
for N = 100. Also, the shape of the distribution resembles a “bell curve”, i.e. a normal
distribution, especially for N = 500.

———

Note that the histograms in Figure 4.8 in the previous example do not show the PDF
of y, p(y). Instead, the present histograms show different information: they illustrate the

90
c 2019, K. Fidkowski
spread in the output mean value estimator across different realizations of a Monte-Carlo
experiment. In practice, we would/could not compute multiple Monte-Carlo experiments,
but we often are still interested in how much (statistical) error we make for one realization.
We have shown that the mean estimator in Eqn. 4.4 is unbiased, which means that over
many Monte Carlo realizations the error will average out to zero. To get a sense of how
much error we expect to see in a single Monte Carlo realization, we look at the variance of
ȳ − µy across realizations,
 !2 
N
1 X
E (ȳ − µy )2 = E 
 
y i − µy 
N i=1
 !2 
N
1 X
= E (yi − µy ) 
N i=1
" N
! N !#
1 X X
= E (yi − µy ) (yj − µy )
N 2 i=1 j=1
" N N
#
1 X 1 X
= E (yi − µy )2 + 2 (yi − µy )(yj − µi )
N 2 i=1 N j=1,j6=i
N N
1 X  2
 1 X
= E (yi − µy ) + 2 E [(yi − µy )(yj − µi )]
N 2 i=1 N j=1,j6=i

Since the samples yi are independent draws, their covariance is zero, which means that the
second term in the last equation above drops out. Now, in the first term, the summand is
just the variance of y, i.e.

E (yi − µy )2 ≡ σy2 ,
 

where σy is the standard deviation of the output distribution for y – this quantity is in-
dependent of N . So, returning to the realization-to-realization variance in our mean value
estimator, we have
1 σy2
σȳ2 ≡ E (ȳ − µy )2 = 2 N σy2 = .
 
(4.5)
N N

We call σȳ =√σy / N the standard error of the mean estimator, and this quantity depends
inversely on N . So the larger the sample size, the smaller the standard error, but the
reduction is not very quick. For example, an order of magnitude reduction in the standard
error requires 100 times as many samples, N .
For a large sample size N , the central limit theorem tells us that the distribution of the
mean estimator, ȳ, over Monte
√ Carlo realizations will be approximately normal with mean µy
and standard deviation σy / N . This explains the bell-shaped histograms in Example 4.7,
and the reduction in the spread with increasing N .

91
c 2019, K. Fidkowski
Now, a caveat is that in practical simulations, we do not know σy , which is the standard
deviation of the actual output y. But, after running Monte Carlo, we do have a realization,
the set {yi }, and we can estimate σy as the standard deviation of the samples,
N
1 X
σy2 ≈ s2y ≡ (yi − ȳ)2 .
N − 1 i=1

As defined above, s2y is an unbiased estimate of the actual variance in y, σy2 . Directly using
s2y in place of σy2 in the standard error estimate (Eqn. 4.5) ignores the fact that s2y is itself
only approximate, and for small N the resulting additional errors may pollute the standard
error estimate.

Example 4.8 (Standard Error Estimates for the Breguet Range Equation). In Example 4.7
we presented statistics for the range estimator, ȳ = R̄, for N = 100 and for N = 500.
Let’s see if the above standard error estimates are accurate for these data. Running a long
realization, N = 1000, we find that the standard deviation of the range is

σy ≈ sy = 124km.

So for N = 100, we would expect √a standard error of σȳ = σy / 100 = 12.4km, and for
N = 500 we would expect σȳ = σy / 500 = 5.54km. Using the actual data in Figure 4.8, we
have that for N = 100, the standard deviation of ȳ over the computed 1000 realizations is
12.06km and for N = 500 it is 5.49km. So the standard error estimates are quite close to
the actual standard deviations over realizations.

———

In practice, a typical use of the standard error estimates is to determine an adequate


sample size for a Monte Carlo simulation, given some desired tolerance on the statistics.
This determination is done on the fly during the Monte Carlo sampling, since it is based
on the sample data yi . Finally, we note that while we have only discussed measuring error
for the mean estimator, estimators exist for other statistical quantities of interest, such as
variances, probabilities of failure, etc.

92
c 2019, K. Fidkowski
Chapter 5

Finite Volume Methods

5.1 Conservation Laws


A conservation law takes the following general differential form,

∂u ~ = 0,
+∇·F (5.1)
∂t

where

• u ∈ Rs is the conservative state vector with s components.

~ = F(u)
• F ~ ∈ [Rs ]d is a flux that tells us the direction and rate at which the conserved
quantities pass through space. Note that d is the number of spatial dimensions, and
the arrow in F~ denotes a spatial vector. The units of each flux are conserved quantity
per unit area per unit time.

An example of a system of conservation laws with s = 4 and d = 2 is given by the Com-


pressible Euler equations in two dimensions, as presented in Section 1.2.3.3.
An example of a scalar conservation law is a convection (also known as advection) equa-
tion. This equation models the transport of a conserved scalar quantity, u, in a prescribed
velocity field V~ . That is, the flux is F~ = V~ u, so that the scalar conservation law reads

∂u  
+ ∇ · V~ u = 0. (5.2)
∂t | {z }
~
F

The differential form in Eqn. 5.1 often derives from an integral form, which we can recover
by integrating Eqn. 5.1 over an area A. Note that we restrict our attention to two spatial

93
dimensions. Referring to the definitions in Figure 5.1, the integral form of Eqn. 5.1 is
Z  
∂u ~ dA = 0
+∇·F
A ∂t
Z Z
∂u ~
dA + ∇ · FdA = 0
A ∂t A
| {z }
use divergence thm.
Z I
∂u ~ · ~n dl = 0
dA + F (5.3)
A ∂t ∂A

Note that in Eqn. 5.3 the normal vector ~n points outward from the enclosed area.

~n

dl ∂A

y
dA

x A

Figure 5.1: Definitions for application of the divergence (Gauss) theorem in two dimensions.

5.2 Cells and Meshes


To solve conservation laws with the finite volume method, we need to use a computational
mesh. A computational mesh, or grid, is a partitioning of the domain into non-overlapping
volumes that are simple to work with – these volumes are called cells, or elements. Sample
meshes around an airfoil geometry are shown in Figure 5.2.
In the finite volume method the cells in the meshes can be of almost arbitrary shape.
Meshes composed of triangles (2D) and tetrahedra (3D), i.e. simplex cells, are usually the
easiest to generate. Quadrilaterals (2D) and hexahedra (3D) are also used, although such
meshes can be harder to generate for complex geometries. Sometimes meshes contain cells
of different shapes, e.g. triangles mixed in with quadrilaterals, and these are called hybrid
meshes. An example of a hybrid mesh is shown in Figure 5.3.
An important distinction between mesh types is structured versus unstructured. In a
structured mesh, the connectivity between cells is implied and not explicitly stored. An
example of a structured mesh is a lattice in a Cartesian coordinate system, e.g. the set
of points (i, j) where i and j are integers. The neighbors of each point can be found by
just incrementing/decrementing the indices i, j. The mesh in Figure 5.2(a) is an example
of a structured mesh. On the other hand, finding the neighbors of a cell in the mesh in

94
c 2019, K. Fidkowski
(a) Structured mesh (b) Unstructured mesh

Figure 5.2: Sample structured and unstructured meshes around a NACA 0012 airfoil.

Figure 5.3: An example of a hybrid mesh containing both triangles and quadrilaterals.

95
c 2019, K. Fidkowski
Figure 5.2(b) is not as easy. Instead, for such meshes the connectivity between cells is
typically stored, and these meshes are called unstructured. Most often, although not always,
unstructured meshes will consist of triangles or tetrahedra.
The number of cells in a mesh and local variations in size, orientation, and stretching
of cells all affect the quality of the final finite-volume solution. The mesh must also resolve
the geometry to a sufficient accuracy. Mesh generation is therefore an important and often
difficult task, one that may have to be performed adaptively to improve solution accuracy.

5.3 A First-Order Discretization


We will now present a method for numerically obtaining an approximate solution to our
conservation law. The starting point will be the integral form in Eqn. 5.3, which we will
apply repeatedly to cells of a computational mesh. Doing this yields the finite-volume (area
in 2D, line in 1D) method. A key concept in the finite volume method is the idea of a cell
average of the state, which we define (in 2D) as
Z
1
cell average on cell i = ui ≡ u dA,
Ai Ai

where Ai refers to the area of cell i. Eqn. 5.3 in terms of cell averages becomes
I
dui ~ · ~n dl = 0
Ai + F (5.4)
dt ∂Ai

The cell averages will be the unknowns in our discretization. Let’s first see how this looks
in one spatial dimension.

5.3.1 One Dimension


In one spatial dimension, our domain will be a portion of the x axis, say x ∈ [0, L]. We divide
this domain into N cells, i.e. intervals, and apply the integral conservation law on each cell.
The cells need not be of equal size. Figure 5.4 illustrates this division schematically with
three cells, for a scalar conservation law. We’ll restrict our attention to scalars for the time
being in order to simplify the presentation.
The quantities that appear in Figure 5.4 are

xi±1/2 : coordinates of left/right endpoints of cell i


∆xi : size of ith cell (here an interval)
ui : average of u in cell i – this is what we will solve for
N : number of cells = number of unknowns
U : N × 1 vector of unknowns over the cells, {ui }

96
c 2019, K. Fidkowski
u u(x)
ui

cell i
x
∆xi
xi−1/2 xi+1/2

Figure 5.4: Definitions used in a first-order finite volume method for a one-dimensional scalar
problem.

We obtain N equations by enforcing the integral conservation law, Eqn. 5.4, to hold
inside each cell. The integral around the boundary of a cell in 1D becomes just the difference
between cell endpoint values, so that Eqn. 5.4 on cell i reads

dui
∆xi + Fi+1/2 − Fi−1/2 = 0, (5.5)
dt | {z }
Ri

where the subscript on F denotes where in the mesh F is evaluated; for example, i − 1/2
refers to the left endpoint of cell i, which is the same as the right endpoint of cell i − 1. In
the above expression we also define a flux residual for cell i, Ri . For a steady-state problem
we expect all flux residuals to be zero when we have the correct solution. In an unsteady
problem, the flux residuals act like a source for the time rate of change of each cell average.
So far, we have not made any approximations: Eqn. 5.5 expresses a physical conservation
statement on cell i. However, we cannot implement the method in Eqn. 5.5 without knowing
how to evaluate the fluxes Fi±1/2 . Here is where the approximation comes in, as when we
solve the problem numerically, we do not know the value of u on the interfaces – we only
store cell averages on each element. So we define numerical fluxes, F̂i±1/2 , that express the
interface fluxes in terms of the unknowns uj .
In order to compute numerical fluxes at interfaces, we must first say something about the
state at the interfaces. This is where we make an approximation, and a rather crude one:
we assume that the state in each cell is constant and equal to the cell average. Clearly this
is not right for any real problem where the solution varies continuously across the domain.
However, as the sizes of the cells shrink to zero, we expect the solution accuracy to improve.
The rate of improvement will be first order, the bare minimum for consistency, since we will
never have any slope in the cells.
With the piecewise constant approximation, we will obtain a solution representation like
that shown in Figure 5.4. So how do we compute fluxes? Let’s focus on the scalar advection
case in which the analytical flux is

F = V u,

97
c 2019, K. Fidkowski
where V is the advection speed. But at interfaces, the state u is discontinuous in our
approximation. So which u do we use to define the flux? The right, left, or some average?
An initial inclination might be to use the average of the state on either side of the interface
to calculate the flux. While fair to both states, this is a bad choice because advection is
a directional phenomenon. Instead, a much better choice is to use just the left or just the
right state, depending on the sign of the velocity, V . This is what’s called an upwind flux.
We motivate the idea of an upwind flux by considering how a piecewise constant state
distribution at t = 0 evolves in time. As shown in Figure 5.5, for V > 0, the entire state
distribution convects to the right as time progresses. Looking at the interfaces, we see that

u(t = 0) u(t > 0)


u

cell i
x
xi−1/2 xi+1/2

Figure 5.5: Behavior of a piecewise constant solution under pure advection, for a constant
advection velocity (shown positive): the entire profile convects to the right by V t for any time
t > 0, so that the flux at each interface is determined by the state to the left of that interface.

the states at the interfaces take on values from the left cells for t > 0. So the numerical
flux in the case of V > 0 should use the left states. Conversely, if V < 0, the numerical flux
should use the right states. We can then express the numerical flux at xi+ 1 as
2


V ui for V ≥ 0
F̂i+1/2 =
V ui+1 for V < 0

We write this more compactly, but completely equivalently, as


1 1
F̂i+1/2 = (V ui + V ui+1 ) − |V | (ui+1 − ui ) . (5.6)
2 2
In this form the numerical flux is the average of the left and right fluxes, corrected by a term
that introduces upwinding. Note, the term “upwind” refers to the observation that, at any
point in space, we need to look in the direction from which the “wind” (velocity) is coming
to determine the state at a future time.
The final step in obtaining a discrete formulation is to discretize Eqn. 5.5 in time. Here
we can use any of the methods presented in Chapter 2. A simple choice is forward Euler, for
which Eqn. 5.5 reads

un+1
i − uni n n
∆xi + F̂i+1/2 − F̂i−1/2 = 0, (5.7)
∆t
98
c 2019, K. Fidkowski
where the superscript n denotes the time index. This is the equation that we can solve for
un+1
i : the state at the next time index in each cell.
An important consideration is how to set the time step, ∆t. An eigenvalue stability
analysis of the spatially-discretized system, as presented in Section 2.4.2, can give us the
answer specific to the time integration method, albeit after a lot of work. Instead, a stan-
dard practice for solving advection problems with explicit time integration methods, such as
forward Euler, is to set the time step based on the CFL (Courant-Friedrichs-Lewy) number.
This nondimensional measure of the time step is defined as

V ∆t
CFL = ,
∆x

where V is the advection speed. When cell sizes differ, we can define a CFL number for each
cell, or we can just look at the CFL number of the smallest cell, as this will be the most
restrictive. We set the time step ∆t by choosing a constant CFL. For forward Euler (and
many other methods) the stability requirement reduces to ensuring

CFL < 1

This is known as the CFL condition and it serves as a speed limit for how “fast” we can
march in time: i.e. it dictates how large a time step we can choose. The CFL condition
has a physical motivation: the time step of an iteration should not be larger than the time
that it takes for the flow to traverse the smallest cell. That is, if the time step is too large,
we cannot use a standard explicit update, since the flow would convect past the nearest
neighbor.
Finally, boundary conditions are imposed through fluxes on the domain boundaries. In
some cases it is convenient to define “ghost cells”, with known states, just outside the domain
to simplify the calculation of these boundary fluxes.

Example 5.1 (A 1D finite volume code for advection with periodic boundary conditions).
Listing 5.1 presents a code that implements a one-dimensional finite volume method for an
advection problem with a constant velocity. The boundary conditions are periodic, which
means that whatever leaves the right end of the domain re-enters through the left end.
Essentially, this means we do not need to handle any boundary conditions in this problem.

99
c 2019, K. Fidkowski
Listing 5.1: Implementation of a finite-volume method in one spatial dimension on uniformly-
spaced cells and a constant advection speed.
1 function [x,u,err ] = Conv1D(N);
2 % function [x,u, err ] = Conv1D(N);
3 % PURPOSE:
4 % This function solves a 1D convection problem with periodic
5 % boundary conditions using a finite volume method.
6 % INPUTS:
7 % N : number of spatial cells
8 % OUTPUTS:
9 % x : N+1 endpoint coordinates of cells
10 % u : N cell averages at the final time
11 % err : L2 norm of error at final time step
12
13 % Space and Time Domain
14 L = 1; % domain extent
15 V = 1; % convection velocity
16 T = L/abs(V); % ensures one period
17
18 % initialize arrays
19 x = linspace(0,L,N+1); % N+1 cell endpoints
20 xm = 0.5∗(x(1:end−1) + x(2:end)); % N cell midpoints
21 u0 = exp(−25∗(xm−L/2).ˆ2/Lˆ2); % initial state
22 dx = L/N; % cell size
23
24 % CFL number and time step
25 CFL = 0.5;
26 dt = CFL∗dx/abs(V);
27 Nt = ceil(T/dt); % number of time steps
28 dt = T/Nt; % ensures we always get to T
29
30 % indices of left and right cells
31 IL = [N, 1:(N−1)];
32 IR = [2:N, 1];
33
34 % loop over time steps and apply forward Euler
35 u = u0;
36 for n=1:Nt,
37 % numerical flux on left of each cell
38 FhatL = 0.5∗(V∗u(IL)+V∗u) − 0.5∗abs(V)∗(u−u(IL));
39 % numerical flux on right of each cell
40 FhatR = 0.5∗(V∗u+V∗u(IR)) − 0.5∗abs(V)∗(u(IR)−u);
41 % update state
42 u = u − dt/dx∗(FhatR−FhatL);
43 end
44
45 % compute error
46 err = norm(u − u0)∗dx; % dx is here because we are integrating

We run the code with an initial condition that is a Gaussian bump, and we convect this
profile for one period: i.e. the final time is T = L/V , where L is the length of the domain
and V is the advection speed. The exact solution is an unchanged profile (i.e. the initial
condition), and so we can compute a numerical error relative to this profile. Figure 5.6
shows solutions at various resolutions and convergence of the final profile with increasing
resolution. Note the observed first-order convergence rate.

———

100
c 2019, K. Fidkowski
0.9 0
N=4 10
0.8 N = 16
N = 36
0.7
N = 64
0.6 −1
10

error in u
0.5
u

0.4
−2
0.3 10

0.2

0.1
−3
slope = 1.06
0 10 −2 −1 0
0 0.2 0.4 0.6 0.8 1 10 10 10
x ∆x
(a) Typical final-time solutions (b) Solution convergence

Figure 5.6: Finite-volume solutions to a one-dimensional advection problem and convergence of


the final-time solution error, measured using an L2 norm of cell averages.

5.3.2 Two Dimensions

Let’s now look at a finite volume method for a scalar problem in two dimensions. The PDE
is given by Eqn. 5.2. We still use Eqn. 5.4 for our discretization, but our cells can now be of
various shapes. Since the shape of the cells does not change the key concepts, we will focus
our attention on triangles.
Figure 5.7 shows an example of a triangular cell with its nearest neighbors. Eqn. 5.4

∂Ai
~n
~n
cell i
Ai

~n

Figure 5.7: A triangular cell, indexed by i, in two dimensions. The edges are straight and the
normal is outward-pointing and constant on each edge.

101
c 2019, K. Fidkowski
applied to this cell, for a scalar problem, yields
I
dui
Ai + F~ · ~n dl = 0, (5.8)
dt ∂Ai
| {z }
Ri

where Ri is the flux residual. Assuming as we did in the one-dimensional case that the
solution is constant in each cell, we can express the flux residual in cell i in terms of the cell
averages that we are storing. That is, we write,
3
X
Ri = F̂ (ui , uN (i,e) , ~ni,e ) ∆li,e , (5.9)
e=1

where e is an index over the three edges, N (i, e) is the element adjacent to cell i across edge
e, ~ni,e is the outward-pointing normal vector for edge e, and ∆li,e is the length of edge e in
cell i. Figure 5.8 illustrates these definitions.

(xB , yB )

cell N (i, e)
~ni,e
ed
ge

cell i
e


l i,e

(xA , yA )

Figure 5.8: Definitions of edges and normals for a triangle.

For an edge between two points (xA , yA ) and (xB , yB ) the length and normal are
p
∆li,e = (xA − xB )2 + (yA − yB )2
h i
~ni,e = (yB − yA )î + (xA − xB )ĵ /∆li,e

Note that the normal points to the right as we traverse from A to B.


F̂ (ui , uN (i,e) , ~ni,e ) is the flux function that tells us how much of u is flowing from cell i
to the neighbor cell, N (i, e), through edge e. As in the one-dimensional case, we will use
an upwind flux, but this time the upwinding decision has to be made based on the normal
component of the velocity field, V~ · ~ni,e . That is, we write,

V~ · ~ni,e ui for V~ · ~ni,e ≥ 0



F̂ (ui , uN (i,e) , ~ni,e ) =
V~ · ~ni,e uN (i,e) for V~ · ~ni,e < 0

102
c 2019, K. Fidkowski
We can again write this as one equation by using absolute values,

1  1
F̂ (ui , uN (i,e) , ~ni,e ) = V~ · ~ni,e ui + uN (i,e) − |V~ · ~ni,e | uN (i,e) − ui

2 2

The last piece is the time integration, and using forward Euler in Eqn. 5.8 gives

3
un+1
i − uni X
Ai + F̂ (ui , uN (i,e) , ~ni,e ) ∆li,e = 0. (5.10)
∆t e=1

This equation gives us un+1


i , the updated state for cell i. Note that the update depends only
on the states of cells that are immediately adjacent to cell i.

In a finite-volume implementation, the flux is typically the most time-consuming part of


the update calculation. Since there is only one flux per edge of the mesh, the fluxes should
be calculated by looping over the edges and accumulating residuals on each element. Once
all flux residuals are computed, each cell takes an update according to Eqn. 5.10.

The size of the time step is again most easily set by a CFL condition. In multiple
dimensions, we have various measures of the size of a cell; a common one is the hydraulic
diameter, which is twice the area divided by the perimeter. Finally, boundary conditions can
be enforced by setting either ghost states or fluxes on edges that lie on domain boundaries.

Example 5.2 (A 2D finite volume code for advection with periodic boundary conditions).
Listing 5.2 presents a code that implements a two-dimensional finite volume method for an
advection problem with a constant velocity: this is the two-dimensional version of Exam-
ple 5.1. Note that the code uses rectangular elements, and that the boundary conditions are
periodic in both a left/right and top/bottom sense.

103
c 2019, K. Fidkowski
Listing 5.2: Implementation of a 2D finite-volume method for scalar advection.
1 function [X,Y,U,err] = Conv2D(Nx, Ny);
2 % function [x,u, err ] = Conv2D(Nx, Ny);
3 % PURPOSE:
4 % This function solves a 2D convection problem with periodic
5 % boundary conditions using a finite volume method.
6 % INPUTS:
7 % Nx, Ny : number of spatial cells
8 % OUTPUTS:
9 % X,Y : x,y coordinates of cell vertices
10 % U : N cell averages at the final time
11 % err : L2 norm of error at final time step
12
13 xrange = [−2, 2]; yrange = [−2, 2]; % range
14
15 % Construct mesh
16 x = linspace(xrange(1),xrange(2),Nx+1);
17 y = linspace(yrange(1),yrange(2),Ny+1);
18 [X,Y] = ndgrid(x,y);
19
20 % Calculate midpoint values in each control volume
21 xmid = 0.5∗(x(1:Nx) + x(2:Nx+1));
22 ymid = 0.5∗(y(1:Ny) + y(2:Ny+1));
23 [Xmid,Ymid] = ndgrid(xmid,ymid);
24
25 % Calculate cell size in control volumes (assumed equal)
26 dx = x(2) − x(1); dy = y(2) − y(1); A = dx∗dy;
27
28 % Set velocity and final time
29 Vx = 1; Vy = 1; tfinal = 4;
30
31 % Set timestep
32 CFL = 1.0; dt = CFL/(abs(Vx)/dx + abs(Vy)/dy);
33
34 % Set initial condition to U0 = exp(−xˆ2 − 20∗yˆ2)
35 U0 = exp(−Xmid.ˆ2 − 20∗Ymid.ˆ2); U = U0;
36
37 % Loop until t > tfinal
38 dt = dt ∗ ( tfinal /dt) / ceil( tfinal /dt); % ensure integer # time steps
39 t = 0;
40 while (t < tfinal ),
41
42 % The following implements periodic BCs by creating a larger array for
43 % U and putting the appropriate values in the first and last columns/rows
44 Ubc(2:Nx+1,2:Ny+1) = U; % Copy U into Ubc
45 Ubc( 1,2: Ny+1) = U(Nx, :); Ubc(Nx+2,2:Ny+1) = U( 1, :);
46 Ubc(2:Nx+1, 1) = U( :,Ny); Ubc(2:Nx+1,Ny+2) = U( :, 1);
47
48 % First the i interfaces
49 F = 0.5∗ Vx ∗( Ubc(2:Nx+2,2:Ny+1) + Ubc(1:Nx+1,2:Ny+1)) ...
50 − 0.5∗abs(Vx)∗( Ubc(2:Nx+2,2:Ny+1) − Ubc(1:Nx+1,2:Ny+1));
51
52 % Now the j interfaces
53 G = 0.5∗ Vy ∗( Ubc(2:Nx+1,2:Ny+2) + Ubc(2:Nx+1,1:Ny+1)) ...
54 − 0.5∗abs(Vy)∗( Ubc(2:Nx+1,2:Ny+2) − Ubc(2:Nx+1,1:Ny+1));
55
56 % Add contributions to residuals from fluxes
57 R = (F(2:Nx+1,:) − F(1:Nx,:))∗dy + (G(:,2:Ny+1) − G(:,1:Ny))∗dx;
58
59 U = U − (dt/A)∗R; % Forward Euler step
60 t = t + dt; % Increment time
61 end
62
63 % compute error
64 err = norm(U − U0)∗A; % A is here because we are integrating

104
c 2019, K. Fidkowski
We run the code with an initial condition that is a two-dimensional Gaussian bump (see
definition in the code), and we convect this profile for one period. As in the one-dimensional
case, we can compute a numerical error relative to the exact profile. Figure 5.9 shows
solutions at various resolutions and convergence of the final profile with increasing resolution.
Note the observed convergence rate, which is close to first order.

(a) Nx = Ny = 50 (b) Nx = Ny = 100


0
10

−1
10
error in u

−2
10

−3
slope = 1.27
10 −3 −2 −1
10 10 10
∆x=∆y
(c) Nx = Ny = 200 (d) Solution convergence

Figure 5.9: Finite-volume solutions to a two-dimensional advection problem and convergence of


the final-time solution error, measured using an L2 norm of cell averages.

105
c 2019, K. Fidkowski
———

5.4 The Finite Volume Method for Systems


The finite volume method extends naturally to systems of nonlinear conservation laws, such
as the Euler equations of gas dynamics. The key difference compared to our linear scalar
equation is in the definition of the flux function at cell interfaces. We saw in the scalar
case that the flux function must rely on the “upwind” state in order to faithfully model the
advection physics. For hyperbolic systems of conservation laws, we must extend this idea
to multiple waves: in effect, we have to calculate a flux that for each wave uses the correct
upwind state.
A relatively simple flux function for systems is the local Lax-Friedrichs flux. In one spatial
dimension, the flux function is (compare with Eqn. 5.6),
1 1
F̂i+1/2 = (Fi + Fi+1 ) − smax (ui+1 − ui ) , (5.11)
2 2
where Fi = F(ui ), Fi+1 = F(ui+1 ), and smax is the maximum speed of any wave in the
system, based on either ui or ui+1 . The goal of using smax is to upwind every wave in the
system, although for systems with different wave speeds, this flux will upwind too much
thereby creating a significant amount of numerical diffusion.
An alternative flux that more carefully upwinds waves one by one is given by

1 1 ∂F ∗
F̂i+1/2 = (Fi + Fi+1 ) − (u ) (ui+1 − ui ) , (5.12)
2 2 ∂u

where ∂F
∂u
(u∗ ) refers to taking the absolute values of the eigenvalues, i.e. R|Λ|L, in the
eigenvalue decomposition (see Eqn. 1.4). u∗ is an appropriately-chosen intermediate state
based on ui and ui+1 . This choice is only important for nonlinear problems, and various
options are available, the most notable being the Roe-average, which produces exact single-
wave solutions to the Riemann problem.

5.5 High-Order Accuracy


Using cell averages as inputs into the numerical fluxes that we have presented means that
we are approximating the solution as constant in each cell. This gives a spatially first-order
accurate method, since the discretization error is proportional to the cell size (e.g. diameter)
to the first power. We can do better by a high-order extension in fact the industry standard
is a second order finite volume method.
In a second-order finite volume method, we need a linear representation of the solution
inside each cell. We cannot construct this using only the cell average. Instead we must look
to cell averages on neighboring cells for extra information. For triangular cells, the states
on the three adjacent neighbors will give us (more than) enough information to construct a

106
c 2019, K. Fidkowski
gradient of the state. For example, we can compute the gradient in cell A by using integration
by parts,
Z Z
∇u dA = û~n ds,
A ∂A

where û is some average state on the boundary of cell A. A simple arithmetic average will
often suffice here.
Higher-order accuracy can be obtained by reconstructing the state at the interfaces using
a wider stencil of neighboring cells. In all cases involving reconstruction, shock capturing
techniques are needed to prevent oscillations and nonlinear instabilities.

107
c 2019, K. Fidkowski
108
c 2019, K. Fidkowski
Chapter 6

Optimization

6.1 Introduction
In this chapter we discuss numerical methods associated with optimization, which is im-
portant to many engineering design applications. When presenting the methods, we will
consider two types of systems. The first is a “toy problem,” in which a scalar output, J, is a
simple function of n inputs, x ∈ Rn , as illustrated in Figure 6.1(a). In practical applications,
however, the system to be optimized will not be as simple. For example, in numerical simula-
tions of fluids or structures, we have the concept of a state, U, which acts as an intermediary
between the inputs and the output. This situation is illustrated in Figure 6.1(b). Note that
we will often write J(x) as shorthand for J(U, x), since the state U is itself a function of x.

analytic
x J(x) x R(U, x) = 0 J(U, x)
function

(a) Toy problem (b) System with state

Figure 6.1: Input-to-output mapping of a simple “toy problem” in which the output J is an
analytic function of the input vector, x, as well as one of a system that involves a possibly-large
state vector, U.

We will deal with an optimization problem in which we try to minimize the output,
i.e. the objective function, J, by finding the best possible input vector, x. Note that a
maximization problem could be easily handled in this context just by using −J instead of
J.

6.2 Gradient-Based Methods


Gradient-based optimization methods make use of a code’s ability to compute not only the
output J(x) but also the output gradient,
∂J
output gradient = ∇J = = a 1 × n vector.
∂x
109
The general form of an iterative gradient-based optimization algorithm is given by the fol-
lowing steps (k is the iteration number):

1. Given xk = solution from the previous iteration or guess at k = 0.

2. Compute a search direction, pk ∈ Rn , using gradient information.

3. Compute a step length αk such that

J(xk + αk pk ) < J(xk )

4. Update the state according to

xk+1 = xk + αk pk ,

and return to step 1 with k incremented.

6.2.1 Steepest Descent


By definition, the gradient ∇J gives the direction in the input space in which J increases
most rapidly. So the opposite direction, −∇J, should give us the direction in which J
decreases most rapidly. To obtain a unit-length search direction, we normalize the gradient
by defining,

−(∇J)T
pk = . (6.1)
|∇J|

Note that the transpose on the gradient just converts a 1 × n vector into an n × 1 vector so
that we can use it to increment xk . The normalization of the search direction in Eqn. 6.1
is solely out of convenience for use in the step-length determination. We do not lose any
information here because the magnitude of the negative gradient −∇J does not contain
information about how far we have to go in that direction in input space to hit the minimum
J. Instead, the step length αk is computed via a line search, as discussed in Section 6.2.3.
Figure 6.2 illustrates the convergence of the steepest-descent for a two-dimensional input
vector, x = (x1 , x2 ), and an analytic J(x). Note the “zig-zag” convergence towards the
minimum, the severity of which depends on the starting location. This effect is due to
the fact that, by construction, each search direction is orthogonal to the previous one, i.e.
pk+1 ⊥ pk . The anisotropy of the objective function with respect to parameters also plays a
role; e.g. if the contours of J(x) were circles around the minimum, steepest descent would
converge in one step. However, general J(x) will not be perfectly isotropic, and in these
cases steepest descent performs rather poorly due to the zig-zag convergence towards the
minimum.

110
c 2019, K. Fidkowski
4

2
x2
0

−2

−4
−10 −5 0 5 10
x1

Figure 6.2: Steepest descent for J(x) = 0.5(x21 + 10x22 ), starting with x0 = (10, 1).

6.2.2 Conjugate Gradients


When applicable, the method of conjugate gradients improves on steepest descent by keeping
track of previous search directions to prevent the zig-zag convergence pattern. At the most
basic level, the conjugate gradient method applies to minimizations of quadratic J(x), which
can be expressed as
1
J(x) = xT Ax − bT x.
2
where A ∈ Rn×n is a positive-definite matrix, and b ∈ Rn . The method extends to general
convex J(x), which locally near the minimum can be approximated as quadratic. Considering
the linear system, Ax = b, the residual of this system turns out to be the negative gradient
of our J,

r ≡ b − Ax = −∇J.

The conjugate gradient method begins like steepest descent: for k = 0 we set the first
search direction (forgetting about normalization for the moment) equal to the negative gra-
dient, p0 = −∇J = r0 . Subsequent search directions, pk , also start out as the gradient,
but “projections” (in the conjugate, not orthogonal, sense) of this gradient along previous
directions are then taken out, according to the following formula:
k−1 T
X p Ark
i
pk = rk − T
pi .
|{z}
i=0
pi Api
gradient of J at xk | {z }
ensures pk is conjugate to pi

We say that two vectors pi , pk are conjugate with respect to A if

pTi Apk = 0.

111
c 2019, K. Fidkowski
By making the newest search direction pk conjugate to the previous search directions, i <
k, we ensure that at the updated state xk+1 the gradient rk+1 will also be automatically
conjugate to the previous directions, i < k. This means that we actually do not have to
store the whole history of search directions; instead, just the current and previous one are
enough.
The matrix A is the second derivative of J with respect to x, also called the Hessian. If
we have access to A, or at least to matrix-vector products involving A, we do not need to
perform a line search to determine αk at each step. Instead, taking advantage of the fact
that J is quadratic in x the step length is
rTk rk
αk = .
pTk Apk
For general convex functions J, without access to A, we would resort to a line search to
determine αk .
Figure 6.3 illustrates the rapid convergence of the conjugate gradient method. Comparing
to steepest descent, we note that the zig-zag behavior is gone and that the method converges
in just two iterations. In fact, for n dimensions, the conjugate gradient method will converge
in at most n iterations.

2
x2

−2

−4
−10 −5 0 5 10
x
1
Figure 6.3: Conjugate gradient method for J(x) = 0.5(x21 + 10x22 ), starting with x0 = (10, 1).
The method converges in two steps.

6.2.3 Line Search Methods


In a gradient-based optimization method, the step length αk used in the update
xk+1 = xk + αk pk
should be chosen wisely to yield the largest possible decrease in J. That is, we’d like to find
αk for which
J(xk + αk pk ) is minimum.

112
c 2019, K. Fidkowski
Some gradient-based methods, such as Newton’s method, generally already return a well-
scaled search direction for which we would expect αk ≈ 1. However, steepest descent or the
general form of conjugate gradients returns search vectors whose magnitudes do not give us
information about αk . It is in these cases that a line search is most useful.
A typical plot of J versus αk is given in Figure 6.4. In a line search, we usually start
with a guess for αk , either from some prior information or from a (not too wild) estimate.

J(x + αk pk )

optimal αk αk

Figure 6.4: Typical output behavior with step length.

If we start with a conservative estimate for αk we can try ramping it up, e.g. by factors
of 2, until J(xk + αk pk ) begins to increase. The last αk before the increase would then
be our answer. We could refine this answer by “zooming in” in the neighborhood of αk ,
depending on how much effort we were willing to expend on finding the precise minimizer
of J. Conversely, if we start with a large estimate of αk , or if the ramping up never causes
a decrease of J, then we could backtrack by dividing αk by a factor, say 2, until J begins to
increase. The last value of αk before this increase would then be our answer.
More sophisticated line search techniques exist; for example, in some methods quadratic
functions are fit to J evaluations at three recent αk points, in an effort to efficiently and
accurately determine the minimizing αk . These fits can be powerful, but they do rely on a
certain degree of smoothness in J(x).

6.2.4 Computing Gradients


For gradient-based optimization methods we need to know how J changes with x,

dJ
∈ R1×n = n sensitivities.
dx

These are constant if the equations and output are linear. Otherwise dJ
dx
changes with x. For
practical problems with large state vectors U, the output J cannot be written in closed form
as a function of the inputs and the gradient may be considerably expensive to compute.

113
c 2019, K. Fidkowski
6.2.4.1 Finite Differences
A simple way to obtain (approximate) derivatives is to use finite differencing. For each input,
xi , 1 ≤ i ≤ n, we calculate
dJ J(x1 , x2 , . . . , xi + , . . . , xn ) − J(x1 , x2 , . . . , xi , . . . , xn )

dxi 
for some “small” choice of the scalar . This calculation is easy to implement but expensive
for large n when each J calculation requires solving a large system of equations for the state
U. Conversely, it is efficient when we have small n but many outputs J.
Choosing  for finite differencing is not easy, as it cannot be too small (machine precision
errors) or too large (finite difference discretization errors). A good choice for  that works
in many cases is the square root of machine precision.
√ The complex step method resolves
the “too small” problem by using i, where i = −1, assuming that J can be evaluated for
complex numbers.

6.2.4.2 Adjoint Methods


For systems of equations in which there is a state governed by N residuals, R(U, x) = 0, we
can use the adjoint method to precompute the expensive step in finite differencing, which is
the sensitivity of J to the N residuals. We call this sensitivity Ψ ∈ RN , so that by the chain
rule, the required sensitivities are
dJ ∂R
= ΨT .
dx ∂x
In this expression we assume that J does not explicitly depend on x; if it does we just
add the linearization of this dependence to the right-hand side of the above equation. The
adjoint method is efficient for computing a large number of sensitivities for one output, at
an incremental cost of one residual calculation and one vector product per sensitivity. This
is useful when the forward problem (x ⇒ J) is very expensive.
Let’s take a closer look at how the adjoint method works. The key idea underpinning this
method is that we do not need to solve the forward problem each time we want a sensitivity.

x R U J
∂R ∂J
∂x ∂U
solver
(expensive)

δJ = ΨT R

Figure 6.5: An adjoint method precomputes the expensive portion of the input-to-output calcu-
lation.

114
c 2019, K. Fidkowski
Suppose that given x we find U such that R(U, x) = 0. Now perturb x → x + δx. What
is the effect on J? Well, we can re-solve for the state, but this is expensive if we have to
do it many times. Instead, we can separate the effects of x on R, and R on J. The adjoint
method precomputes the effect of R on J, which is the expensive step. The resulting N
sensitivities are stored in the vector Ψ.
The chain of operations in the adjoint method is as follows:

Assuming small perturbations


1. Input perturbation x → x + δx
∂R
2. Residual perturbation R(U, x + δx) = δR 6= 0 → R(U, x) + ∂x
δx = δR
U,x
∂R ∂R
3. State perturbation R(U + δU, x + δx) = 0 → R(U, x) + ∂x δx + ∂U δU = 0
U,x U,x
∂J
4. Output perturbation J(U + δU) = J(U) + δJ → δJ = ∂U
δU

Subtracting step 2 from step 3 and assuming small disturbances, we obtain


 −1
∂R ∂R
δU = −δR ⇒ δU = − δR.
∂U U,x ∂U

Combining this result with the output linearization in step 4 to express the output pertur-
bation in terms of the residual perturbation, we have
 −1
∂J ∂J ∂R
δJ = δU = − δR. (6.2)
∂U ∂U ∂U
| {z }
T n
Ψ ∈R
∂J ∂R −1
Taking the transpose of the equation ΨT = − ∂U
 
∂U
and moving everything to the
left-hand side gives the adjoint equation,
 T  T
∂R ∂J
Ψ+ =0
∂U ∂U

The ith component of Ψ is the sensitivity of J to changes in the ith residual.


Since R(U, x) = 0, from step 2 above we have δR = ∂R δx and (6.2) becomes

∂x
U,x

∂R
T dJ T ∂R

δJ = Ψ δx ⇒ =Ψ
∂x U,x dx ∂x U,x

Therefore, once we have Ψ, no more solves are required for new sensitivities for the same
output. Note that the calculation of ∂R
∂x
is cheap compared to a forward solve.

115
c 2019, K. Fidkowski
6.3 Gradient-Free Methods
In certain optimization problems, gradients of the objective function are not readily available,
e.g. due to the complexity of the solver or to a non-differentiable nature of J(x). For such
cases, gradient-based methods are no longer applicable. Yet even by only knowing how to
evaluate J(x), we can still solve optimization problems. In this section we briefly review two
popular options for gradient-free optimization.

6.3.1 Nelder-Mead/Simplex Method


The (nonlinear) simplex method can be used for a problem with a small or moderate number
of inputs, This method defines a simplex structure, which in n-dimensional space consists
of n + 1 points – think of a line segment in 1-d space, a triangle in 2-d space, a tetrahedron
in 3-d space, etc. At each vertex of the simplex, we evaluate J, and iterations consist of
changing the simplex through simple operations by comparing the relative values of J at
the vertices – no gradient information is required. The operations that modify the simplex
consist of reflection, contraction, expansion, and shrinking. Upon convergence, the simplex
will ideally enclose a small volume containing the optimal solution, so that the final x can
be estimated as the centroid of the simplex.

6.3.2 Genetic Algorithms


A genetic algorithm iterates a population of x values to locate the minimizer of J. The use
of a population, as opposed to a single iterate as in gradient-based methods, makes genetic
algorithms less likely to get stuck in local extrema. However, iterating a large population
requires many evaluations of the output J, which can be expensive for large systems of
equations.
The three basic properties of a genetic algorithm are selection, crossover, and mutation.
Selection refers to the idea that members of the population who are close to the optimum
(have a low J) are more likely to survive to a subsequent generation. Crossover refers to the
fact that new members of the population take on “traits” of existing population members, in
an effort to preserve optimal parameter combinations. Finally, mutation refers to an inherent
randomness in which traits, or parameter combinations, are modified during the generation
of new population members.
Genetic algorithms are useful for multi-objective optimization in which we do not have a
single output but instead want to study the trade-off between multiple outputs (e.g. range
versus cost of an aircraft design). This is because of the availability of a large population
of points in parameter space, which can be used to plot Pareto fronts on axes of different
outputs.

116
c 2019, K. Fidkowski
Chapter 7

Finite Element Methods

7.1 Function Approximation


So far we have seen finite difference methods, in which we solve for an approximate state
at nodes of a grid, and finite volume methods, in which we solve for the averages of the
solution on cells of a mesh. In both of these cases, if we were to inquire about the value
of the solution at an arbitrary point in the domain, such as in between grid nodes in a
finite-difference method, we would have to do a little interpretation. For example, we might
interpolate nearby grid nodes or cell averages.
In contrast, finite element methods rely on a functional representation of the solution.
That is, the result of a finite element method is a function that we can evaluate at any
arbitrary point. Now, a function is a collection of an infinite number of points, and as
discrete method a finite element algorithm cannot “solve” for all of these values at once.
Instead, in finite elements we make approximations. We do this by restricting the set of
possible functions that we are going to look at when calculating the solution. The resulting
set is a finite-dimensional function space.

7.1.1 Finite-Dimensional Function Spaces


What is a finite-dimensional function space? Let’s look at an example: polynomials of a
certain order, p, constitute a finite dimensional function space. In particular, in one spatial
dimension, x, we can write a polynomial, u(x), of order p using p + 1 coefficients,

u(x) = ap xp + ap−1 xp−1 + . . . + a1 x + a0 .

Since we can choose the p + 1 coefficients ai arbitrarily, there are an infinite number of
polynomials of a given order p. In addition, each polynomial is defined at an infinite number
of points, x. However, we say that u(x) as defined above lies in a (p+1)-dimensional function
space. Why? Because u(x) is a superposition of only p + 1 linearly-independent functions,
which are: 1, x, x2 , . . ., xp . That is, we only have p + 1 “knobs” that we can adjust when
defining u(x). This is now tractable from a computational point of view, since we can easily
store (p + 1) numbers on a computer.

117
The space of polynomials of order p is just one example of a finite-dimensional function
space, although it is by far the most widely-used for finite elements. In general, we define
a function space by basis functions, φi (x), i ∈ [1 . . . N ], where N is the number of basis
functions. For example, for polynomials of order p, a possible set of N = p + 1 basis
functions is

φ0 = 1, φ1 = x, φ2 = x2 , ..., φp = xp .

This is called the monomial basis, and it is usually fine to use for low orders (up to about
p = 5) – for higher-orders, the basis functions start looking more and more linearly-dependent
when using finite precision arithmetic on computers, so methods based on this basis will
become increasingly ill-conditioned.
We then approximate a function u(x) by a linear combination of the basis functions,
N
X
u(x) ≈ Uj φj (x). (7.1)
j=1

where Uj is the coefficient on basis function j. These coefficients are the “knobs” in the
functional approximation. They will become the unknowns in a finite element method.
Figure 7.1 provides an illustration of such a linear combination using two basis functions.

φ1 (x) u(x) φ2 (x)

Figure 7.1: Schematic of a representation of a function as the linear combination of basis func-
tions, see Eqn. 7.1. In this case, u(x) is obtained by combining N = 2 basis functions, φ1 (x) and
φ2 (x).

Note, in multiple dimensions, basis functions must be defined over ~x ∈ Rd (d = dimension)


instead of just x ∈ R, and there are various ways to do this. For example, in two dimensions
one can define polynomials of x and y, and a possible basis is the set of functions {xr y s },
where r + s ≤ p.

7.1.2 Local Support


Constructing basis functions over complex domains can be a daunting task if we consider
the entire domain all at once. For example when analyzing an intricate structural part,
it is nearly impossible ahead of time to determine basis functions over the entire domain

118
c 2019, K. Fidkowski
that are relevant/necessary/admissible for representing the desired solution. To address this
problem, finite element methods break the computational domain into a large number of
simpler domains, called elements of a mesh. An element is the same object as a cell in finite
volume methods – see, for example, the meshes shown in Figure 5.2.
Given a mesh of elements, we can now simplify the task of constructing basis functions
by considering only basis functions with local support. A function has local support if it
is nonzero only on a small subset of the elements. This simplifies the definition of basis
functions because we do not have to define these functions everywhere. Approximating a
solution over the entire domain is now a systematic process.
Standard finite element methods typically use linear (or bi/trilinear) basis functions as-
sociated with nodes of a mesh. Figure 7.2 shows examples of such basis functions in one
dimension and in two dimensions, on triangles. Note that these basis functions take on a
value of 1 at one and only one node, and that they are nonzero (piecewise linear, in fact)
only on the elements adjacent to that node.

φj

x x
j−1 j j+1 y

support on two elements


(a) One dimension (b) Two dimensions

Figure 7.2: Hat functions in one and two spatial dimensions.

With this basis, we still use Eqn. 7.1 to approximate the solution. Note that the unknowns
Uj , which are the coefficients of the basis functions in the approximation, become just the
solution values at the nodes. This is because each basis function is one at the node with which
it is associated and zero at all other nodes. So in the solution approximation in Eqn. 7.1, we
are effectively “pitching a tent” from nodal values of the solution, Uj . Although with this
basis we are solving for nodal values, which we also do in finite difference discretizations,
the equations for these unknowns will generally be different between the two methods. For
example, in finite element methods we can systematically obtain equations on an arbitrary
unstructured mesh, a task that would be difficult/ambiguous for finite-difference methods.
Finally, boundary conditions can also be handled systematically in finite element meth-
ods, unlike in finite difference methods where we potentially have to change the stencil at

119
c 2019, K. Fidkowski
boundaries. When approximating a function for which boundary values are known, we just
use the known nodal values on these boundaries as the coefficients of the corresponding basis
functions. These nodes will no longer carry unknowns. Boundary conditions that specify
fluxes or gradients can also be handled systematically, as we will soon see.

7.2 The Equations


Using Eqn. 7.1 we can approximate the solution over the entire domain, as long as the set
of basis functions covers the domain. However, we still need equations for Uj , the unknown
coefficients on each basis function. Let’s look at two approaches for obtaining these equations.

7.2.1 Collocation
In collocation we obtain equations by requiring the PDE to hold at a fixed number of points.

Example 7.1 (Collocation for a 1D diffusion problem). Consider a diffusion PDE,

uxx + 50ex = 0, (7.2)


u(−1) = u(1) = 0,
Ω = [−1, 1].

The exact solution is

uexact = −50ex + 50x sinh 1 + 50 cosh 1. (7.3)

We approximate the solution to this PDE using two basis functions,


2 3
u(x) = U1 (1| − |−
{zx}) + U2 (x {zx}). (7.4)
φ1 (x) φ2 (x)

Let’s try the method of collocation using the following two points: x1 = −1/3 and x2 = 1/3.
To substitute Eqn. 7.4 into Eqn. 7.2, we need to evaluate the second derivative, uxx . This
gives

uxx = −2U1 − 6xU2 .

Requiring the PDE to hold at the two collocation points gives two equations for U1 and U2 ,

−2U1 − 6x1 U2 + 50ex1 = 0


−2U1 − 6x2 U2 + 50ex2 = 0

Solving this system gives U1 = 26.4018, U2 = 8.4885. Using these values in our approxima-
tion, we can now construct a plot of u(x) and compare it to the exact solution, uexact (x), as
shown in Figure 7.3.

120
c 2019, K. Fidkowski
30 35

25 30

25
20
20

residual
15
15
u

10
10
5
5

0 0
collocation
exact
−5 −5
−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1
x x

(a) Solution, u(x) (b) Residual, uxx + 50ex

Figure 7.3: Comparison of exact solution to that computed by two-point collocation. Note that
the zero intercepts of the residual plot occur at the collocation points.

———

Determining at which points to enforce the PDE in the collocation method is a somewhat
arbitrary process. There are reasonable choices of points that yield poor results. In addition,
the degree of polynomials used for the basis functions needs to be sufficiently high in order
to be able to evaluate the PDE residual. For example, for the above diffusion equation we
could not have used linear basis functions, since two derivatives of a linear function yield
zero. For these reasons, the collocation method is generally not a very robust method for
obtaining our equations.

7.2.2 Weighted Residuals


A more systematic approach for obtaining equations is the method of weighted residuals.
Instead of enforcing the PDE at individual points, we instead integrate the PDE over the
entire domain. However, doing simply this just gives us one equation. To obtain multiple
unique equations, we multiply the PDE by linearly-independent weight (also called “test”)
functions prior to integrating over the domain.
We need the same number of weight functions as we have unknowns. Rather than defining
new functions for this purpose (which is viable, but requires a little more work), a simple
approach is to use the basis functions from the approximation of u as weight functions. That
is, the φj functions serve two purposes: approximation of the state and weight functions on
the residual.

Example 7.2 (Weighted Residuals for a 1D diffusion problem). Consider the same diffusion
PDE as in the previous example, Eqn. 7.2. We again approximate the solution to this PDE

121
c 2019, K. Fidkowski
using two basis functions,
2 3
u(x) = U1 (1| − |−
{zx}) + U2 (x {zx}).
φ1 (x) φ2 (x)

However, we now obtain equations by weighting the PDE with the basis functions and
integrating over the domain. So our two equations are
Z 1
φi (x) [uxx + 50ex ] dx = 0,
−1

where i = 1, 2. We can now integrate by parts to shift one derivative off u and onto the basis
functions. The result is
Z 1 Z 1
dφi 1
− ux dx + [φi ux ]−1 + φi 50ex dx = 0.
−1 dx | {z } −1
0

The boundary term from the integration by parts drops out because the functions φi are
zero on the boundary. Substituting our approximation for ux = U1 dφdx
1
+ U2 dφ
dx
2
, we obtain
Z 1   Z 1
dφi dφ1 dφ2
− U1 + U2 dx + φi 50ex dx = 0.
−1 dx dx dx −1

Rearranging,
Z 1 Z 1 Z 1
dφ1 dφ1 dφ1 dφ2
i=1: − dx U1 − dx U2 + φ1 50ex dx = 0
−1 dx dx −1 dx dx −1
Z1 Z 1 Z 1
dφ2 dφ1 dφ2 dφ2
i=2: − dx U1 − dx U2 + φ2 50ex dx = 0
−1 dx dx −1 dx dx −1

Evaluating the integrals gives a system of two equations,


8 200
− U1 − 0U2 + = 0
3 e
8 700
−0U1 − U2 + 100e − = 0
5 e
The solution to this system is U1 = 27.5910, U2 = 8.9454. Figure 7.4 compares this solution
to the exact one.
———
Note the integration by parts in the above example, which shifts derivatives from the
approximation functions to the weight/test functions. This allows us to use basis functions
of lower degree than that suggested by the differential operator. For example, we will soon
use linear basis functions for elliptic differential equations.
In the method of weighted residuals, the equations fall out automatically once we define
a basis for approximating the solution. There is also a sounder theoretical framework sup-
porting integrating over the entire domain compared to evaluating the PDE at just a few
collocation points. For this reason, the most common finite element methods are based on
the method of weighted residuals.

122
c 2019, K. Fidkowski
30 30

25 25

20 20

15
15

residual
u

10
10
5
5
0

0 −5
weighted residuals
exact
−5 −10
−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1
x x

(a) Solution, u(x) (b) Residual, uxx + 50ex

Figure 7.4: Comparison of exact solution to that computed by the method of weighted residuals.

7.3 Numerical Integration


In the method of weighted residuals, we have to integrate basis functions multiplying the
PDE over portions of the domain. For simple problems, we can do the integrals analyti-
cally. However, for more complex problems, such as nonlinear PDEs in multiple dimensions,
analytical integration is not practical. Instead, we can use numerical integration.
We will consider using weighted sums of point values of a function to approximate the
integral of that function. We will restrict our attention to one spatial dimension at first,
on a fixed domain, [−1, 1]. Other intervals can be mapped to this domain, so this choice is
not restrictive. Specifically, we seek one set of Nq points, xq , and weights, wq , such that for
arbitrary polynomials f (x) of order p,
Z 1 Nq
X
f (x)dx = wq f (xq ).
−1 q=1

Note that while for general functions f (x) the numerical integration will be approximate, we
require that for polynomial f (x) (up to a certain order) the numerical integration is exact.
One approach to numerical integration is as follows. First, sample f (x) at Nq = p + 1
arbitrary distinct points xq . Second, fit an order p polynomial and integrate this polynomial
using tabulated values for the integrals of, say, each of the monomials. Third, express the
result as a weighted linear combination of f (xq ); the weights are the wq that we seek.
The problem with this simple approach is that with a careless choice of xq , some of the
weights may turn out to be very large and/or negative, creating conditioning and numerical
precision problems. We can do much better in terms of conditioning and accuracy by choosing
xq judiciously.

123
c 2019, K. Fidkowski
A key observation is that the process of integration turns a function of x into a scalar
(the integral). This loss of information suggests that less information about f (x) is needed
to compute the integral than to fully describe the function.

Example 7.3 (Numerical integration of a linear function). Suppose we want to integrate an


arbitrary linear function on x ∈ [−1, 1]. We only need one point, which is at the midpoint
of the interval – for us at x = 0.
Z 1
flinear (x)dx = 2flinear (0)
−1

⇒ Nq = 1, x1 = 0, w1 = 2

The graphical proof of this integration rule is illustrated below. Note that we would need two
points to fully determine the function flinear . The fact that we can get away with one point
for integration is that we do not need to know the function exactly in order to compute its
integral. In addition, in numerical integration we can tune both the weights and the point
locations.
f

flinear (0) arbitrary linear function

area under f =
area of rectangle

−1 1 x

———

Consider an arbitrary order p polynomial,

f (x) = c0 + c1 x + . . . + cp−1 xp−1 + cp xp . (7.5)

Integrating this polynomial from −1 to 1 we obtain,


Z 1  c2 c4 
I≡ f (x)dx = 2 c0 + + + . . . . (7.6)
−1 3 5

Note that the odd powers of x integrate to zero, leaving bp/2c + 1 terms, where bac is the
greatest integer less than or equal to a. So to compute I we just need Nq = bp/2c + 1 pieces
of information (e.g. values of f ). Put another way, Nq points will let us integrate polynomial
orders up to p = 2Nq − 1.

124
c 2019, K. Fidkowski
Example 7.4 (Two-point integration). Suppose we choose to use Nq = 2 integration points.
We should be able to integrate polynomials up to p = 3 (cubics) exactly. For the case of
p = 3, the integral in Eqn. 7.6 is
 c2 
I = 2 c0 + = w1 f (x1 ) + w2 f (x2 )
3
= w1 c0 + c1 x1 + c2 x21 + c3 x31 + w2 c0 + c1 x2 + c2 x22 + c3 x32 .
 

Matching coefficients on ci , we have

c0 : 2 = w1 + w2
c1 : 0 = w1 x1 + w2 x2
2
c2 : 3
= w1 x21 + w2 x22
c3 : 0 = w1 x31 + w2 x32

The following values of wq , xq satisfy these equations

w1 = 1, w2 = 1
−1 1
x1 = √ , x2 = √
3 3
———

Note that computing the weights and points for numerical integration requires the solu-
tion of a nonlinear system of equations, and this gets harder for large Nq . For example, a
Newton-Raphson approach is not guaranteed to converge, especially for poor initial guesses
of the point locations. Moreover, the system becomes increasingly ill-conditioned for high
Nq as columns start to appear linearly dependent – this is due to the fact that monomials
are not a good basis. It turns out that in one dimension there is a clever choice of polyno-
mial basis functions that circumvents these difficulties. The result is a set of points called
the Gauss-Legendre points, where the quadrature point locations are the roots of Legendre
polynomials, and the weights come from solving a well-conditioned linear system.
For domains in multiple dimensions, integration rules are generally harder to derive. This
is not the case for squares and cubes, where tensor products of the one-dimensional rules
can be used. However, for other shapes, such as triangles or tetrahedra, the process is more
painstaking and once found integration rules are usually tabulated for some prototypical
reference shape (e.g. a unit right triangle).

7.4 The Finite Element Method in One Dimension


In this section we present the finite element method for the one-dimensional diffusion equa-
tion,

(νux )x + f (x) = 0,

125
c 2019, K. Fidkowski
where ν is the diffusivity and f (x) is a source term. This equation can be used to model
the conduction of heat or the diffusion of a chemical species, and it is a prototypical form
for more complicated diffusion phenomena. Note, in some cases the diffusivity ν may not
be constant. We will allow it to depend on the position, x, but not on the state, u, as the
latter case would result in a nonlinear system.
The finite-element method is, at its core, the method of weighted residuals presented in
Section 7.2.2. It is in fact a specialized version of this method, in that the basis functions are
judiciously chosen to enable a systematic and local construction of the system of equations.
Let’s see how this works for the problem at hand.

7.4.1 Mesh and Basis Functions


Suppose our domain of interest is x ∈ [0, L]. We divide this domain into Ne intervals that
are not necessarily equally-spaced. The resulting mesh is shown in Figure 7.5. On this mesh

element 1 element k element Ne


∆x = L/Ne
Ne = # elements
x=0 x=L
xk ∆xk xk+1

Figure 7.5: Mesh of Ne elements for a one-dimensional finite-element discretization.

we construct a set of N = Ne + 1 basis functions to support a piecewise linear approximation


of our solution. Each basis function is a “hat” shape that is one at a single node and drops
linearly to zero at the adjacent neighbor nodes. Figure 7.6 illustrates these basis functions
on our mesh. Note that on the two boundaries we have basis functions that look like half of
the interior basis functions.
φ1 φ2 φ3 φ4 φ5 φ6 φ7 φ8

x=0 x=L

Figure 7.6: Ne + 1 linear basis functions on a mesh of Ne elements for a one-dimensional finite-
element discretization.

7.4.2 The Reference Element


In the finite element we need to define basis functions over elements that could be of different
sizes. We also need to integrate over these elements when computing the weighted residuals.
Both of these tasks are simplified by the use of a reference element.

126
c 2019, K. Fidkowski
In one dimension, as a reference element we use the interval [−1, 1]. We denote by ξ the
reference-space coordinate in this interval. Now, for every element in the mesh, we consider
a mapping to this reference element, as illustrated in Figure 7.7. Note that the indexing on
xk xk+1

x=0 x=L
∆xk

ξ = −1 ξ=1

Figure 7.7: Mapping of element k to a reference element, ξ ∈ [−1, 1].

the nodes and elements is such that xk+1 = xk + ∆xk . The mapping from ξ to x is
ξ+1
x = xk + ∆xk . (7.7)
2
The reverse mapping, from x to ξ is
x − xk
ξ = 2 − 1. (7.8)
∆xk

7.4.3 The Weak Form


To obtain equations, we use the method of weighted residuals. That is, we multiply the
PDE by test functions and integrate by parts over the elements. We will use a Galerkin
finite element method, in which the test functions are the same as the approximation basis
functions.
Consider test function φi , where i ∈ [1 . . . Ne + 1] is an index over the nodes. When
we multiply the PDE by these test functions, the integrand is only nonzero over the two
elements adjacent to node i. This is due to the local support of each finite element test
function. So we only need to integrate over the support of φi – i.e. element i − 1 (between
xi−1 and xi ) and element i (between xi and xi+1 ).
Z
φi [(νux )x + f (x)] dx = 0

Z xi+1
⇒ φi [(νux )x + f (x)] dx = 0
xi−1
Z xi+1 Z xi+1
⇒ φi (νux )x dx + φi f (x) dx = 0
xi−1 xi−1
Z xi+1 Z xi+1
xi+1
⇒ − φi,x νux dx + [φi νux ]xi−1 + φi f (x) dx = 0 (7.9)
xi−1 xi−1

We have assumed that we are dealing with an interior basis function, i.e. one that is not on
the boundary. The treatment of boundary basis functions is discussed in the next subsection.

127
c 2019, K. Fidkowski
Now, for every interior basis function, the middle term in the above expression vanishes
because φi = 0 at xi−1 and at xi+1 . So the equation associated with every interior node is
Z xi+1 Z xi+1
− φi,x νux dx + φi f (x) dx = 0. (7.10)
xi−1 xi−1

We call this a weak enforcement of the PDE since we only require the PDE to hold in an
integral sense.

7.4.4 Boundary Conditions


The weak form requires special treatment at domain boundaries. At the boundaries, extra
information is generally available in the form of boundary conditions, and this information
must be incorporated into the finite element method. We will see how this happens for three
of the most common types of boundary conditions.

1. Dirichlet: the value of u is prescribed on the boundary.

2. Neumann: the flux (gradient) of u, specifically νux , is prescribed on the boundary. For
example, in a heat-transfer problem, an adiabatic wall would requite νux = 0.

3. Robin: a linear combination of u and the flux νux is prescribed on the boundary. In a
heat-transfer problem, this could be a model for a convective boundary.

We present each of these conditions applied to the left boundary of our domain. The
relevant equation is the weighted residual associated with φ1 , which, starting from Eqn. 7.9,
is
Z x2 Z x2
x2
− φ1,x νux dx + [φ1 νux ]x1 + φ1 f (x) dx = 0. (7.11)
x1 x1

Let’s see how this equation looks for each boundary condition.

7.4.4.1 Dirichlet Boundary Condition


In the Dirichlet case, we know the value of u on the boundary. Let’s call it uDirichlet . For our
case, this means we know u at the left boundary, i.e. u1 . This value is the coefficient on the
first basis function, i.e. U1 which multiplies φ1 , in our nodal Lagrange basis. So we do not
need the boundary weak form in Eqn. 7.11, and instead our equation is

U1 = uDirichlet . (7.12)

Another strategy is to simply eliminate the unknown U1 from the system by replacing all
occurrences of U1 with uDirichlet in the other equations.

128
c 2019, K. Fidkowski
7.4.4.2 Neumann Boundary Condition
In the Neumann case, we know the value of νux on the boundary. Let’s call this qNeumann .
Eqn. 7.11 then becomes
Z x2 Z x2
− φ1,x νux dx − φ1 (x1 )qNeumann + φ1 f (x) dx = 0 (7.13)
x1 x1

Note that while φ1 (x2 ) = 0, φ1 (x1 ) 6= 0, so the term containing the Neumann flux boundary
condition does not vanish. It is only through this boundary term, which arose from our
integration by parts, that the Neumann boundary condition is incorporated. This is another
advantage of applying integration by parts to the weighted residuals.

7.4.4.3 Robin Boundary Condition


Finally, in Robin boundary conditions, the value of qRobin ≡ au + bνux is known on the
boundary, where a and b are some constants. We can solve for νux in terms of the known
quantity and u,
1
νux = (qRobin − au(x1 )) .

x1 b
As in the Neumann case, we substitute this value of νux into Eqn. 7.11, obtaining
Z x2 Z x2
1
− φ1,x νux dx − φ1 (x1 ) (qRobin − au(x1 )) + φ1 f (x) dx = 0 (7.14)
x1 b x1

This is the equation for the weighted residual associated with the first basis function. Note
that in the Robin boundary condition, the boundary flux term that came from the integration
by parts now brings in a dependence on the state. This dependence must be taken into
account when assembling the stiffness matrix.

7.4.5 Matrix Assembly and Solution


7.4.5.1 The Stiffness Matrix, A
In the finite element method, each basis function brings both an unknown (the coefficient in
the approximation of u(x)) and an equation (the weighted residual). We will consider only
linear PDEs, for which the resulting system of equations will also be linear. Let’s take a
closer look at how the weak form equations become a matrix system.
We start with the equations for every interior basis function, Eqn 7.10 (repeated here),
Z xi+1 Z xi+1
− φi,x νux dx + φi f (x) dx = 0.
xi−1 xi−1

Into this equation we substitute the expansion for u(x) from Eqn. 7.1 (repeated here),
N
X
u(x) ≈ Uj φj (x).
j=1

129
c 2019, K. Fidkowski
Recall that N is the number of basis functions, which is Ne + 1 for our 1D mesh of Ne
elements. The resulting equation is
Z xi+1 N
! Z xi+1
X
− φi,x ν Uj φj (x) dx + φi f (x) dx = 0
xi−1 j=1 xi−1
x
XN  Z xi+1  Z xi+1
− φi,x νφj,x dx Uj + φi f (x) dx = 0 (7.15)
j=1 | xi−1 xi−1
{z } | {z }
Aij Fi

This is now just an N × N matrix system,

AU + F = 0,

where the entries of the stiffness matrix A are Aij , the entries of the solution vector U are
Uj , and the entries of the (LHS) forcing vector F are Fi . The matrix A is sparse (mostly
zeros), because the basis functions have local support. That is, Aij will only be nonzero
if the nonzero values of φi and φj overlap – which does not happen often since each basis
function is nonzero over only two elements.

7.4.5.2 Stiffness Matrix Assembly


We can build the matrix A using a systematic approach by looping over the elements. From
Eqn. 7.15, we have that (for i not a boundary node), the matrix components are
Z xi+1
Aij = − φi,x νφj,x dx. (7.16)
xi−1

Building the A matrix one row at a time is a bit cumbersome because each basis function
is piecewise linear, so that the slope (φi,x ) changes over the elements. This creates even
more cumbersome bookkeeping in two and three dimensions. An alternative approach is to
initialize A = 0 and then to build the matrix up with contributions to A from each element.
Let’s pick one element, k ∈ [1 . . . Ne ]. There are two nonzero basis functions on this
element, φk and φk+1 , as illustrated in Figure 7.8. So according to Eqn. 7.16, element k

φk φk+1

x
xk element k xk+1

Figure 7.8: Nonzero basis functions on element k.

130
c 2019, K. Fidkowski
contributes to four entries in the calculation of A. These four contributions are
Z xk+1 Z xk+1
Ak,k = Ak,k − φk,x νφk,x dx Ak,k+1 = Ak,k+1 − φk,x νφk+1,x dx
xk xk
Z xk+1 Z xk+1
Ak+1,k = Ak+1,k − φk+1,x νφk,x dx Ak+1,k+1 = Ak+1,k+1 − φk+1,x νφk+1,x dx
xk xk

Note that we say these are contributions to, not necessarily the actual entries in, A. This is
because on element k we are only looking at a portion of the basis functions φk and φk+1 –
both of these will generally be nonzero on other elements (e.g. k − 1 or k + 1). Additional
contributions will come from integrating over these other elements.
Looping over the elements and adding to A in the manner described above is called
(stiffness) matrix assembly. In fact, for each element k we can crate a 2 × 2 matrix Ak that
is the contribution of element k to the total stiffness matrix A. Then by looping over the
elements we “tile” the total stiffness matrix as illustrated in Figure 7.9.

Ak

row k
row k+1

Figure 7.9: Assembly of matrix A with elemental contributions Ak . Note the overlap of the
elemental contributions, which is due to the fact that each interior basis function spans two elements.

7.4.5.3 Incorporating Boundary Conditions into A and F


When using Dirichlet boundary conditions, we need to enforce Eqn. 7.12. So we replace the
first row of the stiffness matrix by a single 1 on the main diagonal, and we replace the first
entry in the forcing vector F by −uDirichlet .
When using Neumann boundary conditions, the stiffness matrix does not change. Looking
at Eqn. 7.13, the only difference in this equation compared to that of an interior node is the
presence of the constant −φ1 (x1 )qNeumann term, which we transfer over to the forcing vector.
Finally, when using Robin boundary conditions, both the stiffness matrix and the forcing
vector change. Looking at Eqn. 7.14, we see that the first extra term, −φ1 (x1 ) 1b qRobin , is
a constant that can be incorporated into the forcing vector. The second additional term,
φ1 (x1 ) ab u(x1 ) = φ1 (x1 ) ab U1 affects the (1, 1) entry of the stiffness matrix – specifically, we
need to add the quantity φ1 (x1 ) ab to this entry in A.

131
c 2019, K. Fidkowski
7.4.5.4 Reference Element Calculations
The entries in the stiffness matrix and forcing vector, e.g. in Eqn. 7.15, require integrals over
the elements. In the case of the stiffness matrix, the derivatives of the basis functions are
constants, and so we can evaluate these analytically. However, the forcing vector requires
an integral of the product of a basis function and some arbitrary f (x). This is most easily
done using numerical integration, after mapping to a reference element. Specifically, the
map from element k with x ∈ [xk , xk+1 ] to a reference element with ξ ∈ [−1, 1] is given in
Eqn. 7.7 and Eqn. 7.8. So we can write, for example,
Z xk+1 Z 1
∆xk
φk (x)f (x) dx = φL (ξ)f (x(ξ)) dξ
xk −1 2
Z xk+1 Z 1
∆xk
φk+1 (x)f (x) dx = φR (ξ)f (x(ξ)) dξ
xk −1 2
where φL (x) = (1 − ξ)/2 and φR (x) = (1 + ξ)/2 are the two basis functions on the reference
element. Note, ∆x2 k is the Jacobian of the mapping – i.e. dx = ∆x2 k dξ. We can now evaluate
this integral using a numerical quadrature rule with a number of points Nq that is appropriate
for the desired accuracy.
The reference element mapping is also useful for the analytical stiffness matrix calculation.
For example, let’s look at the 2 × 2 contribution associated with element k, Ak . The first
entry in this matrix is
Z xk+1 Z 1
2 2 ∆xk
− φk,x νφk,x dx = − φL,ξ ν φL,ξ dξ (7.17)
xk ∆xk
−1 | ∆xk 2
{z } | {z }
φk,x φk,x
Z 1

= − φL,ξ φL,ξ dξ
∆xk −1

We used the chain rule to transform basis function derivatives with respect to x into deriva-
tives with respect to ξ. We can now evaluate the integral, since φL,ξ = −1/2. Doing the
same for the other three entries of Ak gives
 
k ν −1 1
A =
∆xk 1 −1

7.4.5.5 Putting it all Together


Let’s now look at an implementation of the finite element method for a one-dimensional heat
transfer problem.
Example 7.5 (Steady Heat Conduction in a Bar with a Source). The governing equation
for the temperature, denoted by T (x), in a one-dimensional rod is
 
d dT
κ + f (x) = 0,
dx dx

132
c 2019, K. Fidkowski
where κ is the thermal conductivity, and f (x) is the (negative of the) heat source. We
consider a domain of x ∈ [0, 1], a source term of the form f (x) = sin(πx), κ = 1, and the
following boundary conditions:
A Dirichlet BC at x = 0:
T (0) = 1,
dT
A Neumann BC at x = 1: κ = 0.
dx
Listing 7.1 presents a code that implements the finite element discretization for this problem,
using a uniform grid of elements of size ∆x = 1/Ne .

Listing 7.1: Computes the temperature in a heated 1D bar using the FEM
1 function [x,T] = heat1d(TL, kappa, Ne);
2 % function [x,T] = heat1d(TL, kappa, Ne);
3 % PURPOSE:
4 % Solves a 1D heat transfer problem using FEM.
5 % INPUTS:
6 % TL : temperature a the left end of the bar
7 % the right of the bar is adiabatic
8 % kappa : heat conductivity inside the bar
9 % Ne : number of elements
10 % OUTPUTS
11 % x : Ne+1 by 1 vector of node locations
12 % T : Ne+1 by 1 vector of temperatures at each node
13
14 dx = 1/Ne; % element length (all the same)
15 x = linspace(0,1,Ne+1); % node positions
16
17 N = Ne+1; % # unknowns = # nodes
18 A = sparse(N,N); % stiffness matrix
19 F = zeros(N,1); % forcing vector (on LHS)
20
21 % elemental stiffness matrix
22 Ak = kappa/dx ∗ [−1 1; 1 −1];
23
24 % numerical integration (quadrature) rule
25 xiq = [−1;1]/sqrt(3); wq = [1, 1];
26
27 % Ref element basis functions at quadrature points
28 Phiq = [0.5∗(1−xiq), 0.5∗(1+xiq)];
29
30 % set up system by looping over elements
31 for k = 1:Ne,
32
33 Ik = [k, k+1]; % unknowns associated with this element
34
35 A(Ik, Ik) = A(Ik, Ik) + Ak; % add to stiffness matrix
36
37 xq = (k−1)∗dx + 0.5∗(xiq + 1)∗dx; % quad points in global space
38
39 F(Ik) = F(Ik) + Phiq’∗diag(wq)∗sin(pi∗xq)∗dx/2; % numerical integration
40
41 end
42
43 % boundary conditions
44 A(1,:) = zeros(1,N); A(1,1) = 1; F(1) = −TL; % Dirichlet on left
45 % adiabatic on right −> nothing to do for Neumann
46
47 % solve system
48 T = −A\F;

133
c 2019, K. Fidkowski
Running the code with Ne = 8 elements produces the plot shown in Figure 7.10 on the
left. Note how both boundary conditions appear satisfied. A more quantitative comparison
is shown in the error convergence plot in Figure 7.10, where the L2 error is computed against
the exact analytical solution. Note the second-order rate of convergence of the error. We can
reason this from the fact that we are using linear polynomials to approximate our solution:
the leading error term will be quadratic in ∆x.
1.4 −2
10

1.35

1.3
−3
temperature, T

1.25 10

L2 error norm
1.2

1.15
−4
1.1 10

1.05

1
−5
convergence rate = 2
0.95 10 −2 −1 0
0 0.2 0.4 0.6 0.8 1 10 10 10
x dx

Figure 7.10: Sample finite element solution and convergence of the discrete L2 error norm of the
computed temperature distribution relative to the actual temperature distribution for Example 7.5.

———

7.5 The Finite Element Method in Two Dimensions


We now present the finite element method for the two-dimensional diffusion equation,
∇ · (ν∇u) + f (x, y) = 0. (7.18)
Again, ν is a diffusivity and f is a source term. Note that the diffusive flux, ν∇u, is now a
vector.

7.5.1 Mesh and Basis Functions


We discretize Eqn. 7.18 on a domain broken down into non-overlapping triangles. The set
of these triangles constitutes a mesh, such as the one illustrated in Figure 5.2(b). Each of
the triangles is an element, and the nodes are the mesh vertices.
On the mesh of triangles, we approximate the state using piecewise linear “hat” basis
functions, as illustrated in Figure 7.2(b). Note that each basis function is one at a single node
and is nonzero only on the elements adjacent to that node. As in the one-dimensional case,
we will associate basis functions with all nodes of the mesh, be they interior or boundary.

134
c 2019, K. Fidkowski
7.5.2 The Reference Element
To simplify calculations we will work with a reference element, which in two dimensions will
be a unit right triangle, as illustrated in Figure 7.11. We denote the reference triangle by T ,
and in the figure we show a mapping between T and an arbitrary element Ωk . The mapping

reference space: global space: ~x3 = [x3 , y3 ]T


η
1 Ωk
T ~x ~x2 = [x2 , y2 ]T
y
ξ~ ~x1 = [x1 , y1 ]T
0
ξ x
0 1

Figure 7.11: Mapping from the reference element, with coordinates ξ~ = (ξ, η), to an arbitrarily-
shaped and oriented reference element with coordinates ~x = (x, y).

from reference space to global space, ~x(ξ), ~ is a linear function that interpolates the three
global nodes that define the triangle, ~x1 , ~x2 , ~x3 . Using the interpolation requirement, we can
construct this function as
~ = ~x1 + (~x2 − ~x1 )ξ + (~x3 − ~x1 )η,
~x(ξ)

~ Inspecting this mapping we see that it is linear and


where (ξ, η) are the two components of ξ.
that the three nodes of the reference triangle map to the three nodes ~x1 , ~x2 , ~x3 . In matrix
form, we can write
~
~x = ~x1 + J ξ,

where J is the mapping Jacobian matrix given by


 
∂~x x2 −x1 x3 −x1
J= = .
∂ ξ~ y2 −y1 y3 −y1

The inverse mapping is then ξ~ = J −1 (~x − ~x1 ), where the inverse mapping Jacobian is

∂ ξ~
 
−1 1 y3 −y1 x1 −x3
J = = , J ≡ det(J).
∂~x J y1 −y2 x2 −x1

When transforming integrals from global space to reference space, we must include the
determinant of the mapping, J, inside the reference-space integrand,
Z Z Z
f (~x)dΩ = f (~x(ξ~ ))JdT = J f (~x(ξ~ ))dT.
Ωk T T

135
c 2019, K. Fidkowski
Finally, we can use the chain rule when changing variables during differentiation.
∂f ∂f ∂ξ ∂f ∂η 

= +
∂f ∂ ξ~
  
∂x ∂ξ ∂x ∂η ∂x  ∂f ∂f −1 · ·
= = J = [· ·]
∂f ∂f ∂ξ ∂f ∂η  ∂~x ∂ ~ ∂~x
ξ ∂ ~
ξ · ·
= + 

∂y ∂ξ ∂y ∂η ∂y
∂f
Note that gradients such as ∂ ξ~
are treated as row vectors.

7.5.3 The Weak Form


As in the one-dimensional case, we use the method of weighted residuals to obtain our
equations. This means we multiply the PDE in Eqn. 7.18 by test functions that are the
same as the basis functions, and then we integrate by parts over the entire domain, Ω. The
result is
Z
φi [∇ · (ν∇u) + f (x, y)] dΩ = 0
Z Ω Z
φi ∇ · (ν∇u) dΩ + φi f (x, y)dΩ = 0
Z ZΩ ZΩ
− ∇φi · (ν∇u) dΩ + φi (ν∇u) · ~n dl + φi f (x, y) dΩ = 0 (7.19)
Ω ∂Ω Ω
| {z }
result of integrating by parts

Since each basis/test function has local support, once we substitute an expansion for u, we
expect to obtain a sparse system, as in the one-dimensional case.

7.5.4 Boundary Conditions


The boundary condition treatment in two dimensions is very similar to that in one dimension.
First, for nodes on Dirichlet boundaries, we replace the weak form corresponding to the test
function on that node with the equation,
U1 = uDirichlet .
On Neumann boundaries, we prescribe the normal flux qNeumann = (ν∇u) · ~n. Looking at
the weak form in Eqn. 7.19, the contribution to the equation for test function φi is
Z
φi qNeumann dl.
∂Ω

This contribution is only nonzero for those test functions that are nonzero on the Neumann
boundaries. Note that we still have unknowns associated with the nodes on these boundaries.
Finally, for Robin boundary conditions, we can again solve for the normal flux (ν∇u)·~n in
terms of the prescribed data and the state u on the boundary. Using the resulting expression
in the weak form brings in a dependence on the state that must be taken into account in the
formation of the stiffness matrix.

136
c 2019, K. Fidkowski
7.5.5 Matrix Assembly
We obtain the stiffness matrix A by substituting the expansion for u(x, y),

N
X
u(x, y) = Uj φj (x, y).
j=1

into the weak form in Eqn. 7.19. Note that the sum is over N , the total number of nodes.
In the following we assume that there are no Robin boundary conditions, so that the only
dependence on the state comes from the domain-interior integrals in Eqn. 7.19. Incorporating
Robin boundary conditions is a straightforward extension. Since

N
X
∇u = Uj ∇φj ,
j=1

the stiffness matrix is calculated as follows:


Z XN  Z 
− ∇φi · (ν∇u) dΩ = − ν∇φi · ∇φj dΩ Uj .
Ω j=1 | Ω
{z }
Aij

The resulting system is

AU + F = 0,

where the forcing vector F contains contributions from the source term in Eqn. 7.19 and
from boundary condition data.
As in 1D, we can form A and F efficiently by first zeroing these vectors out and then
looping over the elements and adding each element’s contribution to these quantities. To see
how this works, let’s pick one element, k. Denote by i, j ∈ [1, 2, 3] the local indices of the
nodes that make up the triangular element Ωk . Denote by g(j) ∈ [1 . . . N ] the corresponding
global indices of the local nodes. Then, the contribution of element k to the global stiffness
matrix is,

Ag(i),g(j) = Ag(i),g(j) + Akij ,

where Akij are components of the local stiffness matrix, Ak , for element k. Similarly, the
contribution to the global forcing vector is

Fg(i) = Fg(i) + Fik ,

where Fik are the components of the local forcing vector.

137
c 2019, K. Fidkowski
We can evaluate the local stiffness matrix by mapping the integral to the reference ele-
ment. Using the equations in Section 7.5.2, we obtain
Z
Akij = −ν∇φi · ∇φj dΩ
Ωk
Z
−ν ∇ξ φi J −1 · ∇ξ φj J −1 dΩ
 
=
ZΩk
−ν ∇ξ φi J −1 · ∇ξ φj J −1 J dT
 
=
T

In the above equations, ∇ξ denotes the gradient with respect to the reference coordinates,
∂ ∂
i.e. ∇ξ = ∂ξ as opposed to ∇ = ∂x . Finally, on the reference triangle, the basis functions
and gradients are

∂φ1 ∂φ1
φ1 = 1 − ξ − η ∂ξ
= −1 ∂η
= −1
∂φ2 ∂φ2
φ2 = ξ ∂ξ
= 1 ∂η
= 0
∂φ3 ∂φ3
φ3 = η ∂ξ
= 0 ∂η
= 1

7.6 High-Order Finite Elements


So far we have presented the finite element for piecewise linear (“hat”) basis functions. This is
indeed the most common form of the finite element method. However, higher-order versions
also exist. The idea behind high-order elements is to use polynomials of a degree higher than
1 to approximate the solution inside each element – see for example the discussion in section
on finite-dimensional spaces, Section 7.1.1. Fortunately, aside from this change in the form
of basis functions, not much else changes in the finite element method. We show this for
quadratic (p = 2) elements in one dimension.
We need a basis to approximate the solution with quadratic polynomials inside each
element. The basis of monomials, 1, x, x2 , would work, but the bookkeeping with monomials
is harder when we go to enforce continuity of basis functions at the nodes. Instead, a
better basis results from the nodal Lagrange functions, illustrated in Figure 7.12. The
three functions that comprise the Lagrange basis on one element are all quadratics. Each
function is equal to 1 on one “node” of the element, and zero on all other nodes. In the
case of quadratic, p = 2, approximation, we define an additional node as the midpoint of the
element.
We can calculate the local stiffness matrix using the same procedure as in the p = 1 case.
By Eqn. 7.17, we have that that i, j component of the local stiffness matrix Ak is
Z 1

Aij = − φi,ξ φj,ξ dξ,
∆xk −1

138
c 2019, K. Fidkowski
φ2 = 1 − ξ 2
1

ξ(ξ−1)
φ1 = 2 ξ(ξ+1)
φ3 = 2

ξ = −1 ξ=1

Figure 7.12: Basis functions for one-dimensional approximation using quadratic functions, de-
fined on the reference element.

where i, j ∈ [1, 2, 3]. The derivatives of the three Lagrange basis functions in reference space
are
1
φ1,ξ = ξ −
2
φ2,ξ = −2ξ
1
φ3,ξ = ξ +
2
Using these formulas, the local stiffness matrix becomes
7
− 43 1
 
6 6
2ν 
Ak = − − 43 8
− 34

3
∆xk
 
1 4 7
6
−3 6

Example 7.6 (Steady Heat Conduction in a Bar with a Source using a p = 2 FEM). In
this example we solve the same problem as in Example 7.5, but this time using quadratic
elements. The governing equations and boundary conditions (repeated below) remain the
same.
 
d dT
κ + f (x) = 0, x ∈ [0, 1], f (x) = sin(πx), κ = 1
dx dx

A Dirichlet BC at x = 0: T (0) = 1,
dT
A Neumann BC at x = 1: κ = 1.
dx
Listing 7.2 presents a code that implements the finite element discretization for this problem,
using a uniform grid of elements of size ∆x = 1/Ne .

139
c 2019, K. Fidkowski
Listing 7.2: Computes the temperature in a heated 1D bar using a quadratic FEM
1 function [x,T] = heat1dHO(TL, kappa, Ne);
2 % function [x,T] = heat1dHO(TL, kappa, Ne);
3 % PURPOSE:
4 % Solves a 1D heat transfer problem using quadratic FEM.
5 % INPUTS:
6 % TL : temperature a the left end of the bar
7 % the right of the bar is adiabatic
8 % kappa : heat conductivity inside the bar
9 % Ne : number of elements
10 % OUTPUTS
11 % x : 2∗Ne+1 by 1 vector of node locations
12 % T : 2∗Ne+1 by 1 vector of temperatures at each node
13
14 dx = 1/Ne; % element length (all the same)
15 x = linspace(0,1,2∗Ne+1); % node positions
16
17 N = 2∗Ne+1; % # unknowns = # nodes + # elements
18 A = sparse(N,N); % stiffness matrix
19 F = zeros(N,1); % forcing vector (on LHS)
20
21 % elemental stiffness matrix
22 Ak = −2∗kappa/dx ∗ [7, −8, 1; −8, 16, −8; 1, −8, 7]/6;
23
24 % numerical integration (quadrature) rule
25 xiq = [−1;0;1]∗sqrt(3/5); wq = [5,8,5]/9;
26
27 % Ref element basis functions at quadrature points
28 Phiq = [0.5∗xiq.∗(xiq−1), 1−xiq.ˆ2, 0.5∗xiq.∗(1+xiq )];
29
30 % set up system by looping over elements
31 for k = 1:Ne,
32
33 Ik = [2∗k−1, 2∗k, 2∗k+1]; % unknowns associated with this element
34
35 A(Ik, Ik) = A(Ik, Ik) + Ak; % add to stiffness matrix
36
37 xq = (k−1)∗dx + 0.5∗(xiq + 1)∗dx; % quad points in global space
38
39 F(Ik) = F(Ik) + Phiq’∗diag(wq)∗sin(pi∗xq)∗dx/2; % numerical integration
40
41 end
42
43 % boundary conditions
44 A(1,:) = zeros(1,N); A(1,1) = 1; F(1) = −1; % Dirichlet on left
45 % adiabatic on right −> nothing to do for Neumann
46
47 % solve system
48 T = −A\F;

Running the code with Ne = 8 quadratic elements produces the plot shown in Figure 7.13
on the left. In fact, we are only plotting the nodal values of the quadratic solution, connected
with straight lines. We could obtain a better picture if we subdivided each element into a
larger number of sub-elements for plotting. Nevertheless, we can see that both boundary
conditions again appear satisfied. A more quantitative comparison is shown in the error
convergence plot in Figure 7.13, where the error is computed against the exact analytical
solution. Note the third -order rate of convergence of the error. We can reason this from the
fact that we are using quadratic polynomials to approximate our solution: the leading error

140
c 2019, K. Fidkowski
term will be cubic in ∆x.
1.4 −3
10

1.35
−4
10
1.3
temperature, T

L2 error norm
1.25 −5
10

1.2
−6
10
1.15

1.1 −7
10
1.05
−8
convergence rate = 3
1 10 −2 −1 0
0 0.2 0.4 0.6 0.8 1 10 10 10
x dx

Figure 7.13: Sample quadratic finite element solution and convergence of the discrete L2 error
norm of the computed temperature distribution relative to the actual temperature distribution for
Example 7.6.

———

7.7 Discontinuous Finite Elements


The linear and high-order finite element methods discussed up to this point are of the con-
tinuous variety, meaning that the approximated solution, u(x) or u(x, y), is a (piecewise)
continuous function over the domain. Continuous finite elements are generally a great choice
for differential equations that are diffusion-dominated: e.g. heat transfer or structural dy-
namics. However, they yield unstable discrete systems for convection-dominated equations,
such as those that govern compressible fluid flows.
For convection-dominated equations, the finite volume method is the historically-preferred
approach because of the stability brought about by the upwind-biased flux functions. Such
upwinding cannot be easily applied to continuous finite elements because there is no ambigu-
ity about the solution at element interfaces ... the solution is continuous there. How can we
“fix” this? We can allow for discontinuities at element interfaces in the solution approxima-
tion, as illustrated in Figure 7.14 1 . The resulting method is called Discontinuous Galerkin
(DG), where “Galerkin” refers to the fact that the approximation and test space are going to
be the same. Adding discontinuities in the solution approximation seems counter-intuitive
from a stability point of view: we can imagine even wilder oscillations when jumps are pos-
sible between elements. However, the jumps bring about flexibility in choosing fluxes with
1
This is not the only fix – another popular approach is to add a specific form of diffusion to the continuous
finite-element weak form to eliminate the unstable modes, and the resulting method is called the Streamwise-
Upwind Petrov-Galerkin (SUPG).

141
c 2019, K. Fidkowski
u

continuous finite elements

discontinuous finite elements

Figure 7.14: Comparison of linear continuous and discontinuous finite element solution approx-
imations in one dimension.

an upwind bias and this improves stability. The discontinuities do increase the expense of
the discretization because, for a given number of elements, it takes more numbers (degrees
of freedom) to represent a discontinuous solution compared to a continuous solution.

7.7.1 Advection in One Dimension


The prototypical equation for convection-dominated flows is scalar advection. We’ll consider
a steady-state advection with a source term,

du
a + s(x) = 0, x ∈ [0, L] (7.20)
dx
u(x = 0) = u0 (7.21)

where a > 0 is a constant advection velocity, L is the length of the domain, s(x) is a known
source function, and u0 is a specified value on the left boundary. Note that we do not need
a boundary condition at x = L, the outflow.
We divide the domain into N equal elements of size ∆x = L/N . On each element, k,
we represent the solution as a polynomial. Just like in the continuous case, we’ll write the
polynomial using basis functions, φk,j (x). For an order p polynomial, we will need p + 1
basis functions, so 1 ≤ j ≤ p + 1. We now have two indices on the basis functions: k for the
element number and j for the index of the basis function within the element. So we write

p+1
X
on element k: u(x) = Uk,j φk,j (x). (7.22)
j=1

To get the weak form, we use the discontinuous basis functions as test functions. Because
of the discontinuities between elements, we cannot apply integration by parts to the entire
domain all at once. Instead, we use integration by parts on a single element k. That is,
we multiply the differential equation in Eqn. 7.20 by test functions φk,i and integrate the

142
c 2019, K. Fidkowski
advection term by parts over the element,
Z xk+1  
du
φk,i a + s(x) dx = 0
xk dx
Z xk+1 Z xk+1
du
φk,i a dx + φk,i s(x) dx = 0
xk dx xk
Z xk+1 Z xk+1
dφk,i xk+1
− au dx + [φk,i au]xk + φk,i s(x) dx = 0 (7.23)
xk dx xk

The above weak form requires evaluations of au (the “flux”) on the element boundaries.
However, the solution can be discontinuous between elements! To resolve this problem, we
borrow an idea from the finite-volume method and use an upwind flux. That is, we set au
on an element interface to au from the left element (since a > 0). So we write

au|xk = auk−1 (xk ),


(7.24)
au|xk+1 = auk (xk+1 ),

where uk (x) is the solution representation on element k. The resulting weak form on element
k is
Z xk+1
dφk,i
− au dx + φk,i (xk+1 )auk (xk+1 ) − φk,i (xk )auk−1 (xk )
xk dx
Z xk+1
+ φk,i s(x) dx = 0. (7.25)
xk

Note that the upwind flux couples solutions on adjacent elements because the right-most
value of u(x) on element k − 1 is required in the equations for element k. Without this
coupling we would have N independent systems and we would never be able to solve the
global problem. Also, for the first element, k = 1, the upwind flux is just au0 ; i.e. it uses
the boundary condition.
To obtain a system of equations, we substitute the solution approximation from Eqn. 7.22
into the weak form and make a linear system by evaluating integrals of the basis functions
and their derivatives. These integrals are most easily done on a reference element. Depending
on the form of the basis functions, the integrals may be evaluated analytically; an alternative
is numerical integration. The integral of the source term is typically done with numerical
integration. Both the source term and terms involving the known boundary condition will
move to the right-hand-side of the linear system. After solving the system, we will have the
solution coefficients Uk,j from which we can plot the solution, or evaluate outputs and errors.

143
c 2019, K. Fidkowski

You might also like