s10107-022-01875-8

Mathematical Programming (2022) 196:935–985
https://doi.org/10.1007/s10107-022-01875-8
FULL LENGTH PAPER

Series B
Stochastic dual dynamic programming for multistage

stochastic mixed-integer nonlinear optimization
Shixuan Zhang1 · Xu Andy Sun2
Received: 8 January 2020 / Accepted: 29 July 2022 / Published online: 20 August 2022
© The Author(s) 2022
Abstract
In this paper, we study multistage stochastic mixed-integer nonlinear programs (MS-
MINLP). This general class of problems encompasses, as important special cases,
multistage stochastic convex optimization with non-Lipschitzian value functions and
multistage stochastic mixed-integer linear optimization. We develop stochastic dual
dynamic programming (SDDP) type algorithms with nested decomposition, determin-
istic sampling, and stochastic sampling. The key ingredient is a new type of cuts based
on generalized conjugacy. Several interesting classes of MS-MINLP are identified,
where the new algorithms are guaranteed to obtain the global optimum without the
assumption of complete recourse. This significantly generalizes the classic SDDP algo-
rithms. We also characterize the iteration complexity of the proposed algorithms. In
particular, for a (T + 1)-stage stochastic MINLP satisfying L-exact Lipschitz regular-
ization with d-dimensional state spaces, to obtain an ε-optimal root node solution, we
prove that the number of iterations of the proposed deterministic sampling algorithm
is upper bounded by O(( 2Lε T )d ), and is lower bounded by O(( L4εT )d ) for the general
case or by O(( L8εT )d/2−1 ) for the convex case. This shows that the obtained complexity
bounds are rather sharp. It also reveals that the iteration complexity depends polyno-
mially on the number of stages. We further show that the iteration complexity depends
linearly on T , if all the state spaces are finite sets, or if we seek a (T ε)-optimal solution
when the state spaces are infinite sets, i.e. allowing the optimality gap to scale with
T . To the best of our knowledge, this is the first work that reports global optimiza-
tion algorithms as well as iteration complexity results for solving such a large class
of multistage stochastic programs. The iteration complexity study resolves a conjec-
B Xu Andy Sun
sunx@mit.edu
Shixuan Zhang
szhang483@gatech.edu
1 H. Milton Stewart School of Industrial and Systems Engineering, Georgia Institute of

Technology, Atlanta, GA, USA
2 Sloan School of Management, Operations Research Center, Massachusetts Institute of Technology,
Cambridge, MA, USA
123
936 S. Zhang, X. A. Sun
ture by the late Prof. Shabbir Ahmed in the general setting of multistage stochastic
mixed-integer optimization.
Keywords Multistage stochastic programming · MINLP · SDDP · Complexity

analysis
Mathematics Subject Classification 90C15 · 90C11 · 90C30 · 90C60 · 90C39 · 49N15
1 Introduction
A multistage stochastic mixed-integer nonlinear program (MS-MINLP) is a sequential

decision making problem under uncertainty with both continuous and integer deci-
sions and nonconvex nonlinear objective function and constraints. This provides an
extremely powerful modeling framework. Special classes of MS-MINLP, such as mul-
tistage stochastic linear programming (MS-LP) and mixed-integer linear programming
(MS-MILP), have already found a wide range of applications in diverse fields such
as electric power system scheduling and expansion planning [3, 36, 37, 39], portfolio
optimization under risk [8, 21, 26], and production and capacity planning problems
[2, 4, 9, 12], just to name a few.
Significant progress has been made in the classic nested Benders decomposition
(NBD) algorithms for solving MS-LP with general scenario trees, and an efficient
random sampling variation of NBD, the stochastic dual dynamic programming (SDDP)
algorithm, is developed for MS-LP with scenario trees having stagewise independent
structures. In the past few years, these algorithms are extended to solve MS-MILP [31].
For example, SDDP is generalized to Stochastic Dual Dynamic integer Programming
(SDDiP) algorithm for global optimization of MS-MILP with binary state variables
[38, 39]. Despite the rapid development, key challenges remain in further extending
SDDP to the most general problems in MS-MINLP: 1) There is no general cutting plane
mechanism for generating exact under-approximation of nonconvex, discontinuous,
or non-Lipschitzian value functions; 2) The computational complexity of SDDP-type
algorithms is not well understood even for the most basic MS-LP setting, especially the
interplay between iteration complexity of SDDP, optimality gap of obtained solution,
number of stages, and dimension of the state spaces of the MS-MINLP.
This paper aims at developing new methodologies for the solution of these chal-
lenges. In particular, we develop a unified cutting plane mechanism in the SDDP
framework for generating exact under-approximation of value functions of a large
class of MS-MINLP, and develop sharp characterization of the iteration complexity
of the proposed algorithms. In the remaining of this section, we first give an overview
of the literature, then summarize more details of our contributions.
1.1 Literature review
Benders decomposition [6], Dantzig-Wolfe decomposition [11], and the L-shaped

method [35] are standard algorithms for solving two-stage stochastic LPs. Nested
123
Stochastic dual dynamic programming… 937
decomposition procedures for deterministic models are developed in [16, 20]. Lou-
veaux [25] first generalized the two-stage L-shaped method to multistage quadratic
problems. Nested Benders decomposition for MS-LP was first proposed in Birge [7]
and Pereira and Pinto [28]. SDDP, the sampling variation of NBD, was first proposed
in [29]. The largest consumer of SDDP by far is in the energy sector, see e.g. [14, 23,
29, 34].
Recently, SDDP has been extended to SDDiP [38]. It is observed that the cuts gener-
ated from Lagrangian relaxation of the nodal problems in an MS-MILP are always tight
at the given parent node’s state, as long as all the state variables only take binary values
and have complete recourse. From this fact, the SDDiP algorithm is proved to find an
exact optimal solution in finitely many iterations with probability one. In this way, the
SDDiP algorithm makes it possible to solve nonconvex problems through binarization
of the state variables [19, 39]. In addition, when the value functions of MS-MILP
with general integer state variables are assumed to be Lipschitz continuous, which is a
critical assumption, augmented Lagrangian cuts with an additional reverse norm term
to the linear part obtained via augmented Lagrangian duality are proposed in [1].
The convergence analysis of the SDDP-type algorithms begins with the linear cases
[10, 18, 24, 30, 33], where almost sure finite convergence is established based on the
polyhedral nodal problem structures. For convex problems, if the value functions are
Lipschitz continuous and the state space is compact, asymptotic convergence of the
under-approximation of the value functions leads to asymptotic convergence of the
optimal value and optimal solutions [15, 17]. By constructing over-approximations
of value functions, an SDDP with a deterministic sampling method with asymptotic
convergence is proposed for the convex case in [5]. Upon completion of this paper,
we became aware of the recent work [22], which proves iteration complexity upper
bounds for multistage convex programs under the assumption that all the value func-
tions and their under-approximations are all Lipschitz continuous. It is shown that
for discounted problems, the iteration complexity depends linearly on the number of
stages. However, the following conjecture (suggested to us by the late Prof. Shab-
bir Ahmed) remains to be resolved, especially for the problems without convexity,
Lipschitz continuity, or discounts:
Conjecture 1 The number of iterations needed for SDDP/SDDiP to find an optimal
first-stage solution grows linearly in terms of the number of stages T , while it may
depend nonlinearly on other parameters such as the diameter D and the dimension d
of the state space.
Our study resolves this conjecture by giving a full picture of the iteration complexity of
SDDP-type algorithms in a general setting of MS-MINLP problems that allow exact
Lipschitz regularization (defined in Sect. 2.3). In the following, we summarize our
contributions.
1.2 Contributions
1. To tackle the MS-MINLP problems without Lipschitz continuous value functions,

which the existing SDDP algorithms and complexity analyses cannot handle, we
propose a regularization approach to provide a surrogate of the original problem
123
such that the value functions become Lipschitz continuous. In many cases, the
regularized problem preserves the set of optimal solutions.
2. We use the theory of generalized conjugacy to develop a cut generation scheme,
referred to as generalized conjugacy cuts, that are valid for value functions of
MS-MINLP. Moreover, generalized conjugacy cuts are shown to be tight to the
regularized value functions. The generalized conjugacy cuts can be replaced by
linear cuts without compromising such tightness when the problem is convex.
3. With the regularization and the generalized conjugacy cuts, we propose three algo-
rithms for MS-MINLP, including nested decomposition for general scenario trees,
SDDP algorithms with random sampling as well as deterministic sampling similar
to [5] for stagewise independent scenario trees.
4. We obtain upper and lower bounds on the iteration complexity for the proposed
SDDP with both sampling methods for MS-MINLP problems. The complexity
bounds show that in general, Conjecture 1 holds if only we seek a (T ε)-optimal
solution, instead of an ε-optimal first-stage solution for a (T + 1)-stage problem,
or when all the state spaces are finite sets.
In addition, this paper contains the following contributions compared with the recent
independent work [22]: (1) We consider a much more general class of problems which
are not necessarily convex. As a result, all the iteration complexity upper bounds of the
algorithms are also valid for these nonconvex problems. (2) We use the technique of
regularization to make the iteration complexity bounds independent of the subproblem
oracles. This is particularly important for the conjecture, since the Lipschitz constants
of the under-approximation of value functions may exceed those of the original value
functions. (3) We propose matching lower bounds on the iteration complexity of the
algorithms and characterize important cases for the conjecture to hold.
This paper is organized as follows. In Sect. 2 we introduce the problem formulation,
regularization of the value functions, and the approximation scheme using generalized
conjugacy. Sect. 3 proposes SDDP algorithms. Sect. 4 investigates upper bounds on
the iteration complexity of the proposed algorithm, while Sect. 5 focuses on lower
bounds, therefore completes the picture of iteration complexity analysis. We finally
provide some concluding remarks in Sect. 6.
2 Problem formulations
In this section, we first present the extensive and recursive formulations of multistage
optimization. Then we characterize the properties of the value functions, with examples
to show that they may fail to be Lipschitz continuous. With this motivation in mind,
we propose a penalty reformulation of the multistage problem through regularization
of value functions and show that it is equivalent to the original formulation for a
broad class of problems. Finally, we propose generalized conjugacy cuts for under-
approximation of value functions.
2.1 Extensive and recursive formulation
For a multistage stochastic program, let T = (N , E) be the scenario tree, where N is

the set of nodes and E is the set of edges. For each node n ∈ N , let a(n) denote the
123
parent node of n, C(n) denote the set of child nodes of n, and T (n) denote the subtree
starting from the node n. Given a node n ∈ N , let t(n) denote the stage that the node
n is in and let T := maxn∈N t(n) denote the last stage of the tree T . A node in the last
stage is called a leaf node, otherwise a non-leaf node. The set of nodes in stage t is
denoted as N (t):={n ∈ N : t(n) = t}. We use the convention that the root node of
the tree is denoted as r ∈ N with t(r ) = 0 so the total number of stages is T + 1. The
parent node of the root node is denoted as a(r ), which is a dummy node for ease of
notation.
For every node n ∈ N , let Fn denote the feasibility set in some Euclidean space of
decision variables (xn , yn ) of the nodal problem at node n. We refer to xn as the state
variable and yn as the internal variable of node n. Denote the image of the projection
of Fn onto the subspace of the variable xn as Xn , which is referred to as the state space.
Let xa(r ) = 0 serve as a dummy parameter and thus Xa(r ) = {0}. The nonnegative
nodal cost function of the problem at node n is denoted as f n (xa(n) , yn , xn ) and is
defined on the set {(z, y, x) : z ∈ Xa(n) , (x, y) ∈ Fn }. We allow f n to take the value
+∞ so indicator functions can be modeled as part of the cost. Let pn > 0 for all
n ∈ N denote the probability that node n on the scenario tree is realized. For the root
node, pr = 1. The transition probability that node m is realized conditional on its
parent node n being realized is given by pnm := pm / pn for all edges (n, m) ∈ E.
The multistage stochastic program considered in this paper is defined in the fol-
lowing extensive form:

v prim := min pn f n (xa(n) , yn , xn ). (1)
(xn ,yn )∈Fn ,
∀ n∈N n∈N
The recursive formulation of the problem (1) is defined as

⎧ ⎫
⎨ ⎬
Q n (xa(n) ):= min f n (xa(n) , yn , xn ) + pnm Q m (xn ) , (2)
(xn ,yn )∈Fn ⎩ ⎭
m∈C (n)
where n ∈ T is a non-leaf node and Q n (xa(n) ) is the value function of node n. At

a leaf node, the sum in (2) reduces to zero, as there are no child nodes C(n) = ∅.
The problem on the right-hand side of (2) is called the nodal problem of node n. Its
objective function consists of the nodal cost function f n and the expected cost-to-go
function, which is denoted as Qn for future reference, i.e.

Qn (xn ):= pnm Q m (xn ). (3)
m∈C (n)
To ensure that the minimum in problem (1) is well defined and finite, we make the
following very general assumption on f n and Fn throughout the paper.
Assumption 1 For every node n ∈ N , the feasibility set Fn is compact, and the
nodal cost function f n is nonnegative and lower semicontinuous (l.s.c.). The sum
n∈N pn f n is a proper function, i.e., there exists (x n , yn ) ∈ Fn for all nodes n ∈ N
such that the sum n∈N pn f n (xa(n) , yn , xn ) < +∞.
123
Note that the state variable xa(n) only appears in the objective function f n of node
n, not in the constraints. Perhaps the more common way is to allow xa(n) to appear in
the constraints of node n. It is easy to see that any such constraint can be modeled by
an indicator function of (xa(n) , xn , yn ) in the objective f n .
2.2 Continuity and convexity of value functions
The following proposition presents some basic properties of the value function Q n
under Assumption 1.
Proposition 1 Under Assumption 1, the value function Q n is lower semicontinuous
(l.s.c.) for all n ∈ N . Moreover, for any node n ∈ N ,
1. if f n (z, y, x) is Lipschitz continuous in the first variable z with constant ln , i.e.
| f n (z, y, x) − f n (z , y, x)| ≤ ln z − z for any z, z ∈ Xa(n) and any (x, y) ∈ Fn ,
then Q n is also Lipschitz continuous with constant ln ;
2. if Xa(n) and Fn are convex sets, and f n and Qn are convex functions, then Q n is
also convex.
The proof is given in Sect. A.1.1. When Q m is l.s.c. for all m ∈ C(n), the sum
m∈C (n) pnm Q m is l.s.c.. Therefore, the minimum in the definition (2) is well define,
since Fn is assumed to be compact.
If the objective function f n (xa(n) , yn , xn ) is not Lipschitz, e.g., when it involves
an indicator function of xa(n) , or equivalently when xa(n) appears in the constraint
of the nodal problem of Q n (xa(n) ), then the value function Q n may not be Lipschitz
continuous, as is shown by the following examples.
Example 1 Consider the convex nonlinear two-stage problem
v ∗ := min x + z : (z − 1)2 + w 2 ≤ 1, w = x, x ∈ [0, 1] .

x,z,w
The objective function and all constraints are Lipschitz continuous. The optimal objec-
tive value v ∗ = 0, and the unique optimal solution is (x ∗ , z ∗ , w ∗ ) = (0, 0, 0). At the
optimal solution, the inequality constraint is active. Note that the problem can be
equivalently written as v ∗ = min0≤x≤1 x + Q(x), where Q(x) is defined √ on [0, 1] as
Q(x):= min z : ∃ w ∈ R, s.t. (z − 1)2 + w 2 ≤ 1, w = x = 1 − 1 − x 2 , which is
not locally Lipschitz continuous at the boundary point x = 1. Therefore, Q(x) is not
Lipschitz continuous on [0, 1].
Example 2 Consider the mixed-integer linear two-stage problem
v ∗ := min {1 − 2x + z : z ≥ x, x ∈ [0, 1], z ∈ {0, 1}} .
The optimal objective value is v ∗ = 0, and the unique optimal solution is (x ∗ , z ∗ ) =

(1, 1). Note that the problem can be equivalently written as v ∗ = min{1−2x + Q(x) :
0 ≤ x ≤ 1}, where the function Q(x) is defined on [0, 1] as Q(x):= min{z ∈ {0, 1} :
z ≥ x}, which equals 0 if x = 0, and 1 for all 0 < x ≤ 1, i.e. Q(x) is discontinuous
at x = 0, therefore, it is not Lipschitz continuous on [0, 1].
123
These examples show a major issue with the introduction of value functions Q n ,
namely Q n may fail to be Lipschitz continuous even when the original problem only
has constraints defined by Lipschitz continuous functions. This could lead to failure of
algorithms based on approximation of the value functions. In the next section, we will
discuss how to circumvent this issue without compromise of feasibility or optimality
for a wide range of problems.
2.3 Regularization and penalty reformulation
The main idea of avoiding failure of cutting plane algorithms in multistage dynamic
programming is to use some Lipschitz continuous envelope functions to replace the
original value functions, which we refer to as regularized value functions.
To begin with, we say a function ψ : Rd → R+ is a penalty function, if ψ(x) = 0
if and only if x = 0, and the diameter of its level set leva (ψ) := {x ∈ Rd : ψ(x) ≤ α}
approaches 0 when a → 0. In this paper, we focus on penalty functions that are locally
Lipschitz continuous, the reason for which will be clear from Proposition 2.
For each node n, we introduce a new variable z n as a local variable of node n and
impose the duplicating constraint xa(n) = z n . This is a standard approach for obtaining
dual variables through relaxation (e.g. [38]). The objective function can then be written
as f n (z n , yn , xn ). Let ψn be a penalty function for node n ∈ N . The new coupling
constraint is relaxed and penalized in the objective function by σn ψn (xa(n) − z n ) for
some σn > 0. Then the DP recursion with penalization becomes
⎧ ⎫
⎨ ⎬
n (x a(n) ):=
QR min f n (z n , yn , xn ) + σn ψn (xa(n) − z n ) + m (x n ) , (4)
pnm Q R
(xn ,yn )∈Fn , ⎩ ⎭
z n ∈Xa(n) m∈C (n)
for all n ∈ N , and Q R n is referred to as the regularized value function. By convention,

Xa(r ) = {xa(r ) } = {0} and therefore, penalization ψr (xa(r ) − zr ) ≡ 0 for any zr ∈
Xa(r ) . Since the state spaces are compact, without loss of generality, we can scale the
penalty functions ψn such that the Lipschitz constant of ψn on Xa(n) − Xa(n) is 1. The
following proposition shows that Q R n is a Lipschitz continuous envelope function of
Q n for all nodes n.
Proposition 2 Suppose ψn is a 1-Lipschitz continuous penalty function on the compact
set Xa(n) − Xa(n) for all n ∈ N . Then Q R
n (x) ≤ Q n (x) for all x ∈ Xa(n) and Q n (x)
R
is σn -Lipschitz continuous on Xa(n) . Moreover, if the original problem (2) and the
n (x) is also convex.
penalty functions ψn are convex, then Q R
The key idea is that by adding a Lipschitz function ψn into the nodal problem, we
can make Q nR (x) Lipschitz continuous even when Q n (x) is not. The proof is given in
Sect. A.1.2. The optimal value of the regularized root nodal problem
⎧ ⎫
⎨ ⎬
v reg := min fr (xa(r ) , yr , xr ) + m (xr )
pr m Q R (5)
(xr ,yr )∈Fr ⎩ ⎭
m∈C (r )
123
is thus an underestimation of v prim , i.e. v reg ≤ v prim . For notational convenience, we

also define the regularized expected cost-to-go function for each node n as:

n (x n ):=
QR m (x n ).
pnm Q R (6)
m∈C (n)
Definition 1 For any ε > 0, a feasible root node solution (xr , yr ) ∈ Fr is said to be
ε-optimal to the regularized problem (4) if it satisfies fr (xa(r ) , yr , xr ) + QrR (xr ) ≤
v reg + ε.
Next we discuss conditions under which v reg = v prim and any optimal solution
(xn , yn )n∈N to the regularized problem (4) is feasible and hence optimal to the original
problem (2). Note that by expanding Q R m in the regularized problem (4) for all nodes,
we obtain the extensive formulation for the regularized problem:

v reg = min pn f n (z n , yn , xn ) + σn ψn (xa(n) − z n ) . (7)
(xn ,yn )∈Fn ,n∈N
z n ∈Xa(n) n∈N
We refer to problem (7) as the penalty reformulation and make the following assump-
tion on its exactness for the rest of the paper.
Assumption 2 We assume that the penalty reformulation (7) is exact for the given
penalty parameters σn > 0, n ∈ N , i.e., any optimal solution of (7) satisfies z n = xa(n)
for all n ∈ N .
Assumption 2 guarantees the solution of the regularized extensive formulation (7) is
feasible for the original problem (1), then by the fact that v reg ≤ v prim , is also optimal
to the original problem, we have v reg = v prim . Thus regularized value functions
serve as a surrogate of the original value function, without compromise of feasibility
of its optimal solutions. A consequence of Assumption 2 is that the original and
regularized value functions coincide at all optimal solutions, the proof of which is
given in Sect. A.1.3.
Lemma 1 Under Assumption 2, any optimal solution (xn , yn )n∈N to problem (1) sat-
n (x a(n) ) = Q n (x a(n) ) for all n = r .
isfies Q R
We illustrate the regularization on the examples through Fig. 1a and b. In Fig. 1a, the
value function Q(x) derived in Example 1 is not Lipschitz continuous at x = 1. With
ψ(x) = x and σ = 4/3, we obtain the regularized value function, which coincides
with the original one on [0, 0.8] and is Lipschitz continuous on the entire interval
[0, 1]. In Fig. 1b, the value function Q(x) derived in Example 1 is not continuous at
x = 0. With ψ(x) = x, σ = 5, we obtain the regularized value function, which
coincides with the primal one on {0}∪[0.2, 1] and is Lipschitz continuous on the entire
interval [0, 1]. In both examples, it can be easily verified that the penalty reformulation
is exact and thus preserves optimal solution.
We comment that Assumption 2 can hold for appropriately chosen penalty factors
in various mixed-integer nonlinear optimization problems, including
123
(a) (b)
Fig. 1 Value functions in Examples 1 and 2
– convex problems with interior feasible solutions,

– problems with finite state spaces,
– problems defined by mixed-integer linear functions, and
– problems defined by continuously differentiable functions,
if certain constraint qualification is satisfied and proper penalty functions are chosen.
We refer the readers to Sect. B in the Appendix for detailed discussions. We emphasize
that Assumption 2 should be interpreted as a restriction on the MS-MINLP problem
class studied in this paper. Namely all problem instances in our discussion must satisfy
Assumption 2 with a given set of penalty functions ψn and penalty parameters σn ,
while they can have other varying problem data such as the numbers of stages T and
characteristics of the state spaces Xn . In general, it is possible that a uniform choice
of σn needs to grow with T to satisfy the assumption.
2.4 Generalized conjugacy cuts and value function approximation
In this part, we first introduce generalized conjugacy cuts for nonconvex functions and
then apply it to the under-approximation of value functions of MS-MINLP.
2.4.1 Generalized conjugacy cuts
Let Q : X → R+ ∪ {+∞} be a proper, l.s.c. function defined on a compact set

X ⊆ Rd . Let U be a nonempty set for dual variables. Given a continuous function
Φ : X × U → R, the Φ-conjugate of Q (see e.g., Chapter 11-L in [32]) is defined as
Q Φ (u) = max {Φ(x, u) − Q(x)} . (8)

x∈X
The following generalized Fenchel-Young inequality holds by definition for any x ∈ X

and u ∈ U,
Q(x) + Q Φ (u) ≥ Φ(x, u).
123
For any û ∈ U and an associated maximizer x̂ in (8), we define
C Φ (x | û, v̂) := v̂ + Φ(x, û) (9)
where v̂:= − Q Φ (û). Then, the following inequality, derived from the generalized
Fenchel-Young inequality, is valid for any x ∈ X ,
Q(x) ≥ C Φ (x | û, v̂), (10)
which we call a generalized conjugacy cut for the target function Q.
2.4.2 Value function approximation
The generalized conjugacy cuts can be used in the setting of an augmented Lagrangian
dual [1] with bounded dual variables. For a nodal problem n ∈ N , n = r and a point
x̄ ∈ Xa(n) , define Φnx̄ (x, u):= − λ, x̄ − x − ρψn (x̄ − x), where u := (λ, ρ) ∈ Rdn +1
are parameters. Consider a compact set of parameters Un = {(λ, ρ) : λ∗ ≤ ln,λ , 0 ≤
ρ ≤ ln,ρ } with nonnegative bounds ln,λ and ln,ρ , where ·∗ is the dual norm of ·.
Consider the following dual problem

v̂n := max min [Q n (z) + λ, x̄ − z + ρψn (x̄ − z)] . (11)
(λ,ρ)∈Un z∈Xa(n)
Denote ẑ n and (λ̂n , ρ̂n ) as an optimal primal-dual solution of (11). The dual problem
(11) can be viewed as choosing (λ̂n , ρ̂n ) as the value of û in (9), which makes the
constant term −Q Φ (û) as large as possible, thus makes the generalized conjugacy cut
(10) as tight as possible. With this choice of the parameters, a generalized conjugacy
cut for Q n at x̄ is given by
Φ x̄
Q n (x) ≥ Cn n (x | λ̂n , ρ̂n , v̂n )
= −λ̂n , x̄ − x − ρ̂n ψn (x̄ − x) + v̂n , ∀ x ∈ Xa(n) . (12)
Proposition 3 Given the above definition of (11)–(12), if (x̄n , ȳn )n∈N is an optimal
solution to problem (1) and the bound ln,ρ satisfies ln,ρ ≥ σn for all nodes n, then
for every node n, the generalized conjugacy cut (12) is tight at x̄n , i.e. Q n (x̄n ) =
Φ x̄n
Cn n (x̄n | λ̂n , ρ̂n , v̂n ).
The proof is given in Sect. A.1.4. The proposition guarantees that, under Assumption 2,
the generalized conjugacy cuts are able to approximate the value functions exactly at
any state associated to an optimal solution.
In the special case where problem (2) is convex and ψn (x) = x for all n ∈ N ,
the exactness of the generalized conjugacy cut holds even if we set ln,ρ = 0, i.e. the
conjugacy cut is linear. To be precise, we begin with the following lemma.
Lemma 2 Let X ⊂ Rd be a convex, compact set. Given a convex, proper, l.s.c. function
Q : X → R ∪ {+∞}, for any x ∈ X , the inf-convolution satisfies
123
Q˝(σ ·)(x):= min{Q(z) + σ x − z} = max min{Q(z) + λ, x − z}. (13)

z∈X λ∗ ≤σ z∈X
The proof are given in Sect. A.1.5. Next we show the tightness in the convex case
similar to Proposition 3. The proof is given in Sect. A.1.6.
Proposition 4 Suppose (2) is convex and ψn (x) = x for all nodes n. Given the
above definition of (11)–(12), if (x̄n , ȳn )n∈N is an optimal solution to problem (1)
and the bounds satisfy ln,λ ≥ σn , ln,ρ = 0 for all nodes n, then for every node n, the
Φ x̄n
generalized conjugacy cut (12) is exact at x̄n , i.e. Q n (x̄n ) = Cn n (x̄n | λ̂n , ρ̂n , v̂n ).
In this case, the generalized conjugacy reduces to the usual conjugacy for convex
functions and the generalized conjugacy cut is indeed linear. This enables approxima-
tion of the value function that preserves convexity.
Remark 1 This proposition can be generalized to special nonconvex problems with

Q extensible to a Lipschitz continuous convex function defined on the convex hull
conv X . This is true if X is the finite set of extreme points of a polytope, e.g., {0, 1}d .
The above discussion provides an alternative explanation of the exactness of the
Lagrangian cuts in SDDiP [38] assuming relatively complete recourse.
3 Nested decomposition and dual dynamic programming algorithms
In this section, we introduce a nested decomposition algorithm for general scenario

trees, and two dual dynamic programming algorithms for stagewise independent sce-
nario trees. Since the size of the scenario tree could be large, we focus our attention to
finding an ε-optimal root node solution xr∗ (see definition (5),) rather than an optimal
solution {xn∗ }n∈T for the entire tree.
3.1 Subproblem oracles
Before we propose the new algorithms, we first define subproblem oracles, which we
will use to describe the algorithms and conduct complexity analysis. A subproblem
oracle is an oracle that takes subproblem information together with the current algo-
rithm information to produce a solution to the subproblem. With subproblem oracles,
we can describe the algorithms consistently regardless of convexity.
We assume three subproblem oracles in this paper, corresponding to the forward
steps and backward steps of non-root nodes, and the root node step in the algorithms.
For non-root nodes, we assume the following two subproblem oracles.
Definition 2 (Forward Step Subproblem Oracle for Non-Root Nodes) Consider the
following subproblem for a non-root node n,
min f n (z, y, x) + σn ψn (xa(n) − z) + Θn (x) . (F)

(x,y)∈Fn ,
z∈Xa(n)
123
where the parent node’s state variable xa(n) ∈ Xa(n) is a given parameter and Θn :
Xn → R̄ is a l.s.c. function, representing an under-approximation of the expected
cost-to-go function Qn (x) defined in (3). The forward step subproblem oracle finds an
optimal solution of (F) given xa(n) and Θn , that is, we denote this oracle as a mapping
OnF that takes (xa(n) , Θn ) as input and outputs an optimal solution (xn , yn , z n ) of (F)
for n = r .
Recall that the values σn for all n ∈ N in (F) are the chosen penalty parameters that
satisfy Assumption 2. In view of Propositions 3 and 4, we set ln,λ ≥ σn and ln,ρ = 0
for the convex case with ψn = ·; or ln,ρ ≥ σn otherwise for the dual variable set
Un := {(λ, ρ) : λ∗ ≤ ln,λ , 0 ≤ ρ ≤ ln,ρ } in the next definition.
Definition 3 (Backward Step Subproblem Oracles for Non-Root Nodes) Consider the
following subproblem for a non-root node n,
max min f n (z, y, x) + λ, xa(n) − z + ρψn (xa(n) − z) + Θn (x) , (B)

(λ,ρ)∈Un (x,y)∈Fn ,
z∈Xa(n)
where the parent node’s state variable xa(n) ∈ Xa(n) is a given parameter and Θn :
Xn → R̄ is a l.s.c. function, representing an under-approximation of the expected
cost-to-go function. The backward step subproblem oracle finds an optimal solution
of (B) for the given xa(n) and Θn . Similarly, we denote this oracle as a mapping OnB
that takes (xa(n) , Θn ) as input and outputs an optimal solution (xn , yn , z n ; λn , ρn ) of
(B) for n = r .
For the root node, we assume the following subproblem oracle.
Definition 4 (Subproblem Oracle for the Root Node) Consider the following subprob-
lem for the root node r ∈ N ,
min fr (xa(r ) , y, x) + Θr (x) , (R)

(x,y)∈Fr
where Θr : Xr → R̄ is a l.s.c. function, representing an under-approximation of the

expected cost-to-go function. The subproblem oracle for the root node is denoted as
Or that takes Θr as input and outputs an optimal solution (xr , yr ) of (R) for the given
function Θr .
These subproblem oracles OnF , OnB , including the parameters σn , ln,λ , and ln,ρ for all
n = r , and Or will be given as inputs to the algorithms. They may return any optimal
solution to the corresponding nodal subproblem. For numerical implementation, they
are usually handled by subroutines or external solvers.
3.2 Under- and over-approximations of cost-to-go functions
We first show how to iteratively construct under-approximation of expected cost-to-go

functions using the generalized conjugacy cuts developed in Sect. 2.4. The under-
approximation serves as a surrogate of the true cost-to-go function in the algorithm.
123
Let i ∈ N be the iteration index of an algorithm. Assume (xni , yni )n∈N are feasible
solutions to the regularized nodal problem (4) in the i-th iteration. Then the under-
approximation of the expected cost-to-go function is defined recursively from leaf
nodes to the root node, and inductively for i ∈ N as
⎧ ⎫
⎨ ⎬
Qin (x):= max Qi−1 (x), pnm Cmi (x | λ̂im , ρ̂mi , v im ) , ∀x ∈ Xn , (14)
⎩ n ⎭
m∈C (n)
where Q0n ≡ 0 on Xn . In the definition (14), Cmi is the generalized conjugacy cut for
xi
Q m at i-th iteration and Φmn (x, λ, ρ) = −λ, xni − x − ρψn (xni − x) (cf. (11)–(12)),
that is,
Cmi (x | λ̂im , ρ̂mi , v im ):= − λ̂im , xni − x − ρ̂mi ψm (xni − x) + v im , (15)
where (x̂mi , ŷmi ẑ m

i ; λ̂i , ρ̂ i ) = O B (x i , Qi ), and v i satisfies
m m m n m m
v im = f m (ẑ m
i
, ŷmi , x̂mi ) + λ̂im , xni − ẑ m
i
+ ρ̂mi ψm (xni − ẑ m
i
) + Qim (x̂mi ). (16)
The next proposition shows that Qin is indeed an under-approximation of Qn , the proof
of which is given in Sect. A.2.1.
Proposition 5 For any n ∈ N , and i ∈ N, Qin (x) is ( m∈C (n) pnm (lm,λ + lm,ρ ))-
Lipschitz continuous and
Qn (x) ≥ Qin (x), ∀x ∈ Xn .
Now, we propose the following over-approximation of the regularized expected

cost-to-go functions, which is used in the sampling and termination of the proposed
nested decomposition and dual dynamic programming algorithms. For i ∈ N, at root
node r , let (xri , yri ) = Or (Qri−1 ), and, at each non-root node n, let (xni , yni , z ni ) =
OnF (xa(n)
i , Qi−1
n ). Then the over-approximation of the regularized expected cost-to-go
function is defined recursively, from leaf nodes to the child nodes of the root node,
and inductively for i ∈ N by
⎧ ⎧ ⎫
⎪
⎪ ⎨ ⎬
⎪
⎪ Q
i−1
(x), v̄ i
+ σ x − i
, if (4) is convex
⎪
⎪conv
⎩ n
p nm m m x n
⎭
⎪
⎪
⎨ m∈C (n)
i
Qn (x):= ⎧ ⎫ (17)
⎪
⎪
⎪
⎪ ⎨ ⎬
⎪
⎪ i−1
⎪
⎪min Qn (x), pnm v̄m
i
+ σm x − xni , otherwise
⎩ ⎩ ⎭
m∈C (n)
0 i
where Qn ≡ +∞ for any non-leaf node n ∈ N , Qn ≡ 0 for any iteration i ∈ N and
any leaf node n, and v̄m
i satisfies
123
i
v̄m
i
= f m (z m
i
, ymi , xmi ) + σm ψm (xni − z m
i
) + Qm (xmi ). (18)
Here, the operation conv{ f , g} forms the convex hull of the union of the epigraphs
of any continuous functions f and g defined on the space Rd . More precisely using
convex conjugacy, we define
conv{ f , g}(x) := (min{ f (x), g(x)})∗∗

= sup inf {min{ f (z), g(z)} + λ, x − z} . (19)
d
λ∈Rd z∈R
The key idea behind the upper bound function (17) is to exploit the Lipschitz continuity
of the regularized value function Q mR (x). In particular, it would follow from induction
i is an upper bound on Q R (x i ), and then, by the σ -Lipschitz continuity of
that v̄m m n m
Q m (x), we have v̄m
R i + σ x − x i ≥ Q R (x i ) + σ x − x i ≥ Q R (x) for all x ∈ X .
m n m n m n m n
The next proposition summarizes this property, with the proof given in Sect. A.2.2.
i
Proposition 6 For any non-root node n ∈ N and i ≥ 1, Qn (x) is ( m∈C (n) pnm σm )-
Lipschitz continuous. Moreover, we have v̄m
i ≥ Q R (x i ) for any node m ∈ C(n) and
m n
thus
i
Qn (x) ≥ QR
n (x), ∀ x ∈ Xn .
3.3 A nested decomposition algorithm for general trees
We first propose a nested decomposition algorithm in Algorithm 1 for a general sce-

nario tree. In each iteration i, Algorithm 1 carries out a forward step, a backward step,
and a root node update step. In the forward step, the algorithm proceeds from t = 1 to
T by solving all the nodal subproblems with the current under-approximation of their
cost-to-go functions in stage t. After all the state variables xni are obtained for nodes
n ∈ N , the backward step goes from t = T back to 1. At each node n in stage t, it first
updates the under-approximation of the expected cost-to-go function. Next it solves
the dual problem to obtain an optimal primal-dual solution pair (x̂ni , ŷni , ẑ ni ; λ̂in , ρ̂ni ),
which is used to construct a generalized conjugacy cut using (15), together with val-
ues v in and v̄ni calculated with (16) and (18). Finally the algorithm updates the root
node solution using the updated under-approximation of the cost-to-go function, and
determines the new lower and upper bounds. The incumbent solution (xr∗ , yr∗ ) may
also be updated as the algorithm output at termination, although it is not used in the
later iterations.
Algorithm 1 solves the regularized problem (4) for an ε-optimal root node solution.
To justify the ε-optimality of the output of the algorithm, we have the following
proposition, the proof of which is given in Sect. A.2.3.
Proposition 7 Given any ε > 0, if UpperBound − LowerBound ≤ ε, then the

returned solution (xr∗ , yr∗ ) is an ε-optimal root node solution to the regularized prob-
i
lem (4). In particular, if Qr (xri+1 ) − Qri (xri+1 ) ≤ ε for some iteration index i, then
123
Algorithm 1 A nested decomposition algorithm for a general tree

Require: scenario tree T = (N , E) with subproblem oracles Or , OnF , OnB , n = r , and ε > 0
Ensure: an ε-optimal root node solution (xr∗ , yr∗ ) to the regularized problem (4)
0 0
1: Initialize: i ← 1; Q0n ← 0, Qn ← +∞ ∀ n : C(n) = ∅ and Qn ← 0 ∀ n : C(n) = ∅
2: Evaluate (xr1 , yr1 ) = Or (0)
3: Set LowerBound ← fr (xa(r ) , yr1 , xr1 ), UpperBound ← +∞
4: while UpperBound − LowerBound > ε do
5: for t = 1, . . . , T − 1 do i-th forward step
6: for n ∈ N (t) do
Evaluate (xni , yni , z ni ) = OnF (xa(n) i , Qi−1
7: n )
8: end for
9: end for
10: for t = T , . . . , 1 do i-th backward step
11: for n ∈ N (t) do
i
12: Update Qin and Qn using (14) and (17)
13: i
Evaluate (x̂ni , ŷni , ẑ ni ; λ̂in , ρ̂ni ) = OnB (xa(n) , Qin )
14: Calculate Cni , v in , and v̄ni using (15), (16), and (18)
15: end for
16: end for
i
17: Update Qri and Qr using (14) and (17) root node update
18: Evaluate (xri+1 , yri+1 ) = Or (Qri )
19: Update LowerBound ← fr (xa(r ) , yri+1 , xri+1 ) + Qri (xri+1 )
i
20: if UpperBound > fr (xa(r ) , yri+1 , xri+1 ) + Qr (xri+1 ) then
i
21: Update UpperBound ← fr (xa(r ) , yri+1 , xri+1 ) + Qr (xri+1 )
22: ∗ ∗ i+1 i+1
Set (xr , yr ) = (xr , yr )
23: end if
24: i ←i +1
25: end while
UpperBound − LowerBound ≤ ε and Algorithm 1 terminates after the i-th itera-

tion.
3.4 A deterministic sampling dual dynamic programming algorithm
Starting from this subsection, we turn our attention to stagewise independent stochastic
problems, which is defined in the following assumption.
Assumption 3 For any t = 1, . . . , T − 1 and any n, n ∈ N (t), the state space,

the transition probabilities, as well as the data associated with the child nodes C(n)
and C(n ) are identical. In particular, this implies Qn (x) = Qn (x)=:Qt (x) for all
x ∈ Xn = Xn =:Xt ⊆ Rdt .
We denote n ∼ n for n, n ∈ N (t) for some t = 1, . . . , T − 1, if the nodes n, n

are defined by identical data. We then use Ñ (t) := N (t)/ ∼ to denote the set of
nodes with size Nt := |Ñ (t)| that are defined by distinct data in stage t for all
T Ñ (t) forms a recombining scenario tree [39].
t = 1, . . . , T − 1, i.e. Ñ := ∪t=0
For each node m ∈ Ñ (t), we denote pt−1,m := pnm for any n ∈ Ñ (t − 1) since
pn,m = pn ,m for any n, n ∈ Ñ (t − 1). Due to stagewise independence, it suffices
123
to keep track of the state of each stage in the algorithm, instead of the state of each
node. To be consistent, we also denote the root node solution as (x0i , y0i ) for i ∈ N.
We present the algorithm in Algorithm 2.
Similar to Algorithm 1, each iteration in Algorithm 2 consists of a forward step,
a backward step, and a root node update step. In particular, at a node n ∈ Ñ (t)
with t < T , the forward step proceeds to a child node m ∈ Ñ (t + 1), where the
i−1
approximation gap γmi := Qt (xmi ) − Qi−1 t (x m ) is among the largest of all the
i

approximation gaps of states xm of nodes m ∈ Ñ (t + 1). Then the state variable of
i
node m is considered the state variable of stage t(m) in the iteration i. Due to stagewise
independence, the backward step at each stage t only need to generate cuts for the
nodes in the recombining tree Ñ . The optimality of the returned solution (x0∗ , y0∗ ) is
guaranteed by Proposition 7.
Algorithm 2 Deterministic sampling dual dynamic programming algorithm

Require: recombining scenario tree Ñ with subproblem oracles Or , OnF , OnB , n = r , and ε > 0
Ensure: an ε-optimal root node solution (x0∗ , y0∗ ) to the regularized problem (4)
0 0
1: Initialize: i ← 1; Q0t ← 0, ∀ t, Qt , ← +∞ ∀ t ≤ T − 1 and QT ← 0
2: Evaluate (x01 , y01 ) = Or (0)
3: Set LowerBound ← fr (xa(r ) , y01 , x01 ), UpperBound ← +∞
4: while UpperBound − LowerBound > ε do
5: for t = 1, . . . , T − 1 do i-th forward step
6: for n ∈ Ñ (t) do
7: i , Qi−1 )
Evaluate (xni , yni , z ni ) = OnF (xt−1 t
i−1
8: Calculate the gap γni = Qt (xni ) − Qi−1 i
t (x n )
9: end for
10: Select any n ∗ (t) ∈ {n ∈ N (t) : γni ≥ γni , ∀ n ∈ Ñ (t)}, and let xti ← xni ∗ (t)
11: end for
i
13: Update Qit and Qt using (14) and (17)
14: for n ∈ Ñ (t) do
15: i , Qi )
Evaluate (x̂ni , ŷni , ẑ ni ; λ̂in , ρ̂ni ) = OnB (xt−1 t
16: Calculate Cni , v in , v̄ni using (15), (16), and (18)
17: end for
18: end for
i
19: Update Qi0 and Q0 using (14) and (17) root node update
20: Evaluate (x0i+1 , y0i+1 ) = Or (Qi0 )
21: Update LowerBound ← fr (xa(r ) , y0i+1 , x0i+1 ) + Qi0 (x0i+1 )
i
22: if UpperBound > fr (xa(r ) , y0i+1 , x0i+1 ) + Q0 (x0i+1 ) then
i
23: Update UpperBound ← fr (xa(r ) , y0i+1 , x0i+1 ) + Q0 (x0i+1 )
24: Set (x0∗ , y0∗ ) = (x0i+1 , y0i+1 )
25: end if
26: i ←i +1
27: end while
123
3.5 A stochastic sampling dual dynamic programming algorithm
Now we present a stochastic dual dynamic programming algorithm, which uses

stochastic sampling rather than deterministic sampling. So, instead of traversing the
scenario tree and finding a path with the largest approximation gap, the stochastic sam-
pling algorithm generates M scenario paths before an iteration beginsT for some M ≥ 1.
To be precise, we introduce the following notations. Let P = t=1 Ñ (t) denote all
possible scenario paths from stage 1 to stage T . A scenario path is denoted as a T -
element sequence P = (n 1 , . . . , n T ) ∈ P, where n t ∈ Ñ (t) for each t = 1, . . . , T . In
the i-th iteration, we sample M independent scenario paths P i = {P i,1 , . . . , P i,M },
i, j
and we use Pt to denote the t-th node in the scenario path P i, j , i.e., the node in the
t-th stage of the j-th scenario path in the i-th iteration, for 1 ≤ j ≤ M and 1 ≤ t ≤ T .
Since in each iteration, the solutions and the approximations depend on the scenario
path P i, j , we use two superscripts i and j for solutions and cuts, where a single
superscript i is used in the deterministic sampling algorithm. In addition, for every
node n ∈ Ñ (t) for some stage t, the under-approximation of the expected cost-to-go
function is updated over all scenario path index j = 1, . . . , M, with M cuts in total,
i.e.,
⎧ ⎫
⎨ ⎬
i, j i, j i, j i, j
Qit (x):= max Qi−1 (x), ptm Cm (x | λ̂m , ρ̂m , v m ), 1 ≤ j ≤ M , (20)
⎩ t ⎭
m∈Ñ (t+1)
i, j i, j i, j i, j i, j i, j
where Cm is the generalized conjugacy cut generated with (x̂m , ŷm , ẑ m ; λ̂m , ρ̂m )
i, j
= OmB (xn , Qit+1 ) using formula (15). With these notations, the algorithm is displayed
in Algorithm 3.
Unlike the preceding two algorithms, Algorithm 3 does not need to construct the
over-approximation of the regularized value functions for selecting the child node to
proceed with. Instead, it determines the scenario paths before the forward step starts.
In the forward step, each nodal problem in the sampled scenario path is solved. Then
in the backward step, the dual problems are solved at the nodes that are defined by
distinct data, dependent on the parent node’s state variable obtained in the forward
step. The termination criterion is flexible. In the existing literature [33, 38], statistical
upper bounds based on the sampled scenario paths are often used together with the
lower bound for terminating the algorithm. In particular for the convex problems, if
we set σn = +∞, which implies ln,λ = +∞ in the backward step subproblem oracles
OnB for all n ∈ N , then Algorithm 3 reduces to the usual SDDP algorithm in the
literature [15, 33].
4 Upper bounds on iteration complexity of proposed algorithms
In this section, we derive upper bounds on the iteration complexity of the three pro-
posed algorithms, i.e. the bound on the iteration index when the algorithm terminates.
These upper bounds on the iteration complexity imply convergence of the algorithm
to an ε-optimal root node solution for any ε > 0.
123
Algorithm 3 Stochastic sampling dual dynamic programming algorithm

Require: recombining scenario tree Ñ with subproblem oracles Or , OnF , OnB , n = r
1: Initialize: i ← 1; Q0t ← 0, ∀ t
2: Evaluate (x01 , y01 ) = Or (0)
3: while some stopping criterion is not satisfied do
4: Sample M scenario paths P i = {P i,1 , . . . , P i,M }
5: for j = 1, . . . , M do i-th forward step
6: for t = 1, . . . , T − 1 do
i, j i, j i, j i, j
7: Evaluate (xt , yt , z t ) = OnF (xt−1 , Qi−1 t )
8: end for
9: end for
11: Update Qit using (20)
12: for j = 1, . . . , M do
13: for n ∈ N (t)/ ∼ do
i, j i, j i, j i, j i, j i, j
14: Evaluate (x̂n , ŷn , ẑ n ; λ̂n , ρ̂n ) = OnB (xt−1 , Qit )
i, j i, j
15: Calculate Cn and v n using (15) and (16)
16: end for
17: end for
18: end for
19: Update Qi0 using (20) root node update
20: Evaluate Or (Qi0 ) = (x0i+1 , y0i+1 )
21: i ←i +1
22: end while
4.1 Upper bound analysis on iteration complexity of algorithm 1
In this section, we discuss the iteration complexity of Algorithm 1. We begin with the
definition of a set of parameters used in the convergence analysis. Let ε denote the
desired root-node optimality gap ε in Algorithm 1. Let δ = (δn )n∈N ,C (n)=∅ be a set
of positive numbers such that ε = n∈N ,C (n)=∅ pn δn . Since ε > 0, such δn ’s clearly
exist. Then, we define recursively for each non-leaf node n

γn (δ):=δn + pnm γm (δ), (21)
m∈C (n)
and γn (δ) = 0 for leaf nodes n. For i ∈ N, recall the approximation gap γni =
i−1
Qn (xni ) − Qi−1
n (x n ) for n ∈ N . For leaf nodes, γn ≡ 0 by definition for all i ∈ N.
i i
In addition, we define the sets of indices In (δ) for each n ∈ N as
In (δ):= i ∈ N : γni > γn (δ) and γmi ≤ γm (δ), ∀ m ∈ C(n) . (22)
Intuitively, the index set In (δ) consists of the iteration indices when all the child nodes
of n have good approximations of the expected cost-to-go function at the forward step
solution, while the node n itself does not. The next lemma shows that the backward
step for node n in the iteration i ∈ In (δ) will reduce the expected cost-to-go function
approximation gap at node n to be no more than γn (δ).
123
i−1
Lemma 3 If an iteration index i ∈ In (δ), i.e., Qn (xni ) − Qi−1
n (x n ) > γn (δ) and
i
i−1
Qm (xmi ) − Qi−1
m (x m ) ≤ γm (δ) for all m ∈ C(n), then
i
i δn
Qn (x) − Qin (x) ≤ γn (δ), ∀ x ∈ Xn , x − xni ≤ , (23)
2L n
where L n := m∈C (n) pnm (lm,λ + lm,ρ ) is determined by the input parameters.
The proof is given in Sect. A.3.1. Lemma 3 shows that an iteration being in the index
set would imply an improvement of the approximation in a neighborhood of the current
state. In other words, each i ∈ In would carve out a ball of radius δn /(2L n ) in the
state space Xn such that no point in the ball can be the forward step solution of some
iteration i in In . This implies that we could bound the cardinality |In | of In by the
size and shape of the corresponding state space Xn .
Lemma 4 Let B = {Bn,k ⊂ Rdn }1≤k≤K n ,n∈N be a collection of balls, each with
Kn
diameter Dn,k ≥ 0, such that Xn ⊆ k=1 Bn,k . Then,
Kn

2L n Dn,k dn
|In (δ)| ≤ 1+ .
δn
k=1
The proof is by a volume argument of the covering balls with details given in
Sect. A.3.2. We have an upper bound on the iteration complexity of Algorithm 1.
Theorem 1 Given ε > 0, choose values δ = (δn )n∈N ,C (n)=∅ such that δn > 0 and
n∈N ,C (n)=∅ pn δn = ε. Let B = {Bn,k }1≤k≤K n ,n∈N be a collection of balls, each
Kn
with diameter Dn,k ≥ 0, such that Xn ⊆ k=1 Bn,k for n ∈ N . If Algorithm 1
terminates with an ε-optimal root node solution (xr∗ , yr∗ ) at the end of i-th iteration,
then
Kn

2L n Dn,k dn
i≤ 1+ .
δn
n∈N , k=1
C (n)=∅
Proof After the i-th iteration, at least one of the following two situations must happen:
i
i. At the root node, it holds that Qr (xri+1 ) − Qri (xri+1 ) ≤ γr (δ), where γr is defined
in (21).
i
ii. There exists a node n ∈ N such that Qn (xni+1 ) − Qin (xni+1 ) > γn (δ), but all of its
i
child nodes satisfy Qm (xmi+1 ) − Qim (xmi+1 ) ≤ γm (δ), ∀ m ∈ C(n). In other words,
i + 1 ∈ In (δ).
Note that γr (δ) = δr + m∈C (r ) pr m γm (δ) = · · · = n∈N ,C (n)=∅ pn δn . If case i
happens, then by Proposition 7, (xri+1 , yri+1 ) is an ε-optimal root node solution. Note
that case ii can only happen at most n∈N |In (δ)| times by Lemma 4. Therefore, we
have that
123
Kn
dn
2L n Dn,k
i≤ 1+ ,
δn
n∈N k=1
C (n)=∅
when the algorithm terminates.

Theorem 1 implies the ε-convergence of the algorithm for any ε > 0. We remark that
the form of the upper bound depends on the values δ and the covering balls Bn,k , and
therefore the right-hand-side can be tightened to the infimum over all possible choices.
While it may be difficult to find the best bound in general, in the next section we take
some specific choices of δ and B and simplify the complexity upper bound, based on
the stagewise independence assumption.
Before giving the iteration complexity bound for Algorithm 2, we slightly adapt the
notations in the previous section to the stagewise independent scenario tree. We take
the values δ = (δn )n∈Ñ ,C (n)=∅ such that δn = δn for all n, n ∈ Ñ (t) for some
t = 1, . . . , T . Thus we denote δt = δn for any n ∈ Ñ (t), and δ0 = δr . The vector of
γt (δ) is defined recursively for non-leaf nodes as
γt (δ) := γt+1 (δ) + δt , if t ≤ T − 1, (24)
i−1
and γT (δ) = 0. Let γti :=Qt (xti ) − Qi−1 t (x t ) and recall that γ0 :=γr for each index
i i i
i. The sets of indices It (δ) are defined for t = 0, . . . , T − 1 as
It (δ):= i ∈ N : γti > γt (δ) and γt+1

i
≤ γt+1 (δ) . (25)
Note that γti = maxn∈Ñ (t) γni (line 10 in Algorithm 2). By Lemma 3, an iteration
i
i ∈ It (δ) implies Qt (x) − Qit (x) ≤ γt (δ) for all x ∈ Xn with x − xti ≤ δt /(2L t ),
where L t = L n for any n ∈ Ñ (t). Moreover, since Xn = Xt for n ∈ Ñ (t), for any
Kt
covering balls Bt,k ⊂ Rdt with diameters Dt,k ≥ 0, such that Xt ⊆ ∪k=1 Bt,k , by the
same argument of Lemma 4, we know that
Kt
dt
2L t Dt,k
|It (δ)| ≤ 1+ . (26)
δt
k=1
We summarize the upper bound on the iteration complexity of Algorithm 2 in the next
theorem, and omit the proof since it is almost a word-for-word repetition with the
notation adapted as above.
T −1
Theorem 2 Given any ε > 0, choose values δ = (δt )t=0 such that δt > 0 and
T −1
t=0 tδ = ε. Let B = {Bt,k ⊂ Rdt }
1≤k≤K t ,0≤t≤T −1 be a collection of balls, each
123
Kt
with diameter Dt,k ≥ 0, such that Xt ⊆ k=1 Bt,k for 0 ≤ t ≤ T − 1. If Algorithm 2
terminates with an ε-optimal root node solution (x0∗ , y0∗ ) in i iterations, then

T Kt
−1 dt
2L t Dt,k
i≤ 1+ .
δt
t=0 k=1
We next discuss some special choices of the values δ and the covering ball collec-
tions B. First, since Xt are compact, suppose Bt is the smallest ball containing Xt .
Then we have diam Xt ≤ Dt ≤ 2diam Xt where Dt = diam Bt . Moreover, suppose
L t ≤ L for some L > 0 and dt ≤ d for some d > 0. Then by taking δt = ε/T for all
0 ≤ t ≤ T − 1, we have the following bound.
Corollary 1 If Algorithm 2 terminates with an ε-optimal root node solution (x0∗ , y0∗ ),
then the iteration index is bounded by

2L DT d
i ≤ T 1+ ,
ε
where L, d, D are the upper bounds for L t , dt , and Dt , 0 ≤ t ≤ T − 1, respectively.

Proof Take δt = ε/T for all 0 ≤ t ≤ T − 1 and apply Theorem 2.

Note that the iteration complexity bound in Corollary 1 grows asymptotically O(T d+1 )
as T → ∞. Naturally such bound is not satisfactory since it is nonlinear in T with
possibly very high degree d. However, by changing the optimality criterion, we next
derive an iteration complexity bound that grows linearly in T , while all other param-
eters, L, D, ε, d, are independent of T .
Corollary 2 If Algorithm 2 terminates with a (T ε)-optimal root node solution (x0∗ , y0∗ ),
then the iteration index is bounded by

2L D d
i ≤ T 1+ ,
ε
where L, d, D are the upper bounds for L t , dt , and Dt , 0 ≤ t ≤ T − 1, respectively.

Proof Take δt = ε for all 0 ≤ t ≤ T − 1 and apply Theorem 2.

The termination criterion in Corollary 2 corresponds to the usual relative optimality
gap, if the total objective is known to grow at least linearly with T , as is the case for
many practical problems. Last, we consider a special case where Xt are finite for all
0 ≤ t ≤ T − 1.
Corollary 3 Suppose the cardinality |Xt | ≤ K < ∞ for all 0 ≤ t ≤ T − 1, for some
positive integer K . In this case, if Algorithm 2 terminates with an ε-optimal root node
solution (x0∗ , y0∗ ), then the iteration index is bounded by
i ≤ T K.
123
Proof Note that when Xt is finite, it can be covered by degenerate balls B0 (x), x ∈ Xt .
Thus Dt,k = 0 for k = 1, . . . , K t and K t ≤ K by assumption. Apply Theorem 2, we
T −1 Kt
get i ≤ t=0 k=1 1 ≤ T K .

The bound in Corollary 3 grows linearly in T and does not depend on the value of ε.
In other words, we are able to obtain exact solutions to the regularized problem (4)
assuming the subproblem oracles.
Remark 2 All the iteration complexity bounds in Theorem 2, Corollaries 1, 2 and 3

are independent of the size of the scenario tree in each stage Nt , 1 ≤ t ≤ T . This can
T −1
be explained by the fact that Algorithm 2 evaluates 1 + N T + 2 t=1 Nt times of the
subproblem oracles in each iteration.
In the following we study the iteration complexity of Algorithm 3. For clarity, we model
the subproblem oracles OnF and OnB as random functions, that are Σioracle -measurable
in each iteration i ∈ N, for any node n = r , where {Σioracle }i=0∞ is a filtration of
σ -algebras in the probability space. Intuitively, this model says that the information
given by Σioracle could be used to predict the outcome of the subproblem oracles. We
now make the following assumption on the sampling step.
Assumption 4 In each iteration i, the M scenario paths are sampled uniformly with
replacement, independent from each other and the outcomes of the subproblem oracles.
That is, the conditional probability of the j-th sample P i, j taking any scenario n t ∈
Ñ (t) in stage t is almost surely
1
i, j i , j
Prob Pt = n t | Σ∞
oracle
, σ {Pt }(i , j ,t )=(i, j,t) = , (27)
Nt
oracle := ∪∞ Σ oracle , and σ {P i , j

where Σ∞ i=1 i t }(i , j ,t )=(i, j,t) is the σ -algebra generated

by scenario samples other than the j-th sample in stage t of iteration i.

i, j i, j i, j
In the sampling step in the i-th iteration, let γt :=QR
t (x t ) − Qt (x t ) for any
i−1
i, j
t ≤ T − 1, which is well defined by Assumption 3, and let γ̃t := max{QR t (x n ) −
i, j
Qt (xn ) : (xn , yn , z n ) = On (xt−1 , Qn ), n ∈ Ñ (t)} for each scenario path index
i−1 F i−1
i, j i, j
1 ≤ j ≤ M. Note that by definition, we have γt ≤ γ̃t for any t = 1, . . . , T − 1,
everywhere in the probability space. We define the sets of indices It (δ) for each
t = 0, . . . , T − 1, similar to those in the deterministic sampling case, as

M
i, j i, j
It (δ):= i ∈ N : γt > γt (δ) and γ̃t+1 ≤ γt+1 (δ) . (28)
j=1
With the same argument, we know that the upper bound (26) on the sizes of It (δ)
holds everywhere for each t = 0, . . . , T − 1. However, since the nodes in the forward
123
T −1
steps are sampled randomly, we do not necessarily have i ∈ ∪t=0 It (δ) for each
iteration index i ∈ N before Algorithm 3 first finds an ε-optimal root node solution.
T −1 i−1, j
Instead, we define an event Ai (δ) := {i ∈ ∪t=0 It (δ)} ∪ M j=1 {γ0 ≤ γ0 (δ) = ε}
for each iteration i, that means either some approximation is improved in iteration
i or the algorithm has found an ε-optimal root node solution in iteration i − 1. The
next lemma estimates the conditional probability of Ai (δ) given any oracles outcomes
sample
and samplings up to iteration i. For simplicity, we define two σ -algebras Σi :=

σ {P i , j }i ≤i, j =1,...,M and Σi := σ (Σioracle , Σi
sample
) for each i.
T −1
Lemma 5 Fix any ε = t=0 δt . Then the conditional probability inequality
Prob(Ai (δ) | Σi−1 ) ≥ ν := 1 − (1 − 1/N ) M ,

T −1
holds almost surely, where N := t=1 Nt if T ≥ 2 and N :=1 otherwise.
The proof is given in Sect. A.3.3. Now we are ready to present the probabilistic
complexity bound of Algorithm 3, the proof of which is given in Sect. A.3.4.
Theorem 3 Let I = I (δ, B) denote the iteration complexity bound in Theorem 2,

determined by the vector δ and the collection of state space covering balls B, and
ν denote the probability bound proposed in Lemma 5. Moreover, let ι be the random
variable of the smallest index such that the root node solution (x0ι+1 , y0ι+1 ) is ε-optimal
in Algorithm 3. Then for any real number κ > 1, the probability

κI −I ν(κ − 1)2
Prob ι ≥ 1 + ≤ exp .
ν 16κ
Remark 3 Theorem 3 shows that for a fixed problem (such that I = I (δ, B) and
N = N1 · · · N T −1 are fixed), given any probability threshold q ∈ (0, 1), the number
of iterations needed for Algorithm 3 to find an ε-optimal root node solution with prob-
ability greater than 1 − q is O(− ln q/ν 2 ), which does not depend on I . In particular,
if we set M = 1, then the number of iterations needed is O(−N 2 ln q), which is expo-
nential in the number of stage T if Nt ≥ 2 for all t = 1, . . . , T −1. It remains unknown
to us whether there exists a complexity bound for Algorithm 3 that is polynomial in
T in general.
5 Lower bounds on iteration complexity of proposed algorithms
In this section, we discuss the sharpness of the iteration complexity bound of Algo-
rithm 2 given in Sect. 4. In particular, we are interested in the question whether it
is possible that the iteration needed for Algorithm 2 to find an ε-optimal root node
solution grows linearly in T when the state spaces are infinite sets. We will see that in
general it is not possible, with or without the assumption of convexity. The following
lemma simplifies the discussion in this section.
123
Lemma 6 Suppose f n (z, y, x) is ln -Lipschitz continuous in z for each n ∈ N . If we

choose ψn (x) = x and σn ≥ ln , then Q R n (x) = Q n (x) on Xa(n) for all non-root
nodes n ∈ N .
The proof exploits the Lipschitz continuity of f n and the fact Q nR (x) is an under-
approximation of Q n (x) in an inductive argument. The details are given in Sect. A.4.1.
In other words, for problems that already have Lipschitz continuous value functions,
the regularization does not change the function value at any point. Thus the examples
in the rest of this section serve the discussion not only for Algorithm 2, but for more
general algorithms including SDDP and SDDiP.
5.1 General lipschitz continuous problems
We discuss the general Lipschitz continuous case, i.e., the nodal objective functions
f n (z, y, x) are ln -Lipschitz continuous in z but not necessarily convex. In this case
we choose to approximate the value function using ψn (x) = x and assume that
ln,ρ ≥ ln . We can set ln,λ = 0 for all n ∈ N , without loss of exactness of the
approximation by the Proof of Proposition 3. We begin with the following lemma on
the complexity of such approximation.
Lemma 7 Consider a norm ball X = {x ∈ Rd : x ≤ D/2} and a finite set of points
W = {wk }k=1K ⊂ X . Suppose that there is β > 0 and an L-Lipschitz continuous
function f : X → R+ such that β < f (wk ) < 2β for k = 1, . . . , K . Define
– Q(x):= max {0, f (wk ) − L x − wk } and
k=1,...,K
– Q(x):= min { f (wk ) + L x − wk }.
k=1,...,K
d
If K < DL
4β , then min Q(x) = 0 and min Q(x) > β.
x∈X x∈X
The proof is given in Sect. A.4.2. The lemma shows that if the number of points in
W is too small, i.e. K < (DL/2β)d , then the difference between the upper and lower
bounds could be big, i.e. Q(x̄) − Q(x̄) > β for some x̄. In other words, in order to
have a small gap between the upper and lower bounds, we need sufficient number of
sample points. This lemma is directly used to provide a lower bound on the complexity
of Algorithm 2.
Now we construct a Lipschitz continuous multistage problem defined on a chain,
i.e., a scenario tree, where each stage has a single node, N (t) = 1 for t = 1, . . . , T .
The problem is given by the value functions in each stage as,
⎧
⎪ Q r = min x0 ∈Xr Q 1 (x0 ),
⎪
⎪
⎪
⎨
Q t (xt−1 ) = min xt ∈Xt { f t (xt−1 ) + Q t+1 (xt )} , 1 ≤ t ≤ T − 1, (29)
⎪
⎪
⎪
⎪
⎩
Q T (x T −1 ) = f T (x T −1 ).
Here for all t = 1, . . . , T , f t : Xt → R+ is an L-Lipschitz continuous function that

satisfies β < f t (x) < 2β for all x ∈ Xt with β := ε/T , the number of stages T ≥ 1,
123
and ε > 0 is a fixed constant. The state space Xt := B d (D/2) ⊂ Rd is a ball with
radius D/2 > 0. We remark that ε will be the optimality gap in Theorem 4. So for a
fixed optimality gap ε, we construct an instance of multistage problem (29) that will
prove to be difficult for Algorithm 2 to solve. Also (29) is constructed such that there
is no constraint coupling the state variables xt in different stages.
By Lemma 6, if we choose ψn (x) = x for all n ∈ N and ln,ρ = L for the
problem (29), then we have Q R t (x) = Q t (x) for all t = 1, . . . , T . The next theorem
shows a lower bound on the iteration complexity of problem (29) with this choice of
penalty functions.
Theorem 4 For any optimality gap ε > 0, there exists a problem of the form (29)
with subproblem oracles OnF , OnB , n ∈ N , and Or , such that if Algorithm 2 gives
UpperBound − LowerBound ≤ ε in the i-th iteration, then
d
DLT
i≥ .
4ε
The proof is given in Sect. A.4.3. The theorem shows that in general Algorithm 2
needs at least O(T d ) iterations before termination. We comment that this is due to the
fact that the approximation using generalized conjugacy is tight only locally. Without
convexity, one may need to visit many states to cover the state space to achieve tight
approximations of the value functions before the algorithm is guaranteed to find an
ε-optimal solution.
5.2 Convex lipschitz continuous problems
In the above example for general Lipschitz continuous problem, we see that the com-
plexity of Algorithm 2 grows at a rate of O(T d ). It remains to answer whether convexity
could help us avoid this possibly undesirable growth rate in terms of d. We show that
even by using linear cuts, rather than generalized conjugacy cuts, for convex value
functions, the complexity lower bound of the proposed algorithms could not be sub-
stantially improved. We begin our discussion with a definition.
Definition 5 Given a d-sphere S d (R) = {x ∈ Rd+1 : x2 = R} with radius R > 0,
a spherical cap with depth β > 0 centered at a point x ∈ S d (R) is the set
Sβd (R, x):={y ∈ S d (R) : y − x, x ≥ −β R}.
The next lemma shows that we can put many spherical caps on a sphere, the center
of each is not contained in any other spherical cap, the proof of which is given in
Sect. A.4.4.
√
Lemma 8 Given a d-sphere S d (R), d ≥ 2 and depth β < (1 − 2 )R,
2
there exists a
finite set of points W with
√ (d−1)/2
(d 2 − 1) π Γ (d/2 + 1) R
|W| ≥ ,
d Γ (d/2 + 3/2) 2β
123
such that, for any w ∈ W, Sβd (R, w) ∩ W = {w}.
Hereafter, we denote a set of points that satisfies Lemma 8 as Wβd (R) ⊂ S d (R).
Next we construct an L-Lipschitz convex function for any L > 0, ε > 0 that satisfies
d (R). The proof is given in Sect. A.4.5.
certain properties on Wε/L
Lemma 9 Given positive constants ε > 0, L > 0 and a set Wε/L d (R). Let
K :=|Wε/Ld (R)|. For any values v ∈ (ε/2, ε), k = 1, . . . , K , define a function

k
F : B (R) → R as F(x) = maxk=1,...,K {0, vk + LR wk , x − wk }. Then F satisfies
d+1
the following properties:
1. F is an L-Lipschitz convex function;

2. F(wk ) = vk for all wk ∈ Wε/L d (R);
3. F is differentiable at all wk , with vk + ∇ F(wk ), wl − wk < 0 for all l = k;

For any wl ∈ Wε/L k=l {0, vk + ∇ F(wk ), x − wk } and
4. d (R), Q (x):= max
l
Q l (x):= convk=l {vk + L x − wk } satisfy
3ε
Q l (wl ) − Q l (wl ) > .
2
Now we present the multistage convex dual dynamic programming example based
on the following parameters: T ≥ 2 (number of stages), L > 0 (Lipschitz constant),
d ≥ 3 (state space dimension), D = 2R > 0 (state space diameter), and ε > 0
(optimality gap). Choose any L 1 , . . . , L T such that L/2 ≤ L T < L T −1 < · · · <
d−1 Kt
L 1 ≤ L, and then construct finite sets Wt :=Wε/((T −1)L t+1 ) (R) = {wt,k }k=1 , K t =
|Wt | as defined in Lemma 8 for t = 1, . . . , T − 1. Moreover, define convex L t+1 -
Lipschitz continuous functions Ft for some values vt,k ∈ (ε/(2T − 2), ε/(T − 1)),
k = 1, . . . , K t , and the finite sets Wt . By Assumption 3, we define the stagewise
independent scenario tree as follows. There are K t distinct nodes in each stage t =
1, . . . , T − 1, which can be denoted by an index pair n = (t, k) for k = 1, . . . , K t , and
all nodes are defined by the same data in the last stage T . Then we define our problem
by specifying the nodal cost functions fr ≡ 0, f 1,k (x0 , y1 , x1 ) := L 1 x1 − w1,k for
k = 1, . . . , K 1 , f t,k (xt−1 , yt , xt ) := Ft−1 (xt−1 ) + L t xt − wt,k for k = 1, . . . , K t
and t = 2, . . . , T − 1, and f T ,1 (x T −1 , yT , x T ) := FT −1 (x T −1 ), and state spaces
Xt = X = B d+1 (R). Alternatively, the value functions can be written as
⎧
⎪
⎪ Q 1,k = min x1 ∈X L 1 x1 − w1,k + Q1 (x1 ) , ∀k ≤ K 1 ,
⎪
⎪
⎨
Q t,k (xt−1 ) = min xt ∈X Ft−1 (xt−1 ) + L t xt − wt,k + Qt (xt ) , k ≤ K t ,
⎪
⎪
⎪
⎪
⎩
Q T ,1 (x T −1 ) = FT −1 (x T −1 ),
(30)
123
where the second equation is defined for all 2 ≤ t ≤ T −1, and the expected cost-to-go
functions as
1
Kt
Qt (xt ):= Q t+1,k (xt ), t = 0, . . . , T − 1.
Kt
k=1
By Lemma 8,
√
((d − 1)2 − 1) π Γ ((d − 1)/2 + 1) R L t (T − 1) (d−2)/2
Kt ≥ ,
d −1 Γ ((d − 1)/2 + 3/2) 2ε
√
d(d − 2) π Γ ((d/2 + 1/2) DL(T − 1) (d−2)/2
≥ .
d −1 Γ (d/2 + 1) 8ε
Since for each value function Q t,k is L t -Lipschitz continuous, we choose σn = L t

with ψn (x) = x for any n = (t, k) ∈ Ñ (t) and t = 1, . . . , T such that by Lemma 6
we have Q t,k (x) = Q R
t,k (x) for all x ∈ X . Moreover, due to convexity, we set ln,ρ = 0
for all n ∈ N and ln,λ = L t for each n ∈ Ñ (t) and t = 1, . . . , T , i.e., the cuts are
linear. Following the argument of Proposition 4, we know that such linear cuts are
capable of tight approximations. With such a choice of regularization we have the
following theorem on the complexity of Algorithm 2.
Theorem 5 For any optimality gap ε > 0, there exists a multistage stochastic
convex problem of the form (30) such that, if Algorithm 2 gives UpperBound −
LowerBound < ε at i-th iteration, then
√
1 d(d − 2) π Γ (d/2 + 1/2) DL(T − 1) (d−2)/2
i> .
3 d −1 Γ (d/2 + 1) 8ε
The proof is given in Sect. A.4.6. The theorem implies that, even if problem (2)
is convex and has Lipschitz continuous value functions, the minimum iteration for
Algorithm 2 to get a guaranteed ε-optimal root node solution grows as a polynomial
of the ratio T /ε, with the degree being d/2 − 1.
We remark that Theorems 4 and 5 correspond to two different challenges of the
SDDP type algorithms. The first challenge is that the backward step subproblem oracle
may not give cuts that provide the desired approximation, which could happen when
the value functions are nonconvex or nonsmooth. Theorem 4 results from the worst
case that the backward step subproblem oracle leads to approximations of the value
function in the smallest neighborhood.
The second challenge is that different nodes, or more generally, different scenario
paths give different states in each stage, so sampling and solving the nodal problem on
one scenario path provides little information to the nodal problem on another scenario
path. In example (30), the linear cut obtained in each iteration does not provide any
information on the subsequent iteration states (unless the same node is sampled again).
From this perspective, we believe that unless some special structure of the problem
is exploited, any algorithm that relies on local approximation of value functions will
123
face the “curse of dimensionality,” i.e., the exponential growth rate of the iteration
complexity in the state space dimensions.
6 Conclusions
In this paper, we propose three algorithms in a unified framework of dual dynamic pro-
gramming for solving multistage stochastic mixed-integer nonlinear programs. The
first algorithm is a generalization of the classic nested Benders decomposition algo-
rithm, which deals with general scenario trees without the stagewise independence
property. The second and third algorithms generalize SDDP with sampling procedures
on a stagewise independent scenario tree, where the second algorithm uses a determin-
istic sampling approach, and the third one uses a randomized sampling approach. The
proposed algorithms are built on regularization of value functions, which enables them
to handle problems with value functions that are non-Lipschitzian or discontinuous.
We show that the regularized problem preserves the feasibility and optimality of the
original multistage program, when the corresponding penalty reformulation satisfies
exact penalization. The key ingredient of the proposed algorithms is a new class of cuts
based on generalized conjugacy for approximating nonconvex cost-to-go functions of
the regularized problems.
We obtain upper and lower bounds on the iteration complexity of the proposed
algorithms on MS-MINLP problem classes that allow exact Lipschitz regularization
with predetermined penalty functions and parameters. The complexity analysis is new
and deepens our understanding of the behavior of SDDP. For example, it is the first time
to prove that the iteration complexity of SDDP depends polynomially on the number
of stages, not exponentially, for both convex and nonconvex multistage stochastic
programs, and this complexity dependence can be reduced to linear if the optimality
gap is allowed to scale linearly with the number of stages, or if all the state spaces are
finite sets. These findings resolve a conjecture of the late Prof. Shabbir Ahmed, who
inspired us to work on this problem.
Funding Open Access funding provided by the MIT Libraries Funding was provided by Directorate for
Engineering (Grant Number 1751747).
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which
permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give
appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence,
and indicate if changes were made. The images or other third party material in this article are included
in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If
material is not included in the article’s Creative Commons licence and your intended use is not permitted
by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the
copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
A Proofs
In this section, we present the proofs to the theorems, propositions, and lemmas that
are not displayed in the main text.
123
A.1 Proofs for statements in Sect. 2
A.1.1 Proof for Proposition 1
Proof We show that Q n is l.s.c. by showing the lower level sets leva (Q n ) = {z ∈
Xa(n) : Q n (z) ≤ a} are closed for all a ∈ R. At any leaf node n, the expected cost-to-
go function Qn (xn ) is zero, thus z is in leva (Q n ) if and only if z is in the projection of
the following set {(z, y, x) : (x, y) ∈ Fn , f n (z, y, x) ≤ a}. Since f n is defined on a
compact set {(z, y, x) : z ∈ Xa(n) , (x, y) ∈ Fn } and l.s.c. by Assumption 1, we know
that the set {(z, y, x) : f n (z, y, x) ≤ a} is compact. Moreover, since the projection
(z, y, x) → z is continuous, the image leva (Q n ) is still compact, hence closed.
At any non-leaf node n, suppose Q m is l.s.c. for all its child nodes m ∈ C(n). Then, Qn
is l.s.c. since Qn is defined in (3) and pnm > 0 for all m. A point z ∈ leva (Q n ) if and
only if z is in the projection of the set {(z, y, x) : (y, x) ∈ Fn , f n (z, y, x) + Qn (x) ≤
a}. Similarly, this shows leva Q n is closed since f n , Qn are l.s.c. and the projection
(z, y, x) → z is continuous. We thus conclude Q n is l.s.c. for every node n in the
scenario tree.
To show claims 1 and 2 in the proposition, take any two points z 1 , z 2 ∈ Xa(n) .
Suppose (x1 , y1 ), (x2 , y2 ) ∈ Fn are the corresponding minimizers in the definition (2).
Therefore, Q n (z 1 ) = f n (z 1 , y1 , x1 ) + Qn (x1 ) and Q n (z 2 ) = f n (z 2 , y2 , x2 ) + Qn (x2 ).
If f n is Lipschitz continuous in the first variable, then we have
Q n (z 1 ) − Q n (z 2 ) = f n (z 1 , y1 , x1 ) + Qn (x1 ) − f n (z 2 , y2 , x2 ) − Qn (x2 )
≤ f n (z 1 , y2 , x2 ) + Qn (x2 ) − f n (z 2 , y2 , x2 ) − Qn (x2 )
≤ f n (z 1 , y2 , x2 ) − f n (z 2 , y2 , x2 ) ≤ ln z 1 − z 2 .
Likewise, by exchanging z 1 and z 2 , we know that Q n (z 2 ) − Q n (z 1 ) ≤ ln z 1 − z 2 .

This proves that Q n is Lipschitz continuous with the constant ln .
To show that Q n is convex, take any t ∈ [0, 1]. Since Xa(n) is convex, Q n is defined
at t z 1 + (1 − t)z 2 . Thus,
Q n (t z 1 + (1 − t)z 2 ) ≤ f n (t z 1 + (1 − t)z 2 , t y1 + (1 − t)y2 , t x1 + (1 − t)x2 )

+ Qn (t x1 + (1 − t)x2 )
≤ t f n (z 1 , y1 , x1 ) + (1 − t) f n (z 2 , y2 , x2 ) + tQn (x1 )
+ (1 − t)Qn (x2 )
= t Q n (z 1 ) + (1 − t)Q n (z 2 ).
The first inequality follows from the definition (2), while the second inequality follows
from the convexity of f n and Qn . This shows Q n is convex.

Proof First we show that the partial inf-convolution
f n ˝ (σn ψn )(xa(n) , yn , xn ) := min f n (z n , yn , xn ) + σn ψn (xa(n) − z n )

z∈Xa(n)
123
is σn -Lipschitz continuous in the first variable xa(n) . Note that the minimum is
well-defined since Xa(n) is compact and the functions f n , σn ψn are l.s.c.. Besides,
since z = xa(n) is a feasible solution in the minimization, we know that f n ˝
(σn ψn )(xa(n) , yn , xn ) ≤ f n (xa(n) , yn , xn ) for all xa(n) ∈ Xa(n) and (xn , yn ) ∈ Fn .
Pick any x1 , x2 ∈ Xa(n) , (x, y) ∈ Fn , and let z 1 , z 2 ∈ Xa(n) be the correspond-
ing minimizers in the definition of f n ˝ (σn ψn )(x1 , y, x) and f n ˝ (σn ψn )(x2 , y, x),
respectively. By definition,
f n ˝ (σn ψn )(x1 , y, x) − f n ˝ (σn ψn )(x2 , y, x)

= f n (z 1 , y, x) + ψ(x1 − z 1 ) − f n (z 2 , y, x) − ψ(x2 − z 2 )
≤ f n (z 2 , y, x) + ψ(x1 − z 2 ) − f n (z 2 , y, x) − ψ(x2 − z 2 ) ≤ σn x1 − x2 .
Similarly, we can get f n ˝ (σn ψn )(x2 ) − f n ˝ (σn ψn )(x1 ) ≤ σn x1 − x2 by exchang-

ing x1 , x2 and z 1 , z 2 in the above inequality. Therefore, f n ˝ (σn ψn ) is σn -Lipschitz
continuous in the first variable xa(n) .
The regularized problem (4) can be viewed as replacing the nodal objective function
f n with the inf-convolution f n ˝(σn ψn ). Then by Proposition 1, Q Rn (x) is σn -Lipschitz
continuous on Xa(n) . Moreover, if the original problem (2) is convex and ψn are convex
penalty functions, then f n ˝ (σn ψn ) is also convex. Proposition 1 ensures Q R n (x) is
also convex on Xa(n) .

A.1.3 Proof for Lemma 1
Proof By definition, we have Q R n (x n ) ≤ Q n (x n ) for all n ∈ N , n = r . We show

the other direction by contradiction. Suppose there exists a node n ∈ N such that

QRn (x a(n) ) < Q n (x a(n) ). By definition, there exist z m ∈ Xa(m) and (x m , ym ) ∈ Fm for
all nodes in the subtree m ∈ T (n), such that
1

n (x a(n) ) =
QR pm f m (z m , ym , xm ) + σm ψm (xa(m)

− zm ) .
pn
m∈T (n)
We can extend (xm , ym , z m

)
m∈T (n) to a feasible solution (z m , ym , x m )m∈N of the
regularized problem by setting z m = x , y = y , and x = x for all m ∈/ T (n).
a(m) m m m m
Thus

v reg ≤ pm f m (z m , ym , xm ) +
pm f m (z m , ym , xm )
m∈T (n) / T (n)
m∈

= n (x a(n) ) +
pn Q R pm f m (xa(m) , ym , xm )
/ T (n)
m∈

< pn Q n (xa(n) ) + pm f m (xa(m) , ym , xm )
/ T (n)
m∈

= pm f m (xa(m) , ym , xm ) + pm f m (xa(m) , ym , xm ) = v prim .
m∈T (n) / T (n)
m∈
123
This leads to a contradiction with the assumption that v reg = v prim . Therefore, we
n (x a(n) ) = Q n (x a(n) ) for all n ∈ N , n = r .
conclude that Q R

Proof If ln,ρ ≥ σn , then (λ, ρ) = (0, σn ) is contained in Un , and therefore, is a dual

feasible solution for (11). Thus, we have
x̄n
Q n (x̄n ) ≥ CnΦn (x̄n | λ̂n , ρ̂n , v̂n ) = v̂n ≥ min {Q n (z) + σn ψn (x̄n − z)} = Q R
n ( x̄ n )
z∈Xa(n)
= Q n (x̄n ),
where the first inequality is the validity of the generalized conjugacy cut (10) and the
second and the last equality are due to Lemma 1 for (x̄n , ȳn )n∈N being an optimal
solution to problem (1). This completes the proof.

Proof The minimums in (13) are well-defined because of the compactness of X and
lower semicontinuity of Q. Take any x ∈ X . Since both the primal set X and the dual
set {λ ∈ Rd : λ∗ ≤ σ } are bounded, by strong duality (cf. Theorem 3.1.30 in [27]),
we have
max min{Q(z) + λ, x − z} = min max {Q(z) + λ, x − z} = min{Q(z)
λ∗ ≤σ z∈X z∈X λ∗ ≤σ z∈X
+ σ x − z},
which completes the proof.

Proof By definition, Q R n (x) ≤ Q n ˝ (σn ψn )(x). Since ψn (x) = x is convex, by

Proposition 2, Q R
n (x) is convex. Then by Lemma 2, we have
n (x) = max
QR min {Q n (z) + λ, x − z} .
λ∗ ≤σn z∈Xa(n)
Therefore,
x̄n
CnΦn (x̄n | λ̂n , ρ̂n , v̂n ) = v̂n = max min {Q n (z) + λ, x̄n − z}
λ∗ ≤ln,λ z∈Xa(n)
≥ max min {Q n (z) + λ, x̄n − z} ≥ Q R

n ( x̄ n ).
λ∗ ≤σn z∈Xa(n)
n ( x̄ n ) = Q n ( x̄ n ) if ( x̄ n , ȳn )n∈N is an optimal solution to problem (1).

By Lemma 1, Q R
Φ x̄n
Therefore, we conclude that Cn n (x̄n | λ̂n , ρ̂n , v̂n ) = Q n (x̄n ) due to the validness of
Φ x̄n
Cn n by (10).

123
Proof Let L n := m∈C (n) pnm (lm,λ + lm,ρ ) for simplicity. We prove the proposition
recursively for nodes n ∈ N , and inductively for iteration indices i ∈ N. For leaf nodes
and the first iteration, it holds obviously because Qin (x) = 0 for any leaf node n ∈ N
with C(n) = ∅, and Qn0 (x) = 0 from the definition (14). Now suppose for some
n ∈ N , and i ∈ N, it holds for all m ∈ C(n) that Qim (x) ≤ Qm (x), Qi−1 n (x) ≤ Qn (x),
and that Qi−1n (x) is L n -Lipschitz continuous. Then it follows from (12), (16), and
(B) that Cm (x | λ̂m , ρ̂m , v m ) ≤ Q m (x) for all m ∈ C(n). By (15), Cmi is (lm,λ +
i i i i
lm,ρ )-Lipschitz continuous so m∈C (n) pnm Cmi is L n -Lipschitz continuous. Thus the
pointwise maximum of Qi−1 n (x) and m∈C (n) pnm C m (x | λ̂n , ρ̂n , v n ) (cf. (14)) is still
i i i i
dominated by Qn (x) and L n -Lipschitz continuous.

Proof Let L n = m∈C (n) pnm σm for simplicity in this proof. We prove the
1
statement by induction on the number
of iterations i. When i = 1, Qn (x) =
min{+∞, m∈C (n) pnm (v̄mi + σ x − x i )} = i
m∈C (n) pnm (v̄m + σm x − x n )
i
m n
1
which is clearly L n -Lipschitz continuous. For any leaf node n, Qn ≡ 0 = QR
n by
1
definition. Going recursively from leaf nodes to the root node, suppose Qm ≥ QR
m for
all m ∈ C(n) for some node n, then we have
1
v̄m
1
= f m (z m
1
, ym1 , xm1 ) + σm ψm (xn1 − z m
1
) + Qm (xm1 )
1
≥ min{ f m (z, y, x) + σm ψm (xn1 − z) + Qm (x) : (x, y) ∈ Fm , z ∈ Xn }
≥ min{ f m (z, y, x) + σm ψm (xn1 − z) + QR
m (x) : (x, y) ∈ Fm , z ∈ Xn }
m (x n ).
= QR 1
(31)
1
1 + σ x − x 1 ) ≥ QR (x) for all x ∈ X by the
Thus Qn (x) = m∈C (n) pnm (v̄m m n n n
σm -Lipschitz continuity of the regularized value functions Q R m (x) for all m ∈ C(n)
shown in Proposition 2.
Now assume that the statement holds for all iterations up to i −1. For any leaf node n,
i i
Qn ≡ 0 = QR n still holds by definition. For any non-leaf node n, suppose Qm ≥ Qm
R
for all m ∈ C(n). Then by the same argument (31), we know that v̄m ≥ Q m (xn ).
i R i
i−1
By induction hypothesis, Qn (x) ≥ QR n (x) for all x ∈ Xn . So for the cases without
i i−1
i +σ x − x i )} is L -Lipschitz
convexity, Qn (x) = min{Qn (x), m∈C (n) pnm (v̄m m n n

continuous and satisfies Qn (x) ≥ Qn (x) since m∈C (n) pnm (v̄m + σm x − xni ) ≥

i R i
QRn (x) for all x ∈ Xn following Proposition 2.
123
i
It remains to show that in the convex case Qn (x) is still L n -Lipschitz continuous and
i
satisfies Qn (x) ≥ QR n (x) for any x ∈ Xn . Note that Qn (x) can be naturally extended
R
to the entire space R ⊃ Xn since Qn ˝ (L n ·)(x) = QR

d n R
n (x) for any x ∈ Xn by
the L n -Lipschitz continuity of QRn . The above argument of the base case i = 1 for the
nonconvex case can be directly applied to the convex case over x ∈ Rdn . Now assume
i−1 i−1
that Qn (x) is L n -Lipschitz continuous on Rdn and Qn (x) ≥ QnR (x) for x ∈ Rdn

pnm (v̄ i + σm x − x i )} is L n -
i−1
up to i − 1. Since Q (x) := min{Q (x),
n n m∈C (n) m n
Lipschitz continuous in x ∈ Rdn , we claim that the supremum in the definition (19)
can be attained within the dual norm ball B∗ (L n ) := {λ ∈ Rdn : λ∗ ≤ L n }. In fact,
/ B∗ (L n ), the infimum
for any λ ∈
inf {Q n (z) + λ, z − x} ≤ inf {Q n (x) + L n z − x + λ, z − x} = −∞.

z∈Rdn z∈Rdn
i
As a result, Qn (x) is a supremum of L n -Lipschitz linear functions (of the forms
i
l(x) := Qn (ẑ)
+ λ̂, ẑ − x where λ̂ ∈ B∗ (L n ) and ẑ ∈ Xn ) and thus is also an
L n -Lipschitz continuous function. Therefore, Q n (x) ≥ QR n (x) for all x ∈ R . By
dn
i
(19), Qn (x) = (Q n )∗∗ (x) ≥ (QR ∗∗
n ) (x) = Qn (x) for all x ∈ R . This completes
R dn
the proof.

Proof From the definition of v reg and Proposition 6,
i
v reg ≤ fr (xa(r ) , yr∗ , xr∗ ) + QrR (xr∗ ) ≤ fr (xa(r ) , yr∗ , xr∗ ) + Qr (xr∗ ) ≤ UpperBound.
Since UpperBound − LowerBound ≤ ε, we have

i
fr (xa(r ) , yr∗ , xr∗ ) + Qr (xr∗ ) ≤ fr (xa(r ) , yri+1 , xri+1 ) + Qri (xri+1 ) + ε.
Then, using the optimality of (xri+1 , yri+1 ) given by Or (Qri ) and the fact that Qri (x) ≤
Qr (x), we see that
fr (xa(r ) , yri+1 , xri+1 ) + Qri (xri+1 ) ≤ min fr (xa(r ) , y, x) + Qr (x) = v prim .

(x,y)∈Fr
Under Assumption 2, v reg = v prim . Therefore, combining all the above inequalities,
we have shown that
i
v reg ≤ fr (xa(r ) , yr∗ , xr∗ ) + Qr (xr∗ ) ≤ v reg + ε,
which means (xr∗ , yr∗ ) is an ε-optimal root node solution to the regularized prob-
i
lem (4). Now suppose Qr (xri+1 ) − Qri (xri+1 ) ≤ ε for some iteration index i. Note that
i
UpperBound ≤ fr (xa(r ) , yri+1 , xri+1 ) + Qr (xri+1 ), we have
123
i
UpperBound − LowerBound ≤ fr (xa(r ) , yri+1 , xri+1 ) + Qr (xri+1 )
− ( fr (xa(r ) , yri+1 , xri+1 ) + Qri (xri+1 ))
i
= Qr (xri+1 ) − Qri (xri+1 ) ≤ ε.
Therefore the algorithm terminates after the i-th iteration.

Proof By definition (14), Qim (x) ≥ Qi−1

m (x) on Xm for all m ∈ C(n). If the problem
is convex with ψn = ·, then by Lemma 2, we have
v im = max min f m (z, y, x) + λ, xni − z + Qim (x)

λ∗ ≤lm,λ (x,y)∈Fm ,
z∈Xn
= min f m (z, y, x) + ln,λ xni − z + Qim (x)

(x,y)∈Fm ,
z∈Xn
≥ min f m (z, y, x) + σm xni − z + Qim (x)

(x,y)∈Fm ,
z∈Xn
as lm,λ ≥ σm in Definition 3. Otherwise, by definition (16) and the fact that (λ, ρ) =
(0, σn ) is a dual feasible solution for the problem (B), we have
v im = max min f m (z, y, x) + λ, xni − z + ρψn (xni − z) + Qim (x)

(λ,ρ)∈Um (x,y)∈Fm ,
z∈Xn
≥ min f m (z, y, x) + σn ψn (xni − z) + Qim (x) .

(x,y)∈Fm ,
z∈Xn
Thus in both cases, we have
v im ≥ min f m (z, y, x) + σn ψn (xni − z) + Qi−1

m (x)
(x,y)∈Fm ,
z∈Xn
= f m (z m
i
, ymi , xmi ) + σn ψn (xni − z m
i
) + Qi−1
m (x m )
i
for all m ∈ C(n). The last equality is due to the forward step subproblem oracle
i i−1
OmF (xni , Qi−1
m ) in the algorithm. Meanwhile, note that Qm (x) ≤ Qm (x) for x ∈ Xm .
By definition (18), we have
i
v̄m
i
= f m (z m
i
i
) + Qm (xmi )
i−1
≤ f m (z m
i
i
) + Qm (xmi )
123
i
for all m ∈ C(n). Note that by definition (17), Qn (xni ) ≤ m∈C (n) pnm v̄m
i
and by definitions (14) and (15), Qin (xni ) ≥ m∈C (n) pnm C m (x n | λ̂m , ρ̂m , v m ) =
i i i i i
m∈C (n) pnm v m . Therefore,

i
i
Qn (xni ) − Qin (xni ) ≤ pnm (v̄m
i
− v im )
m∈C (n)
i−1
≤ pnm [Qm (xmi ) − Qi−1
m (x m )] ≤
i
pnm γm (δ).
m∈C (n) m∈C (n)
i
Note that Qn (x) is m∈C (n) pnm σm -Lipschitz continuous by Proposition 6, and

Qim (x) is m∈C (n) p nm (l n,λ + l n,ρ ) -Lipschitz continuous on Xn by Proposition 5.
Since we have lm,λ + lm,ρ ≥ σm regardless of convexity of the problem, it holds
i
that Qn (x) and Qin (x) are both L n -Lipschitz continuous. Therefore, for any x ∈ Xn ,
x − xni ≤ δn /(2L n ), we have
i i
Qn (x) − Qin (x) ≤ Qn (xni ) − Qin (xni ) + 2L n x − xni ≤ pnm γm (δ) + δn
m∈C (n)
= γn (δ).
This completes the proof.


j
Proof We claim that for any i, j ∈ In , i = j, then xni − xn > δn /(2L n ). Assume
j
for contradiction that xni − xn ≤ δn /(2L n ) for some i < j and i, j ∈ In (δ). By the
i
definition of In (δ), γmi ≤ γm (δ) for all m ∈ C(n). By Lemma 3, Qn (x) − Qin (x) ≤
j
γn (δ) for all x ∈ Xn , x − xni ≤ δn /(2L n ). Since j > i and xni − xn ≤ δn /(2L n ),
j j j j
this implies γn = Qn (xn )−Qnj (xn ) ≤ γn (δ), which is a contradiction with j ∈ In (δ).
Hence we prove the claim.
Let B(R), B(R, x) ⊆ Rd denote the closed balls with radius R ≥ 0, centered at 0
and x, respectively. It follows from the claim that the closed balls B(δn /(4L n ), xni ) are
non-overlapping for all i ∈ In (δ), each with the volume VolB(δn /(4L n )). Thus the sum
of the volumes of these balls is |In (δ)|VolB(δn /(4L n )). Note that for each index i ∈
In (δ), xni ∈ Xn and hence xni ∈ Bn,k for some k. The closed ball B(δn /(4L n ), xni ) ⊆
Bn,k + B(δn /(4L n )), and therefore

Kn
B(δn /(4L n ), xni ) ⊆ (Bn,k + B(δn /(4L n ))).
i∈In (δ) k=1
123
It follows that
⎡ ⎤

Vol ⎣ B(δn /(4L n ), xni )⎦ = |In (δ)| · VolB(δn /(4L n )) ≤
i∈In (δ)
#K $
n
Kn

Vol (Bn,k + B(δn /(4L n )) ≤ Vol Bn,k + B(δn /(4L n )) .
k=1 k=1
Therefore,
Kn

Kn
Vol Bn,k + B(δn /(4L n )) 2L n Dn,k dn
|In (δ)| ≤ = 1+ .
VolB(δn /(4L n )) δn
k=1 k=1


i−1, j
Proof For each iteration i ∈ N, the event ∪ M j=1 {γ0 ≤ γ0 (δ) = ε} is Σi−1 -
measurable, so it suffices to prove this inequality for its complement in Ai (δ). Note
that
i, j i, j i, j i, j
Prob{γt = γ̃t | Σi−1 } ≥ Prob{Pt = n(γ̃t ) | Σi−1 },
i, j
where n(γ̃t ) is the smallest node index n ∈ Ñ (t) such that QR
t (x n ) − Qt (x n ) =
i−1
i, j i, j
γ̃t for (xn , yn , z n ) = OnF (xt−1 , Qi−1 n ), which is determined given Σi−1 . Using
the same argument as in the Proof of Theorem 1, Lemma 3 shows that the
T −1 i, j i, j T −1
event ∩t=1 {γt = γ̃t } implies the event {i ∈ ∪t=0 It (δ)} and hence the
event Ai (δ) for each j = 1, . . . , M. Therefore, since Σi−1 is contained in
σ (Σ∞oracle , σ {P i , j }
(i , j )=(i, j) ), by the independent, uniform sampling (Assump-
tion 4), we have
Prob(Ai (δ) | Σi−1 )

⎛ ⎞
M T'−1 (
(
≥ Prob ⎝ {γt = γ̃t } (( Σi−1 ⎠
i, j i, j
j=1 t=1
⎛ ⎞
−1
M T' (
(
≥ Prob ⎝ = n(γ̃t )} (( Σi−1 ⎠
i, j i, j
{γt
j=1 t=1
+ +T −1 ( ,, M
' (
= 1 − 1 − Prob {γt
i, j
=
i, j
n(γ̃t )} ( Σi−1 = 1 − (1 − 1/N ) M .
(
t=1
123
-T −1 i, j i, j
Here, the last step follows from Prob( t=1 {γt = n(γ̃t )} | Σi−1 ) =
T −1 i, j i, j T −1
t=1 Prob({γt = n(γ̃t )} | Σi−1 ) = t=1 (1/Nt ) = N .

A.3.4 Proof for Theorem 3
Proof Let ai := 1 Ai denote the indicator of the event Ai for i ∈ N, and Si := ii =1 ai .

Note that the event {ι ≥ i} implies the event {Si ≤ I }, so we want to bound probability
of the latter for sufficiently large indices i.
By Lemma 5, we see that the adapted sequence {Si − iν}i=1 ∞ is a submartingale
∞
with respect to the filtration {Σi }i=1 , because
E(Si − iν | Σi−1 ) = Si−1 − (i − 1)ν + (E(ai | Σi−1 ) − ν) ≥ Si−1 − (i − 1)ν.
Moreover, it has a bounded difference as Si + iν − (Si−1 + (i − 1)ν) = ai + ν ≤ 2

almost surely. Now apply the one-sided Azuma-Hoeffding inequality and we get for
any k > 0 that
2
k
Prob(Si ≤ iν − k) ≤ exp − .
8i
For any κ > 1, take the smallest iteration index i such that iν ≥ κ I , and set k :=
(κ − 1)I . Since I ≥ 2κ
iν
, the probability bound can be then written as

(κ − 1)2 I 2 (κ − 1)2 I ν
Prob(ι ≥ i) ≤ Prob(Si ≤ I ) ≤ exp − ≤ exp −
8i 16κ
κI
Substitute the left-hand-side with Prob(ι ≥ 1 + ν ) using the definition of i and we
have obtained the desired inequality.

Proof We prove the lemma recursively starting from the leaf nodes. For leaf nodes
n ∈ N , C(n) = ∅, Q R n (x) = min z∈Xa(n) Q n (z) + σn ψn (x − z) ≥ min z∈Xa(n) Q n (z) +
ln x − z. Since Q n is ln -Lipschitz continuous, Q n (z) ≥ Q n (x) −ln x − z. There-
n (x) ≥ Q n (x) and by Proposition 2, we know Q n (x) = Q n (x) for all
fore, Q R R
x ∈ Xa(n) .
Now suppose for a node n ∈ N , we know that all of its child nodes satisfy Q R m (x) =
Q m (x), ∀ x ∈ Xn , for all m ∈ C(n). Then by definition,
n (x a(n) ) =
QR min f n (z, y, x) + σn ψn (xa(n) − z) + QR
n (x).
(x,y)∈Fn ,z∈Xa(n)
By assumption, we know that QR n (x) = Qn (x) for all x ∈ Xn . Therefore, Q n (x a(n) ) =

R
minz∈Xa(n) Q n (z) + σn ψn (xa(n) − z) ≥ minz∈Xa(n) Q n (z) + ln xa(n) − z. Then
123
n (x) = Q n (x) for all x

again by ln -Lipschitz continuity of f n , we conclude that Q R
∈ Xa(n) .


d
Proof We claim that if K < DL 4β , then there exists a point x̂ ∈ X such that
2β
x̂ − wk ≥
L for all k = 1, . . . , K . We prove the claim by contradiction. Suppose
such a point does not exist, or equivalently, for any point x ∈ X , there exists wk ∈ W
such that x − wk < 2β L . This implies that the balls B(2β/L, wk ) cover the set X ,
which leads to
+ ,

K
K
VolX ≤ Vol B(2β/L, wk ) ≤ VolB(2β/L, wk ) = K · VolB(2β/L).
k=1 k=1
d
Therefore, it must hold that K ≥ VolX /VolB(2β/L) = DL 4β , hence a contradic-
tion.
The existence of x̂ guarantees that f (wk )− Lx̂ −wk ≤ f (wk )−2β < 0 for each
k = 1, 2, . . . , K . Therefore, 0 ≤ min x∈X Q(x) ≤ Q(x̂) = max1≤k≤K {0, f (wk ) −
Lx̂ − wk } = 0. From compactness of X and the continuity of Q(x), we have
the inequality min x∈X Q(x) ≥ min1≤k≤K Q(wk ) = min1≤k≤K f (wk ) > β, which
completes the proof.

Proof Let us define the forward subproblem oracle OnF in iteration i and stage t as
i , Q i−1 ) to an optimal solution (x i , z i ) of the forward subproblem
mapping (xt−1 t+1 t t
min f t (z t ) + Lxt−1
i
− z t + Q i−1 (x ) ,
t+1 t
xt ,z t ∈Xt
and the backward subproblem oracle OnB in iteration i and stage t as mapping
i , Q i ) to an optimal solution ( x̂ i , ẑ i ; λ̂i = 0, ρ̂ i = L) of the backward sub-
(xt−1 t+1 t t t t
problem
max min f t (z t ) + ρxt−1

i
− z t + Q it+1 (xt ) . (32)
λ=0 xt ,z t ∈Xt
0≤ρ≤L
Note that in the backward subproblem (32), we choose that ln,λ = 0 and ln,ρ = L. It is
observed that the objective function in (32) is nondecreasing in ρ. Therefore, ρ̂ti = L
is always an optimal solution for the outer maximization in (32). The root-node oracle
Or in iteration i simply solves min x0 ∈X Q i1 (x0 ) and outputs x0i+1 .
123
In the backward step (Algorithm 2, step 14) and c.f. the definition (15), the new
generalized conjugacy cut in iteration k ≤ i is generated by
Ctk (x | 0, L, v kt ) = v kt − Lx − xt−1

k
= v kt − Lx − xt−1
k
,
for node t ≥ 1, where v kt is computed and upper bounded as
v kt = f t (ẑ tk ) + Lxt−1
k
− ẑ tk + min Q kt+1 (xt ),
xt ∈Xt
≤ f t (xt−1
k
)+ min Q k (xt ),
xt ∈Xt t+1
≤ f t (xt−1
k
) + min Q it+1 (xt ),
xt ∈Xt
where the first inequality directly follows from (32), as z = xt−1

i is a feasible solution
of the inner minimization problem, and the second inequality is due to the monotonicity
Q kt+1 (x) ≤ Q it+1 (x) for k ≤ i. Therefore,
Q it (x) = max 0, Ctk (x | 0, L, v kt ) , ( by (14)),

k=1,...,i
= max 0, v kt − Lx − xt−1

k
,
k=1,...,i
≤ max 0, f t (xt−1
k
) − Lx − xt−1
k
+ min Q it+1 (xt ). (33)
k=1,...,i xt ∈Xt
Similarly, by (18), the upper approximation of the value function is computed and
lower bounded as
k
v̄tk = f t (z tk ) + Lz tk − xt−1
k
+ Q t+1 (xtk ),
k
≥ f t (xt−1
k
) + Q t+1 (xtk ),
k
≥ f t (xt−1
k
) + min Q t+1 (xt ),
xt ∈Xt
i
≥ f t (xt−1
k
)+ min Q t+1 (xt ),
xt ∈Xt
where X̂tk := arg min xt ∈Xt Q k−1 (x ). Therefore, the over-approximation satisfies
t+1 t
i
Q t (x) = min v̄tk + Lx − xt−1k
, (by (17)),
k=1,...,i

k i
≥ min f t (xt−1
k
) + L x − xt−1 + min Q t+1 (xt ),
k=1,...,i xt ∈Xt
ε i
> + min Q t+1 (xt ), (34)
T xt ∈Xt
where (34) follows from the construction that f t (x) > ε/T for all x ∈ Xt .
123
Now using (33) and (34), we can prove the statement of the theorem. Sup-
d
pose the iteration index i < DLT 4ε . Denote wk := xt−1 k for k = 1, . . . , i.
Since ε/T < f t (wk ) < 2ε/T by construction, applying Lemma 7, we get
min xt ∈Xt maxk=1,...,i 0, f t (xt−1
k ) − Lx − x k
t t−1 = 0. By (33), min xt−1 ∈Xt−1 Q it
(xt−1 ) ≤ min xt ∈Xt Q t+1 (xt ) for t = 1, . . . , T . Note at stage T , Q iT +1 ≡ 0. There-
i
fore, min xt−1 ∈Xt−1 Q it (xt−1 ) ≤ 0 for all t = 1, . . . , T . But since Q it (x) ≥ 0 for all
x ∈ Xt−1 , we have min x∈Xt−1 Q it (x) = 0 for all 1 ≤ t ≤ T . Hence we see that
LowerBound = min x0 ∈X0 Q i1 (x0 ) = 0 in iteration i.
Since Xt is a norm ball, it is compact. So by (34), we have
i i
min Q t (xt−1 ) > ε/T + min Q t+1 (xt ), ∀1 ≤ t ≤ T .
xt−1 ∈Xt−1 xt ∈Xt
i
This recursion implies that min x0 ∈X0 Q 1 (x0 ) > T (ε/T ) = ε. According to Algo-
k
rithm 2, Steps 18-19, we have that in iteration i, UpperBound = min {Q 1 (x0k+1 )} ≥
k=1,...,i
i i
min {Q 1 (x0k+1 )} ≥ min Q 1 (x0 ) > ε. Combining the above analysis, we have
k=1,...,i x0 ∈X0
UpperBound − LowerBound > ε in iteration i. Therefore, we conclude that if
d
UpperBound−LowerBound ≤ ε at the i-th iteration, then we have i ≥ DLT
4ε .

Proof Let vd denote the d-volume for a d-dimensional unit ball. Recall that the d-
(d + 1)π (d+1)/2 d
volume of S d (R) is given by Vold (S d (R)) = (d + 1)vd+1 R d = R .
Γ ( d+1
2 + 1)
We next estimate the d-volume for the spherical cap Sβd (R, x). Let α ∈ (0, π/2) denote
√
the central angle for the spherical cap, i.e., cos α = 1 − β/R. Since β < (1 − 22 )R,
we know that α < π/4. Then for any x ∈ S d (R), the d-volume of the spherical cap
can be calculated through
. α . α
Vold (Sβd (R, x)) = Vold−1 (S d−1 (R sin θ ))Rdθ = dvd R d (sin θ )d−1 dθ.
0 0
Note that when θ ∈ (0, α), sin θ > 0 and cos θ/ sin θ > 1. Therefore, since d ≥ 2,
. α cos θ (sin α)d−1
Vold (Sβd (R, x)) ≤ dvd R d
(sin θ )d−1 dθ = dvd R d · .
0 sin θ d −1
/
By substituting sin α = 1 − (1 − β/R)2 , we have that
Vold (Sβd (R, x)) d vd

≤ (sin α)d−1 ,
Vold (S d (R)) d 2 − 1 vd+1
123
+ ,(d−1)/2
d vd β 2
= 2 1− 1−
d − 1 vd+1 R
(d−1)/2
d vd 2β
≤ .
d 2 − 1 vd+1 R
Now suppose W = {wi }k=1K is a maximal set satisfying the assumption, that is, for
any w ∈ S (R), w ∈
d / W, there exists wk ∈ W such that w ∈ Sβd (R, wk ). Then,
K
k=1 Sβ (R, wk ) ⊇ S (R), therefore
d d

K
Vold (S d (R)) ≤ Vold (Sβd (R, wk )) = |W|Vold (Sβd (R, w1 )).
k=1
Therefore we have
# (d−1)/2 $−1
Vold (S d (R)) d vd 2β
|W| ≥ ≥
Vold (Sβ (R, w1 ))
d d − 1 vd+1 R
2
√ (d−1)/2
(d 2 − 1) π Γ (d/2 + 1) R
= .
d Γ (d/2 + 3/2) 2β

Proof 1. By construction, F is a convex piecewise linear function. Since each linear

piece has a Lipschitz constant (L/R)wk = L, as wk = R. Thus, F, as the
maximum of these linear functions, is also an L-Lipschitz function.
2. Since wk ∈ / Sε/L d (w ) for l = k, w , w − w < −ε R/L. Hence, v +
l l k l l
L/R wl , wk − wl < vl − ε < 0. Therefore, F(wk ) = max{0, vk } = vk .
3. Notice that the above maximum for F(wk ) is achieved at a unique linear piece,
which implies that F is differentiable at wk for all k. The gradient ∇ F(wk ) =
(L/R)wk . This gives the inequality in property 3 from the Proof of property 2.
4. The inequality of property 3 also implies that Q l (wl ) = 0. Now we show
Q l (wl ) > 3ε/2. Since wl ∈ / Sε/L
d (w ) for any k = l, then w − w > ε/L
k l k
by the Pythagorean theorem. Also vk > ε/2. So qk := vk + Lwl − wk > 3ε/2.
Since Q l (wl ) is the convex combination of qk ’s, we have Q l (wl ) > 3ε/2.

Proof First we claim that in any iteration i, for any nodal problem k in stage t, the
optimal solution in the forward step (Algorithm 2 line 7) must be xti = wt,k . To see
this, recall that we set ln,λ = L t and ln,ρ = 0 for all n ∈ Ñ (t), so by Proposition 5, the
123
under-approximation of the cost-to-go function Qit (x) is L t+1 -Lipschitz continuous

for all iteration i ∈ N. So consider the forward step subproblem for node n = (t, k)
with t ≥ 2 in iteration i

min Ft (xt−1
i
) + L t xt − wt,k + Qit (xt )
xt ∈X

= Ft (xt−1
i
) + min L t xt − wt,k + Qit (xt ) . (35)
xt ∈X
Note that by the L t+1 -Lipschitz continuity of Qit ,

L t xt − wt,k + Qit (xt ) ≥ Qit (wt,k ) + (L t − L t+1 ) xt − wt,k ,
which, alongside with the fact that L t+1 < L t , implies that xt = wt,k is the unique
optimal solution to the forward step problem (35). The above argument also works
for any node in the stage t = 1 by simply removing the constant term Ft (xt−1
i ) in the
nodal problem (35).

Now we define over- and under-approximations of the value functions for the pur-
pose of this proof. For node n = (t, k), let
j j j
Q it,k (x) := max {0, Ct,k (x | λ̂t,k , 0, v t,k )},
1≤ j≤i
and
i j j
Q t,k (x) := conv1≤ j≤i {v̄t,k + L t xt−1 − x},
where for each j, by formula (16),

Ft−1 (z) + λ, xt−1 − z + L t x − wt,k + Qtj (x) ,
j j
v t,k := max min
λ≤L t z,x∈X

= max min Ft−1 (z) + λ, xt−1 − z + min L t x − wt,k + Qtj (x)
j
λ≤L t z∈X x∈X
j
= Ft−1 (xt−1 ) + Qtj (wt,k )
j j
with −λ̂t,k ∈ ∂ Ft−1 (xt−1 ) being an optimal solution to the outer maximization prob-
lem, and by formula (18)
j j j j j j
v̄t,k := Ft−1 (z t,k ) + L t xt−1 − z t,k + Qt (xt,k )
j j j
= Ft−1 (xt−1 ) + Qt (xt,k ).
j j
The last equalities of v t,k and v̄t,k are due to the L t -Lipschitz continuity of Ft−1 . So we
have by the monotonicity of the under- and over-approximations that for each j ≤ i,
j j j
Q it,k (x) = max {0, Ft−1 (xt−1 ) + λ̂t,k , xt−1 − x + Qtj (wt,k )}
1≤ j≤i
123
j j j
≤ max {0, Ft−1 (xt−1 ) + λ̂t,k , xt−1 − x} + Qit (wt,k ),
1≤ j≤i
and
i j j j
Q t,k (x) = conv1≤ j≤i Ft−1 (xt−1 ) + L t xt−1 − x + Qt (wt,k )
j j i
≥ conv1≤ j≤i Ft−1 (xt−1 ) + L t xt−1 − x + Qt (wt,k ).
Therefore,
i i
Q t,k (x) − Q it,k (x) ≥ Qt (wt,k ) − Qit (wt,k )
j j j
+ conv1≤ j≤i {Ft−1 (xt−1 ) + L t xt−1 − x} − max {0, Ft−1 (xt−1 )
1≤ j≤i
j j
+ λ̂t,k , xt−1 − x}. (36)
i i
Thus we have Q t,k (wt−1,k ) − Q it,k (wt−1,k ) ≥ Qt (wt,k ) − Qit (wt,k ) for any k =
1, . . . , K t−1 , since the last two terms on the right hand side of (36) are over- and under-
j
approximations of the function Ft−1 , respectively. Moreover, note that xt−1 = wt−1,k

for some k = 1, . . . , K t−1 as it is the unique solution in the forward step. By Lemma 9,
whenever the node n = (t − 1, k ) is never sampled up to iteration i, we further have
i 3ε i
Q t,k (wt−1,k ) − Q it,k (wt−1,k ) > + Qt (wt,k ) − Qit (wt,k ).
2(T − 1)
Recall the definitions (14) and (17), for any x ∈ X ,

0 1
1 1
Kt Kt
j j j
Qit−1 (x) = max 0, Ct,k (x | λ̂t,k , 0, v t,k ) ≤ Q it,k (x),
1≤ j≤i Kt Kt
k=1 k=1
and
0 1
1 1
Kt Kt
i j j i
Qt−1 (x) = conv1≤ j≤i (v̄t,k + L t xt−1 − x) ≥ Q t,k (x).
Kt Kt
k=1 k=1
Consequently, for any k = 1, . . . , K t−1 ,
1
Kt
i i
Qt−1 (wt−1,k ) − Qit−1 (wt−1,k ) ≥ [Qt (wt,k ) − Qit (wt,k )],
Kt
k=1
and in addition, for any node n = (t − 1, k ) not sampled up to iteration i,
1
Kt
i 3ε i
Qt−1 (wt−1,k ) − Qit−1 (wt−1,k ) > + [Qt (wt,k ) − Qit (wt,k )].
2(T − 1) Kt
k=1
123
Therefore, for any iteration index i ≤ 13 |Wt |, t = 1, . . . , T − 1, then there are

K t − i ≥ 23 |Wt | nodes not sampled in stage t, which implies
1
K t−1
i ε 1
Kt
i
[Qt−1 (wt−1,k ) − Qit−1 (wt−1,k )] > + [Qt (wt,k ) − Qit (wt,k )].
K t−1 T −1 Kt
k =1 k=1
i ε
Consequently, Qr − Qri > (T − 1) · = ε. Therefore, if UpperBound −
T −1
i
LowerBound = Qr − Qri ≤ ε in the iteration i, then
√
1 1 d(d − 2) π Γ (d/2 + 1/2) DL(T − 1) (d−2)/2
i> min |Wt | ≥ .
3 t=1,...,T −1 3 d −1 Γ (d/2 + 1) 8ε

B Problem classes with exact penalization
In this section, we discuss the problem classes that allows exact penalty reformulation,
as stated in Assumption 2. A penalty function ψ : Rd → R+ is said to be sharp, if
ψ(x) ≥ c x for all x ∈ V ⊂ Rd , for some open neighborhood V 0 and some
positive scalar c > 0.
B.1 Convex problems with interior points
For convex problems, the Slater condition implies that the intersection of the domain
dom( n∈N f n ) and the feasible sets Πn∈N Fn has a non-empty interior. We have the
following proposition.
Proposition 8 Suppose the problem (1) is convex and satisfies the Slater condition For
any sharp penalty functions ψn , there exist σn > 0 such that the penalty reformulation
is exact.
Proof Consider a perturbation vector w = (wn )n∈N such that wn ∈ Xa(n) − Xa(n) for
each n ∈ N , and define the perturbation function
0 ( 1
(
τ (w):= min pn f n (z n , yn , xn ) (( wn = xa(n) − z n , ∀ n ∈ N .
(z n ,xn ,yn )∈X ×F n a(n)
n∈N
The function τ is convex and v prim = τ (0) by definition. By the Slater condition,
0 ∈ int(dom(τ )) so there exists a vector λ ∈ R|N | such that τ (w) ≥ τ (0) + λ, w for
all perturbation w. Since ψn are sharp, there exist σn > 0 such that n∈N σn ψn (wn )+
λ, w > 0 for all w = 0. Consequently the penalty reformulation is exact since
v reg = minw τ (w) + n∈N σn ψn (wn ) and all optimal solutions must satisfy wn =
xa(n) − z n = 0 for all n ∈ N .

123
B.2 Problems with finite state spaces
We say a problem (2) has finite state spaces if the state spaces Xn are finite sets for all
nodes n. Such problems appear in multistage integer programming [38], or when the
original state spaces can be approximated through finite ones [19, 39]. The following
proposition shows the penalty reformulation is exact whenever the state spaces are
finite.
Proposition 9 For any penalty functions ψn , n ∈ N , if the state spaces are finite, then
there exists a finite σn > 0 such that the penalty reformulation (7) is exact.
Proof Let dn := min x=z∈Xa(n) |ψn (x − z)| for each n ∈ N . Since ψn is a penalty
function and the state space Xn is finite, we know dn > 0. Define c as

c := min pn f n (z n , yn , xn ). (37)
(z n ,yn ,xn )∈Xa(n) ×Fn
n∈N
Since (37) is a relaxation of the original problem (1) by ignoring coupling constraint
z n = xa(n) , then c ≤ v prim . We choose σn = 1 + (v prim − c)/( pn dn ) for all n ∈ N .
Now let (xn , yn , z n )n∈N be an optimal solution to the regularized problem (4). Then
if there exists xa(m) = z m for some m = r , then pm σm ψm (xa(m) − z m ) > v prim − c.
Consequently,

v reg ≥ c + pn σn ψn (xa(n) − z n ) ≥ c + pm σm ψm (xa(m) − z m ) > c + v prim − c
n∈N
=v .
prim
This is a contradiction since v reg ≤ v prim . Thus, any optimal solution to the reformu-
lation (7) must have xa(n) = z n for all n = r , which means the penalty reformulation
is exact.

B.3 Problems defined by mixed-integer linear functions
The problem (1) is said to be defined by mixed-integer linear functions, if all the feasi-
ble sets Fn and the epigraphs epi f n are representable by mixed-integer variables and
non-strict linear inequalities with rational coefficients. Recall that by Assumption 1,
the primal problem is feasible, v prim > −∞. We have the following proposition on
the exact penalty reformulation.
Proposition 10 ([13], Theorem 5) If problem (1) is defined by mixed-integer linear

functions and the penalty functions ψn are sharp for all n ∈ N , then there exist σn > 0,
such that the penalty reformulation is exact.
123
B.4 Problems defined by C1 -functions
The problem (1) is said to be defined by C 1 -functions if it is defined by functional

constraints using indicator functions in each node n ∈ N :
0
f n,0 (xa(n) , yn , xn ), if gn,i (xa(n) , yn , xn ) ≤ 0, i = 1, . . . , In ,
f n (xa(n) , yn , xn ) =
+∞ otherwise.
with all f n,0 , gn,i , i = 1, . . . , In being continuously differentiable. The Karush-Kuhn-

Tucker condition at a feasible point (xn , yn )n∈N of (1) says that there exist multipliers
μn,i ≥ 0, i = 1, . . . , In , such that
0 1

∇xn ,yn ( f n,0 (xa(n) , yn , xn ) − μn,i gn,i (xa(n) , yn , xn ) = 0,
n∈N
μn,i gn,i (xa(n) , yn , xn ) = 0, i = 1, . . . , In .
We have the following proposition on the exactness.
Proposition 11 Suppose the problem (1) is defined by C 1 -functions and the Karush-
Kuhn-Tucker condition holds for every local minimum solution of (1). If the penalty
functions ψn are sharp for all n ∈ N , then there exist σn > 0 such that the penalty
reformulation is exact.
We give the Proof of Proposition 11 below.
B.4.1 Proof for Proposition 11
We begin by stating a general exact penalization result for problems defined by C 1 -

functions. Consider the following perturbation function
p(u):= min f (x, u)

x∈Rd
s.t. gi (x, u) ≤ 0, i = 1, . . . , I ,
h j (x, u) = 0, j = 1, . . . , J , (38)
Here u is the perturbation vector and u = 0 corresponds to the original primal problem.
Let ψ be a penalty function on Rd and σ > 0 a penalty factor. A penalization of the
original primal problem p(0) is given by
min f (x, u) + σ ψ(u)

x∈Rd
s.t. gi (x, u) ≤ 0, i = 1, . . . , I ,
h j (x, u) = 0, j = 1, . . . , J . (39)
123
Naturally we could impose some bound on the perturbation as u ≤ Ru . We assume

that f , gi , h j are continuously differentiable in x and u for all i, j. Moreover, the com-
pactness in Assumption 1 implies that the feasible region prescribed by the inequality
constraints gi (x, u) ≤ 0 are compact in x for any u, i.e., X = {x ∈ Rd : ∃u, u ≤
Ru , s.t.gi (x, u) ≤ 0, i = 1, . . . , J } is compact. For example, some of the inequalities
are bounds on the variables, x∞ ≤ 1. We will show that there exists a penalty factor
σ > 0 such that any optimal solution to (39) is feasible to (38). We next characterize
the property of the perturbation function p(u).
Lemma 10 The perturbation function p(u) is lower semicontinuous.
Proof Let X (u) ⊂ X denote the feasible set in x dependent on u. The minimum in
the definition is well defined for every u due to the compactness of X (u).
We show that p(u) is lower semicontinuous (lsc) by showing lim inf v→u p(v) ≥
p(u) for any u. Assume for contradiction that for any ε > 0, there exists a sequence
{vk }∞
k=1 such that vk → u and p(vk ) ≤ p(u) − ε. Let x k ∈ arg min f (x, vk )
and thus p(vk ) = f (x, vk ). Since X is compact, there exists a subsequence xk j
and z ∈ X such that xk j → z as j → ∞. Then by continuity of f , f (z, u) =
lim j→∞ f (xk j , vk j ) ≤ p(u) − ε. This contradicts with the definition of p(u), since
p(u) = min x∈X (u) f (x, u) ≤ f (z, u) ≤ p(u) − ε. Therefore p(u) is lsc.

Now we give the theorem of exact penalization for problems defined by C 1 -

functions.
Proposition 12 If the Karush-Kuhn-Tucker condition is satisfied at every local min-

imum solution of (38), then the penalty reformulation (39) is exact for some finite
σ > 0.
Proof Let X (u) denote the feasible region of x defined by constraints gi (x, u) ≤
0, i = 1, . . . , I and h j (x, u) = 0, j = 1, . . . , J . Then X (u) is compact for any u
by the continuity of the constraint functions. We show that for every optimal solution
x0 ∈ X (0), there exists a neighborhood V (x0 ) x0 in the x space, U (x0 ) u = 0
in the u space, and constant L(x0 ) > 0, such that for all x ∈ V (x0 ) and u ∈ U (x0 ),
we have f (x, u) ≥ f (x0 , 0) − L(x0 ) · u. Then we use this fact together with
compactness of X (0) to show the existence of exact penalization. In this proof, the
little-o is used to simplify notation, i.e., o(a) denotes a function b(a) such that
lima→0 |b(a)|/ a = 0.
Pick any optimal solution x0 ∈ X (0). By definition, it is also a local minimum
solution. By hypothesis, the KKT condition is satisfied at x0 , that is, there exist λi ∈
R, i = 1, . . . , I , and μ j ≥ 0, j = 1, . . . , J such that

I
J
∇x f (x0 , 0) + λi ∇x gi (x0 , 0) + μ j ∇x h j (x0 , 0) = 0,
i=1 j=1
h j (x0 , 0) = 0, j = 1, . . . , J ,
gi (x0 , 0) ≤ 0, λi · gi (x0 , 0) = 0, i = 1, . . . , I . (40)
123
Since h j ’s are continuously differentiable and h j (x0 , 0) = 0, we have

2 3 2 3
∇x h j (x0 , 0), x − x0 + ∇u h j (x0 , 0), u
+o(x − x0 + u) = 0, j = 1, . . . , J . (41)
Let A ⊂ I denote the set of active inequality constraints. Then similarly we have
∇x gi (x0 , 0), x − x0 + ∇u gi (x0 , 0), u + o(x − x0 + u) ≤ 0, i ∈ A. (42)
For any i ∈/ A, by the continuity of gi , there exist neighborhoods Wi of x0 and Ui of

u = 0 such that for any (x, u) ∈ Wi × Ui , gi (x, u) < 0 remains inactive. Now, from
(40), (41), (42), and f being continuously differentiable, we have
f (x, u) − f (x0 , 0)
= ∇x f (x0 , 0), x − x0 + ∇u f (x0 , 0), u + o(x − x0 + u)

J
= − λi ∇x gi (x0 , 0) − μ j ∇x h j (x0 , 0), x − x0
i∈A j=1
+ ∇u f (x0 , 0), u + o(x − x0 + u)

J
≥ ∇u f (x0 , 0) + λi ∇u gi (x0 , 0) + μ j ∇u h j (x0 , 0), u
i∈A j=1
+ o(x − x0 + u)
> −L(x0 ) · u + o(x − x0 + u),
where L(x0 ):=∇u f (x0 , 0) + i∈A λi ∇u gi (x0 , 0) + Jj=1 μ j ∇u h j (x0 , 0) + 1 >
0. By the definition of the little-o notation, there exists a neighborhood V (x0 ) ⊂
/ Wi , x 0 ∈ V (x 0 ) and U (x 0 ) ⊂ ∩i ∈A
∩i ∈A / Ui , 0 ∈ U (x 0 ) such that
f (x, u) − f (x0 , 0) ≥ −L(x0 ) · u , ∀ (x, u) ∈ V (x0 ) × U (x0 ).
Now, let X opt (0) denote the set of optimal solutions of x when u = 0. Note
that X opt (0) ⊂ X (0) is closed due to the continuity of f , h i , g j , hence compact.
The collection of open sets {V (x)}x∈X opt (0) covers X opt (0). By compactness, there
exists a finite subcollection {V (xk )}k=1
K such that X opt (0) ⊂ ∪k=1K V (x )=:V . Let
k
L:= maxk=1,...,K L(xk ) and U = ∩k=1 K U (x ). Let f ∗ denote the optimal value for
k
u = 0. Then we have
f (x, u) ≥ f ∗ − L · u , ∀ (x, u) ∈ V × U .
To show the inequality for x ∈ / V , define p̃(u) := min x∈X (u)\V f (x, u). Note that
p̃(0) > f ∗ by the definition of X opt (0). Then by Lemma 10, p(u) is lower semicon-
tinuous, and we know that there exists a neighborhood U of 0 such that p̃(u) > f ∗
for all u ∈ U . Therefore, for all u ∈ U ∩U , we have f (x, u) ≥ f ∗ − L · u. Finally,
123
we can show that the penalization is exact. Since ψ is sharp, there exist an open set
Ũ ⊂ U ∩ U , and positive constants c > 0 such that
ψ(u) ≥ c u on Ũ .
Let M = minu∈ B̄ R (0)\Ũ p̃(u) > f ∗ , m = minu∈ B̄ R (0)\Ũ ψ(u) > 0 because ψ is a
u u
penalty function. Let σ = (M − f ∗ )/m + 1. We have f (x, u) ≥ f ∗ − σ · u , ∀ u ∈
B̄ Ru (0)\{0}, x ∈ ∪u X (u). As a result, any optimal solution to the penalization (39)
would satisfy u = 0.

Note that our problem (7) can be written into the form (39) by letting u =
(xa(n) , z n )n∈N , and including the duplicate constraints z n − xa(n) = 0 for any
n = r ∈ N in the equality constraints h j (x, u) = 0. And other constraints
gi (x, u) ≤ 0 correspond to the functional constraints in the problem (1). Since ψn are
sharp, the aggregate penalty function defined by ψ(u) = n∈N pn ψn (xa(n) − z n ) is
also sharp. Let σ denote the penalty factor in Proposition 12. Proposition 11 follows
from this by letting σn = σ/ pn for all n ∈ N .
References
1. Ahmed, S., Cabral, F.G., da Costa, B.F.P.: Stochastic Lipschitz dynamic programming Math. Program.
191, 755–793 (2022)
2. Ahmed, S., King, A.J., Parija, G.: A Multi-stage Stochastic Integer Programming Approach for Capac-
ity Expansion under Uncertainty. Journal of Global Optimization 26, 3–24 (2003)
3. Baringo, L., Conejo, A.J.: Risk-Constrained Multi-Stage Wind Power Investment. IEEE Transac-
tions on Power Systems 28(1), 401–411 (2013) https://doi.org/10.1109/TPWRS.2012.2205411. http://
ieeexplore.ieee.org/document/6247489/
4. Basciftci, B., Ahmed, S., Gebraeel, N.: Adaptive Two-stage Stochastic Programming with an Appli-
cation to Capacity Expansion Planning. arXiv:1906.03513 [math] (2019)
5. Baucke, R., Downward, A., Zakeri, G.: A deterministic algorithm for solving multistage stochastic
programming problems. Optim. Online p. 25 (2017)
6. Benders, J.F.: Partitioning procedures for solving mixed-variables programming problems. Numerische
Mathematik 4, 238–252 (1962)
7. Birge, J.R.: Decomposition and Partitioning Methods for Multistage Stochastic Linear Programs. Oper.
Res. 33(5), 989–1007 (1985). https://doi.org/10.1287/opre.33.5.989
8. Bradley, S.P., Crane, D.B.: A Dynamic Model for Bond Portfolio Management. Manage. Sci. 19(2),
139–151 (1972). https://doi.org/10.1287/mnsc.19.2.139
9. Chen, Z.L., Li, S., Tirupati, D.: A scenario-based stochastic programming approach for technology
and capacity planning. Computers & Operations Research 29(7), 781–806 (2002) https://doi.org/10.
1016/S0305-0548(00)00076-9. http://linkinghub.elsevier.com/retrieve/pii/S0305054800000769
10. Chen, Z.L., Powell, W.B.: Convergent Cutting-Plane and Partial-Sampling Algorithm for Multistage
Stochastic Linear Programs with Recourse. J. Optim. Theory Appl. 102(3), 497–524 (1999). https://
doi.org/10.1023/A:1022641805263
11. Dantzig, G.B., Wolfe, P.: Decomposition Principle for Linear Programs. Oper. Res. 8(1), 101–111
(1960). https://doi.org/10.1287/opre.8.1.101
12. Escudero, L.F., Kamesam, P.V., King, A.J., Wets, R.J.B.: Production planning via scenario modelling.
Annals of Operations Research 43, 309–335 (1993)
13. Feizollahi, M.J., Ahmed, S., Sun, A.: Exact augmented lagrangian duality for mixed integer linear
programming. Math. Program. 161(1–2), 365–387 (2017)
14. Flach, B., Barroso, L., Pereira, M.: Long-term optimal allocation of hydro generation for a price-maker
company in a competitive market: latest developments and a stochastic dual dynamic programming
123
approach. IET Generation, Transmission & Distribution 4(2), 299 (2010). https://doi.org/10.1049/iet-
gtd.2009.0107
15. Girardeau, P., Leclere, V., Philpott, A.B.: On the Convergence of Decomposition Methods for Multi-
stage Stochastic Convex Programs. Math. Oper. Res. 40(1), 130–145 (2015). https://doi.org/10.1287/
moor.2014.0664
16. Glassey, C.R.: Nested Decomposition and Multi-Stage Linear Programs. Manage. Sci. 20(3), 282–292
(1973). https://doi.org/10.1287/mnsc.20.3.282
17. Guigues, V.: Convergence analysis of sampling-based decomposition methods for risk-averse multi-
stage stochastic convex programs. SIAM J. Optim. 26(4), 2468–2494 (2016)
18. Guigues, V.: Dual dynamic programing with cut selection: Convergence proof and numerical experi-
ments. Eur. J. Oper. Res. 258(1), 47–57 (2017)
19. Hjelmeland, M.N., Zou, J., Helseth, A., Ahmed, S.: Nonconvex Medium-Term Hydropower Scheduling
by Stochastic Dual Dynamic Integer Programming. IEEE Transactions on Sustainable Energy 10(1),
481–490 (2019) https://doi.org/10.1109/TSTE.2018.2805164. http://ieeexplore.ieee.org/document/
8289405/
20. Ho, J.K., Manne, A.S.: Nested decomposition for dynamic models. Math. Program. 6(1), 121–140
(1974). https://doi.org/10.1007/BF01580231
21. Kusy, M.I., Ziemba, W.T.: A Bank Asset and Liability Management Model. Oper. Res. 34(3), 356–376
(1986). https://doi.org/10.1287/opre.34.3.356
22. Lan, G.: Complexity of stochastic dual dynamic programming. Mathematical Programming pp. 1–38
(2020)
23. Lara, C.L., Mallapragada, D.S., Papageorgiou, D.J., Venkatesh, A., Grossmann, I.E.: Deterministic
electric power infrastructure planning: Mixed-integer programming model and nested decomposition
algorithm. European Journal of Operational Research 271(3), 1037–1054 (2018) https://doi.org/10.
1016/j.ejor.2018.05.039. http://linkinghub.elsevier.com/retrieve/pii/S0377221718304466
24. Linowsky, K., Philpott, A.B.: On the Convergence of Sampling-Based Decomposition Algorithms for
Multistage Stochastic Programs. J. Optim. Theory Appl. 125(2), 349–366 (2005). https://doi.org/10.
1007/s10957-004-1842-z
25. Louveaux, F.V.: A Solution Method for Multistage Stochastic Programs with Recourse with Application
to an Energy Investment Problem. Oper. Res. 28(4), 889–902 (1980). https://doi.org/10.1287/opre.28.
4.889
26. Mulvey, J.M., Vladimirou, H.: Stochastic Network Programming for Financial Planning Problems.
Manage. Sci. 38(11), 1642–1664 (1992). https://doi.org/10.1287/mnsc.38.11.1642
27. Nesterov, Y.: Lectures on Convex Optimization, vol. 137. Springer, Berlin (2018)
28. Pereira, M.V.F., Pinto, L.M.V.G.: Stochastic Optimization of a Multireservoir Hydroelectric Sys-
tem: A Decomposition Approach. Water Resour. Res. 21(6), 779–792 (1985). https://doi.org/10.1029/
WR021i006p00779
29. Pereira, M.V.F., Pinto, L.M.V.G.: Multi-stage stochastic optimization applied to energy planning. Math.
Program. 52(1–3), 359–375 (1991). https://doi.org/10.1007/BF01582895
30. Philpott, A., Guan, Z.: On the convergence of stochastic dual dynamic programming and related
methods. Operations Research Letters 36(4), 450–455 (2008) https://doi.org/10.1016/j.orl.2008.01.
013. http://linkinghub.elsevier.com/retrieve/pii/S0167637708000308
31. Philpott, A., Wahid, F., Bonnans, F.: MIDAS: A Mixed Integer Dynamic Approximation Scheme.
Mathematical Programming 181, 19–50 (2020)
32. Rockafellar, R.T., Wets, R.J.B.: Variational Analysis, vol. 317. Springer Science & Business Media,
Berlin (2009)
33. Shapiro, A.: Analysis of stochastic dual dynamic programming method. European J. Oper. Res. 209(1),
63–72 (2011) https://doi.org/10.1016/j.ejor.2010.08.007. http://linkinghub.elsevier.com/retrieve/pii/
S0377221710005448
34. Shapiro, A., Tekaya, W., da Costa, J.P., Soares, M.P.: Risk neutral and risk averse Stochastic Dual
Dynamic Programming method. European J. Oper. Res. 224(2), 375–391 (2013) https://doi.org/10.
1016/j.ejor.2012.08.022. http://linkinghub.elsevier.com/retrieve/pii/S0377221712006455
35. Slyke, R.M.V., Wets, R.: L-Shaped Linear Programs with Applications to Optimal Control and Stochas-
tic Programming. SIAM J. Appl. Math. 17(4), 638–663 (1969). http://www.jstor.org/stable/2099310
36. Takriti, S., Krasenbrink, B., Wu, L.S.Y.: Incorporating Fuel Constraints and Electricity Spot Prices
into the Stochastic Unit Commitment Problem. Oper. Res. 48(2), 268–280 (2000). https://doi.org/10.
1287/opre.48.2.268.12379
123
37. Zou, J., Ahmed, S., Sun, X.A.: Partially Adaptive Stochastic Optimization for Electric Power Genera-
tion Expansion Planning. INFORMS J. Comput. 30(2), 388–401 (2018). https://doi.org/10.1287/ijoc.
2017.0782
38. Zou, J., Ahmed, S., Sun, X.A.: Stochastic dual dynamic integer programming. Math. Program. (2018).
https://doi.org/10.1007/s10107-018-1249-5
39. Zou, J., Ahmed, S., Sun, X.A.: Multistage Stochastic Unit Commitment Using Stochastic Dual
Dynamic Integer Programming. IEEE Trans. Power Syst. 34(3), 1814–1823 (2019) https://doi.org/
10.1109/TPWRS.2018.2880996. http://ieeexplore.ieee.org/document/8532315/
Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps
and institutional affiliations.
123

s10107-022-01875-8

Uploaded by

Copyright:

Available Formats

s10107-022-01875-8

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

s10107-022-01875-8

Uploaded by

Copyright:

Available Formats

Mathematical Programming (2022) 196:935–985

FULL LENGTH PAPER

Stochastic dual dynamic programming for multistage

Shixuan Zhang1 · Xu Andy Sun2

1 H. Milton Stewart School of Industrial and Systems Engineering, Georgia Institute of

Keywords Multistage stochastic programming · MINLP · SDDP · Complexity

Mathematics Subject Classification 90C15 · 90C11 · 90C30 · 90C60 · 90C39 · 49N15

A multistage stochastic mixed-integer nonlinear program (MS-MINLP) is a sequential

1.1 Literature review

Benders decomposition [6], Dantzig-Wolfe decomposition [11], and the L-shaped

1. To tackle the MS-MINLP problems without Lipschitz continuous value functions,

For a multistage stochastic program, let T = (N , E) be the scenario tree, where N is

The recursive formulation of the problem (1) is defined as

where n ∈ T is a non-leaf node and Q n (xa(n) ) is the value function of node n. At

2.2 Continuity and convexity of value functions

v ∗ := min x + z : (z − 1)2 + w 2 ≤ 1, w = x, x ∈ [0, 1] .

v ∗ := min {1 − 2x + z : z ≥ x, x ∈ [0, 1], z ∈ {0, 1}} .

The optimal objective value is v ∗ = 0, and the unique optimal solution is (x ∗ , z ∗ ) =

2.3 Regularization and penalty reformulation

for all n ∈ N , and Q R n is referred to as the regularized value function. By convention,

is thus an underestimation of v prim , i.e. v reg ≤ v prim . For notational convenience, we

– convex problems with interior feasible solutions,

2.4 Generalized conjugacy cuts and value function approximation

2.4.1 Generalized conjugacy cuts

Let Q : X → R+ ∪ {+∞} be a proper, l.s.c. function defined on a compact set

Q Φ (u) = max {Φ(x, u) − Q(x)} . (8)

The following generalized Fenchel-Young inequality holds by definition for any x ∈ X

Q(x) + Q Φ (u) ≥ Φ(x, u).

For any û ∈ U and an associated maximizer x̂ in (8), we define

C Φ (x | û, v̂) := v̂ + Φ(x, û) (9)

Q(x) ≥ C Φ (x | û, v̂), (10)

which we call a generalized conjugacy cut for the target function Q.

2.4.2 Value function approximation

Q˝(σ ·)(x):= min{Q(z) + σ x − z} = max min{Q(z) + λ, x − z}. (13)

Remark 1 This proposition can be generalized to special nonconvex problems with

3 Nested decomposition and dual dynamic programming algorithms

In this section, we introduce a nested decomposition algorithm for general scenario

3.1 Subproblem oracles

min f n (z, y, x) + σn ψn (xa(n) − z) + Θn (x) . (F)

max min f n (z, y, x) + λ, xa(n) − z + ρψn (xa(n) − z) + Θn (x) , (B)

min fr (xa(r ) , y, x) + Θr (x) , (R)

where Θr : Xr → R̄ is a l.s.c. function, representing an under-approximation of the

3.2 Under- and over-approximations of cost-to-go functions

We first show how to iteratively construct under-approximation of expected cost-to-go

Cmi (x | λ̂im , ρ̂mi , v im ):= − λ̂im , xni − x − ρ̂mi ψm (xni − x) + v im , (15)

where (x̂mi , ŷmi ẑ m

Qn (x) ≥ Qin (x), ∀x ∈ Xn .

Now, we propose the following over-approximation of the regularized expected

conv{ f , g}(x) := (min{ f (x), g(x)})∗∗

3.3 A nested decomposition algorithm for general trees

We first propose a nested decomposition algorithm in Algorithm 1 for a general sce-

Proposition 7 Given any ε > 0, if UpperBound − LowerBound ≤ ε, then the

Algorithm 1 A nested decomposition algorithm for a general tree

UpperBound − LowerBound ≤ ε and Algorithm 1 terminates after the i-th itera-

3.4 A deterministic sampling dual dynamic programming algorithm

Assumption 3 For any t = 1, . . . , T − 1 and any n, n  ∈ N (t), the state space,

We denote n ∼ n  for n, n  ∈ N (t) for some t = 1, . . . , T − 1, if the nodes n, n 

Algorithm 2 Deterministic sampling dual dynamic programming algorithm

3.5 A stochastic sampling dual dynamic programming algorithm

Now we present a stochastic dual dynamic programming algorithm, which uses

Q˝(σ ·)(x):= min{Q(z) + σ x − z} = max min{Q(z) + λ, x − z}. (13)

max min f n (z, y, x) + λ, xa(n) − z + ρψn (xa(n) − z) + Θn (x) , (B)

Cmi (x | λ̂im , ρ̂mi , v im ):= − λ̂im , xni − x − ρ̂mi ψm (xni − x) + v im , (15)

Assumption 3 For any t = 1, . . . , T − 1 and any n, n ∈ N (t), the state space,

We denote n ∼ n for n, n ∈ N (t) for some t = 1, . . . , T − 1, if the nodes n, n

when the algorithm terminates.

oracle := ∪∞ Σ oracle , and σ {P i , j

Sβd (R, x):={y ∈ S d (R) : y − x, x ≥ −β R}.

3. F is differentiable at all wk , with vk + ∇ F(wk ), wl − wk < 0 for all l = k;

Likewise, by exchanging z 1 and z 2 , we know that Q n (z 2 ) − Q n (z 1 ) ≤ ln z 1 − z 2 .

Similarly, we can get f n ˝ (σn ψn )(x2 ) − f n ˝ (σn ψn )(x1 ) ≤ σn x1 − x2 by exchang-

Proof By definition, we have Q R n (x n ) ≤ Q n (x n ) for all n ∈ N , n = r . We show

We can extend (xm , ym , z m

which completes the proof.

Proof By definition, Q R n (x) ≤ Q n ˝ (σn ψn )(x). Since ψn (x) = x is convex, by

≥ max min {Q n (z) + λ, x̄n − z} ≥ Q R

dominated by Qn (x) and L n -Lipschitz continuous.

to the entire space R ⊃ Xn since Qn ˝ (L n ·)(x) = QR

inf {Q n (z) + λ, z − x} ≤ inf {Q n (x) + L n z − x + λ, z − x} = −∞.

Therefore the algorithm terminates after the i-th iteration.

v im = max min f m (z, y, x) + λ, xni − z + Qim (x)

= min f m (z, y, x) + ln,λ xni − z + Qim (x)

≥ min f m (z, y, x) + σm xni − z + Qim (x)

v im = max min f m (z, y, x) + λ, xni − z + ρψn (xni − z) + Qim (x)