s10107-022-01875-8
s10107-022-01875-8
s10107-022-01875-8
https://doi.org/10.1007/s10107-022-01875-8
Received: 8 January 2020 / Accepted: 29 July 2022 / Published online: 20 August 2022
© The Author(s) 2022
Abstract
In this paper, we study multistage stochastic mixed-integer nonlinear programs (MS-
MINLP). This general class of problems encompasses, as important special cases,
multistage stochastic convex optimization with non-Lipschitzian value functions and
multistage stochastic mixed-integer linear optimization. We develop stochastic dual
dynamic programming (SDDP) type algorithms with nested decomposition, determin-
istic sampling, and stochastic sampling. The key ingredient is a new type of cuts based
on generalized conjugacy. Several interesting classes of MS-MINLP are identified,
where the new algorithms are guaranteed to obtain the global optimum without the
assumption of complete recourse. This significantly generalizes the classic SDDP algo-
rithms. We also characterize the iteration complexity of the proposed algorithms. In
particular, for a (T + 1)-stage stochastic MINLP satisfying L-exact Lipschitz regular-
ization with d-dimensional state spaces, to obtain an ε-optimal root node solution, we
prove that the number of iterations of the proposed deterministic sampling algorithm
is upper bounded by O(( 2Lε T )d ), and is lower bounded by O(( L4εT )d ) for the general
case or by O(( L8εT )d/2−1 ) for the convex case. This shows that the obtained complexity
bounds are rather sharp. It also reveals that the iteration complexity depends polyno-
mially on the number of stages. We further show that the iteration complexity depends
linearly on T , if all the state spaces are finite sets, or if we seek a (T ε)-optimal solution
when the state spaces are infinite sets, i.e. allowing the optimality gap to scale with
T . To the best of our knowledge, this is the first work that reports global optimiza-
tion algorithms as well as iteration complexity results for solving such a large class
of multistage stochastic programs. The iteration complexity study resolves a conjec-
B Xu Andy Sun
sunx@mit.edu
Shixuan Zhang
szhang483@gatech.edu
123
936 S. Zhang, X. A. Sun
ture by the late Prof. Shabbir Ahmed in the general setting of multistage stochastic
mixed-integer optimization.
1 Introduction
123
Stochastic dual dynamic programming… 937
decomposition procedures for deterministic models are developed in [16, 20]. Lou-
veaux [25] first generalized the two-stage L-shaped method to multistage quadratic
problems. Nested Benders decomposition for MS-LP was first proposed in Birge [7]
and Pereira and Pinto [28]. SDDP, the sampling variation of NBD, was first proposed
in [29]. The largest consumer of SDDP by far is in the energy sector, see e.g. [14, 23,
29, 34].
Recently, SDDP has been extended to SDDiP [38]. It is observed that the cuts gener-
ated from Lagrangian relaxation of the nodal problems in an MS-MILP are always tight
at the given parent node’s state, as long as all the state variables only take binary values
and have complete recourse. From this fact, the SDDiP algorithm is proved to find an
exact optimal solution in finitely many iterations with probability one. In this way, the
SDDiP algorithm makes it possible to solve nonconvex problems through binarization
of the state variables [19, 39]. In addition, when the value functions of MS-MILP
with general integer state variables are assumed to be Lipschitz continuous, which is a
critical assumption, augmented Lagrangian cuts with an additional reverse norm term
to the linear part obtained via augmented Lagrangian duality are proposed in [1].
The convergence analysis of the SDDP-type algorithms begins with the linear cases
[10, 18, 24, 30, 33], where almost sure finite convergence is established based on the
polyhedral nodal problem structures. For convex problems, if the value functions are
Lipschitz continuous and the state space is compact, asymptotic convergence of the
under-approximation of the value functions leads to asymptotic convergence of the
optimal value and optimal solutions [15, 17]. By constructing over-approximations
of value functions, an SDDP with a deterministic sampling method with asymptotic
convergence is proposed for the convex case in [5]. Upon completion of this paper,
we became aware of the recent work [22], which proves iteration complexity upper
bounds for multistage convex programs under the assumption that all the value func-
tions and their under-approximations are all Lipschitz continuous. It is shown that
for discounted problems, the iteration complexity depends linearly on the number of
stages. However, the following conjecture (suggested to us by the late Prof. Shab-
bir Ahmed) remains to be resolved, especially for the problems without convexity,
Lipschitz continuity, or discounts:
Conjecture 1 The number of iterations needed for SDDP/SDDiP to find an optimal
first-stage solution grows linearly in terms of the number of stages T , while it may
depend nonlinearly on other parameters such as the diameter D and the dimension d
of the state space.
Our study resolves this conjecture by giving a full picture of the iteration complexity of
SDDP-type algorithms in a general setting of MS-MINLP problems that allow exact
Lipschitz regularization (defined in Sect. 2.3). In the following, we summarize our
contributions.
1.2 Contributions
123
938 S. Zhang, X. A. Sun
such that the value functions become Lipschitz continuous. In many cases, the
regularized problem preserves the set of optimal solutions.
2. We use the theory of generalized conjugacy to develop a cut generation scheme,
referred to as generalized conjugacy cuts, that are valid for value functions of
MS-MINLP. Moreover, generalized conjugacy cuts are shown to be tight to the
regularized value functions. The generalized conjugacy cuts can be replaced by
linear cuts without compromising such tightness when the problem is convex.
3. With the regularization and the generalized conjugacy cuts, we propose three algo-
rithms for MS-MINLP, including nested decomposition for general scenario trees,
SDDP algorithms with random sampling as well as deterministic sampling similar
to [5] for stagewise independent scenario trees.
4. We obtain upper and lower bounds on the iteration complexity for the proposed
SDDP with both sampling methods for MS-MINLP problems. The complexity
bounds show that in general, Conjecture 1 holds if only we seek a (T ε)-optimal
solution, instead of an ε-optimal first-stage solution for a (T + 1)-stage problem,
or when all the state spaces are finite sets.
In addition, this paper contains the following contributions compared with the recent
independent work [22]: (1) We consider a much more general class of problems which
are not necessarily convex. As a result, all the iteration complexity upper bounds of the
algorithms are also valid for these nonconvex problems. (2) We use the technique of
regularization to make the iteration complexity bounds independent of the subproblem
oracles. This is particularly important for the conjecture, since the Lipschitz constants
of the under-approximation of value functions may exceed those of the original value
functions. (3) We propose matching lower bounds on the iteration complexity of the
algorithms and characterize important cases for the conjecture to hold.
This paper is organized as follows. In Sect. 2 we introduce the problem formulation,
regularization of the value functions, and the approximation scheme using generalized
conjugacy. Sect. 3 proposes SDDP algorithms. Sect. 4 investigates upper bounds on
the iteration complexity of the proposed algorithm, while Sect. 5 focuses on lower
bounds, therefore completes the picture of iteration complexity analysis. We finally
provide some concluding remarks in Sect. 6.
2 Problem formulations
In this section, we first present the extensive and recursive formulations of multistage
optimization. Then we characterize the properties of the value functions, with examples
to show that they may fail to be Lipschitz continuous. With this motivation in mind,
we propose a penalty reformulation of the multistage problem through regularization
of value functions and show that it is equivalent to the original formulation for a
broad class of problems. Finally, we propose generalized conjugacy cuts for under-
approximation of value functions.
2.1 Extensive and recursive formulation
123
Stochastic dual dynamic programming… 939
parent node of n, C(n) denote the set of child nodes of n, and T (n) denote the subtree
starting from the node n. Given a node n ∈ N , let t(n) denote the stage that the node
n is in and let T := maxn∈N t(n) denote the last stage of the tree T . A node in the last
stage is called a leaf node, otherwise a non-leaf node. The set of nodes in stage t is
denoted as N (t):={n ∈ N : t(n) = t}. We use the convention that the root node of
the tree is denoted as r ∈ N with t(r ) = 0 so the total number of stages is T + 1. The
parent node of the root node is denoted as a(r ), which is a dummy node for ease of
notation.
For every node n ∈ N , let Fn denote the feasibility set in some Euclidean space of
decision variables (xn , yn ) of the nodal problem at node n. We refer to xn as the state
variable and yn as the internal variable of node n. Denote the image of the projection
of Fn onto the subspace of the variable xn as Xn , which is referred to as the state space.
Let xa(r ) = 0 serve as a dummy parameter and thus Xa(r ) = {0}. The nonnegative
nodal cost function of the problem at node n is denoted as f n (xa(n) , yn , xn ) and is
defined on the set {(z, y, x) : z ∈ Xa(n) , (x, y) ∈ Fn }. We allow f n to take the value
+∞ so indicator functions can be modeled as part of the cost. Let pn > 0 for all
n ∈ N denote the probability that node n on the scenario tree is realized. For the root
node, pr = 1. The transition probability that node m is realized conditional on its
parent node n being realized is given by pnm := pm / pn for all edges (n, m) ∈ E.
The multistage stochastic program considered in this paper is defined in the fol-
lowing extensive form:
v prim := min pn f n (xa(n) , yn , xn ). (1)
(xn ,yn )∈Fn ,
∀ n∈N n∈N
To ensure that the minimum in problem (1) is well defined and finite, we make the
following very general assumption on f n and Fn throughout the paper.
Assumption 1 For every node n ∈ N , the feasibility set Fn is compact, and the
nodal cost function f n is nonnegative and lower semicontinuous (l.s.c.). The sum
n∈N pn f n is a proper function, i.e., there exists (x n , yn ) ∈ Fn for all nodes n ∈ N
such that the sum n∈N pn f n (xa(n) , yn , xn ) < +∞.
123
940 S. Zhang, X. A. Sun
Note that the state variable xa(n) only appears in the objective function f n of node
n, not in the constraints. Perhaps the more common way is to allow xa(n) to appear in
the constraints of node n. It is easy to see that any such constraint can be modeled by
an indicator function of (xa(n) , xn , yn ) in the objective f n .
The following proposition presents some basic properties of the value function Q n
under Assumption 1.
Proposition 1 Under Assumption 1, the value function Q n is lower semicontinuous
(l.s.c.) for all n ∈ N . Moreover, for any node n ∈ N ,
1. if f n (z, y, x) is Lipschitz continuous in the first variable z with constant ln , i.e.
| f n (z, y, x) − f n (z , y, x)| ≤ ln z − z for any z, z ∈ Xa(n) and any (x, y) ∈ Fn ,
then Q n is also Lipschitz continuous with constant ln ;
2. if Xa(n) and Fn are convex sets, and f n and Qn are convex functions, then Q n is
also convex.
The proof is given in Sect. A.1.1. When Q m is l.s.c. for all m ∈ C(n), the sum
m∈C (n) pnm Q m is l.s.c.. Therefore, the minimum in the definition (2) is well define,
since Fn is assumed to be compact.
If the objective function f n (xa(n) , yn , xn ) is not Lipschitz, e.g., when it involves
an indicator function of xa(n) , or equivalently when xa(n) appears in the constraint
of the nodal problem of Q n (xa(n) ), then the value function Q n may not be Lipschitz
continuous, as is shown by the following examples.
Example 1 Consider the convex nonlinear two-stage problem
The objective function and all constraints are Lipschitz continuous. The optimal objec-
tive value v ∗ = 0, and the unique optimal solution is (x ∗ , z ∗ , w ∗ ) = (0, 0, 0). At the
optimal solution, the inequality constraint is active. Note that the problem can be
equivalently written as v ∗ = min0≤x≤1 x + Q(x), where Q(x) is defined √ on [0, 1] as
Q(x):= min z : ∃ w ∈ R, s.t. (z − 1)2 + w 2 ≤ 1, w = x = 1 − 1 − x 2 , which is
not locally Lipschitz continuous at the boundary point x = 1. Therefore, Q(x) is not
Lipschitz continuous on [0, 1].
Example 2 Consider the mixed-integer linear two-stage problem
123
Stochastic dual dynamic programming… 941
These examples show a major issue with the introduction of value functions Q n ,
namely Q n may fail to be Lipschitz continuous even when the original problem only
has constraints defined by Lipschitz continuous functions. This could lead to failure of
algorithms based on approximation of the value functions. In the next section, we will
discuss how to circumvent this issue without compromise of feasibility or optimality
for a wide range of problems.
The main idea of avoiding failure of cutting plane algorithms in multistage dynamic
programming is to use some Lipschitz continuous envelope functions to replace the
original value functions, which we refer to as regularized value functions.
To begin with, we say a function ψ : Rd → R+ is a penalty function, if ψ(x) = 0
if and only if x = 0, and the diameter of its level set leva (ψ) := {x ∈ Rd : ψ(x) ≤ α}
approaches 0 when a → 0. In this paper, we focus on penalty functions that are locally
Lipschitz continuous, the reason for which will be clear from Proposition 2.
For each node n, we introduce a new variable z n as a local variable of node n and
impose the duplicating constraint xa(n) = z n . This is a standard approach for obtaining
dual variables through relaxation (e.g. [38]). The objective function can then be written
as f n (z n , yn , xn ). Let ψn be a penalty function for node n ∈ N . The new coupling
constraint is relaxed and penalized in the objective function by σn ψn (xa(n) − z n ) for
some σn > 0. Then the DP recursion with penalization becomes
⎧ ⎫
⎨ ⎬
n (x a(n) ):=
QR min f n (z n , yn , xn ) + σn ψn (xa(n) − z n ) + m (x n ) , (4)
pnm Q R
(xn ,yn )∈Fn , ⎩ ⎭
z n ∈Xa(n) m∈C (n)
is σn -Lipschitz continuous on Xa(n) . Moreover, if the original problem (2) and the
n (x) is also convex.
penalty functions ψn are convex, then Q R
The key idea is that by adding a Lipschitz function ψn into the nodal problem, we
can make Q nR (x) Lipschitz continuous even when Q n (x) is not. The proof is given in
Sect. A.1.2. The optimal value of the regularized root nodal problem
⎧ ⎫
⎨ ⎬
v reg := min fr (xa(r ) , yr , xr ) + m (xr )
pr m Q R (5)
(xr ,yr )∈Fr ⎩ ⎭
m∈C (r )
123
942 S. Zhang, X. A. Sun
Definition 1 For any ε > 0, a feasible root node solution (xr , yr ) ∈ Fr is said to be
ε-optimal to the regularized problem (4) if it satisfies fr (xa(r ) , yr , xr ) + QrR (xr ) ≤
v reg + ε.
Next we discuss conditions under which v reg = v prim and any optimal solution
(xn , yn )n∈N to the regularized problem (4) is feasible and hence optimal to the original
problem (2). Note that by expanding Q R m in the regularized problem (4) for all nodes,
we obtain the extensive formulation for the regularized problem:
v reg = min pn f n (z n , yn , xn ) + σn ψn (xa(n) − z n ) . (7)
(xn ,yn )∈Fn ,n∈N
z n ∈Xa(n) n∈N
We refer to problem (7) as the penalty reformulation and make the following assump-
tion on its exactness for the rest of the paper.
Assumption 2 We assume that the penalty reformulation (7) is exact for the given
penalty parameters σn > 0, n ∈ N , i.e., any optimal solution of (7) satisfies z n = xa(n)
for all n ∈ N .
Assumption 2 guarantees the solution of the regularized extensive formulation (7) is
feasible for the original problem (1), then by the fact that v reg ≤ v prim , is also optimal
to the original problem, we have v reg = v prim . Thus regularized value functions
serve as a surrogate of the original value function, without compromise of feasibility
of its optimal solutions. A consequence of Assumption 2 is that the original and
regularized value functions coincide at all optimal solutions, the proof of which is
given in Sect. A.1.3.
Lemma 1 Under Assumption 2, any optimal solution (xn , yn )n∈N to problem (1) sat-
n (x a(n) ) = Q n (x a(n) ) for all n = r .
isfies Q R
We illustrate the regularization on the examples through Fig. 1a and b. In Fig. 1a, the
value function Q(x) derived in Example 1 is not Lipschitz continuous at x = 1. With
ψ(x) = x and σ = 4/3, we obtain the regularized value function, which coincides
with the original one on [0, 0.8] and is Lipschitz continuous on the entire interval
[0, 1]. In Fig. 1b, the value function Q(x) derived in Example 1 is not continuous at
x = 0. With ψ(x) = x, σ = 5, we obtain the regularized value function, which
coincides with the primal one on {0}∪[0.2, 1] and is Lipschitz continuous on the entire
interval [0, 1]. In both examples, it can be easily verified that the penalty reformulation
is exact and thus preserves optimal solution.
We comment that Assumption 2 can hold for appropriately chosen penalty factors
in various mixed-integer nonlinear optimization problems, including
123
Stochastic dual dynamic programming… 943
(a) (b)
Fig. 1 Value functions in Examples 1 and 2
In this part, we first introduce generalized conjugacy cuts for nonconvex functions and
then apply it to the under-approximation of value functions of MS-MINLP.
123
944 S. Zhang, X. A. Sun
where v̂:= − Q Φ (û). Then, the following inequality, derived from the generalized
Fenchel-Young inequality, is valid for any x ∈ X ,
The generalized conjugacy cuts can be used in the setting of an augmented Lagrangian
dual [1] with bounded dual variables. For a nodal problem n ∈ N , n = r and a point
x̄ ∈ Xa(n) , define Φnx̄ (x, u):= − λ, x̄ − x − ρψn (x̄ − x), where u := (λ, ρ) ∈ Rdn +1
are parameters. Consider a compact set of parameters Un = {(λ, ρ) : λ∗ ≤ ln,λ , 0 ≤
ρ ≤ ln,ρ } with nonnegative bounds ln,λ and ln,ρ , where ·∗ is the dual norm of ·.
Consider the following dual problem
v̂n := max min [Q n (z) + λ, x̄ − z + ρψn (x̄ − z)] . (11)
(λ,ρ)∈Un z∈Xa(n)
Denote ẑ n and (λ̂n , ρ̂n ) as an optimal primal-dual solution of (11). The dual problem
(11) can be viewed as choosing (λ̂n , ρ̂n ) as the value of û in (9), which makes the
constant term −Q Φ (û) as large as possible, thus makes the generalized conjugacy cut
(10) as tight as possible. With this choice of the parameters, a generalized conjugacy
cut for Q n at x̄ is given by
Φ x̄
Q n (x) ≥ Cn n (x | λ̂n , ρ̂n , v̂n )
= −λ̂n , x̄ − x − ρ̂n ψn (x̄ − x) + v̂n , ∀ x ∈ Xa(n) . (12)
Proposition 3 Given the above definition of (11)–(12), if (x̄n , ȳn )n∈N is an optimal
solution to problem (1) and the bound ln,ρ satisfies ln,ρ ≥ σn for all nodes n, then
for every node n, the generalized conjugacy cut (12) is tight at x̄n , i.e. Q n (x̄n ) =
Φ x̄n
Cn n (x̄n | λ̂n , ρ̂n , v̂n ).
The proof is given in Sect. A.1.4. The proposition guarantees that, under Assumption 2,
the generalized conjugacy cuts are able to approximate the value functions exactly at
any state associated to an optimal solution.
In the special case where problem (2) is convex and ψn (x) = x for all n ∈ N ,
the exactness of the generalized conjugacy cut holds even if we set ln,ρ = 0, i.e. the
conjugacy cut is linear. To be precise, we begin with the following lemma.
Lemma 2 Let X ⊂ Rd be a convex, compact set. Given a convex, proper, l.s.c. function
Q : X → R ∪ {+∞}, for any x ∈ X , the inf-convolution satisfies
123
Stochastic dual dynamic programming… 945
The proof are given in Sect. A.1.5. Next we show the tightness in the convex case
similar to Proposition 3. The proof is given in Sect. A.1.6.
Proposition 4 Suppose (2) is convex and ψn (x) = x for all nodes n. Given the
above definition of (11)–(12), if (x̄n , ȳn )n∈N is an optimal solution to problem (1)
and the bounds satisfy ln,λ ≥ σn , ln,ρ = 0 for all nodes n, then for every node n, the
Φ x̄n
generalized conjugacy cut (12) is exact at x̄n , i.e. Q n (x̄n ) = Cn n (x̄n | λ̂n , ρ̂n , v̂n ).
In this case, the generalized conjugacy reduces to the usual conjugacy for convex
functions and the generalized conjugacy cut is indeed linear. This enables approxima-
tion of the value function that preserves convexity.
Before we propose the new algorithms, we first define subproblem oracles, which we
will use to describe the algorithms and conduct complexity analysis. A subproblem
oracle is an oracle that takes subproblem information together with the current algo-
rithm information to produce a solution to the subproblem. With subproblem oracles,
we can describe the algorithms consistently regardless of convexity.
We assume three subproblem oracles in this paper, corresponding to the forward
steps and backward steps of non-root nodes, and the root node step in the algorithms.
For non-root nodes, we assume the following two subproblem oracles.
Definition 2 (Forward Step Subproblem Oracle for Non-Root Nodes) Consider the
following subproblem for a non-root node n,
123
946 S. Zhang, X. A. Sun
where the parent node’s state variable xa(n) ∈ Xa(n) is a given parameter and Θn :
Xn → R̄ is a l.s.c. function, representing an under-approximation of the expected
cost-to-go function Qn (x) defined in (3). The forward step subproblem oracle finds an
optimal solution of (F) given xa(n) and Θn , that is, we denote this oracle as a mapping
OnF that takes (xa(n) , Θn ) as input and outputs an optimal solution (xn , yn , z n ) of (F)
for n = r .
Recall that the values σn for all n ∈ N in (F) are the chosen penalty parameters that
satisfy Assumption 2. In view of Propositions 3 and 4, we set ln,λ ≥ σn and ln,ρ = 0
for the convex case with ψn = ·; or ln,ρ ≥ σn otherwise for the dual variable set
Un := {(λ, ρ) : λ∗ ≤ ln,λ , 0 ≤ ρ ≤ ln,ρ } in the next definition.
Definition 3 (Backward Step Subproblem Oracles for Non-Root Nodes) Consider the
following subproblem for a non-root node n,
where the parent node’s state variable xa(n) ∈ Xa(n) is a given parameter and Θn :
Xn → R̄ is a l.s.c. function, representing an under-approximation of the expected
cost-to-go function. The backward step subproblem oracle finds an optimal solution
of (B) for the given xa(n) and Θn . Similarly, we denote this oracle as a mapping OnB
that takes (xa(n) , Θn ) as input and outputs an optimal solution (xn , yn , z n ; λn , ρn ) of
(B) for n = r .
For the root node, we assume the following subproblem oracle.
Definition 4 (Subproblem Oracle for the Root Node) Consider the following subprob-
lem for the root node r ∈ N ,
123
Stochastic dual dynamic programming… 947
Let i ∈ N be the iteration index of an algorithm. Assume (xni , yni )n∈N are feasible
solutions to the regularized nodal problem (4) in the i-th iteration. Then the under-
approximation of the expected cost-to-go function is defined recursively from leaf
nodes to the root node, and inductively for i ∈ N as
⎧ ⎫
⎨ ⎬
Qin (x):= max Qi−1 (x), pnm Cmi (x | λ̂im , ρ̂mi , v im ) , ∀x ∈ Xn , (14)
⎩ n ⎭
m∈C (n)
where Q0n ≡ 0 on Xn . In the definition (14), Cmi is the generalized conjugacy cut for
xi
Q m at i-th iteration and Φmn (x, λ, ρ) = −λ, xni − x − ρψn (xni − x) (cf. (11)–(12)),
that is,
v im = f m (ẑ m
i
, ŷmi , x̂mi ) + λ̂im , xni − ẑ m
i
+ ρ̂mi ψm (xni − ẑ m
i
) + Qim (x̂mi ). (16)
The next proposition shows that Qin is indeed an under-approximation of Qn , the proof
of which is given in Sect. A.2.1.
Proposition 5 For any n ∈ N , and i ∈ N, Qin (x) is ( m∈C (n) pnm (lm,λ + lm,ρ ))-
Lipschitz continuous and
0 i
where Qn ≡ +∞ for any non-leaf node n ∈ N , Qn ≡ 0 for any iteration i ∈ N and
any leaf node n, and v̄m
i satisfies
123
948 S. Zhang, X. A. Sun
i
v̄m
i
= f m (z m
i
, ymi , xmi ) + σm ψm (xni − z m
i
) + Qm (xmi ). (18)
Here, the operation conv{ f , g} forms the convex hull of the union of the epigraphs
of any continuous functions f and g defined on the space Rd . More precisely using
convex conjugacy, we define
The key idea behind the upper bound function (17) is to exploit the Lipschitz continuity
of the regularized value function Q mR (x). In particular, it would follow from induction
i is an upper bound on Q R (x i ), and then, by the σ -Lipschitz continuity of
that v̄m m n m
Q m (x), we have v̄m
R i + σ x − x i ≥ Q R (x i ) + σ x − x i ≥ Q R (x) for all x ∈ X .
m n m n m n m n
The next proposition summarizes this property, with the proof given in Sect. A.2.2.
i
Proposition 6 For any non-root node n ∈ N and i ≥ 1, Qn (x) is ( m∈C (n) pnm σm )-
Lipschitz continuous. Moreover, we have v̄m
i ≥ Q R (x i ) for any node m ∈ C(n) and
m n
thus
i
Qn (x) ≥ QR
n (x), ∀ x ∈ Xn .
123
Stochastic dual dynamic programming… 949
Starting from this subsection, we turn our attention to stagewise independent stochastic
problems, which is defined in the following assumption.
123
950 S. Zhang, X. A. Sun
to keep track of the state of each stage in the algorithm, instead of the state of each
node. To be consistent, we also denote the root node solution as (x0i , y0i ) for i ∈ N.
We present the algorithm in Algorithm 2.
Similar to Algorithm 1, each iteration in Algorithm 2 consists of a forward step,
a backward step, and a root node update step. In particular, at a node n ∈ Ñ (t)
with t < T , the forward step proceeds to a child node m ∈ Ñ (t + 1), where the
i−1
approximation gap γmi := Qt (xmi ) − Qi−1 t (x m ) is among the largest of all the
i
approximation gaps of states xm of nodes m ∈ Ñ (t + 1). Then the state variable of
i
node m is considered the state variable of stage t(m) in the iteration i. Due to stagewise
independence, the backward step at each stage t only need to generate cuts for the
nodes in the recombining tree Ñ . The optimality of the returned solution (x0∗ , y0∗ ) is
guaranteed by Proposition 7.
123
Stochastic dual dynamic programming… 951
⎧ ⎫
⎨ ⎬
i, j i, j i, j i, j
Qit (x):= max Qi−1 (x), ptm Cm (x | λ̂m , ρ̂m , v m ), 1 ≤ j ≤ M , (20)
⎩ t ⎭
m∈Ñ (t+1)
i, j i, j i, j i, j i, j i, j
where Cm is the generalized conjugacy cut generated with (x̂m , ŷm , ẑ m ; λ̂m , ρ̂m )
i, j
= OmB (xn , Qit+1 ) using formula (15). With these notations, the algorithm is displayed
in Algorithm 3.
Unlike the preceding two algorithms, Algorithm 3 does not need to construct the
over-approximation of the regularized value functions for selecting the child node to
proceed with. Instead, it determines the scenario paths before the forward step starts.
In the forward step, each nodal problem in the sampled scenario path is solved. Then
in the backward step, the dual problems are solved at the nodes that are defined by
distinct data, dependent on the parent node’s state variable obtained in the forward
step. The termination criterion is flexible. In the existing literature [33, 38], statistical
upper bounds based on the sampled scenario paths are often used together with the
lower bound for terminating the algorithm. In particular for the convex problems, if
we set σn = +∞, which implies ln,λ = +∞ in the backward step subproblem oracles
OnB for all n ∈ N , then Algorithm 3 reduces to the usual SDDP algorithm in the
literature [15, 33].
In this section, we derive upper bounds on the iteration complexity of the three pro-
posed algorithms, i.e. the bound on the iteration index when the algorithm terminates.
These upper bounds on the iteration complexity imply convergence of the algorithm
to an ε-optimal root node solution for any ε > 0.
123
952 S. Zhang, X. A. Sun
In this section, we discuss the iteration complexity of Algorithm 1. We begin with the
definition of a set of parameters used in the convergence analysis. Let ε denote the
desired root-node optimality gap ε in Algorithm 1. Let δ = (δn )n∈N ,C (n)=∅ be a set
of positive numbers such that ε = n∈N ,C (n)=∅ pn δn . Since ε > 0, such δn ’s clearly
exist. Then, we define recursively for each non-leaf node n
γn (δ):=δn + pnm γm (δ), (21)
m∈C (n)
and γn (δ) = 0 for leaf nodes n. For i ∈ N, recall the approximation gap γni =
i−1
Qn (xni ) − Qi−1
n (x n ) for n ∈ N . For leaf nodes, γn ≡ 0 by definition for all i ∈ N.
i i
Intuitively, the index set In (δ) consists of the iteration indices when all the child nodes
of n have good approximations of the expected cost-to-go function at the forward step
solution, while the node n itself does not. The next lemma shows that the backward
step for node n in the iteration i ∈ In (δ) will reduce the expected cost-to-go function
approximation gap at node n to be no more than γn (δ).
123
Stochastic dual dynamic programming… 953
i−1
Lemma 3 If an iteration index i ∈ In (δ), i.e., Qn (xni ) − Qi−1
n (x n ) > γn (δ) and
i
i−1
Qm (xmi ) − Qi−1
m (x m ) ≤ γm (δ) for all m ∈ C(n), then
i
i δn
Qn (x) − Qin (x) ≤ γn (δ), ∀ x ∈ Xn , x − xni ≤ , (23)
2L n
where L n := m∈C (n) pnm (lm,λ + lm,ρ ) is determined by the input parameters.
The proof is given in Sect. A.3.1. Lemma 3 shows that an iteration being in the index
set would imply an improvement of the approximation in a neighborhood of the current
state. In other words, each i ∈ In would carve out a ball of radius δn /(2L n ) in the
state space Xn such that no point in the ball can be the forward step solution of some
iteration i in In . This implies that we could bound the cardinality |In | of In by the
size and shape of the corresponding state space Xn .
Lemma 4 Let B = {Bn,k ⊂ Rdn }1≤k≤K n ,n∈N be a collection of balls, each with
Kn
diameter Dn,k ≥ 0, such that Xn ⊆ k=1 Bn,k . Then,
Kn
2L n Dn,k dn
|In (δ)| ≤ 1+ .
δn
k=1
The proof is by a volume argument of the covering balls with details given in
Sect. A.3.2. We have an upper bound on the iteration complexity of Algorithm 1.
Theorem 1 Given ε > 0, choose values δ = (δn )n∈N ,C (n)=∅ such that δn > 0 and
n∈N ,C (n)=∅ pn δn = ε. Let B = {Bn,k }1≤k≤K n ,n∈N be a collection of balls, each
Kn
with diameter Dn,k ≥ 0, such that Xn ⊆ k=1 Bn,k for n ∈ N . If Algorithm 1
terminates with an ε-optimal root node solution (xr∗ , yr∗ ) at the end of i-th iteration,
then
Kn
2L n Dn,k dn
i≤ 1+ .
δn
n∈N , k=1
C (n)=∅
Proof After the i-th iteration, at least one of the following two situations must happen:
i
i. At the root node, it holds that Qr (xri+1 ) − Qri (xri+1 ) ≤ γr (δ), where γr is defined
in (21).
i
ii. There exists a node n ∈ N such that Qn (xni+1 ) − Qin (xni+1 ) > γn (δ), but all of its
i
child nodes satisfy Qm (xmi+1 ) − Qim (xmi+1 ) ≤ γm (δ), ∀ m ∈ C(n). In other words,
i + 1 ∈ In (δ).
Note that γr (δ) = δr + m∈C (r ) pr m γm (δ) = · · · = n∈N ,C (n)=∅ pn δn . If case i
happens, then by Proposition 7, (xri+1 , yri+1 ) is an ε-optimal root node solution. Note
that case ii can only happen at most n∈N |In (δ)| times by Lemma 4. Therefore, we
have that
123
954 S. Zhang, X. A. Sun
Kn
dn
2L n Dn,k
i≤ 1+ ,
δn
n∈N k=1
C (n)=∅
Theorem 1 implies the ε-convergence of the algorithm for any ε > 0. We remark that
the form of the upper bound depends on the values δ and the covering balls Bn,k , and
therefore the right-hand-side can be tightened to the infimum over all possible choices.
While it may be difficult to find the best bound in general, in the next section we take
some specific choices of δ and B and simplify the complexity upper bound, based on
the stagewise independence assumption.
Before giving the iteration complexity bound for Algorithm 2, we slightly adapt the
notations in the previous section to the stagewise independent scenario tree. We take
the values δ = (δn )n∈Ñ ,C (n)=∅ such that δn = δn for all n, n ∈ Ñ (t) for some
t = 1, . . . , T . Thus we denote δt = δn for any n ∈ Ñ (t), and δ0 = δr . The vector of
γt (δ) is defined recursively for non-leaf nodes as
i−1
and γT (δ) = 0. Let γti :=Qt (xti ) − Qi−1 t (x t ) and recall that γ0 :=γr for each index
i i i
Note that γti = maxn∈Ñ (t) γni (line 10 in Algorithm 2). By Lemma 3, an iteration
i
i ∈ It (δ) implies Qt (x) − Qit (x) ≤ γt (δ) for all x ∈ Xn with x − xti ≤ δt /(2L t ),
where L t = L n for any n ∈ Ñ (t). Moreover, since Xn = Xt for n ∈ Ñ (t), for any
Kt
covering balls Bt,k ⊂ Rdt with diameters Dt,k ≥ 0, such that Xt ⊆ ∪k=1 Bt,k , by the
same argument of Lemma 4, we know that
Kt
dt
2L t Dt,k
|It (δ)| ≤ 1+ . (26)
δt
k=1
We summarize the upper bound on the iteration complexity of Algorithm 2 in the next
theorem, and omit the proof since it is almost a word-for-word repetition with the
notation adapted as above.
T −1
Theorem 2 Given any ε > 0, choose values δ = (δt )t=0 such that δt > 0 and
T −1
t=0 tδ = ε. Let B = {Bt,k ⊂ Rdt }
1≤k≤K t ,0≤t≤T −1 be a collection of balls, each
123
Stochastic dual dynamic programming… 955
Kt
with diameter Dt,k ≥ 0, such that Xt ⊆ k=1 Bt,k for 0 ≤ t ≤ T − 1. If Algorithm 2
terminates with an ε-optimal root node solution (x0∗ , y0∗ ) in i iterations, then
T Kt
−1 dt
2L t Dt,k
i≤ 1+ .
δt
t=0 k=1
We next discuss some special choices of the values δ and the covering ball collec-
tions B. First, since Xt are compact, suppose Bt is the smallest ball containing Xt .
Then we have diam Xt ≤ Dt ≤ 2diam Xt where Dt = diam Bt . Moreover, suppose
L t ≤ L for some L > 0 and dt ≤ d for some d > 0. Then by taking δt = ε/T for all
0 ≤ t ≤ T − 1, we have the following bound.
Corollary 1 If Algorithm 2 terminates with an ε-optimal root node solution (x0∗ , y0∗ ),
then the iteration index is bounded by
2L DT d
i ≤ T 1+ ,
ε
i ≤ T K.
123
956 S. Zhang, X. A. Sun
Proof Note that when Xt is finite, it can be covered by degenerate balls B0 (x), x ∈ Xt .
Thus Dt,k = 0 for k = 1, . . . , K t and K t ≤ K by assumption. Apply Theorem 2, we
T −1 Kt
get i ≤ t=0 k=1 1 ≤ T K .
The bound in Corollary 3 grows linearly in T and does not depend on the value of ε.
In other words, we are able to obtain exact solutions to the regularized problem (4)
assuming the subproblem oracles.
In the following we study the iteration complexity of Algorithm 3. For clarity, we model
the subproblem oracles OnF and OnB as random functions, that are Σioracle -measurable
in each iteration i ∈ N, for any node n = r , where {Σioracle }i=0∞ is a filtration of
σ -algebras in the probability space. Intuitively, this model says that the information
given by Σioracle could be used to predict the outcome of the subproblem oracles. We
now make the following assumption on the sampling step.
Assumption 4 In each iteration i, the M scenario paths are sampled uniformly with
replacement, independent from each other and the outcomes of the subproblem oracles.
That is, the conditional probability of the j-th sample P i, j taking any scenario n t ∈
Ñ (t) in stage t is almost surely
1
i, j i , j
Prob Pt = n t | Σ∞
oracle
, σ {Pt }(i , j ,t )=(i, j,t) = , (27)
Nt
M
i, j i, j
It (δ):= i ∈ N : γt > γt (δ) and γ̃t+1 ≤ γt+1 (δ) . (28)
j=1
With the same argument, we know that the upper bound (26) on the sizes of It (δ)
holds everywhere for each t = 0, . . . , T − 1. However, since the nodes in the forward
123
Stochastic dual dynamic programming… 957
T −1
steps are sampled randomly, we do not necessarily have i ∈ ∪t=0 It (δ) for each
iteration index i ∈ N before Algorithm 3 first finds an ε-optimal root node solution.
T −1 i−1, j
Instead, we define an event Ai (δ) := {i ∈ ∪t=0 It (δ)} ∪ M j=1 {γ0 ≤ γ0 (δ) = ε}
for each iteration i, that means either some approximation is improved in iteration
i or the algorithm has found an ε-optimal root node solution in iteration i − 1. The
next lemma estimates the conditional probability of Ai (δ) given any oracles outcomes
sample
and samplings up to iteration i. For simplicity, we define two σ -algebras Σi :=
σ {P i , j }i ≤i, j =1,...,M and Σi := σ (Σioracle , Σi
sample
) for each i.
T −1
Lemma 5 Fix any ε = t=0 δt . Then the conditional probability inequality
Remark 3 Theorem 3 shows that for a fixed problem (such that I = I (δ, B) and
N = N1 · · · N T −1 are fixed), given any probability threshold q ∈ (0, 1), the number
of iterations needed for Algorithm 3 to find an ε-optimal root node solution with prob-
ability greater than 1 − q is O(− ln q/ν 2 ), which does not depend on I . In particular,
if we set M = 1, then the number of iterations needed is O(−N 2 ln q), which is expo-
nential in the number of stage T if Nt ≥ 2 for all t = 1, . . . , T −1. It remains unknown
to us whether there exists a complexity bound for Algorithm 3 that is polynomial in
T in general.
In this section, we discuss the sharpness of the iteration complexity bound of Algo-
rithm 2 given in Sect. 4. In particular, we are interested in the question whether it
is possible that the iteration needed for Algorithm 2 to find an ε-optimal root node
solution grows linearly in T when the state spaces are infinite sets. We will see that in
general it is not possible, with or without the assumption of convexity. The following
lemma simplifies the discussion in this section.
123
958 S. Zhang, X. A. Sun
We discuss the general Lipschitz continuous case, i.e., the nodal objective functions
f n (z, y, x) are ln -Lipschitz continuous in z but not necessarily convex. In this case
we choose to approximate the value function using ψn (x) = x and assume that
ln,ρ ≥ ln . We can set ln,λ = 0 for all n ∈ N , without loss of exactness of the
approximation by the Proof of Proposition 3. We begin with the following lemma on
the complexity of such approximation.
Lemma 7 Consider a norm ball X = {x ∈ Rd : x ≤ D/2} and a finite set of points
W = {wk }k=1K ⊂ X . Suppose that there is β > 0 and an L-Lipschitz continuous
function f : X → R+ such that β < f (wk ) < 2β for k = 1, . . . , K . Define
– Q(x):= max {0, f (wk ) − L x − wk } and
k=1,...,K
– Q(x):= min { f (wk ) + L x − wk }.
k=1,...,K
d
If K < DL
4β , then min Q(x) = 0 and min Q(x) > β.
x∈X x∈X
The proof is given in Sect. A.4.2. The lemma shows that if the number of points in
W is too small, i.e. K < (DL/2β)d , then the difference between the upper and lower
bounds could be big, i.e. Q(x̄) − Q(x̄) > β for some x̄. In other words, in order to
have a small gap between the upper and lower bounds, we need sufficient number of
sample points. This lemma is directly used to provide a lower bound on the complexity
of Algorithm 2.
Now we construct a Lipschitz continuous multistage problem defined on a chain,
i.e., a scenario tree, where each stage has a single node, N (t) = 1 for t = 1, . . . , T .
The problem is given by the value functions in each stage as,
⎧
⎪ Q r = min x0 ∈Xr Q 1 (x0 ),
⎪
⎪
⎪
⎨
Q t (xt−1 ) = min xt ∈Xt { f t (xt−1 ) + Q t+1 (xt )} , 1 ≤ t ≤ T − 1, (29)
⎪
⎪
⎪
⎪
⎩
Q T (x T −1 ) = f T (x T −1 ).
123
Stochastic dual dynamic programming… 959
and ε > 0 is a fixed constant. The state space Xt := B d (D/2) ⊂ Rd is a ball with
radius D/2 > 0. We remark that ε will be the optimality gap in Theorem 4. So for a
fixed optimality gap ε, we construct an instance of multistage problem (29) that will
prove to be difficult for Algorithm 2 to solve. Also (29) is constructed such that there
is no constraint coupling the state variables xt in different stages.
By Lemma 6, if we choose ψn (x) = x for all n ∈ N and ln,ρ = L for the
problem (29), then we have Q R t (x) = Q t (x) for all t = 1, . . . , T . The next theorem
shows a lower bound on the iteration complexity of problem (29) with this choice of
penalty functions.
Theorem 4 For any optimality gap ε > 0, there exists a problem of the form (29)
with subproblem oracles OnF , OnB , n ∈ N , and Or , such that if Algorithm 2 gives
UpperBound − LowerBound ≤ ε in the i-th iteration, then
d
DLT
i≥ .
4ε
The proof is given in Sect. A.4.3. The theorem shows that in general Algorithm 2
needs at least O(T d ) iterations before termination. We comment that this is due to the
fact that the approximation using generalized conjugacy is tight only locally. Without
convexity, one may need to visit many states to cover the state space to achieve tight
approximations of the value functions before the algorithm is guaranteed to find an
ε-optimal solution.
In the above example for general Lipschitz continuous problem, we see that the com-
plexity of Algorithm 2 grows at a rate of O(T d ). It remains to answer whether convexity
could help us avoid this possibly undesirable growth rate in terms of d. We show that
even by using linear cuts, rather than generalized conjugacy cuts, for convex value
functions, the complexity lower bound of the proposed algorithms could not be sub-
stantially improved. We begin our discussion with a definition.
Definition 5 Given a d-sphere S d (R) = {x ∈ Rd+1 : x2 = R} with radius R > 0,
a spherical cap with depth β > 0 centered at a point x ∈ S d (R) is the set
The next lemma shows that we can put many spherical caps on a sphere, the center
of each is not contained in any other spherical cap, the proof of which is given in
Sect. A.4.4.
√
Lemma 8 Given a d-sphere S d (R), d ≥ 2 and depth β < (1 − 2 )R,
2
there exists a
finite set of points W with
√ (d−1)/2
(d 2 − 1) π Γ (d/2 + 1) R
|W| ≥ ,
d Γ (d/2 + 3/2) 2β
123
960 S. Zhang, X. A. Sun
Hereafter, we denote a set of points that satisfies Lemma 8 as Wβd (R) ⊂ S d (R).
Next we construct an L-Lipschitz convex function for any L > 0, ε > 0 that satisfies
d (R). The proof is given in Sect. A.4.5.
certain properties on Wε/L
Lemma 9 Given positive constants ε > 0, L > 0 and a set Wε/L d (R). Let
3ε
Q l (wl ) − Q l (wl ) > .
2
Now we present the multistage convex dual dynamic programming example based
on the following parameters: T ≥ 2 (number of stages), L > 0 (Lipschitz constant),
d ≥ 3 (state space dimension), D = 2R > 0 (state space diameter), and ε > 0
(optimality gap). Choose any L 1 , . . . , L T such that L/2 ≤ L T < L T −1 < · · · <
d−1 Kt
L 1 ≤ L, and then construct finite sets Wt :=Wε/((T −1)L t+1 ) (R) = {wt,k }k=1 , K t =
|Wt | as defined in Lemma 8 for t = 1, . . . , T − 1. Moreover, define convex L t+1 -
Lipschitz continuous functions Ft for some values vt,k ∈ (ε/(2T − 2), ε/(T − 1)),
k = 1, . . . , K t , and the finite sets Wt . By Assumption 3, we define the stagewise
independent scenario tree as follows. There are K t distinct nodes in each stage t =
1, . . . , T − 1, which can be denoted by an index pair n = (t, k) for k = 1, . . . , K t , and
all nodes are defined by the same data in the last stage T . Then we define our problem
by specifying the nodal cost functions fr ≡ 0, f 1,k (x0 , y1 , x1 ) := L 1 x1 − w1,k for
k = 1, . . . , K 1 , f t,k (xt−1 , yt , xt ) := Ft−1 (xt−1 ) + L t xt − wt,k for k = 1, . . . , K t
and t = 2, . . . , T − 1, and f T ,1 (x T −1 , yT , x T ) := FT −1 (x T −1 ), and state spaces
Xt = X = B d+1 (R). Alternatively, the value functions can be written as
⎧
⎪
⎪ Q 1,k = min x1 ∈X L 1 x1 − w1,k + Q1 (x1 ) , ∀k ≤ K 1 ,
⎪
⎪
⎨
Q t,k (xt−1 ) = min xt ∈X Ft−1 (xt−1 ) + L t xt − wt,k + Qt (xt ) , k ≤ K t ,
⎪
⎪
⎪
⎪
⎩
Q T ,1 (x T −1 ) = FT −1 (x T −1 ),
(30)
123
Stochastic dual dynamic programming… 961
where the second equation is defined for all 2 ≤ t ≤ T −1, and the expected cost-to-go
functions as
1
Kt
Qt (xt ):= Q t+1,k (xt ), t = 0, . . . , T − 1.
Kt
k=1
By Lemma 8,
√
((d − 1)2 − 1) π Γ ((d − 1)/2 + 1) R L t (T − 1) (d−2)/2
Kt ≥ ,
d −1 Γ ((d − 1)/2 + 3/2) 2ε
√
d(d − 2) π Γ ((d/2 + 1/2) DL(T − 1) (d−2)/2
≥ .
d −1 Γ (d/2 + 1) 8ε
Theorem 5 For any optimality gap ε > 0, there exists a multistage stochastic
convex problem of the form (30) such that, if Algorithm 2 gives UpperBound −
LowerBound < ε at i-th iteration, then
√
1 d(d − 2) π Γ (d/2 + 1/2) DL(T − 1) (d−2)/2
i> .
3 d −1 Γ (d/2 + 1) 8ε
The proof is given in Sect. A.4.6. The theorem implies that, even if problem (2)
is convex and has Lipschitz continuous value functions, the minimum iteration for
Algorithm 2 to get a guaranteed ε-optimal root node solution grows as a polynomial
of the ratio T /ε, with the degree being d/2 − 1.
We remark that Theorems 4 and 5 correspond to two different challenges of the
SDDP type algorithms. The first challenge is that the backward step subproblem oracle
may not give cuts that provide the desired approximation, which could happen when
the value functions are nonconvex or nonsmooth. Theorem 4 results from the worst
case that the backward step subproblem oracle leads to approximations of the value
function in the smallest neighborhood.
The second challenge is that different nodes, or more generally, different scenario
paths give different states in each stage, so sampling and solving the nodal problem on
one scenario path provides little information to the nodal problem on another scenario
path. In example (30), the linear cut obtained in each iteration does not provide any
information on the subsequent iteration states (unless the same node is sampled again).
From this perspective, we believe that unless some special structure of the problem
is exploited, any algorithm that relies on local approximation of value functions will
123
962 S. Zhang, X. A. Sun
face the “curse of dimensionality,” i.e., the exponential growth rate of the iteration
complexity in the state space dimensions.
6 Conclusions
In this paper, we propose three algorithms in a unified framework of dual dynamic pro-
gramming for solving multistage stochastic mixed-integer nonlinear programs. The
first algorithm is a generalization of the classic nested Benders decomposition algo-
rithm, which deals with general scenario trees without the stagewise independence
property. The second and third algorithms generalize SDDP with sampling procedures
on a stagewise independent scenario tree, where the second algorithm uses a determin-
istic sampling approach, and the third one uses a randomized sampling approach. The
proposed algorithms are built on regularization of value functions, which enables them
to handle problems with value functions that are non-Lipschitzian or discontinuous.
We show that the regularized problem preserves the feasibility and optimality of the
original multistage program, when the corresponding penalty reformulation satisfies
exact penalization. The key ingredient of the proposed algorithms is a new class of cuts
based on generalized conjugacy for approximating nonconvex cost-to-go functions of
the regularized problems.
We obtain upper and lower bounds on the iteration complexity of the proposed
algorithms on MS-MINLP problem classes that allow exact Lipschitz regularization
with predetermined penalty functions and parameters. The complexity analysis is new
and deepens our understanding of the behavior of SDDP. For example, it is the first time
to prove that the iteration complexity of SDDP depends polynomially on the number
of stages, not exponentially, for both convex and nonconvex multistage stochastic
programs, and this complexity dependence can be reduced to linear if the optimality
gap is allowed to scale linearly with the number of stages, or if all the state spaces are
finite sets. These findings resolve a conjecture of the late Prof. Shabbir Ahmed, who
inspired us to work on this problem.
Funding Open Access funding provided by the MIT Libraries Funding was provided by Directorate for
Engineering (Grant Number 1751747).
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which
permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give
appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence,
and indicate if changes were made. The images or other third party material in this article are included
in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If
material is not included in the article’s Creative Commons licence and your intended use is not permitted
by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the
copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
A Proofs
In this section, we present the proofs to the theorems, propositions, and lemmas that
are not displayed in the main text.
123
Stochastic dual dynamic programming… 963
Proof We show that Q n is l.s.c. by showing the lower level sets leva (Q n ) = {z ∈
Xa(n) : Q n (z) ≤ a} are closed for all a ∈ R. At any leaf node n, the expected cost-to-
go function Qn (xn ) is zero, thus z is in leva (Q n ) if and only if z is in the projection of
the following set {(z, y, x) : (x, y) ∈ Fn , f n (z, y, x) ≤ a}. Since f n is defined on a
compact set {(z, y, x) : z ∈ Xa(n) , (x, y) ∈ Fn } and l.s.c. by Assumption 1, we know
that the set {(z, y, x) : f n (z, y, x) ≤ a} is compact. Moreover, since the projection
(z, y, x) → z is continuous, the image leva (Q n ) is still compact, hence closed.
At any non-leaf node n, suppose Q m is l.s.c. for all its child nodes m ∈ C(n). Then, Qn
is l.s.c. since Qn is defined in (3) and pnm > 0 for all m. A point z ∈ leva (Q n ) if and
only if z is in the projection of the set {(z, y, x) : (y, x) ∈ Fn , f n (z, y, x) + Qn (x) ≤
a}. Similarly, this shows leva Q n is closed since f n , Qn are l.s.c. and the projection
(z, y, x) → z is continuous. We thus conclude Q n is l.s.c. for every node n in the
scenario tree.
To show claims 1 and 2 in the proposition, take any two points z 1 , z 2 ∈ Xa(n) .
Suppose (x1 , y1 ), (x2 , y2 ) ∈ Fn are the corresponding minimizers in the definition (2).
Therefore, Q n (z 1 ) = f n (z 1 , y1 , x1 ) + Qn (x1 ) and Q n (z 2 ) = f n (z 2 , y2 , x2 ) + Qn (x2 ).
If f n is Lipschitz continuous in the first variable, then we have
Q n (z 1 ) − Q n (z 2 ) = f n (z 1 , y1 , x1 ) + Qn (x1 ) − f n (z 2 , y2 , x2 ) − Qn (x2 )
≤ f n (z 1 , y2 , x2 ) + Qn (x2 ) − f n (z 2 , y2 , x2 ) − Qn (x2 )
≤ f n (z 1 , y2 , x2 ) − f n (z 2 , y2 , x2 ) ≤ ln z 1 − z 2 .
The first inequality follows from the definition (2), while the second inequality follows
from the convexity of f n and Qn . This shows Q n is convex.
123
964 S. Zhang, X. A. Sun
is σn -Lipschitz continuous in the first variable xa(n) . Note that the minimum is
well-defined since Xa(n) is compact and the functions f n , σn ψn are l.s.c.. Besides,
since z = xa(n) is a feasible solution in the minimization, we know that f n ˝
(σn ψn )(xa(n) , yn , xn ) ≤ f n (xa(n) , yn , xn ) for all xa(n) ∈ Xa(n) and (xn , yn ) ∈ Fn .
Pick any x1 , x2 ∈ Xa(n) , (x, y) ∈ Fn , and let z 1 , z 2 ∈ Xa(n) be the correspond-
ing minimizers in the definition of f n ˝ (σn ψn )(x1 , y, x) and f n ˝ (σn ψn )(x2 , y, x),
respectively. By definition,
1
n (x a(n) ) =
QR pm f m (z m , ym , xm ) + σm ψm (xa(m)
− zm ) .
pn
m∈T (n)
123
Stochastic dual dynamic programming… 965
This leads to a contradiction with the assumption that v reg = v prim . Therefore, we
n (x a(n) ) = Q n (x a(n) ) for all n ∈ N , n = r .
conclude that Q R
= Q n (x̄n ),
where the first inequality is the validity of the generalized conjugacy cut (10) and the
second and the last equality are due to Lemma 1 for (x̄n , ȳn )n∈N being an optimal
solution to problem (1). This completes the proof.
Proof The minimums in (13) are well-defined because of the compactness of X and
lower semicontinuity of Q. Take any x ∈ X . Since both the primal set X and the dual
set {λ ∈ Rd : λ∗ ≤ σ } are bounded, by strong duality (cf. Theorem 3.1.30 in [27]),
we have
max min{Q(z) + λ, x − z} = min max {Q(z) + λ, x − z} = min{Q(z)
λ∗ ≤σ z∈X z∈X λ∗ ≤σ z∈X
+ σ x − z},
n (x) = max
QR min {Q n (z) + λ, x − z} .
λ∗ ≤σn z∈Xa(n)
Therefore,
x̄n
CnΦn (x̄n | λ̂n , ρ̂n , v̂n ) = v̂n = max min {Q n (z) + λ, x̄n − z}
λ∗ ≤ln,λ z∈Xa(n)
123
966 S. Zhang, X. A. Sun
Proof Let L n := m∈C (n) pnm (lm,λ + lm,ρ ) for simplicity. We prove the proposition
recursively for nodes n ∈ N , and inductively for iteration indices i ∈ N. For leaf nodes
and the first iteration, it holds obviously because Qin (x) = 0 for any leaf node n ∈ N
with C(n) = ∅, and Qn0 (x) = 0 from the definition (14). Now suppose for some
n ∈ N , and i ∈ N, it holds for all m ∈ C(n) that Qim (x) ≤ Qm (x), Qi−1 n (x) ≤ Qn (x),
and that Qi−1n (x) is L n -Lipschitz continuous. Then it follows from (12), (16), and
(B) that Cm (x | λ̂m , ρ̂m , v m ) ≤ Q m (x) for all m ∈ C(n). By (15), Cmi is (lm,λ +
i i i i
lm,ρ )-Lipschitz continuous so m∈C (n) pnm Cmi is L n -Lipschitz continuous. Thus the
pointwise maximum of Qi−1 n (x) and m∈C (n) pnm C m (x | λ̂n , ρ̂n , v n ) (cf. (14)) is still
i i i i
Proof Let L n = m∈C (n) pnm σm for simplicity in this proof. We prove the
1
statement by induction on the number
of iterations i. When i = 1, Qn (x) =
min{+∞, m∈C (n) pnm (v̄mi + σ x − x i )} = i
m∈C (n) pnm (v̄m + σm x − x n )
i
m n
1
which is clearly L n -Lipschitz continuous. For any leaf node n, Qn ≡ 0 = QR
n by
1
definition. Going recursively from leaf nodes to the root node, suppose Qm ≥ QR
m for
all m ∈ C(n) for some node n, then we have
1
v̄m
1
= f m (z m
1
, ym1 , xm1 ) + σm ψm (xn1 − z m
1
) + Qm (xm1 )
1
≥ min{ f m (z, y, x) + σm ψm (xn1 − z) + Qm (x) : (x, y) ∈ Fm , z ∈ Xn }
≥ min{ f m (z, y, x) + σm ψm (xn1 − z) + QR
m (x) : (x, y) ∈ Fm , z ∈ Xn }
m (x n ).
= QR 1
(31)
1
1 + σ x − x 1 ) ≥ QR (x) for all x ∈ X by the
Thus Qn (x) = m∈C (n) pnm (v̄m m n n n
σm -Lipschitz continuity of the regularized value functions Q R m (x) for all m ∈ C(n)
shown in Proposition 2.
Now assume that the statement holds for all iterations up to i −1. For any leaf node n,
i i
Qn ≡ 0 = QR n still holds by definition. For any non-leaf node n, suppose Qm ≥ Qm
R
for all m ∈ C(n). Then by the same argument (31), we know that v̄m ≥ Q m (xn ).
i R i
i−1
By induction hypothesis, Qn (x) ≥ QR n (x) for all x ∈ Xn . So for the cases without
i i−1
i +σ x − x i )} is L -Lipschitz
convexity, Qn (x) = min{Qn (x), m∈C (n) pnm (v̄m m n n
continuous and satisfies Qn (x) ≥ Qn (x) since m∈C (n) pnm (v̄m + σm x − xni ) ≥
i R i
123
Stochastic dual dynamic programming… 967
i
It remains to show that in the convex case Qn (x) is still L n -Lipschitz continuous and
i
satisfies Qn (x) ≥ QR n (x) for any x ∈ Xn . Note that Qn (x) can be naturally extended
R
i
As a result, Qn (x) is a supremum of L n -Lipschitz linear functions (of the forms
i
l(x) := Qn (ẑ)
+ λ̂, ẑ − x where λ̂ ∈ B∗ (L n ) and ẑ ∈ Xn ) and thus is also an
L n -Lipschitz continuous function. Therefore, Q n (x) ≥ QR n (x) for all x ∈ R . By
dn
i
(19), Qn (x) = (Q n )∗∗ (x) ≥ (QR ∗∗
n ) (x) = Qn (x) for all x ∈ R . This completes
R dn
the proof.
i
v reg ≤ fr (xa(r ) , yr∗ , xr∗ ) + QrR (xr∗ ) ≤ fr (xa(r ) , yr∗ , xr∗ ) + Qr (xr∗ ) ≤ UpperBound.
Then, using the optimality of (xri+1 , yri+1 ) given by Or (Qri ) and the fact that Qri (x) ≤
Qr (x), we see that
Under Assumption 2, v reg = v prim . Therefore, combining all the above inequalities,
we have shown that
i
v reg ≤ fr (xa(r ) , yr∗ , xr∗ ) + Qr (xr∗ ) ≤ v reg + ε,
which means (xr∗ , yr∗ ) is an ε-optimal root node solution to the regularized prob-
i
lem (4). Now suppose Qr (xri+1 ) − Qri (xri+1 ) ≤ ε for some iteration index i. Note that
i
UpperBound ≤ fr (xa(r ) , yri+1 , xri+1 ) + Qr (xri+1 ), we have
123
968 S. Zhang, X. A. Sun
i
UpperBound − LowerBound ≤ fr (xa(r ) , yri+1 , xri+1 ) + Qr (xri+1 )
− ( fr (xa(r ) , yri+1 , xri+1 ) + Qri (xri+1 ))
i
= Qr (xri+1 ) − Qri (xri+1 ) ≤ ε.
as lm,λ ≥ σm in Definition 3. Otherwise, by definition (16) and the fact that (λ, ρ) =
(0, σn ) is a dual feasible solution for the problem (B), we have
= f m (z m
i
, ymi , xmi ) + σn ψn (xni − z m
i
) + Qi−1
m (x m )
i
for all m ∈ C(n). The last equality is due to the forward step subproblem oracle
i i−1
OmF (xni , Qi−1
m ) in the algorithm. Meanwhile, note that Qm (x) ≤ Qm (x) for x ∈ Xm .
By definition (18), we have
i
v̄m
i
= f m (z m
i
, ymi , xmi ) + σn ψn (xni − z m
i
) + Qm (xmi )
i−1
≤ f m (z m
i
, ymi , xmi ) + σn ψn (xni − z m
i
) + Qm (xmi )
123
Stochastic dual dynamic programming… 969
i
for all m ∈ C(n). Note that by definition (17), Qn (xni ) ≤ m∈C (n) pnm v̄m
i
and by definitions (14) and (15), Qin (xni ) ≥ m∈C (n) pnm C m (x n | λ̂m , ρ̂m , v m ) =
i i i i i
i
Qn (xni ) − Qin (xni ) ≤ pnm (v̄m
i
− v im )
m∈C (n)
i−1
≤ pnm [Qm (xmi ) − Qi−1
m (x m )] ≤
i
pnm γm (δ).
m∈C (n) m∈C (n)
i
Note that Qn (x) is m∈C (n) pnm σm -Lipschitz continuous by Proposition 6, and
Qim (x) is m∈C (n) p nm (l n,λ + l n,ρ ) -Lipschitz continuous on Xn by Proposition 5.
Since we have lm,λ + lm,ρ ≥ σm regardless of convexity of the problem, it holds
i
that Qn (x) and Qin (x) are both L n -Lipschitz continuous. Therefore, for any x ∈ Xn ,
x − xni ≤ δn /(2L n ), we have
i i
Qn (x) − Qin (x) ≤ Qn (xni ) − Qin (xni ) + 2L n x − xni ≤ pnm γm (δ) + δn
m∈C (n)
= γn (δ).
Kn
B(δn /(4L n ), xni ) ⊆ (Bn,k + B(δn /(4L n ))).
i∈In (δ) k=1
123
970 S. Zhang, X. A. Sun
It follows that
⎡ ⎤
Vol ⎣ B(δn /(4L n ), xni )⎦ = |In (δ)| · VolB(δn /(4L n )) ≤
i∈In (δ)
#K $
n
Kn
Vol (Bn,k + B(δn /(4L n )) ≤ Vol Bn,k + B(δn /(4L n )) .
k=1 k=1
Therefore,
Kn
Kn
Vol Bn,k + B(δn /(4L n )) 2L n Dn,k dn
|In (δ)| ≤ = 1+ .
VolB(δn /(4L n )) δn
k=1 k=1
i, j i, j i, j i, j
Prob{γt = γ̃t | Σi−1 } ≥ Prob{Pt = n(γ̃t ) | Σi−1 },
i, j
where n(γ̃t ) is the smallest node index n ∈ Ñ (t) such that QR
t (x n ) − Qt (x n ) =
i−1
i, j i, j
γ̃t for (xn , yn , z n ) = OnF (xt−1 , Qi−1 n ), which is determined given Σi−1 . Using
the same argument as in the Proof of Theorem 1, Lemma 3 shows that the
T −1 i, j i, j T −1
event ∩t=1 {γt = γ̃t } implies the event {i ∈ ∪t=0 It (δ)} and hence the
event Ai (δ) for each j = 1, . . . , M. Therefore, since Σi−1 is contained in
σ (Σ∞oracle , σ {P i , j }
(i , j )=(i, j) ), by the independent, uniform sampling (Assump-
tion 4), we have
j=1 t=1
⎛ ⎞
−1
M T' (
(
≥ Prob ⎝ = n(γ̃t )} (( Σi−1 ⎠
i, j i, j
{γt
j=1 t=1
+ +T −1 ( ,, M
' (
= 1 − 1 − Prob {γt
i, j
=
i, j
n(γ̃t )} ( Σi−1 = 1 − (1 − 1/N ) M .
(
t=1
123
Stochastic dual dynamic programming… 971
-T −1 i, j i, j
Here, the last step follows from Prob( t=1 {γt = n(γ̃t )} | Σi−1 ) =
T −1 i, j i, j T −1
t=1 Prob({γt = n(γ̃t )} | Σi−1 ) = t=1 (1/Nt ) = N .
For any κ > 1, take the smallest iteration index i such that iν ≥ κ I , and set k :=
(κ − 1)I . Since I ≥ 2κ
iν
, the probability bound can be then written as
(κ − 1)2 I 2 (κ − 1)2 I ν
Prob(ι ≥ i) ≤ Prob(Si ≤ I ) ≤ exp − ≤ exp −
8i 16κ
κI
Substitute the left-hand-side with Prob(ι ≥ 1 + ν ) using the definition of i and we
have obtained the desired inequality.
Proof We prove the lemma recursively starting from the leaf nodes. For leaf nodes
n ∈ N , C(n) = ∅, Q R n (x) = min z∈Xa(n) Q n (z) + σn ψn (x − z) ≥ min z∈Xa(n) Q n (z) +
ln x − z. Since Q n is ln -Lipschitz continuous, Q n (z) ≥ Q n (x) −ln x − z. There-
n (x) ≥ Q n (x) and by Proposition 2, we know Q n (x) = Q n (x) for all
fore, Q R R
x ∈ Xa(n) .
Now suppose for a node n ∈ N , we know that all of its child nodes satisfy Q R m (x) =
Q m (x), ∀ x ∈ Xn , for all m ∈ C(n). Then by definition,
n (x a(n) ) =
QR min f n (z, y, x) + σn ψn (xa(n) − z) + QR
n (x).
(x,y)∈Fn ,z∈Xa(n)
123
972 S. Zhang, X. A. Sun
d
Therefore, it must hold that K ≥ VolX /VolB(2β/L) = DL 4β , hence a contradic-
tion.
The existence of x̂ guarantees that f (wk )− Lx̂ −wk ≤ f (wk )−2β < 0 for each
k = 1, 2, . . . , K . Therefore, 0 ≤ min x∈X Q(x) ≤ Q(x̂) = max1≤k≤K {0, f (wk ) −
Lx̂ − wk } = 0. From compactness of X and the continuity of Q(x), we have
the inequality min x∈X Q(x) ≥ min1≤k≤K Q(wk ) = min1≤k≤K f (wk ) > β, which
completes the proof.
Proof Let us define the forward subproblem oracle OnF in iteration i and stage t as
i , Q i−1 ) to an optimal solution (x i , z i ) of the forward subproblem
mapping (xt−1 t+1 t t
min f t (z t ) + Lxt−1
i
− z t + Q i−1 (x ) ,
t+1 t
xt ,z t ∈Xt
and the backward subproblem oracle OnB in iteration i and stage t as mapping
i , Q i ) to an optimal solution ( x̂ i , ẑ i ; λ̂i = 0, ρ̂ i = L) of the backward sub-
(xt−1 t+1 t t t t
problem
Note that in the backward subproblem (32), we choose that ln,λ = 0 and ln,ρ = L. It is
observed that the objective function in (32) is nondecreasing in ρ. Therefore, ρ̂ti = L
is always an optimal solution for the outer maximization in (32). The root-node oracle
Or in iteration i simply solves min x0 ∈X Q i1 (x0 ) and outputs x0i+1 .
123
Stochastic dual dynamic programming… 973
In the backward step (Algorithm 2, step 14) and c.f. the definition (15), the new
generalized conjugacy cut in iteration k ≤ i is generated by
v kt = f t (ẑ tk ) + Lxt−1
k
− ẑ tk + min Q kt+1 (xt ),
xt ∈Xt
≤ f t (xt−1
k
)+ min Q k (xt ),
xt ∈Xt t+1
≤ f t (xt−1
k
) + min Q it+1 (xt ),
xt ∈Xt
≤ max 0, f t (xt−1
k
) − Lx − xt−1
k
+ min Q it+1 (xt ). (33)
k=1,...,i xt ∈Xt
Similarly, by (18), the upper approximation of the value function is computed and
lower bounded as
k
v̄tk = f t (z tk ) + Lz tk − xt−1
k
+ Q t+1 (xtk ),
k
≥ f t (xt−1
k
) + Q t+1 (xtk ),
k
≥ f t (xt−1
k
) + min Q t+1 (xt ),
xt ∈Xt
i
≥ f t (xt−1
k
)+ min Q t+1 (xt ),
xt ∈Xt
where X̂tk := arg min xt ∈Xt Q k−1 (x ). Therefore, the over-approximation satisfies
t+1 t
i
Q t (x) = min v̄tk + Lx − xt−1k
, (by (17)),
k=1,...,i
k i
≥ min f t (xt−1
k
) + L x − xt−1 + min Q t+1 (xt ),
k=1,...,i xt ∈Xt
ε i
> + min Q t+1 (xt ), (34)
T xt ∈Xt
where (34) follows from the construction that f t (x) > ε/T for all x ∈ Xt .
123
974 S. Zhang, X. A. Sun
Now using (33) and (34), we can prove the statement of the theorem. Sup-
d
pose the iteration index i < DLT 4ε . Denote wk := xt−1 k for k = 1, . . . , i.
Since ε/T < f t (wk ) < 2ε/T by construction, applying Lemma 7, we get
min xt ∈Xt maxk=1,...,i 0, f t (xt−1
k ) − Lx − x k
t t−1 = 0. By (33), min xt−1 ∈Xt−1 Q it
(xt−1 ) ≤ min xt ∈Xt Q t+1 (xt ) for t = 1, . . . , T . Note at stage T , Q iT +1 ≡ 0. There-
i
fore, min xt−1 ∈Xt−1 Q it (xt−1 ) ≤ 0 for all t = 1, . . . , T . But since Q it (x) ≥ 0 for all
x ∈ Xt−1 , we have min x∈Xt−1 Q it (x) = 0 for all 1 ≤ t ≤ T . Hence we see that
LowerBound = min x0 ∈X0 Q i1 (x0 ) = 0 in iteration i.
Since Xt is a norm ball, it is compact. So by (34), we have
i i
min Q t (xt−1 ) > ε/T + min Q t+1 (xt ), ∀1 ≤ t ≤ T .
xt−1 ∈Xt−1 xt ∈Xt
i
This recursion implies that min x0 ∈X0 Q 1 (x0 ) > T (ε/T ) = ε. According to Algo-
k
rithm 2, Steps 18-19, we have that in iteration i, UpperBound = min {Q 1 (x0k+1 )} ≥
k=1,...,i
i i
min {Q 1 (x0k+1 )} ≥ min Q 1 (x0 ) > ε. Combining the above analysis, we have
k=1,...,i x0 ∈X0
UpperBound − LowerBound > ε in iteration i. Therefore, we conclude that if
d
UpperBound−LowerBound ≤ ε at the i-th iteration, then we have i ≥ DLT
4ε .
Proof Let vd denote the d-volume for a d-dimensional unit ball. Recall that the d-
(d + 1)π (d+1)/2 d
volume of S d (R) is given by Vold (S d (R)) = (d + 1)vd+1 R d = R .
Γ ( d+1
2 + 1)
We next estimate the d-volume for the spherical cap Sβd (R, x). Let α ∈ (0, π/2) denote
√
the central angle for the spherical cap, i.e., cos α = 1 − β/R. Since β < (1 − 22 )R,
we know that α < π/4. Then for any x ∈ S d (R), the d-volume of the spherical cap
can be calculated through
. α . α
Vold (Sβd (R, x)) = Vold−1 (S d−1 (R sin θ ))Rdθ = dvd R d (sin θ )d−1 dθ.
0 0
Note that when θ ∈ (0, α), sin θ > 0 and cos θ/ sin θ > 1. Therefore, since d ≥ 2,
. α cos θ (sin α)d−1
Vold (Sβd (R, x)) ≤ dvd R d
(sin θ )d−1 dθ = dvd R d · .
0 sin θ d −1
/
By substituting sin α = 1 − (1 − β/R)2 , we have that
123
Stochastic dual dynamic programming… 975
+ ,(d−1)/2
d vd β 2
= 2 1− 1−
d − 1 vd+1 R
(d−1)/2
d vd 2β
≤ .
d 2 − 1 vd+1 R
Now suppose W = {wi }k=1K is a maximal set satisfying the assumption, that is, for
any w ∈ S (R), w ∈
d / W, there exists wk ∈ W such that w ∈ Sβd (R, wk ). Then,
K
k=1 Sβ (R, wk ) ⊇ S (R), therefore
d d
K
Vold (S d (R)) ≤ Vold (Sβd (R, wk )) = |W|Vold (Sβd (R, w1 )).
k=1
Therefore we have
# (d−1)/2 $−1
Vold (S d (R)) d vd 2β
|W| ≥ ≥
Vold (Sβ (R, w1 ))
d d − 1 vd+1 R
2
√ (d−1)/2
(d 2 − 1) π Γ (d/2 + 1) R
= .
d Γ (d/2 + 3/2) 2β
Proof First we claim that in any iteration i, for any nodal problem k in stage t, the
optimal solution in the forward step (Algorithm 2 line 7) must be xti = wt,k . To see
this, recall that we set ln,λ = L t and ln,ρ = 0 for all n ∈ Ñ (t), so by Proposition 5, the
123
976 S. Zhang, X. A. Sun
which, alongside with the fact that L t+1 < L t , implies that xt = wt,k is the unique
optimal solution to the forward step problem (35). The above argument also works
for any node in the stage t = 1 by simply removing the constant term Ft (xt−1
i ) in the
j j j
Q it,k (x) := max {0, Ct,k (x | λ̂t,k , 0, v t,k )},
1≤ j≤i
and
i j j
Q t,k (x) := conv1≤ j≤i {v̄t,k + L t xt−1 − x},
j j
with −λ̂t,k ∈ ∂ Ft−1 (xt−1 ) being an optimal solution to the outer maximization prob-
lem, and by formula (18)
j j j j j j
v̄t,k := Ft−1 (z t,k ) + L t xt−1 − z t,k + Qt (xt,k )
j j j
= Ft−1 (xt−1 ) + Qt (xt,k ).
j j
The last equalities of v t,k and v̄t,k are due to the L t -Lipschitz continuity of Ft−1 . So we
have by the monotonicity of the under- and over-approximations that for each j ≤ i,
j j j
Q it,k (x) = max {0, Ft−1 (xt−1 ) + λ̂t,k , xt−1 − x + Qtj (wt,k )}
1≤ j≤i
123
Stochastic dual dynamic programming… 977
j j j
≤ max {0, Ft−1 (xt−1 ) + λ̂t,k , xt−1 − x} + Qit (wt,k ),
1≤ j≤i
and
i j j j
Q t,k (x) = conv1≤ j≤i Ft−1 (xt−1 ) + L t xt−1 − x + Qt (wt,k )
j j i
≥ conv1≤ j≤i Ft−1 (xt−1 ) + L t xt−1 − x + Qt (wt,k ).
Therefore,
i i
Q t,k (x) − Q it,k (x) ≥ Qt (wt,k ) − Qit (wt,k )
j j j
+ conv1≤ j≤i {Ft−1 (xt−1 ) + L t xt−1 − x} − max {0, Ft−1 (xt−1 )
1≤ j≤i
j j
+ λ̂t,k , xt−1 − x}. (36)
i i
Thus we have Q t,k (wt−1,k ) − Q it,k (wt−1,k ) ≥ Qt (wt,k ) − Qit (wt,k ) for any k =
1, . . . , K t−1 , since the last two terms on the right hand side of (36) are over- and under-
j
approximations of the function Ft−1 , respectively. Moreover, note that xt−1 = wt−1,k
for some k = 1, . . . , K t−1 as it is the unique solution in the forward step. By Lemma 9,
whenever the node n = (t − 1, k ) is never sampled up to iteration i, we further have
i 3ε i
Q t,k (wt−1,k ) − Q it,k (wt−1,k ) > + Qt (wt,k ) − Qit (wt,k ).
2(T − 1)
and
0 1
1 1
Kt Kt
i j j i
Qt−1 (x) = conv1≤ j≤i (v̄t,k + L t xt−1 − x) ≥ Q t,k (x).
Kt Kt
k=1 k=1
1
Kt
i i
Qt−1 (wt−1,k ) − Qit−1 (wt−1,k ) ≥ [Qt (wt,k ) − Qit (wt,k )],
Kt
k=1
1
Kt
i 3ε i
Qt−1 (wt−1,k ) − Qit−1 (wt−1,k ) > + [Qt (wt,k ) − Qit (wt,k )].
2(T − 1) Kt
k=1
123
978 S. Zhang, X. A. Sun
1
K t−1
i ε 1
Kt
i
[Qt−1 (wt−1,k ) − Qit−1 (wt−1,k )] > + [Qt (wt,k ) − Qit (wt,k )].
K t−1 T −1 Kt
k =1 k=1
i ε
Consequently, Qr − Qri > (T − 1) · = ε. Therefore, if UpperBound −
T −1
i
LowerBound = Qr − Qri ≤ ε in the iteration i, then
√
1 1 d(d − 2) π Γ (d/2 + 1/2) DL(T − 1) (d−2)/2
i> min |Wt | ≥ .
3 t=1,...,T −1 3 d −1 Γ (d/2 + 1) 8ε
In this section, we discuss the problem classes that allows exact penalty reformulation,
as stated in Assumption 2. A penalty function ψ : Rd → R+ is said to be sharp, if
ψ(x) ≥ c x for all x ∈ V ⊂ Rd , for some open neighborhood V 0 and some
positive scalar c > 0.
For convex problems, the Slater condition implies that the intersection of the domain
dom( n∈N f n ) and the feasible sets Πn∈N Fn has a non-empty interior. We have the
following proposition.
Proposition 8 Suppose the problem (1) is convex and satisfies the Slater condition For
any sharp penalty functions ψn , there exist σn > 0 such that the penalty reformulation
is exact.
Proof Consider a perturbation vector w = (wn )n∈N such that wn ∈ Xa(n) − Xa(n) for
each n ∈ N , and define the perturbation function
0 ( 1
(
τ (w):= min pn f n (z n , yn , xn ) (( wn = xa(n) − z n , ∀ n ∈ N .
(z n ,xn ,yn )∈X ×F n a(n)
n∈N
The function τ is convex and v prim = τ (0) by definition. By the Slater condition,
0 ∈ int(dom(τ )) so there exists a vector λ ∈ R|N | such that τ (w) ≥ τ (0) + λ, w for
all perturbation w. Since ψn are sharp, there exist σn > 0 such that n∈N σn ψn (wn )+
λ, w > 0 for all w = 0. Consequently the penalty reformulation is exact since
v reg = minw τ (w) + n∈N σn ψn (wn ) and all optimal solutions must satisfy wn =
xa(n) − z n = 0 for all n ∈ N .
123
Stochastic dual dynamic programming… 979
We say a problem (2) has finite state spaces if the state spaces Xn are finite sets for all
nodes n. Such problems appear in multistage integer programming [38], or when the
original state spaces can be approximated through finite ones [19, 39]. The following
proposition shows the penalty reformulation is exact whenever the state spaces are
finite.
Proposition 9 For any penalty functions ψn , n ∈ N , if the state spaces are finite, then
there exists a finite σn > 0 such that the penalty reformulation (7) is exact.
Proof Let dn := min x=z∈Xa(n) |ψn (x − z)| for each n ∈ N . Since ψn is a penalty
function and the state space Xn is finite, we know dn > 0. Define c as
c := min pn f n (z n , yn , xn ). (37)
(z n ,yn ,xn )∈Xa(n) ×Fn
n∈N
Since (37) is a relaxation of the original problem (1) by ignoring coupling constraint
z n = xa(n) , then c ≤ v prim . We choose σn = 1 + (v prim − c)/( pn dn ) for all n ∈ N .
Now let (xn , yn , z n )n∈N be an optimal solution to the regularized problem (4). Then
if there exists xa(m) = z m for some m = r , then pm σm ψm (xa(m) − z m ) > v prim − c.
Consequently,
v reg ≥ c + pn σn ψn (xa(n) − z n ) ≥ c + pm σm ψm (xa(m) − z m ) > c + v prim − c
n∈N
=v .
prim
This is a contradiction since v reg ≤ v prim . Thus, any optimal solution to the reformu-
lation (7) must have xa(n) = z n for all n = r , which means the penalty reformulation
is exact.
The problem (1) is said to be defined by mixed-integer linear functions, if all the feasi-
ble sets Fn and the epigraphs epi f n are representable by mixed-integer variables and
non-strict linear inequalities with rational coefficients. Recall that by Assumption 1,
the primal problem is feasible, v prim > −∞. We have the following proposition on
the exact penalty reformulation.
123
980 S. Zhang, X. A. Sun
Proposition 11 Suppose the problem (1) is defined by C 1 -functions and the Karush-
Kuhn-Tucker condition holds for every local minimum solution of (1). If the penalty
functions ψn are sharp for all n ∈ N , then there exist σn > 0 such that the penalty
reformulation is exact.
Here u is the perturbation vector and u = 0 corresponds to the original primal problem.
Let ψ be a penalty function on Rd and σ > 0 a penalty factor. A penalization of the
original primal problem p(0) is given by
123
Stochastic dual dynamic programming… 981
Proof Let X (u) ⊂ X denote the feasible set in x dependent on u. The minimum in
the definition is well defined for every u due to the compactness of X (u).
We show that p(u) is lower semicontinuous (lsc) by showing lim inf v→u p(v) ≥
p(u) for any u. Assume for contradiction that for any ε > 0, there exists a sequence
{vk }∞
k=1 such that vk → u and p(vk ) ≤ p(u) − ε. Let x k ∈ arg min f (x, vk )
and thus p(vk ) = f (x, vk ). Since X is compact, there exists a subsequence xk j
and z ∈ X such that xk j → z as j → ∞. Then by continuity of f , f (z, u) =
lim j→∞ f (xk j , vk j ) ≤ p(u) − ε. This contradicts with the definition of p(u), since
p(u) = min x∈X (u) f (x, u) ≤ f (z, u) ≤ p(u) − ε. Therefore p(u) is lsc.
Proof Let X (u) denote the feasible region of x defined by constraints gi (x, u) ≤
0, i = 1, . . . , I and h j (x, u) = 0, j = 1, . . . , J . Then X (u) is compact for any u
by the continuity of the constraint functions. We show that for every optimal solution
x0 ∈ X (0), there exists a neighborhood V (x0 ) x0 in the x space, U (x0 ) u = 0
in the u space, and constant L(x0 ) > 0, such that for all x ∈ V (x0 ) and u ∈ U (x0 ),
we have f (x, u) ≥ f (x0 , 0) − L(x0 ) · u. Then we use this fact together with
compactness of X (0) to show the existence of exact penalization. In this proof, the
little-o is used to simplify notation, i.e., o(a) denotes a function b(a) such that
lima→0 |b(a)|/ a = 0.
Pick any optimal solution x0 ∈ X (0). By definition, it is also a local minimum
solution. By hypothesis, the KKT condition is satisfied at x0 , that is, there exist λi ∈
R, i = 1, . . . , I , and μ j ≥ 0, j = 1, . . . , J such that
I
J
∇x f (x0 , 0) + λi ∇x gi (x0 , 0) + μ j ∇x h j (x0 , 0) = 0,
i=1 j=1
h j (x0 , 0) = 0, j = 1, . . . , J ,
gi (x0 , 0) ≤ 0, λi · gi (x0 , 0) = 0, i = 1, . . . , I . (40)
123
982 S. Zhang, X. A. Sun
Let A ⊂ I denote the set of active inequality constraints. Then similarly we have
f (x, u) − f (x0 , 0)
= ∇x f (x0 , 0), x − x0 + ∇u f (x0 , 0), u + o(x − x0 + u)
J
= − λi ∇x gi (x0 , 0) − μ j ∇x h j (x0 , 0), x − x0
i∈A j=1
+ ∇u f (x0 , 0), u + o(x − x0 + u)
J
≥ ∇u f (x0 , 0) + λi ∇u gi (x0 , 0) + μ j ∇u h j (x0 , 0), u
i∈A j=1
+ o(x − x0 + u)
> −L(x0 ) · u + o(x − x0 + u),
where L(x0 ):=∇u f (x0 , 0) + i∈A λi ∇u gi (x0 , 0) + Jj=1 μ j ∇u h j (x0 , 0) + 1 >
0. By the definition of the little-o notation, there exists a neighborhood V (x0 ) ⊂
/ Wi , x 0 ∈ V (x 0 ) and U (x 0 ) ⊂ ∩i ∈A
∩i ∈A / Ui , 0 ∈ U (x 0 ) such that
Now, let X opt (0) denote the set of optimal solutions of x when u = 0. Note
that X opt (0) ⊂ X (0) is closed due to the continuity of f , h i , g j , hence compact.
The collection of open sets {V (x)}x∈X opt (0) covers X opt (0). By compactness, there
exists a finite subcollection {V (xk )}k=1
K such that X opt (0) ⊂ ∪k=1K V (x )=:V . Let
k
L:= maxk=1,...,K L(xk ) and U = ∩k=1 K U (x ). Let f ∗ denote the optimal value for
k
u = 0. Then we have
To show the inequality for x ∈ / V , define p̃(u) := min x∈X (u)\V f (x, u). Note that
p̃(0) > f ∗ by the definition of X opt (0). Then by Lemma 10, p(u) is lower semicon-
tinuous, and we know that there exists a neighborhood U of 0 such that p̃(u) > f ∗
for all u ∈ U . Therefore, for all u ∈ U ∩U , we have f (x, u) ≥ f ∗ − L · u. Finally,
123
Stochastic dual dynamic programming… 983
we can show that the penalization is exact. Since ψ is sharp, there exist an open set
Ũ ⊂ U ∩ U , and positive constants c > 0 such that
ψ(u) ≥ c u on Ũ .
Let M = minu∈ B̄ R (0)\Ũ p̃(u) > f ∗ , m = minu∈ B̄ R (0)\Ũ ψ(u) > 0 because ψ is a
u u
penalty function. Let σ = (M − f ∗ )/m + 1. We have f (x, u) ≥ f ∗ − σ · u , ∀ u ∈
B̄ Ru (0)\{0}, x ∈ ∪u X (u). As a result, any optimal solution to the penalization (39)
would satisfy u = 0.
Note that our problem (7) can be written into the form (39) by letting u =
(xa(n) , z n )n∈N , and including the duplicate constraints z n − xa(n) = 0 for any
n = r ∈ N in the equality constraints h j (x, u) = 0. And other constraints
gi (x, u) ≤ 0 correspond to the functional constraints in the problem (1). Since ψn are
sharp, the aggregate penalty function defined by ψ(u) = n∈N pn ψn (xa(n) − z n ) is
also sharp. Let σ denote the penalty factor in Proposition 12. Proposition 11 follows
from this by letting σn = σ/ pn for all n ∈ N .
References
1. Ahmed, S., Cabral, F.G., da Costa, B.F.P.: Stochastic Lipschitz dynamic programming Math. Program.
191, 755–793 (2022)
2. Ahmed, S., King, A.J., Parija, G.: A Multi-stage Stochastic Integer Programming Approach for Capac-
ity Expansion under Uncertainty. Journal of Global Optimization 26, 3–24 (2003)
3. Baringo, L., Conejo, A.J.: Risk-Constrained Multi-Stage Wind Power Investment. IEEE Transac-
tions on Power Systems 28(1), 401–411 (2013) https://doi.org/10.1109/TPWRS.2012.2205411. http://
ieeexplore.ieee.org/document/6247489/
4. Basciftci, B., Ahmed, S., Gebraeel, N.: Adaptive Two-stage Stochastic Programming with an Appli-
cation to Capacity Expansion Planning. arXiv:1906.03513 [math] (2019)
5. Baucke, R., Downward, A., Zakeri, G.: A deterministic algorithm for solving multistage stochastic
programming problems. Optim. Online p. 25 (2017)
6. Benders, J.F.: Partitioning procedures for solving mixed-variables programming problems. Numerische
Mathematik 4, 238–252 (1962)
7. Birge, J.R.: Decomposition and Partitioning Methods for Multistage Stochastic Linear Programs. Oper.
Res. 33(5), 989–1007 (1985). https://doi.org/10.1287/opre.33.5.989
8. Bradley, S.P., Crane, D.B.: A Dynamic Model for Bond Portfolio Management. Manage. Sci. 19(2),
139–151 (1972). https://doi.org/10.1287/mnsc.19.2.139
9. Chen, Z.L., Li, S., Tirupati, D.: A scenario-based stochastic programming approach for technology
and capacity planning. Computers & Operations Research 29(7), 781–806 (2002) https://doi.org/10.
1016/S0305-0548(00)00076-9. http://linkinghub.elsevier.com/retrieve/pii/S0305054800000769
10. Chen, Z.L., Powell, W.B.: Convergent Cutting-Plane and Partial-Sampling Algorithm for Multistage
Stochastic Linear Programs with Recourse. J. Optim. Theory Appl. 102(3), 497–524 (1999). https://
doi.org/10.1023/A:1022641805263
11. Dantzig, G.B., Wolfe, P.: Decomposition Principle for Linear Programs. Oper. Res. 8(1), 101–111
(1960). https://doi.org/10.1287/opre.8.1.101
12. Escudero, L.F., Kamesam, P.V., King, A.J., Wets, R.J.B.: Production planning via scenario modelling.
Annals of Operations Research 43, 309–335 (1993)
13. Feizollahi, M.J., Ahmed, S., Sun, A.: Exact augmented lagrangian duality for mixed integer linear
programming. Math. Program. 161(1–2), 365–387 (2017)
14. Flach, B., Barroso, L., Pereira, M.: Long-term optimal allocation of hydro generation for a price-maker
company in a competitive market: latest developments and a stochastic dual dynamic programming
123
984 S. Zhang, X. A. Sun
approach. IET Generation, Transmission & Distribution 4(2), 299 (2010). https://doi.org/10.1049/iet-
gtd.2009.0107
15. Girardeau, P., Leclere, V., Philpott, A.B.: On the Convergence of Decomposition Methods for Multi-
stage Stochastic Convex Programs. Math. Oper. Res. 40(1), 130–145 (2015). https://doi.org/10.1287/
moor.2014.0664
16. Glassey, C.R.: Nested Decomposition and Multi-Stage Linear Programs. Manage. Sci. 20(3), 282–292
(1973). https://doi.org/10.1287/mnsc.20.3.282
17. Guigues, V.: Convergence analysis of sampling-based decomposition methods for risk-averse multi-
stage stochastic convex programs. SIAM J. Optim. 26(4), 2468–2494 (2016)
18. Guigues, V.: Dual dynamic programing with cut selection: Convergence proof and numerical experi-
ments. Eur. J. Oper. Res. 258(1), 47–57 (2017)
19. Hjelmeland, M.N., Zou, J., Helseth, A., Ahmed, S.: Nonconvex Medium-Term Hydropower Scheduling
by Stochastic Dual Dynamic Integer Programming. IEEE Transactions on Sustainable Energy 10(1),
481–490 (2019) https://doi.org/10.1109/TSTE.2018.2805164. http://ieeexplore.ieee.org/document/
8289405/
20. Ho, J.K., Manne, A.S.: Nested decomposition for dynamic models. Math. Program. 6(1), 121–140
(1974). https://doi.org/10.1007/BF01580231
21. Kusy, M.I., Ziemba, W.T.: A Bank Asset and Liability Management Model. Oper. Res. 34(3), 356–376
(1986). https://doi.org/10.1287/opre.34.3.356
22. Lan, G.: Complexity of stochastic dual dynamic programming. Mathematical Programming pp. 1–38
(2020)
23. Lara, C.L., Mallapragada, D.S., Papageorgiou, D.J., Venkatesh, A., Grossmann, I.E.: Deterministic
electric power infrastructure planning: Mixed-integer programming model and nested decomposition
algorithm. European Journal of Operational Research 271(3), 1037–1054 (2018) https://doi.org/10.
1016/j.ejor.2018.05.039. http://linkinghub.elsevier.com/retrieve/pii/S0377221718304466
24. Linowsky, K., Philpott, A.B.: On the Convergence of Sampling-Based Decomposition Algorithms for
Multistage Stochastic Programs. J. Optim. Theory Appl. 125(2), 349–366 (2005). https://doi.org/10.
1007/s10957-004-1842-z
25. Louveaux, F.V.: A Solution Method for Multistage Stochastic Programs with Recourse with Application
to an Energy Investment Problem. Oper. Res. 28(4), 889–902 (1980). https://doi.org/10.1287/opre.28.
4.889
26. Mulvey, J.M., Vladimirou, H.: Stochastic Network Programming for Financial Planning Problems.
Manage. Sci. 38(11), 1642–1664 (1992). https://doi.org/10.1287/mnsc.38.11.1642
27. Nesterov, Y.: Lectures on Convex Optimization, vol. 137. Springer, Berlin (2018)
28. Pereira, M.V.F., Pinto, L.M.V.G.: Stochastic Optimization of a Multireservoir Hydroelectric Sys-
tem: A Decomposition Approach. Water Resour. Res. 21(6), 779–792 (1985). https://doi.org/10.1029/
WR021i006p00779
29. Pereira, M.V.F., Pinto, L.M.V.G.: Multi-stage stochastic optimization applied to energy planning. Math.
Program. 52(1–3), 359–375 (1991). https://doi.org/10.1007/BF01582895
30. Philpott, A., Guan, Z.: On the convergence of stochastic dual dynamic programming and related
methods. Operations Research Letters 36(4), 450–455 (2008) https://doi.org/10.1016/j.orl.2008.01.
013. http://linkinghub.elsevier.com/retrieve/pii/S0167637708000308
31. Philpott, A., Wahid, F., Bonnans, F.: MIDAS: A Mixed Integer Dynamic Approximation Scheme.
Mathematical Programming 181, 19–50 (2020)
32. Rockafellar, R.T., Wets, R.J.B.: Variational Analysis, vol. 317. Springer Science & Business Media,
Berlin (2009)
33. Shapiro, A.: Analysis of stochastic dual dynamic programming method. European J. Oper. Res. 209(1),
63–72 (2011) https://doi.org/10.1016/j.ejor.2010.08.007. http://linkinghub.elsevier.com/retrieve/pii/
S0377221710005448
34. Shapiro, A., Tekaya, W., da Costa, J.P., Soares, M.P.: Risk neutral and risk averse Stochastic Dual
Dynamic Programming method. European J. Oper. Res. 224(2), 375–391 (2013) https://doi.org/10.
1016/j.ejor.2012.08.022. http://linkinghub.elsevier.com/retrieve/pii/S0377221712006455
35. Slyke, R.M.V., Wets, R.: L-Shaped Linear Programs with Applications to Optimal Control and Stochas-
tic Programming. SIAM J. Appl. Math. 17(4), 638–663 (1969). http://www.jstor.org/stable/2099310
36. Takriti, S., Krasenbrink, B., Wu, L.S.Y.: Incorporating Fuel Constraints and Electricity Spot Prices
into the Stochastic Unit Commitment Problem. Oper. Res. 48(2), 268–280 (2000). https://doi.org/10.
1287/opre.48.2.268.12379
123
Stochastic dual dynamic programming… 985
37. Zou, J., Ahmed, S., Sun, X.A.: Partially Adaptive Stochastic Optimization for Electric Power Genera-
tion Expansion Planning. INFORMS J. Comput. 30(2), 388–401 (2018). https://doi.org/10.1287/ijoc.
2017.0782
38. Zou, J., Ahmed, S., Sun, X.A.: Stochastic dual dynamic integer programming. Math. Program. (2018).
https://doi.org/10.1007/s10107-018-1249-5
39. Zou, J., Ahmed, S., Sun, X.A.: Multistage Stochastic Unit Commitment Using Stochastic Dual
Dynamic Integer Programming. IEEE Trans. Power Syst. 34(3), 1814–1823 (2019) https://doi.org/
10.1109/TPWRS.2018.2880996. http://ieeexplore.ieee.org/document/8532315/
Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps
and institutional affiliations.
123