ADMM For Combinatorial Graph Problems: Preprint
ADMM For Combinatorial Graph Problems: Preprint
ADMM For Combinatorial Graph Problems: Preprint
net/publication/325414024
CITATIONS READS
0 55
3 authors, including:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Chuangchuang Sun on 31 May 2018.
Abstract
We investigate a class of general combinatorial graph problems, including MAX-CUT and community
detection, reformulated as quadratic objectives over nonconvex constraints and solved via the alternating
direction method of multipliers (ADMM). We propose two reformulations: one using vector variables and
a binary constraint, and the other further reformulating the Burer-Monteiro form for simpler subproblems.
Despite the nonconvex constraint, we prove the ADMM iterates converge to a stationary point in both
formulations, under mild assumptions. Additionally, recent work suggests that in this latter form, when
the matrix factors are wide enough, local optimum with high probability is also the global optimum. To
demonstrate the scalability of our algorithm, we include results for MAX-CUT, community detection,
and image segmentation benchmark and simulated examples.
1 INTRODUCTION
A number of important problems arising in graph applications can be formulated as a linear binary opti-
mization problem, for different choices of C:
minimize xT Cx
(1.1)
subject to x ∈ {+1, −1}n
Here, C is an n × n symmetric matrix, where n is the number of nodes in an undirected graph. For different
choices of C, problem (1.1) may correspond to the MAX-CUT problem in combinatorics, the 2-community
detection problem in machine learning, or many other combinatorial graph problems.
Exactly solving combinatorial problems of this nature is a very difficult problem (it is in general NP-
Hard), and even the best branch-and-bound methods do not scale well beyond a few thousands of nodes.
1.1 Notation
Define Sn the space of n × n symmetric matrices. For an undirected graph with n nodes, we define A ∈ Sn
the adjacency matrix of the graph, where Aij > 0 is the strictly positive weight of the edge between nodes i
and j, and Aij = 0 if there is no edge between i and j. Define 1 as the vector of all 1’s.
1
The stochastic binary model [HLL83] is one of the simplest generative models for this application. Given a
graph with n nodes and parameters 0 < q < p < 1, the model partitions the nodes into two communities, and
generates an edge between nodes in a community with probability p and nodes in two different communities
with probability q. Following the analysis in [ABH16], we can define C = p+q T
2 11 − A, and the solution to
(1.1) gives a solution to the community detection problem with sharp recovery guarantees.
1.3 MAX-CUT
While the community detection problem aims to find two densely connected clusters in a graph, the MAX-
CUT problem attempts to find two clusters with dense interconnections. For an undirected graph with n
nodes and (possibly weighted) edges between some pairs of nodes, a cut is defined as a partition of the n
nodes, and the cut value is the sum of the weights of the severed edges. The MAX-CUT problem is to find
the cut in a given graph that gives the maximum cut value. When C = (A − diag(A1))/4, the solution to
(1.1) is exactly the maximum cut.
Many methods exist to find the MAX-CUT of a graph. In general, combinatorial methods can be solved
using branch-and-bound schemes, using a linear relaxation of (1.1) as a bound [BGJR88, DSDJ+ 95], where
the binary constraint is replaced with 0 ≤ (x + 1)/2 ≤ 1. Historically, these “polyhedral methods” were
the main approach to find exact solutions of the MAX-CUT problem. Though this is an NP-Hard problem,
if the graph is sparse enough, branch-and-bound converges quickly even for very large graphs [DSDJ+ 95].
However, when the graph is not very sparse, the linear relaxation is loose, and finding efficient branching
mechanisms is challenging, causing the algorithm to run slowly.
The MAX-CUT problem can also be approximated by one pass of the linear relaxation (with bound
frelax
fexact ≥ 2 × #edges) [PT94]. A tighter approximation can be found with the semidefinite relaxation (see
next section), which is also used for better bounding in branch-and-bound techniques [HR98,RRW07, BV08,
BST11]. In particular, the rounding algorithm of [GW95] returns a feasible x̂ given optimal Z, and is shown
T
in expectation to satisfy xx̂T Cx
C x̂
≥ 0.878. For this reason, the semidefinite relaxation for problems of type
(1.1) are heavily studied (e.g. [PRW95, Hel00, FK97]).
minimize
n×r
Tr(RT CR)
R∈R (1.3)
subject to diag(RRT ) = 1
which we will call the factorized SDP (FSDP). Note that FSDP is not convex; however, because the search
space is compact, if r is large enough such that r(r + 1)/2 > n, then the global optimum of the FSDP (1.3)
2
is the global optimum of the SDR [Pat98, Bar95]. Furthermore, a recent work [BVB16] shows that almost
all local optima of FSDP are also global optima, suggesting that any stationary point of the FSDP is also a
reasonable approximation of (1.1).
To solve (1.3), [BM03] put forward an algorithm based on augmented Lagrangian method to find local
optima of (1.3); unsurprisingly, due to the later obsrvation of Boumel et. al, this method empirically found
global optima almost all of the time. However, solving (1.3) is still numerically burdensome; in the augmented
Lagrangian term, the objective is quartic in R, and is usually solved using an iterative numerical method,
such as L-BFGS.
Another way of finding a low-rank solution to the SDR is to use the nuclear norm as a surrogate for
the rank penalty [CR09, RFP10, UHZ+ 16]. This method is popular because of its theoretical guarantees and
experimental success. A difficulty with nuclear norm minimization is in choosing the penalty parameter; in
general this is done through cross-validation. Also, when the rank function appears as a hard constraint in
the optimization problem, the corresponding bound for the surrogate model is ambiguous.
2 VECTOR FORM
We begin by considering a simple reformulation of (1.1) using only vector variables
minimize
n
xT Cx
x,y∈R
subject to y ∈ {−1, 1}n
x = y.
This is similar to the nonconvex formulations proposed in [BPC+ 11], Chapter 9.1.
To solve (2.4) via ADMM, we first form the augmented Lagrangian
ρ
L(x, y; µ) = xT Cx + hµ, x − yi + kx − yk2F , (2.4)
2
where µ ∈ Rn is the corresponding dual variable and ρ > 0 is the weighting factor associated with the
penalty term. Under the ADMM framework, problem in (2.4) can be solved by alternating between the two
3
primal variables x and y, and then updating the dual variable µ. The ADMM for (2.4) is summarized in
Algorithm 1. Note that unlike typical ADMM, the penalty parameter ρ in Algorithm 1 is increasing with
the iteration index, which will facilitate the convergence.
8: if kxk − y k k ≤ then
9: break
10: end if
11: end for
4
Definition 2.1 xT Cx is L1 −smooth function, that is, any x1 , x2 , k2Cx1 − 2Cx2 k ≤ L1 kx1 − x2 k.
Definition 2.2 LH ∈ R+ is the smallest number such that −LH I 2C, where 2C is the Hessian of the
objective function of (1.1). In other words, the Hessian is lower bounded.
Corollary 2.5 With the prerequisites in Theorem 2.4 satisfied and (y ∗ , x∗ ; µ∗ ) as the accumulation point,
we have that
minimize Tr(CZ)
X,Y,Z
subject to diag(Z) = 1 (3.9)
Z = XY T
X = Y.
Here the matrix variables are Z ∈ Sn , and X, Y ∈ Rn×r . The extra variable Y and the constraint X = Y
are introduced to transform the quadratic constraint in (1.3) to bilinear, which will simplify the subproblems
significantly. Two cases of r are of interest. When r = 1, (3.9) is equivalent to (1.1), with X = x. When,
2n , then r(r+1)
√
r = 2 > n and the global optimum of (3.9) is with high probability that of the SDR,
based on the results of [BVB16].
5
To solve (3.9) via ADMM, we build the augmented Lagrangian function as
L(Z, X, Y ;Λ1 , Λ2 ) = Tr(CZ) + hΛ2 , X − Y i
ρ
+ hΛ1 , Z − XY T i + kX − Y k2F
2
ρ T 2
+ kZ − XY kF (3.10)
2
where Λ1 ∈ Rn×n and Λ2 ∈ Rn×r are the dual variables corresponding to the two coupling constraints,
respectively. Similar as to the vector case, problem (3.9) is solved by alternating minimization between
two primal variable sets, Y and (Z, X), and then updating the dual variable Λ1 and Λ2 . This sequence is
summarized in Algorithm 2 below.
Λk1 2 Λk
kZ k − X k Y T + k
kF + kX k − Y + k2 k2F (3.11)
ρ ρ
Λk+1
1 = Λk1 + ρk (Z k+1 − X k+1 (Y k+1 )T )
Λk+1
2 = Λk2 + ρk (X k+1 − Y k+1 )
ρk+1 = αρk (3.13)
8: if max{kX k − Y k k, kZ k − X k (Y k )T k} ≤ then
9: break
10: end if
11: end for
6
C − A∗ (ν) + Λ1 + ρ(Z − XY T ) = 0
Λ2 − Λ1 Y + ρ(XY T − Z)Y + ρ(X − Y ) = 0
A(X) = 1,
where ν ∈ Rm is the dual variable for the local constraint A(X) = b.
Given Y , solving for ν reduces to solving
where
1
D= (Λ1 Y − Λ2 ) + Y, G = A∗ (A(I + Y Y T )).
ρ
Since G is a diagonal positive definite matrix then finding G−1 is computationally simple. Then
X = BY + D, and Z = XY T + B,
where
1
B = − (C − A∗ (ν) + Λ1 ).
ρ
Note that the complexity of all inversions are O(r3 ), and are particularly simple if r = 1. In that case, the
complexity is dominated by the matrix multiplications.
Theorem 3.2 If Assumption 3.1 holds, and given a sequence {ρk } such that
X ρk+1 X 1
ρk > 0, k 2
< ∞, and < ∞,
(ρ ) ρk
Unfortunately, the extension of KKT stationary points to global minima is not yet known when r(r+1)
2 <n
(i.e., r = 1). However, our empirical results suggest that even when r = 1, often a local solution to (3.9)
well-approximates the global solution to (1.1).
4 NUMERICAL RESULTS
From√the convergence
proofs, we have shown that our algorithms converge to a KKT stability point, and if
r= 2n , with high probability achieves the global solution of the SDR. In this section, we give empirical
evidence that they are in addition good solutions to (1.1), our original combinatorial problem. We show
this through three applications: community detection of the stochastic block model, MAX-CUT, and image
segmentation.
We evaluate four methods for solving (1.1):
1. SD: the solution to the SDR rounded to a binary vector using a Goemans-Williamson style rounding
[GW95] technique;
7
Figure 1: Sample evolution for n = 2500, showing the objective value Tr(CX), best objective value using
xr = sign(x), and recovery rate using xr .
Rounding For both the SDP and MRR methods, we need to round the solutions to a vector. For the SDP
method, we first do an eigenvalue decomposition X = QΛQT and form a factor F = QΛ1/2 where the diagonal
elements of Λ are in decreasing magnitude order. Then we scan k = 1, . . . , n and find xk,t = sign(Fk zt ) for
trials t = 1, . . . 10. Here, Fk contain the first k columns of F , and each element of zt is drawn i.i.d from a
normal Gaussian distribution. We keep xr = xk,t that minimizes xTr Cxr . For the MRR method, we repeat
the procedure using a factor F = U Σ1/2 where X = U ΣV T is the SVD of X. For MR1 and V, we simply
take xr = sign(x) as the binary solution.
Computer information The following simulations are performed on a Dell Poweredge R715 with an
AMD Opteron(tm) processor, with 64 GB of RAM and 24 cores. It is running Ubuntu 16.04 (Xenial) with
Matlab 2017a.
8
1000 10000
100
10
10
1
2500 5000 10000 1
2500 5000 10000
0.1
n = # nodes n = # nodes
Figure 2: Top: CPU per-iteration runtime for four methods. Bottom: CPU runtime for overhead operations
(rounding for SDP and MRR and initial matrix inversion or SVD for V). MR1 gives a vector output and
has no additional overhead.
n m p S V MR1 MRR
1000 100 0.1 1 1 1 1
1000 500 0.1 1 1 1 1
2500 250 0.04 1 1 1 1
2500 1250 0.04 1 1 1 1.00
5000 500 0.02 1 1 1 1.00
5000 2500 0.02 1 1 1 1.00
10000 1000 0.01 1 1 1 1
10000 5000 0.01 1 1 1 1.00
Table 1: Correct community recovery rate for stochastic block model. Result after 10 iterations for
S, MR1, and MRR methods and 50 iterations for V, averaged over 10 trials for each method. n = number
of nodes in graph. m = number of nodes in smaller of two communities. p = percentage of edges within
communities, and q = p/10 the percentage of edges between communities. 1 = perfect recovery, 1.00 =
rounds to 1.
9
g3-15 g3-8 pm3-15-50 pm3-8-50
n 3375 512 3375 512
# edges 20250 3072 20250 3072
sparsity 0.18% 1.2% 0.18% 1.2%
BK 281029888 41684814 2964 454
R 12838418 4816306 170 62
MR1 255681256 36780180 1990 312
V 202052290 8213762 2016 330
Table 2: MAX-CUT values for graphs from the 7th DIMACS Challenge. BK = best known. R = best of
1000 random guesses. MR1 = matrix formulation, r = 1. V = vector formulation.
4.2 MAX-CUT
Table 2 gives the best MAX-CUT values using best-of-random-guesses and our approaches over four examples
from the 7th DIMACS Implementation Challenge in 2002.1 Our solutions are very close to the best known
solutions.
5 CONCLUSION
We present two methods for solving quadratic combinatorial problems using ADMM on two reformulations.
Though the problem has a nonconvex constraint, we give convergence results to KKT solutions under much
milder conditions than previously shown. From this, we give empirical solutions to several graph-based
combinatorial problems, specifically MAX-CUT and community detection; both can can be used in additional
downstream applications, like image segmentation.
References
[ABH16] Emmanuel Abbe, Afonso S Bandeira, and Georgina Hall. Exact recovery in the stochastic block
model. IEEE Transactions on Information Theory, 62(1):471–487, 2016.
[Bar95] Alexander I. Barvinok. Problems of distance geometry and convex properties of quadratic maps.
Discrete & Computational Geometry, 13(2):189–202, 1995.
1 See http://dimacs.rutgers.edu/Workshops/7thchallenge/. Problems downloaded from
http://www.optsicom.es/maxcut/
10
Figure 3: Image segmentation. The center (right) column are the best MAX-CUT (community detection)
results.
[BGJR88] Francisco Barahona, Martin Grötschel, Michael Jünger, and Gerhard Reinelt. An application of
combinatorial optimization to statistical physics and circuit layout design. Operations Research,
36(3):493–513, 1988.
[BM03] Samuel Burer and Renato DC Monteiro. A nonlinear programming algorithm for solving semidef-
inite programs via low-rank factorization. Mathematical Programming, 95(2):329–357, 2003.
[BM05] Samuel Burer and Renato DC Monteiro. Local minima and convergence in low-rank semidefinite
programming. Mathematical Programming, 103(3):427–444, 2005.
[BPC+ 11] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed opti-
mization and statistical learning via the alternating direction method of multipliers. Foundations
and Trends
R in Machine Learning, 3(1):1–122, 2011.
[BST11] Xiaowei Bao, Nikolaos V Sahinidis, and Mohit Tawarmalani. Semidefinite relaxations for
quadratically constrained quadratic programming: A review and comparisons. Mathematical
programming, 129(1):129–157, 2011.
[BV08] Samuel Burer and Dieter Vandenbussche. A finite branch-and-bound algorithm for nonconvex
quadratic programming via semidefinite relaxations. Mathematical Programming, 113(2):259–
282, 2008.
[BVB16] Nicolas Boumal, Vlad Voroninski, and Afonso Bandeira. The non-convex Burer-Monteiro ap-
proach works on smooth semidefinite programs. In Advances in Neural Information Processing
Systems, pages 2757–2765, 2016.
11
[CR09] Emmanuel J Candès and Benjamin Recht. Exact matrix completion via convex optimization.
Foundations of Computational mathematics, 9(6):717, 2009.
[DSDJ+ 95] Caterina De Simone, Martin Diehl, Michael Jünger, Petra Mutzel, Gerhard Reinelt, and Gio-
vanni Rinaldi. Exact ground states of ising spin glasses: New experimental results with a
branch-and-cut algorithm. Journal of Statistical Physics, 80(1-2):487–496, 1995.
[EB92] Jonathan Eckstein and Dimitri P Bertsekas. On the Douglas Rachford splitting method and
the proximal point algorithm for maximal monotone operators. Mathematical Programming,
55(1):293–318, 1992.
[FH16] Santo Fortunato and Darko Hric. Community detection in networks: A user guide. Physics
Reports, 659:1–44, 2016.
[FK97] Tetsuya Fujie and Masakazu Kojima. Semidefinite programming relaxation for nonconvex
quadratic programs. Journal of Global Optimization, 10(4):367–380, 1997.
[GM75] Roland Glowinski and A Marroco. Sur l’approximation, par éléments finis d’ordre un, et la
résolution, par pénalisation-dualité d’une classe de problèmes de dirichlet non linéaires. Revue
française d’automatique, informatique, recherche opérationnelle. Analyse numérique, 9(R2):41–
76, 1975.
[GM76] Daniel Gabay and Bertrand Mercier. A dual algorithm for the solution of nonlinear varia-
tional problems via finite element approximation. Computers & Mathematics with Applications,
2(1):17–40, 1976.
[GW95] Michel X Goemans and David P Williamson. Improved approximation algorithms for maximum
cut and satisfiability problems using semidefinite programming. Journal of the ACM (JACM),
42(6):1115–1145, 1995.
[Hel00] Christoph Helmberg. Semidefinite programming for combinatorial optimization. Konrad-Zuse-
Zentrum für Informationstechnik Berlin, 2000.
[HLL83] Paul W Holland, Kathryn Blackmond Laskey, and Samuel Leinhardt. Stochastic blockmodels:
First steps. Social networks, 5(2):109–137, 1983.
[HLR16] Mingyi Hong, Zhi-Quan Luo, and Meisam Razaviyayn. Convergence analysis of alternating direc-
tion method of multipliers for a family of nonconvex problems. SIAM Journal on Optimization,
26(1):337–364, 2016.
[HR98] Christoph Helmberg and Franz Rendl. Solving quadratic (0, 1)-problems by semidefinite pro-
grams and cutting planes. Mathematical programming, 82(3):291–315, 1998.
[HS16] Kejun Huang and Nicholas D Sidiropoulos. Consensus-ADMM for general quadratically con-
strained quadratic programming. IEEE Transactions on Signal Processing, 64(20):5297–5310,
2016.
[JMZ14] Bo Jiang, Shiqian Ma, and Shuzhong Zhang. Alternating direction method of multipliers for
real and complex polynomial optimization models. Optimization, 63(6):883–898, 2014.
[LHW17] Songtao Lu, Mingyi Hong, and Zhengdao Wang. A nonconvex splitting method for symmetric
nonnegative matrix factorization: Convergence analysis and optimality. IEEE Transactions on
Signal Processing, 2017.
[LM79] Pierre-Louis Lions and Bertrand Mercier. Splitting algorithms for the sum of two nonlinear
operators. SIAM Journal on Numerical Analysis, 16(6):964–979, 1979.
12
[LP15] Guoyin Li and Ting Kei Pong. Global convergence of splitting methods for nonconvex composite
optimization. SIAM Journal on Optimization, 25(4):2434–2460, 2015.
[MWRF16] Sindri Magnússon, Pradeep Chathuranga Weeraddana, Michael G Rabbat, and Carlo Fischione.
On the convergence of alternating direction Lagrangian methods for nonconvex structured opti-
mization problems. IEEE Transactions on Control of Network Systems, 3(3):296–309, 2016.
[Pat98] Gábor Pataki. On the rank of extreme matrices in semidefinite programs and the multiplicity
of optimal eigenvalues. Mathematics of operations research, 23(2):339–358, 1998.
[PRW95] Svatopluk Poljak, Franz Rendl, and Henry Wolkowicz. A recipe for semidefinite relaxation for
(0, 1)-quadratic programming. Journal of Global Optimization, 7(1):51–73, 1995.
[PT94] Svatopluk Poljak and Zsolt Tuza. The expected relative error of the polyhedral approximation
of the MAX-CUT problem. Operations Research Letters, 16(4):191–198, 1994.
[RFP10] Benjamin Recht, Maryam Fazel, and Pablo A Parrilo. Guaranteed minimum-rank solutions of
linear matrix equations via nuclear norm minimization. SIAM review, 52(3):471–501, 2010.
[RRW07] Franz Rendl, Giovanni Rinaldi, and Angelika Wiegele. A branch and bound algorithm for MAX-
CUT based on combining semidefinite and polyhedral relaxations. In IPCO, volume 4513, pages
295–309. Springer, 2007.
[SWZ14] Yuan Shen, Zaiwen Wen, and Yin Zhang. Augmented Lagrangian alternating direction method
for matrix separation based on low-rank factorization. Optimization Methods and Software,
29(2):239–263, 2014.
[UHZ+ 16] Madeleine Udell, Corinne Horn, Reza Zadeh, Stephen Boyd, et al. Generalized low rank models.
Foundations and Trends
R in Machine Learning, 9(1):1–118, 2016.
[WYZ15] Yu Wang, Wotao Yin, and Jinshan Zeng. Global convergence of ADMM in nonconvex nonsmooth
optimization. arXiv preprint arXiv:1511.06324, 2015.
[XYWZ12] Yangyang Xu, Wotao Yin, Zaiwen Wen, and Yin Zhang. An alternating direction algorithm for
matrix completion with nonnegative factors. Frontiers of Mathematics in China, 7(2):365–384,
2012.
Lemma A.1 If Assumption 2.3 holds, for two adjacent iterations of Algorithm 1 we have
As (1.16) holds for each iteration k, combining it with the definition of L1 proves the lemma.
Remark : We update x after y so that we can apply the optimality condition (1.16).
Next we will show that the augmented Lagrangian (2.4) is monotonically decreasing and lower bounded.
13
Lemma A.2 The augmented Lagrangian (2.4) is monotonically decreasing; that is,
L(y k+1 , xk+1 ; µk+1 ; ρk+1 ) − L(y k , xk ; µk ; ρk ) ≤ ck1 kxk+1 − xk k2F . (1.17)
k
ρ −LH (α+1)L21
where ck1 = 2 − 2ρk
> 0.
14
Proof: For notation ease, we denote f (x) = xT Cx. From the L1 -Lipschitz continuity of ∇x f (x) , it follows
that
L1
f (y) ≤ f (x) + h∇f (x), y − xi + kx − yk2F (1.23)
2
for any x and y. By definition
ρk k
L(y k , xk ; µk ; ρk ) = f (xk ) + hµk , xk − y k i + kx − y k k2F
2
ρk
= f (xk ) − h∇f (xk ), xk − y k i + kxk − y k k2F (1.24)
2
k ρk − L1 k k 2
≥ f (y ) + kx − y kF , (1.25)
2
where (1.24) follows from (1.16) and (1.25) follows from (1.23). From the boundedness of the objective
(Assumption 2.3) and the assumption ρk > L1 , it follows that L(y k , xk ; µk ; ρk ) is bounded below for all k.
Since the sequence {L(z k , xk ; µk )} is monotonically decreasing and lower bounded, then the sequence
{L(z k , xk ; µk )} converges. Combining with Lemma A.1 we have
kxk+1 − xk k → 0, (1.26)
which indicates that the sequence {xk } converges to a limit point (x∗ ). Moreover, as {xk } converges, the
sequence {µk } also converges to µ∗ based on (1.14). From (2.7), µk+1 − µk = ρk (xk+1 − y k+1 ) and so
xk+1 − y k+1 → 0; thus {y k } converges to a limit point y ∗ and x∗ = y ∗ . Additionally, note that x∗ = y ∗
implies ∇x L(y ∗ , x∗ ; µ∗ ; ρ) = ∇y L(y ∗ , x∗ ; µ∗ ; ρ) = 0 and x∗ = y ∗ ∈ {−1, 1} are feasible. Thus Theorem 2.4 is
proven.
We now prove Corollary 2.5.
∇x f (x∗ ) + µ∗ = 0, (1.28)
or equivalently,
x∗ = arg min f (x) + xT µ∗ .
x
∗ ∗
If, additionally, µ = 0, then x is also the minimizer of the unconstrained problem; that is
suggesting that the combinatorial constraints x ∈ {−1, 1} are not necessary, and the convex relaxation is
exact; therefore x∗ is the global minimum of (1.1).
15
B Convergence analysis for matrix form
To simplify notation, we first collect the primal and dual variables P k = (Z, X, Y )k and Dk = (Λ1 , Λ2 )k .
ρk k Λk
L(P k ; Dk ; ρk ) = Tr(CZ k ) + kZ − X k (Y k )T + k1 k2F
2 ρ
k k
ρ Λ 1
+ kX k − Y k + k2 k2F − k kΛk1 k2F + kΛk2 k2F > −∞
(2.29)
2 ρ 2ρ
where the inequality follows the boundedness of {Tr(CZ k )} and {Λk1 , Λk2 } in Assumption 3.1.
Lemma B.2
∇2 LY ρk I
and p !
2 k λ2N + 4λN − λN
∇ L(X,Z) ρ 1− I.
2
∇2 LY = ρk (M + I) ρk I
where
M = blkdiag((X k )T (X k ), (X k )T (X k ), ...) 0.
For (X, Z), we have
I + NNT
−N
∇2(X,Z) Lk =ρ k
−N T I
where 2
N = blkdiag((Y k+1 )T , . . . , (Y k+1 )T ) ∈ Rnr×n .
Note that for block diagonal matrices, kN k2 = kY k+1 k2 . Since I 0 and its Schur complement (I +N N T )−
N N T 0, then ∇2(X,Z) L̃k 0 and equivalently λmin (∇2(X,Z) Lk ) > 0.
To find the smallest eigenvalue λmin (∇2(X,Z) Lk ), it suffices to find the largest σ > 0 such that
(1 − σ)I + N N T
−N
H2 = (ρk )−1 ∇2(X,Z) L̃k − σI = 0. (2.30)
−N T (1 − σ)I
Using the same Schur Complement trick, we want to find the largest σ > 0 where (1 − σ)I 0 and
H3 = (1 − σ)I + N N T (1 − (1 − σ)−1 )) 0.
Since this implies 1 − σ > 0, then together with σ > 0, it must be that 1 − (1 − σ)−1 < 0.
Therefore, defining λN = kY k+1 k22 ,
We can see that (1 − σ)λmin (H3 ) is a convex function in (1 − σ), with two zeros at
p
± λ2N + 4λN − λN
1−σ = .
2
16
In between the two roots, λmin H3 < 0. Since the smaller root cannot satisfy 1 − σ > 0, we choose
p
λ2N + 4λN − λN
σmax = 1 − >0
2
as the largest feasible σ that maintains λmin (H3 ) ≥ 0.
As a result,
We now prove Thm. 3.2.
L(P k+1 ; Dk ; ρk ) − L(Z k , X k , Y k+1 ; Dk ; ρk ) ≤ h∇Z L(P k+1 ; Dk ; ρk ), Z k+1 − Z k i + h∇X L(P k+1 ; Dk ; ρk ), X k+1 − X k i
λmin (∇2(X,Z) L̃k )
kZ k+1 − Z k k2F + kX k+1 − X k k2F
−
2
λmin (∇2(X,Z) L̃k )
kZ k+1 − Z k k2F + kX k+1 − X k k2F ,
≤ − (2.32)
2
where the first inequality follows from the strong convexity of L(Z, X, Y k ; Λk1 , Λk2 ) with respect to (X, Z) with
∇2(X,Z) L̃k = ∇2vec(X,Z) L(P k+1 ; Dk ; ρk ). In addition, the second inequality follows the optimality condition
of (3.12). Note that λmin (∇2(X,Z) L̃) > 0 given L(Z, X, Y k ; Λk1 , Λk2 ) is strongly convex with regard to (X, Z).
In addition for the update of the dual variables and the penalty coefficient, we have
L(P k+1 ; Dk+1 ; ρk+1 ) − L(P k+1 ; Dk ; ρk ) = hΛ1 k+1 − Λ1 k , Z k+1 − X k+1 (Y k+1 )T i
+hΛ2 k+1 − Λ2 k , X k+1 − Y k+1 i
ρk+1 − ρk
+ (kZ k+1 − X k+1 (Y k+1 )T k2F )
2
ρk+1 − ρk
+ (kX k+1 − Y k+1 k2F )
2
ρk+1 + ρk
kΛ1 k+1 − Λ1 k k2F + kΛ2 k+1 − Λ2 k k2F (2.33)
=
2(ρk )2
17
where the first equality follows the definition of L, the second follows (3.13). Moreover, it can be easily
verified that
where from Assumption 3.1 we have that {Λk1 , Λk2 } is bounded; that is there exists B < ∞ where
P ρk+1 +ρk P ρk+1
B ≥ max{kΛk1 k2F , kΛk2 k2F } for all k. Since additionally 2(ρk )2
≤ (ρk )2
< +∞, then
K−1
X ρk+1 + ρk
kΛk+1 − Λk1 k2F + kΛk+1 − Λk2 k2F
0 ≤ k 2 1 2
2(ρ )
k=0
≤ +∞.
18
ρk = +∞, this immediately yields Z k+1 − Z k → 0, X k+1 − X k → 0, Y k+1 − Y k → 0.
P
Since additionally k
Moreover,
∞
X 1 1
k
< ∞ ⇐⇒ k → 0 ⇐⇒ ρk → ∞.
ρ ρ
k=1
19