ADMM For Combinatorial Graph Problems: Preprint

Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/325414024

ADMM for combinatorial graph problems

Preprint · May 2018

CITATIONS READS
0 55

3 authors, including:

Chuangchuang Sun Ran Dai


Boston University The Ohio State University
21 PUBLICATIONS   66 CITATIONS    65 PUBLICATIONS   350 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Wearable Technology View project

All content following this page was uploaded by Chuangchuang Sun on 31 May 2018.

The user has requested enhancement of the downloaded file.


ADMM for combinatorial graph problems
Chuangchuang Sun∗, Yifan Sun∗, Ran Dai
May 29, 2018
arXiv:1805.10678v1 [math.OC] 27 May 2018

Abstract
We investigate a class of general combinatorial graph problems, including MAX-CUT and community
detection, reformulated as quadratic objectives over nonconvex constraints and solved via the alternating
direction method of multipliers (ADMM). We propose two reformulations: one using vector variables and
a binary constraint, and the other further reformulating the Burer-Monteiro form for simpler subproblems.
Despite the nonconvex constraint, we prove the ADMM iterates converge to a stationary point in both
formulations, under mild assumptions. Additionally, recent work suggests that in this latter form, when
the matrix factors are wide enough, local optimum with high probability is also the global optimum. To
demonstrate the scalability of our algorithm, we include results for MAX-CUT, community detection,
and image segmentation benchmark and simulated examples.

1 INTRODUCTION
A number of important problems arising in graph applications can be formulated as a linear binary opti-
mization problem, for different choices of C:

minimize xT Cx
(1.1)
subject to x ∈ {+1, −1}n

Here, C is an n × n symmetric matrix, where n is the number of nodes in an undirected graph. For different
choices of C, problem (1.1) may correspond to the MAX-CUT problem in combinatorics, the 2-community
detection problem in machine learning, or many other combinatorial graph problems.
Exactly solving combinatorial problems of this nature is a very difficult problem (it is in general NP-
Hard), and even the best branch-and-bound methods do not scale well beyond a few thousands of nodes.

1.1 Notation
Define Sn the space of n × n symmetric matrices. For an undirected graph with n nodes, we define A ∈ Sn
the adjacency matrix of the graph, where Aij > 0 is the strictly positive weight of the edge between nodes i
and j, and Aij = 0 if there is no edge between i and j. Define 1 as the vector of all 1’s.

1.2 Community detection


The problem of community detection is to identify node clusters in undirected graphs that are more likely to
be connected with each other than with nodes outside the cluster. This prediction is useful in many graphical
settings, such as interpreting online communities through social network or linking behavior, interpreting
molecular or neuronal structure in biology or chemistry, finding disease sources in epidemiology, and many
more. There are many varieties and methodologies in this field, and it would be impossible to list them all,
though a very comprehensive overview is given in [FH16].
∗ This work was done at Technicolor Research in the summer of 2017.

1
The stochastic binary model [HLL83] is one of the simplest generative models for this application. Given a
graph with n nodes and parameters 0 < q < p < 1, the model partitions the nodes into two communities, and
generates an edge between nodes in a community with probability p and nodes in two different communities
with probability q. Following the analysis in [ABH16], we can define C = p+q T
2 11 − A, and the solution to
(1.1) gives a solution to the community detection problem with sharp recovery guarantees.

1.3 MAX-CUT
While the community detection problem aims to find two densely connected clusters in a graph, the MAX-
CUT problem attempts to find two clusters with dense interconnections. For an undirected graph with n
nodes and (possibly weighted) edges between some pairs of nodes, a cut is defined as a partition of the n
nodes, and the cut value is the sum of the weights of the severed edges. The MAX-CUT problem is to find
the cut in a given graph that gives the maximum cut value. When C = (A − diag(A1))/4, the solution to
(1.1) is exactly the maximum cut.
Many methods exist to find the MAX-CUT of a graph. In general, combinatorial methods can be solved
using branch-and-bound schemes, using a linear relaxation of (1.1) as a bound [BGJR88, DSDJ+ 95], where
the binary constraint is replaced with 0 ≤ (x + 1)/2 ≤ 1. Historically, these “polyhedral methods” were
the main approach to find exact solutions of the MAX-CUT problem. Though this is an NP-Hard problem,
if the graph is sparse enough, branch-and-bound converges quickly even for very large graphs [DSDJ+ 95].
However, when the graph is not very sparse, the linear relaxation is loose, and finding efficient branching
mechanisms is challenging, causing the algorithm to run slowly.
The MAX-CUT problem can also be approximated by one pass of the linear relaxation (with bound
frelax
fexact ≥ 2 × #edges) [PT94]. A tighter approximation can be found with the semidefinite relaxation (see
next section), which is also used for better bounding in branch-and-bound techniques [HR98,RRW07, BV08,
BST11]. In particular, the rounding algorithm of [GW95] returns a feasible x̂ given optimal Z, and is shown
T
in expectation to satisfy xx̂T Cx
C x̂
≥ 0.878. For this reason, the semidefinite relaxation for problems of type
(1.1) are heavily studied (e.g. [PRW95, Hel00, FK97]).

1.4 Semidefinite relaxation


To find the semidefinite relaxation of problem (1.1), we first reformulate it equivalently using a change of
variables Z = xxT :
minimize
n
Tr(CZ)
Z∈S
subject to diag(Z) = 1 (1.2)
Z0
rank(Z) ≤ 1.
We refer to this formulation as the rank constrained semidefinite program (RCSDP). Without the rank
constraint, (1.2) is the usual semidefinite (convex) relaxation used in MAX-CUT approximations; we refer to
the semidefinite relaxation as the SDR. Solving the SDR exactly can be done in polynomial time (e.g.using an
interior point method); however, the per-iteration complexity can be large (O(n6 ) for interior point method
and O(n3 ) for most first-order methods) limiting its scalability.
Consequently, a reformulation based on matrix factorization has been proposed by [BM03, BM05], which
solves the SDR with a factorization Z = RRT , R ∈ Rn×r :

minimize
n×r
Tr(RT CR)
R∈R (1.3)
subject to diag(RRT ) = 1

which we will call the factorized SDP (FSDP). Note that FSDP is not convex; however, because the search
space is compact, if r is large enough such that r(r + 1)/2 > n, then the global optimum of the FSDP (1.3)

2
is the global optimum of the SDR [Pat98, Bar95]. Furthermore, a recent work [BVB16] shows that almost
all local optima of FSDP are also global optima, suggesting that any stationary point of the FSDP is also a
reasonable approximation of (1.1).
To solve (1.3), [BM03] put forward an algorithm based on augmented Lagrangian method to find local
optima of (1.3); unsurprisingly, due to the later obsrvation of Boumel et. al, this method empirically found
global optima almost all of the time. However, solving (1.3) is still numerically burdensome; in the augmented
Lagrangian term, the objective is quartic in R, and is usually solved using an iterative numerical method,
such as L-BFGS.
Another way of finding a low-rank solution to the SDR is to use the nuclear norm as a surrogate for
the rank penalty [CR09, RFP10, UHZ+ 16]. This method is popular because of its theoretical guarantees and
experimental success. A difficulty with nuclear norm minimization is in choosing the penalty parameter; in
general this is done through cross-validation. Also, when the rank function appears as a hard constraint in
the optimization problem, the corresponding bound for the surrogate model is ambiguous.

1.5 ADMM for nonconvex optimization


We propose using the Alternagting Direction Method of Multipliers (ADMM) to solve vector and SDP-based
matrix reformulations of (1.1), as a scalable alternative to the prior methods mentioned here. Recently, the
Alternating Direction Method of Multipliers (ADMM) [GM76, GM75, LM79, EB92] has been widely applied
and investigated in various fields; see the survey [BPC+ 11] and the references therein. Specifically, it is
favored because it is easy to apply to a distributed framework, and has been shown to converge quickly in
practice. For convex optimization problems, ADMM can be shown to converge to a global optima in several
ways: as a series of contractions of monotone operators [EB92] or as the minimization of a global potential
function [BPC+ 11].
However, in general there is a lack of theoretical justification for ADMM on nonconvex problems despite
its good numerical performance. Almost all works concerning ADMM on nonconvex problems investigate
when nonconvexity is in the objective functions (i.e. [HLR16, WYZ15, LP15, MWRF16], and also [LHW17,
XYWZ12] for matrix factorization) Under a variety of assumptions (e.g. convergence or boundedness of dual
objectives) they are shown to convergence to a KKT stationary point.
In comparison, relatively fewer works deal with nonconvex constraints. [JMZ14] tackles polynomial
optimization problems by minimizing a general objective over a spherical constraint kxk2 = 1, [HS16] solves
general QCQPs, and [SWZ14] solves the low-rank-plus-sparse matrix separation problem. In all cases, they
show that all limit points are also KKT stationary points, but do not show that their algorithms will actually
converge to the limit points. In this work, we investigate a class of nonconvex constrained problems, and
show with much milder assumptions that the sequence always converges to a KKT stationary point.

2 VECTOR FORM
We begin by considering a simple reformulation of (1.1) using only vector variables

minimize
n
xT Cx
x,y∈R
subject to y ∈ {−1, 1}n
x = y.

This is similar to the nonconvex formulations proposed in [BPC+ 11], Chapter 9.1.
To solve (2.4) via ADMM, we first form the augmented Lagrangian
ρ
L(x, y; µ) = xT Cx + hµ, x − yi + kx − yk2F , (2.4)
2
where µ ∈ Rn is the corresponding dual variable and ρ > 0 is the weighting factor associated with the
penalty term. Under the ADMM framework, problem in (2.4) can be solved by alternating between the two

3
primal variables x and y, and then updating the dual variable µ. The ADMM for (2.4) is summarized in
Algorithm 1. Note that unlike typical ADMM, the penalty parameter ρ in Algorithm 1 is increasing with
the iteration index, which will facilitate the convergence.

Algorithm 1 ADMM for solving (2.4)


n
1: Inputs: C ∈ S , ρ0 > 0, α > 1, tol  > 0
2: Outputs: Local optimum (x, y) of (2.4)
3: Initialize: x0 , y 0 ; µ0 as random vectors
4: for k = 1 . . . do
5: Update y k as the solution to
µk 2
min kxk − y + k
ρk F
y (2.5)
n
s.t. y ∈ {−1, 1}
6: Update xk the minimizer of
ρk µk
xT Cx + kx − y k+1 + k k2F (2.6)
2 ρ
7: Update µ via

µk+1 = µk + ρk (xk+1 − y k+1 )


ρk+1 = αρk (2.7)

8: if kxk − y k k ≤  then
9: break
10: end if
11: end for

2.1 Solution for Subproblems in ADMM


For every step k in ADMM of Algorithm 2, the updating procedure is composed of three subproblems,
expressed in (2.5)-(2.7).The update for y in (2.5) is just rounding
yik+1 = sign(xk + µk /ρk )i ,
and is computationally cheap. However, updating x is more cumbersome. The update of x is the solution
to a linear system
2Cx + µk + ρk (x − y k+1 ) = 0. (2.8)
and xk = (ρk I + 2C)−1 (−µk + ρk y k+1 ). If ρk = ρ is constant, then one can avoid inverting at each iteration
by storing (ρk I + 2C)−1 and using matrix multiplication at each step. For changing ρk , one can store the
eigenvalue decomposition of C = UC ΛC UCT , and compute for each ρk
(ρk I + 2C)−1 = UC (2ΛC + ρk I)−1 UCT .
Although this formulation seems simple, when n is large, the initial eigenvalue decomposition or matrix
inverse can incur significant overhead. Still, despite these limitations, this vector-based method can be
shown to converge under very mild conditions.

2.2 Convergence Analysis


We will now show that Algorithm 1 converges under very mild conditions; specifically, the augmented
Lagrangian monotonically decreases.

4
Definition 2.1 xT Cx is L1 −smooth function, that is, any x1 , x2 , k2Cx1 − 2Cx2 k ≤ L1 kx1 − x2 k.

Definition 2.2 LH ∈ R+ is the smallest number such that −LH I  2C, where 2C is the Hessian of the
objective function of (1.1). In other words, the Hessian is lower bounded.

Assumption 2.3 The value xT Cx is lower bounded over x ∈ {−1, 1}n .

Theorem 2.4 Given the sequence {ρk } such that

(ρk )2 − LH ρk − (α + 1)L21 > 0, ρk > LH , α > 0

under Algorithm 1 the augmented Lagrangian monotonically decreases

L(y k+1 , xk+1 ; µk+1 ) − L(y k , xk ; µk ) ≤ −ck1 kxk+1 − xk k2F


k (α+1)L2
with ck1 = ρ −L2
H
− 2ρk 1 > 0. With Assumption 2.3 and ρk > L1 , the augmented Lagrangian L(y k , xk ; µk )
is lower bounded and convergent. Therefore, {y k , xk , µk } converge to {y ∗ , x∗ , µ∗ } a KKT solution of (2.4)
and x∗ = y ∗ .

Proof: See section A in the appendix. 


Remark : Convergence is also guaranteed under a constant penalty coefficient ρk ≡ ρ0 (α = 1) satisfying
(ρ0 )2 − LH ρ0 − 2L21 > 0. However, in implementation, we find empirically that increasing {ρk } from a
relatively small ρ0 can encourage convergence to more useful global minima.

Corollary 2.5 With the prerequisites in Theorem 2.4 satisfied and (y ∗ , x∗ ; µ∗ ) as the accumulation point,
we have that

i. µ∗i x∗i ≥ −ρ∗ with x∗ ∈ {−1, 1}n .

ii. x∗ is a stationary point of the optimization problem minx xT Cx + (µ∗ )T x.


In particular for a convex function objective f (x) = xT Cx (C  0), µ∗ = 0 indicates that the constraint
x ∈ {−1, 1}n is automatically satisfied; that is, the minimizer of f (x) is the exact minimizer of (1.1).

Proof: See section A in the appendix. 

3 PSD MATRIX FORM


We now present our second reformulation of (1.1), using ideas from the SDR and FSDP:

minimize Tr(CZ)
X,Y,Z
subject to diag(Z) = 1 (3.9)
Z = XY T
X = Y.

Here the matrix variables are Z ∈ Sn , and X, Y ∈ Rn×r . The extra variable Y and the constraint X = Y
are introduced to transform the quadratic constraint in (1.3) to bilinear, which will simplify the subproblems
significantly. Two cases of r are of interest. When r = 1, (3.9) is equivalent to (1.1), with X = x. When,
2n , then r(r+1)
√ 
r = 2 > n and the global optimum of (3.9) is with high probability that of the SDR,
based on the results of [BVB16].

5
To solve (3.9) via ADMM, we build the augmented Lagrangian function as
L(Z, X, Y ;Λ1 , Λ2 ) = Tr(CZ) + hΛ2 , X − Y i
ρ
+ hΛ1 , Z − XY T i + kX − Y k2F
2
ρ T 2
+ kZ − XY kF (3.10)
2
where Λ1 ∈ Rn×n and Λ2 ∈ Rn×r are the dual variables corresponding to the two coupling constraints,
respectively. Similar as to the vector case, problem (3.9) is solved by alternating minimization between
two primal variable sets, Y and (Z, X), and then updating the dual variable Λ1 and Λ2 . This sequence is
summarized in Algorithm 2 below.

Algorithm 2 ADMM for solving (3.9)


n
1: Inputs: C ∈ S , ρ0 > 0, α > 1, tol  > 0
2: Initialize: Z 0 , X 0 ; Λ01 , Λ02 as random matrices
3: Outputs: Z, X = Y
4: for k = 1 . . . do
5: Update Y k+1 the minimizer of

Λk1 2 Λk
kZ k − X k Y T + k
kF + kX k − Y + k2 k2F (3.11)
ρ ρ

6: Update (Z, X)k+1 as the solutions of

min L(Z, X, Y k+1 ; Λk1 , Λk2 ; ρk )


X,Z (3.12)
s.t. diag(Z) = 1

where L is as defined in (3.10)


7: Update Λ1 , Λ2 and ρ via

Λk+1
1 = Λk1 + ρk (Z k+1 − X k+1 (Y k+1 )T )
Λk+1
2 = Λk2 + ρk (X k+1 − Y k+1 )
ρk+1 = αρk (3.13)

8: if max{kX k − Y k k, kZ k − X k (Y k )T k} ≤  then
9: break
10: end if
11: end for

3.1 Solution for Subproblems in ADMM


For notation ease, we omit the iteration index k. To update Y in (3.12), we solve the first order optimality
condition, which reduces to the following linear system:
 
1 T
Y = (Λ X + Λ2 ) + Z X + X (I + X T X)−1 .
T
ρ 1
√  √
For r = 2n , the size of the matrix to be inverted is of the order O( n); for r = 1, it is a scalar and
inversion is trivial.
To update (Z, X) in (3.12), we similarly need to solve an equality-constrained quadratic problem. Define
the linear map A(Z) = diag(Z) selecting the diagonal of a symmetric matrix, and its adjoint operator
A∗ (ν) = Diag(ν) producing a diagonal matrix from a vector. Then the optimality conditions of (3.12) are

6
C − A∗ (ν) + Λ1 + ρ(Z − XY T ) = 0
Λ2 − Λ1 Y + ρ(XY T − Z)Y + ρ(X − Y ) = 0
A(X) = 1,
where ν ∈ Rm is the dual variable for the local constraint A(X) = b.
Given Y , solving for ν reduces to solving

Gν = ρ(b − A(DY T )) + A[(C + Λ1 )(I + Y Y T )]

where
1
D= (Λ1 Y − Λ2 ) + Y, G = A∗ (A(I + Y Y T )).
ρ
Since G is a diagonal positive definite matrix then finding G−1 is computationally simple. Then

X = BY + D, and Z = XY T + B,

where
1
B = − (C − A∗ (ν) + Λ1 ).
ρ
Note that the complexity of all inversions are O(r3 ), and are particularly simple if r = 1. In that case, the
complexity is dominated by the matrix multiplications.

3.2 Convergence Analysis


Assumption 3.1 The sequences {Tr(CZ k )} and {Λk1 , Λk2 } are bounded in norm.

Theorem 3.2 If Assumption 3.1 holds, and given a sequence {ρk } such that
X ρk+1 X 1
ρk > 0, k 2
< ∞, and < ∞,
(ρ ) ρk

then {Z k , X k , Y k } generated by Algorithm 2 will globally converge to a stationary point of (3.9).

Proof: See section B in the appendix. 


√ 
Corollary 3.3 If r ≥ 2n and the stationary point of Algorithm 2 converges to a second order critical
point of (3.9), then it is globally optimal for the convex relaxation of (1.3) [BVB16].

Unfortunately, the extension of KKT stationary points to global minima is not yet known when r(r+1)
2 <n
(i.e., r = 1). However, our empirical results suggest that even when r = 1, often a local solution to (3.9)
well-approximates the global solution to (1.1).

4 NUMERICAL RESULTS
From√the convergence
 proofs, we have shown that our algorithms converge to a KKT stability point, and if
r= 2n , with high probability achieves the global solution of the SDR. In this section, we give empirical
evidence that they are in addition good solutions to (1.1), our original combinatorial problem. We show
this through three applications: community detection of the stochastic block model, MAX-CUT, and image
segmentation.
We evaluate four methods for solving (1.1):
1. SD: the solution to the SDR rounded to a binary vector using a Goemans-Williamson style rounding
[GW95] technique;

7
Figure 1: Sample evolution for n = 2500, showing the objective value Tr(CX), best objective value using
xr = sign(x), and recovery rate using xr .

2. V: the binary vector reformulation (2.4) solved via ADMM;


3. MR1: the matrix reformulation (3.9) with r = 1, solved via ADMM;
√ 
4. MRR: the matrix reformulation (3.9) with r = 2n , solved via ADMM, and rounded to a binary
vector using a nonsymmetric version of the Goemans-Williamson style rounding [GW95] technique.

Rounding For both the SDP and MRR methods, we need to round the solutions to a vector. For the SDP
method, we first do an eigenvalue decomposition X = QΛQT and form a factor F = QΛ1/2 where the diagonal
elements of Λ are in decreasing magnitude order. Then we scan k = 1, . . . , n and find xk,t = sign(Fk zt ) for
trials t = 1, . . . 10. Here, Fk contain the first k columns of F , and each element of zt is drawn i.i.d from a
normal Gaussian distribution. We keep xr = xk,t that minimizes xTr Cxr . For the MRR method, we repeat
the procedure using a factor F = U Σ1/2 where X = U ΣV T is the SVD of X. For MR1 and V, we simply
take xr = sign(x) as the binary solution.

Computer information The following simulations are performed on a Dell Poweredge R715 with an
AMD Opteron(tm) processor, with 64 GB of RAM and 24 cores. It is running Ubuntu 16.04 (Xenial) with
Matlab 2017a.

4.1 Community detection


In this section we evaluate C as defined for community detection. Figure 1 gives a sample evolution for each
method with n = 2500, m = n/2, and p = 10q = 0.04. Although the vector method is simple and has a
fast per-iteration rate, it has significant initial overhead and slow overall progress. In comparison, the two
matrix methods MR1 and MRR are significantly faster. The SDR method has a slow per iteration rate,
but suggests good answers in only a few iterations. However, this plot does not take into account rounding
overhead time (see figure 2), which is significant for both the SDR and MRR methods.
Table 4.1 gives the average recovery rate for each method on the community detection problem for the
stochastic block model. The results are surprisingly consistent, and suggest that solving (1.1) using our
approaches all have great promise for general community detection problems.

8
1000 10000

CPU Overhead Time


100 1000
CPU Runtime

100
10
10
1
2500 5000 10000 1
2500 5000 10000
0.1
n = # nodes n = # nodes

SDP MR1 MRR V SDP MR1 MRR V

Figure 2: Top: CPU per-iteration runtime for four methods. Bottom: CPU runtime for overhead operations
(rounding for SDP and MRR and initial matrix inversion or SVD for V). MR1 gives a vector output and
has no additional overhead.

n m p S V MR1 MRR
1000 100 0.1 1 1 1 1
1000 500 0.1 1 1 1 1
2500 250 0.04 1 1 1 1
2500 1250 0.04 1 1 1 1.00
5000 500 0.02 1 1 1 1.00
5000 2500 0.02 1 1 1 1.00
10000 1000 0.01 1 1 1 1
10000 5000 0.01 1 1 1 1.00

Table 1: Correct community recovery rate for stochastic block model. Result after 10 iterations for
S, MR1, and MRR methods and 50 iterations for V, averaged over 10 trials for each method. n = number
of nodes in graph. m = number of nodes in smaller of two communities. p = percentage of edges within
communities, and q = p/10 the percentage of edges between communities. 1 = perfect recovery, 1.00 =
rounds to 1.

9
g3-15 g3-8 pm3-15-50 pm3-8-50
n 3375 512 3375 512
# edges 20250 3072 20250 3072
sparsity 0.18% 1.2% 0.18% 1.2%
BK 281029888 41684814 2964 454
R 12838418 4816306 170 62
MR1 255681256 36780180 1990 312
V 202052290 8213762 2016 330

Table 2: MAX-CUT values for graphs from the 7th DIMACS Challenge. BK = best known. R = best of
1000 random guesses. MR1 = matrix formulation, r = 1. V = vector formulation.

4.2 MAX-CUT
Table 2 gives the best MAX-CUT values using best-of-random-guesses and our approaches over four examples
from the 7th DIMACS Implementation Challenge in 2002.1 Our solutions are very close to the best known
solutions.

4.3 Image segmentation


Both community detection and MAX-CUT can be used in image segmentation, where each pixel is a node and
the similarity between pixels form the weight of the edges. Generally, solving (1.1) for this application is not
preferred, since the number of pixels in even a moderately sized image is extremely large. However, because
of our fast methods, we successfully performed image segmentation on several thumbnail-sized images, in
figure 3.
The C matrix is composed as follows. For each pixel, we compose two feature vectors: fcij containing the
RGB values and fpij containing the pixel location. Scaling fcij by some weight c, we form the concatenated
feature vector f ij = [fcij , cfpij ], and form the weighted adjacency matrix as the squared distance matrix
between each feature vector A(ij),(kl) = kf ij − f kl k22 . For MAX-CUT, we again form C = A − Diag(A1) as
before. For community detection, since we do not have exact p and q values, we use an approximation as
C = a11T − A where a = n12 1T A1 the mean value of A. Sweeping C and ρ0 , we give the best qualitative
result in figure 3.

5 CONCLUSION
We present two methods for solving quadratic combinatorial problems using ADMM on two reformulations.
Though the problem has a nonconvex constraint, we give convergence results to KKT solutions under much
milder conditions than previously shown. From this, we give empirical solutions to several graph-based
combinatorial problems, specifically MAX-CUT and community detection; both can can be used in additional
downstream applications, like image segmentation.

References
[ABH16] Emmanuel Abbe, Afonso S Bandeira, and Georgina Hall. Exact recovery in the stochastic block
model. IEEE Transactions on Information Theory, 62(1):471–487, 2016.
[Bar95] Alexander I. Barvinok. Problems of distance geometry and convex properties of quadratic maps.
Discrete & Computational Geometry, 13(2):189–202, 1995.
1 See http://dimacs.rutgers.edu/Workshops/7thchallenge/. Problems downloaded from
http://www.optsicom.es/maxcut/

10
Figure 3: Image segmentation. The center (right) column are the best MAX-CUT (community detection)
results.

[BGJR88] Francisco Barahona, Martin Grötschel, Michael Jünger, and Gerhard Reinelt. An application of
combinatorial optimization to statistical physics and circuit layout design. Operations Research,
36(3):493–513, 1988.

[BM03] Samuel Burer and Renato DC Monteiro. A nonlinear programming algorithm for solving semidef-
inite programs via low-rank factorization. Mathematical Programming, 95(2):329–357, 2003.
[BM05] Samuel Burer and Renato DC Monteiro. Local minima and convergence in low-rank semidefinite
programming. Mathematical Programming, 103(3):427–444, 2005.

[BPC+ 11] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed opti-
mization and statistical learning via the alternating direction method of multipliers. Foundations
and Trends R in Machine Learning, 3(1):1–122, 2011.

[BST11] Xiaowei Bao, Nikolaos V Sahinidis, and Mohit Tawarmalani. Semidefinite relaxations for
quadratically constrained quadratic programming: A review and comparisons. Mathematical
programming, 129(1):129–157, 2011.
[BV08] Samuel Burer and Dieter Vandenbussche. A finite branch-and-bound algorithm for nonconvex
quadratic programming via semidefinite relaxations. Mathematical Programming, 113(2):259–
282, 2008.
[BVB16] Nicolas Boumal, Vlad Voroninski, and Afonso Bandeira. The non-convex Burer-Monteiro ap-
proach works on smooth semidefinite programs. In Advances in Neural Information Processing
Systems, pages 2757–2765, 2016.

11
[CR09] Emmanuel J Candès and Benjamin Recht. Exact matrix completion via convex optimization.
Foundations of Computational mathematics, 9(6):717, 2009.
[DSDJ+ 95] Caterina De Simone, Martin Diehl, Michael Jünger, Petra Mutzel, Gerhard Reinelt, and Gio-
vanni Rinaldi. Exact ground states of ising spin glasses: New experimental results with a
branch-and-cut algorithm. Journal of Statistical Physics, 80(1-2):487–496, 1995.

[EB92] Jonathan Eckstein and Dimitri P Bertsekas. On the Douglas Rachford splitting method and
the proximal point algorithm for maximal monotone operators. Mathematical Programming,
55(1):293–318, 1992.
[FH16] Santo Fortunato and Darko Hric. Community detection in networks: A user guide. Physics
Reports, 659:1–44, 2016.
[FK97] Tetsuya Fujie and Masakazu Kojima. Semidefinite programming relaxation for nonconvex
quadratic programs. Journal of Global Optimization, 10(4):367–380, 1997.
[GM75] Roland Glowinski and A Marroco. Sur l’approximation, par éléments finis d’ordre un, et la
résolution, par pénalisation-dualité d’une classe de problèmes de dirichlet non linéaires. Revue
française d’automatique, informatique, recherche opérationnelle. Analyse numérique, 9(R2):41–
76, 1975.
[GM76] Daniel Gabay and Bertrand Mercier. A dual algorithm for the solution of nonlinear varia-
tional problems via finite element approximation. Computers & Mathematics with Applications,
2(1):17–40, 1976.

[GW95] Michel X Goemans and David P Williamson. Improved approximation algorithms for maximum
cut and satisfiability problems using semidefinite programming. Journal of the ACM (JACM),
42(6):1115–1145, 1995.
[Hel00] Christoph Helmberg. Semidefinite programming for combinatorial optimization. Konrad-Zuse-
Zentrum für Informationstechnik Berlin, 2000.

[HLL83] Paul W Holland, Kathryn Blackmond Laskey, and Samuel Leinhardt. Stochastic blockmodels:
First steps. Social networks, 5(2):109–137, 1983.
[HLR16] Mingyi Hong, Zhi-Quan Luo, and Meisam Razaviyayn. Convergence analysis of alternating direc-
tion method of multipliers for a family of nonconvex problems. SIAM Journal on Optimization,
26(1):337–364, 2016.
[HR98] Christoph Helmberg and Franz Rendl. Solving quadratic (0, 1)-problems by semidefinite pro-
grams and cutting planes. Mathematical programming, 82(3):291–315, 1998.
[HS16] Kejun Huang and Nicholas D Sidiropoulos. Consensus-ADMM for general quadratically con-
strained quadratic programming. IEEE Transactions on Signal Processing, 64(20):5297–5310,
2016.
[JMZ14] Bo Jiang, Shiqian Ma, and Shuzhong Zhang. Alternating direction method of multipliers for
real and complex polynomial optimization models. Optimization, 63(6):883–898, 2014.
[LHW17] Songtao Lu, Mingyi Hong, and Zhengdao Wang. A nonconvex splitting method for symmetric
nonnegative matrix factorization: Convergence analysis and optimality. IEEE Transactions on
Signal Processing, 2017.
[LM79] Pierre-Louis Lions and Bertrand Mercier. Splitting algorithms for the sum of two nonlinear
operators. SIAM Journal on Numerical Analysis, 16(6):964–979, 1979.

12
[LP15] Guoyin Li and Ting Kei Pong. Global convergence of splitting methods for nonconvex composite
optimization. SIAM Journal on Optimization, 25(4):2434–2460, 2015.
[MWRF16] Sindri Magnússon, Pradeep Chathuranga Weeraddana, Michael G Rabbat, and Carlo Fischione.
On the convergence of alternating direction Lagrangian methods for nonconvex structured opti-
mization problems. IEEE Transactions on Control of Network Systems, 3(3):296–309, 2016.
[Pat98] Gábor Pataki. On the rank of extreme matrices in semidefinite programs and the multiplicity
of optimal eigenvalues. Mathematics of operations research, 23(2):339–358, 1998.
[PRW95] Svatopluk Poljak, Franz Rendl, and Henry Wolkowicz. A recipe for semidefinite relaxation for
(0, 1)-quadratic programming. Journal of Global Optimization, 7(1):51–73, 1995.
[PT94] Svatopluk Poljak and Zsolt Tuza. The expected relative error of the polyhedral approximation
of the MAX-CUT problem. Operations Research Letters, 16(4):191–198, 1994.
[RFP10] Benjamin Recht, Maryam Fazel, and Pablo A Parrilo. Guaranteed minimum-rank solutions of
linear matrix equations via nuclear norm minimization. SIAM review, 52(3):471–501, 2010.
[RRW07] Franz Rendl, Giovanni Rinaldi, and Angelika Wiegele. A branch and bound algorithm for MAX-
CUT based on combining semidefinite and polyhedral relaxations. In IPCO, volume 4513, pages
295–309. Springer, 2007.
[SWZ14] Yuan Shen, Zaiwen Wen, and Yin Zhang. Augmented Lagrangian alternating direction method
for matrix separation based on low-rank factorization. Optimization Methods and Software,
29(2):239–263, 2014.
[UHZ+ 16] Madeleine Udell, Corinne Horn, Reza Zadeh, Stephen Boyd, et al. Generalized low rank models.
Foundations and Trends R in Machine Learning, 9(1):1–118, 2016.

[WYZ15] Yu Wang, Wotao Yin, and Jinshan Zeng. Global convergence of ADMM in nonconvex nonsmooth
optimization. arXiv preprint arXiv:1511.06324, 2015.
[XYWZ12] Yangyang Xu, Wotao Yin, Zaiwen Wen, and Yin Zhang. An alternating direction algorithm for
matrix completion with nonnegative factors. Frontiers of Mathematics in China, 7(2):365–384,
2012.

A Convergence analysis for vector form


We first prove the following lemma.

Lemma A.1 If Assumption 2.3 holds, for two adjacent iterations of Algorithm 1 we have

kµk+1 − µk k22 ≤ L21 kxk+1 − xk k22 . (1.14)

Proof: Recall the first order optimality conditions for x in (2.8)

2Cxk+1 + µk + ρk (xk+1 − y k+1 ) = 0. (1.15)

Combining with (2.7), we get

2Cxk+1 + µk+1 = 0. (1.16)

As (1.16) holds for each iteration k, combining it with the definition of L1 proves the lemma. 
Remark : We update x after y so that we can apply the optimality condition (1.16).
Next we will show that the augmented Lagrangian (2.4) is monotonically decreasing and lower bounded.

13
Lemma A.2 The augmented Lagrangian (2.4) is monotonically decreasing; that is,

L(y k+1 , xk+1 ; µk+1 ; ρk+1 ) − L(y k , xk ; µk ; ρk ) ≤ ck1 kxk+1 − xk k2F . (1.17)
k
ρ −LH (α+1)L21
where ck1 = 2 − 2ρk
> 0.

Proof: For the update of y in (2.5), we have


L(y k+1 , xk ; µk ; ρk ) − L(y k , xk ; µk ; ρk ) ≤ 0, (1.18)
based on the fact that problem (2.5) is globally minimized. Using {y, µ, ρ} = {y k+1 , µk , ρk }, for the update
of (2.6), we have
ρk − LH k+1
L(y, xk+1 ; µ; ρ) − L(y, xk ; µ; ρ) ≤ h∇x L(y, xk+1 ; µ; ρ), xk+1 − xk i − kx − xk k22
2
ρk − LH k+1
≤ − kx − xk k22 , (1.19)
2
where the first inequality follows that L(y, x; µ; ρ) is strongly convex with respect to x given ρ > LH , and
the second inequality follows from the optimality condition of (2.6).
In addition, using {x, y} = {xk+1 , y k+1 } we have
ρk+1 − ρk
L(y, x; µk+1 ; ρk+1 ) − L(y, x; µk ; ρk ) = hµk+1 − µk , x − yi + kx − yk2F
2
α + 1 k+1
= kµ − µk k22
2ρk
(1.20)
where the first equality follows the definition of L and the second follows (2.7). Combining with Lemma
(A.1)
(α + 1)L21 k+1
L(y k+1 , xk+1 ; µk+1 ; ρk+1 ) − L(y k+1 , xk+1 ; µk ; ρk ) ≤ kx − xk k2F , (1.21)
2ρk
Moreover, it can be easily verified that
L(y k+1 , xk+1 ; µk+1 ; ρk+1 ) − L(y k , xk ; µk ; ρk ) = L(y k+1 , xk ; µk ; ρk ) − L(y k , xk ; µk ; ρk )
+L(y k+1 , xk+1 ; µk ; ρk ) − L(y k+1 , xk ; µk ; ρk )
+L(y k+1 , xk+1 ; µk+1 ; ρk+1 )
−L(y k+1 , xk+1 ; µk ; ρk ). (1.22)
By incorporating (1.18), (1.19) and (1.21), (1.22) can be rewritten as

L(y k+1 , xk+1 ; µk+1 ; ρk+1 ) − L(y k , xk ; µk ; ρk )


ρk − LH k+1 (α + 1)L21 k+1
≤− kx − xk k2F + kx − xk k2F
2 2ρk
 k
(α + 1)L21

ρ − LH
≤− − kxk+1 − xk k2F .
2 2ρk
ρk −LH (α+1)L21
With (ρk )2 − LH ρk − (α + 1)L21 > 0, ρk > 0, we have c1 = 2 − 2ρk
> 0.

Lemma A.3 If the objective of (2.4) is lower-bounded over x ∈ {−1, 1}n (assumption (2.3)), then the
augmented Lagrangian (2.4) is lower bounded.

14
Proof: For notation ease, we denote f (x) = xT Cx. From the L1 -Lipschitz continuity of ∇x f (x) , it follows
that
L1
f (y) ≤ f (x) + h∇f (x), y − xi + kx − yk2F (1.23)
2
for any x and y. By definition

ρk k
L(y k , xk ; µk ; ρk ) = f (xk ) + hµk , xk − y k i + kx − y k k2F
2
ρk
= f (xk ) − h∇f (xk ), xk − y k i + kxk − y k k2F (1.24)
2
k ρk − L1 k k 2
≥ f (y ) + kx − y kF , (1.25)
2
where (1.24) follows from (1.16) and (1.25) follows from (1.23). From the boundedness of the objective
(Assumption 2.3) and the assumption ρk > L1 , it follows that L(y k , xk ; µk ; ρk ) is bounded below for all k.

Since the sequence {L(z k , xk ; µk )} is monotonically decreasing and lower bounded, then the sequence
{L(z k , xk ; µk )} converges. Combining with Lemma A.1 we have

kxk+1 − xk k → 0, (1.26)

which indicates that the sequence {xk } converges to a limit point (x∗ ). Moreover, as {xk } converges, the
sequence {µk } also converges to µ∗ based on (1.14). From (2.7), µk+1 − µk = ρk (xk+1 − y k+1 ) and so
xk+1 − y k+1 → 0; thus {y k } converges to a limit point y ∗ and x∗ = y ∗ . Additionally, note that x∗ = y ∗
implies ∇x L(y ∗ , x∗ ; µ∗ ; ρ) = ∇y L(y ∗ , x∗ ; µ∗ ; ρ) = 0 and x∗ = y ∗ ∈ {−1, 1} are feasible. Thus Theorem 2.4 is
proven.
We now prove Corollary 2.5.

Proof: i) At the accumulation point, the update of x∗ is expressed as


u∗
x∗ = y ∗ = Proj{−1,1}n (x∗ + ρ∗ ).
i
(1.27)

Since x∗i = sign(ρ∗ x∗i + u∗i ), therefore

x∗i (ρ∗ x∗i + u∗i ) = ρ∗ + x∗i u∗i ≥ 0.

ii) Given x∗ = y ∗ , then from (2.8) it follows that

∇x f (x∗ ) + µ∗ = 0, (1.28)

or equivalently,
x∗ = arg min f (x) + xT µ∗ .
x
∗ ∗
If, additionally, µ = 0, then x is also the minimizer of the unconstrained problem; that is

arg min f (x) + xT µ∗ = arg min f (x) = arg min f (x)


x x∈{−1,1} x

suggesting that the combinatorial constraints x ∈ {−1, 1} are not necessary, and the convex relaxation is
exact; therefore x∗ is the global minimum of (1.1).


15
B Convergence analysis for matrix form
To simplify notation, we first collect the primal and dual variables P k = (Z, X, Y )k and Dk = (Λ1 , Λ2 )k .

Lemma B.1 L(P k ; Dk ; ρk ) is bounded.

Proof: Recall the definition of L(P k ; Dk ; ρk )

ρk k Λk
L(P k ; Dk ; ρk ) = Tr(CZ k ) + kZ − X k (Y k )T + k1 k2F
2 ρ
k k
ρ Λ 1
+ kX k − Y k + k2 k2F − k kΛk1 k2F + kΛk2 k2F > −∞

(2.29)
2 ρ 2ρ

where the inequality follows the boundedness of {Tr(CZ k )} and {Λk1 , Λk2 } in Assumption 3.1. 

Lemma B.2
∇2 LY  ρk I
and p !
2 k λ2N + 4λN − λN
∇ L(X,Z)  ρ 1− I.
2

Proof: Given the definition of L, we can see that the Hessian

∇2 LY = ρk (M + I)  ρk I

where
M = blkdiag((X k )T (X k ), (X k )T (X k ), ...)  0.
For (X, Z), we have
I + NNT
 
−N
∇2(X,Z) Lk =ρ k
−N T I
where 2
N = blkdiag((Y k+1 )T , . . . , (Y k+1 )T ) ∈ Rnr×n .
Note that for block diagonal matrices, kN k2 = kY k+1 k2 . Since I  0 and its Schur complement (I +N N T )−
N N T  0, then ∇2(X,Z) L̃k  0 and equivalently λmin (∇2(X,Z) Lk ) > 0.
To find the smallest eigenvalue λmin (∇2(X,Z) Lk ), it suffices to find the largest σ > 0 such that

(1 − σ)I + N N T
 
−N
H2 = (ρk )−1 ∇2(X,Z) L̃k − σI =  0. (2.30)
−N T (1 − σ)I

Using the same Schur Complement trick, we want to find the largest σ > 0 where (1 − σ)I  0 and

H3 = (1 − σ)I + N N T (1 − (1 − σ)−1 ))  0.

Since this implies 1 − σ > 0, then together with σ > 0, it must be that 1 − (1 − σ)−1 < 0.
Therefore, defining λN = kY k+1 k22 ,

λmin (H3 ) = (1 − σ) + λN (1 − (1 − σ)−1 ).

We can see that (1 − σ)λmin (H3 ) is a convex function in (1 − σ), with two zeros at
p
± λ2N + 4λN − λN
1−σ = .
2

16
In between the two roots, λmin H3 < 0. Since the smaller root cannot satisfy 1 − σ > 0, we choose
p
λ2N + 4λN − λN
σmax = 1 − >0
2
as the largest feasible σ that maintains λmin (H3 ) ≥ 0.
As a result,

λmin (∇2(X,Z) L) = ρk σmax


p !
k λ2N + 4λN − λN
= ρ 1− .
2


We now prove Thm. 3.2.

Proof: For the update of Y in (3.11), taking {D, ρ} = {Dk , ρk }, we have

L(Z k , X k , Y k+1 ; Dk ; ρk ) − L(P k ; Dk ; ρk ) ≤ h∇Y L(Z k , X k , Y k+1 ; Dk ; ρk ), Y k+1 − Y k i


λmin (∇2Y L̄k ) k+1
− kY − Y k k2F
2
ρk
≤ − kY k+1 − Y k k2F (2.31)
2
with ∇2Y L̄k = ∇2vec(Y ) L(Z k , X k , Y K+1 ; Dk ; ρk )  ρk I.
which follows from the ρk -strong convexity of L with respect to Y , and the optimality of Y k+1 .
For the update of (Z, X) in (3.11), we have

L(P k+1 ; Dk ; ρk ) − L(Z k , X k , Y k+1 ; Dk ; ρk ) ≤ h∇Z L(P k+1 ; Dk ; ρk ), Z k+1 − Z k i + h∇X L(P k+1 ; Dk ; ρk ), X k+1 − X k i
λmin (∇2(X,Z) L̃k )
kZ k+1 − Z k k2F + kX k+1 − X k k2F


2
λmin (∇2(X,Z) L̃k )
kZ k+1 − Z k k2F + kX k+1 − X k k2F ,

≤ − (2.32)
2
where the first inequality follows from the strong convexity of L(Z, X, Y k ; Λk1 , Λk2 ) with respect to (X, Z) with
∇2(X,Z) L̃k = ∇2vec(X,Z) L(P k+1 ; Dk ; ρk ). In addition, the second inequality follows the optimality condition
of (3.12). Note that λmin (∇2(X,Z) L̃) > 0 given L(Z, X, Y k ; Λk1 , Λk2 ) is strongly convex with regard to (X, Z).
In addition for the update of the dual variables and the penalty coefficient, we have

L(P k+1 ; Dk+1 ; ρk+1 ) − L(P k+1 ; Dk ; ρk ) = hΛ1 k+1 − Λ1 k , Z k+1 − X k+1 (Y k+1 )T i
+hΛ2 k+1 − Λ2 k , X k+1 − Y k+1 i
ρk+1 − ρk
+ (kZ k+1 − X k+1 (Y k+1 )T k2F )
2
ρk+1 − ρk
+ (kX k+1 − Y k+1 k2F )
2
ρk+1 + ρk
kΛ1 k+1 − Λ1 k k2F + kΛ2 k+1 − Λ2 k k2F (2.33)

=
2(ρk )2

17
where the first equality follows the definition of L, the second follows (3.13). Moreover, it can be easily
verified that

L(P k+1 ; Dk+1 ; ρk+1 ) − L(P k ; Dk ; ρk )


= L(Z k , X k , Y k+1 ; Dk ; ρk ) − L(P k ; Dk ; ρk )
+ L(P k+1 ; Dk ; ρk ) − L(Z k , X k , Y k+1 ; Dk ; ρk )
+ L(P k+1 ; Dk+1 ; ρk+1 ) − L(P k+1 ; Dk ; ρk ). (2.34)

By incorporating (2.32), (2.31) and (2.33), (2.34) can be rewritten as

L(P k+1 ; Dk+1 ; ρk+1 ) − L(P k ; Dk ; ρk )


≤ −ck1 kZ k+1 − Z k k2F − ck2 kX k+1 − X k k2F
− ck3 kY k+1 − Y k k2F
ρk+1 + ρk
kΛ1 k+1 − Λ1 k k2F + kΛ2 k+1 − Λ2 k k2F . (2.35)

+ k 2
2(ρ )
˜
λmin (∇2(Z,X) Lk ) ρk
with ck1 = ck2 = 2 > 0 as given in Lemma B.2 and ck3 = 2 > 0.
By telescoping, the summation
K−1
X
L(P K ; DK ; ρK ) − L(P 0 ; D0 ; ρ0 ) = L(P k+1 ; Dk+1 ; ρk+1 ) − L(P k ; Dk ; ρk )
k=0
K−1
ρk+1 + ρk
X  
k+1 k 2 k+1 k 2
≤ kΛ1 − Λ1 kF + kΛ2 − Λ2 kF
2(ρk )2
k=0
K−1
X  
k k+1 k 2 k+1 k 2 k+1 k 2
− c kZ − Z kF + kX − X kF + kY − Y kF ,
k=0

with ck = min(ck1 , ck2 , ck3 ) > 0.


For the first term,
K−1
ρk+1 + ρk
X  
k+1 k 2 k+1 k 2
0 ≤ kΛ 1 − Λ k
1 F + kΛ2 − Λ k
2 F
2(ρk )2
k=0
K−1
X ρk+1 + ρk
≤ 4 B
(ρk )2
k=0

where from Assumption 3.1 we have that {Λk1 , Λk2 } is bounded; that is there exists B < ∞ where
P ρk+1 +ρk P ρk+1
B ≥ max{kΛk1 k2F , kΛk2 k2F } for all k. Since additionally 2(ρk )2
≤ (ρk )2
< +∞, then

K−1
X ρk+1 + ρk
kΛk+1 − Λk1 k2F + kΛk+1 − Λk2 k2F

0 ≤ k 2 1 2
2(ρ )
k=0
≤ +∞.

Since L(P k ; Dk ; ρk ) is bounded, then it follows that


K−1
X  
0≤ ck kZ k+1 − Z k k2F + kX k+1 − X k k2F + kY k+1 − Y k k2F ≤ +∞. (2.36)
k=0

18
ρk = +∞, this immediately yields Z k+1 − Z k → 0, X k+1 − X k → 0, Y k+1 − Y k → 0.
P
Since additionally k
Moreover,

X 1 1
k
< ∞ ⇐⇒ k → 0 ⇐⇒ ρk → ∞.
ρ ρ
k=1

Combined with Assumption 3.1, this implies that


1 k+1
Z k+1 − X k+1 (Y k+1 )T = (Λ − Λk1 ) → 0,
ρk 1
1 k+1
X k+1 − Y k+1 = (Λ − Λk2 ) → 0.
ρk 2
Therefore the limit points X ∗ , Y ∗ , and Z ∗ are all feasible, and simply checking the first optimality condition
will verify that this accumulation point is a stationary point of (3.9). The proof is complete.


19

View publication stats

You might also like