A Deterministic Strongly Polynomial Algorithm For Matrix Scaling and Approximate Permanents

Download as pdf or txt
Download as pdf or txt
You are on page 1of 23

A deterministic strongly polynomial algorithm

for matrix scaling and approximate permanents


∗ † ‡
Nathan Linial Alex Samorodnitsky Avi Wigderson

Abstract

We present a deterministic strongly polynomial algorithm that computes the


permanent of a nonnegative n × n matrix to within a multiplicative factor of en .
To this end we develop the first strongly polynomial-time algorithm for matrix
scaling - an important nonlinear optimization problem with many applications.
Our work suggests a simple new (slow) polynomial time decision algorithm for
bipartite perfect matching, conceptually different from classical approaches.


Hebrew University. Work supported in part by a grant of the Binational Israel-US Science Founda-
tion.

Hebrew University

Hebrew University. Work partially supported by grant 032-7736 from the Israel Academy of Sciences.
Part of this work was done during a visit to the Institute for Advanced Study, under the support of a
Sloan Foundation grant 96-6-2.

1
1 Introduction

1.1 The permanent


Background: Let A = (aij ) be an n × n matrix. The number
n
X Y
per(A) = aiσ(i) ,
σ∈Sn i=1

where Sn is the symmetric group on n elements, is called the permanent of A. For a


0, 1 matrix A, per(A) counts the number of perfect matchings in G, the bipartite graph
represented by A. We freely interchange between a 0, 1 matrix and the corresponding
graph.
It is #P -hard to compute the permanent of a nonnegative (even 0, 1) matrix [31], and
so it is unlikely to be efficiently computable exactly for all matrices. In fact, even the
existence of an efficient algorithm that computes the permanent of an exponentially small
fraction (appropriately defined) of all 0, 1 matrices implies a collapse of the polynomial
hierarchy [9]. Planar graphs provide the only interesting class of matrices for which a
(beautiful) polynomial-time algorithm is known for the permanent [20].
The realistic goal, then, is to try and approximate the permanent efficiently as well
as possible, for large classes of matrices.
There are deep and interesting upper and lower bounds on the permanent in certain
classes of matrices. The most well known is Egorychev’s [7] and Falikman’s [8] resolution
of van der Waerden’s conjecture: The permanent of a doubly stochastic n × n matrix is
at least nn!n ≥ e−n . Bregman [5] resolved the Minc conjecture and proved a tight upper
bound on the permanent of a zero-one matrix with given row sums. Both bounds are
easy to compute, and indeed were the starting point of our approach.
Efficient poly-time probabilistic algorithms that approximate the permanent extremely
tightly (1+ factor) were developed for several classes of graphs, e.g., dense
√ graphs and
random graphs [18], and others. This line of work has led to an exp(O( n))-time prob-
abilistic algorithm that achieves (1 + )-approximation for every graph [19].
How well can the permanent be approximated in polynomial time? The first result
that provides a 2O(n) -factor approximation for arbitrary positive matrices was recently
obtained by Barvinok [2]. His work can be viewed as a continuation of earlier papers
where certain random variables are shown to be unbiased estimators for the permanent.
Barvinok introduced new such estimators for which he was able to establish a strong
enough concentration of measure. He went on [3], [4] to develop a probabilistic polynomial
time algorithm with a cn -factor approximation, c ≈ 1.31, currently the record in this area.
1−α
We remark [3] that any polynomial-time 2n approximation, 0 < α ≤ 1 yields a
polynomial-time (1 + )-approximation.

1
Our results: We achieve O(en )-factor approximation deterministically. Our algorithm
is strongly polynomial.

Theorem 1.1: There is a function f , such that

per(A) ≤ f (A) ≤ en per(A)

holds on every nonnegative n × n matrix A. The function f is computable in Õ(n5 )


elementary operations.

Our approach to this problem is completely different from the previously taken routes.
It involves a natural reduction technique between problem instances: scaling. Observe the
following linearity of permanents: Multiplying a row or column by a constant c, multiplies
the permanent by c as well. More generally, we say that a matrix B is a scaling of A (by
positive vectors x, y ∈ (<+ )n ) if B = XAY , where X = diag(x) and Y = diag(y) are
diagonal matrices with x (resp. y) on their diagonal (these being the factors that multiply
Q Q
the rows and the columns respectively). As observed, per(B) = ( i xi )( i yi )per(A).
Thus scaling reductions not only allow us to compute per(A) ¿from per(B), but in fact any
k-factor approximation of per(B) efficiently yields the same approximation for per(A).
The idea, then, is to scale an input matrix A into a matrix B whose permanent we can
efficiently approximate. A natural strategy is to seek an efficient algorithm for scaling A
to a doubly stochastic B. For suppose we succeed: the permanent of B is clearly at most
1, and per(B) ≥ n!/nn > e−n by the lower bound of [7, 8]. Consequently, per(A) is also
approximated to within an en factor, as claimed.
Note that such a scaling may not always exist - when per(A) = 0. Moreover, even if
scaling exists, the scaling vectors x, y may have irrational coordinates, so we may have
to settle for an approximately doubly stochastic matrix. The scaling algorithm must,
therefore, be accompanied by approximate versions of the van der Waerden bound, and
indeed we prove results of the following type (see also proposition 5.1).

Lemma 1.2: Let B be a nonnegative n × n matrix, in which all row sums are 1, and
where no column sum exceeds 1 + n12 , then per(B) ≥ e−(n+1) .

So we want to efficiently scale A to an almost doubly stochastic matrix. This scaling


problem that so naturally arose from our considerations, turned out to have been studied
in other contexts as well. The next subsection briefly describes these as well as scaling
algorithms - old and new.

2
1.2 Matrix Scaling
Background
Our discussion will be restricted to square matrices, though everything generalizes to
rectangular matrices (and some of it even to multidimensional arrays).
Let r, c ∈ (<+ )n be two positive vectors with ri = cj . A matrix B is an (r, c)-
P P

matrix if r and c are the vectors of row and columns sums of B respectively. An (r, c)-
scaling of a matrix A consists of two vectors x, y for which B = XAY is an (r, c)-matrix.
(Again X = diag(x) and Y = diag(y).) Given A, r, c one is led to consider the existence
of a scaling, its uniqueness, and of course the complexity of deciding and finding (or
approximating) such a scaling. Note that scaling to a doubly stochastic matrix is exactly
(1̄, 1̄)-scaling. Since multiplying the vectors r and c by the same constant does not change
P P
the problem, we will assume ri = cj = n (to make it consistent with the doubly
stochastic case).
The (r, c)-scaling problem arose in such diverse contexts as statistics [30, 10], nu-
merical analysis [32], operations research [28], image reconstruction [17], engineering [6],
and as we were not surprised to discover, permanents [12]. Here are a couple of quick
examples: To solve the linear system of equations Av = u, we consider instead the scaled
system B(Y v) = X −1 u. A judicious choice of scaling can increase numerical stability,
and indeed this method is used for matrix preconditioning [32]. In computer tomography
and some statistical applications we wish to reconstruct a (sometimes multidimensional)
array of new data, from complete old data and some additional information. Different
dimensions represent essential (spatial or statistical) parameters. The assumption is that
data was modified as a result of a (multiplicative) change in “intensity” or “importance”
of the parameters, leading naturally to the scaling problem. Finally, observe also that
scaling does not affect the zero/nonzero pattern of the matrix, and this (along with lin-
ear constraints on the marginals and positivity requirements) is a natural constraint in
various optimization problems.
The mathematical literature abounds with equivalent formulations of the problem,
connections with other problems, and various necessary and sufficient conditions for
existence and uniqueness of an (r, c)-scaling (see e.g. [29] and the references therein).
Again our discussion is very brief: The max-flow-min-cut theorem provides a necessary
and sufficient condition for the existence of scaling. The scaling vectors are then the
(unique) optimal solution of a nonlinear program which can be made convex (but ugly)
after appropriate changes of variables.
Finally, the algorithmics: Here the literature seems quite sparse. We note that (r, c)-
scalability can be decided in polynomial time, using network flows. Thus we will assume
that the input matrix A is (r, c)-scalable, and the task is to find (or approximate to a
desired accuracy) the target matrix and the scaling vectors.
The original paper by Sinkhorn [30] already introduced an iterative algorithm for

3
this task1 . The idea is simple – apply a sequence of scalings to the given matrix A,
alternately normalizing the row sums to r (say in odd steps) and the column sums to c
(in even steps). Let At be the matrix generated after t iterations. Sinkhorn proved that
this sequence converges to an (r, c)-matrix, but gave no bound on convergence rates. It
should be clear that each iteration is computationally trivial, and that the final scaling
vectors are obtained by keeping a running product of the scaling vectors used at odd and
even steps.
To quantify convergence rates, let  be the desired accuracy (say in L∞ ), and L(A)
the binary input length (i.e. log of the ratio of the largest to smallest nonzero entries) of
A.
The convergence rate of this procedure was first considered by Franklin and Lorenz
[11]. They showed that each step in Sinkhorn’s method is a contraction map in the
Hilbert projective metric. By estimating the contraction constant in terms of L(A), they
concluded that the number of iterations is bounded by

O(L(A) · 1/).

This shows that Sinkhorn’s is an approximation scheme, but of course not polynomial in
log(1/).
Moreover, simple examples show that the dependence on the input length cannot be
improved. This is the best existing bound for the general (r, c)-scaling problem.
For the (1, 1)-scaling problem, Kalantari and Kachiyan [21] have recently developed
an approximation scheme based on convex optimization via the ellipsoid method. Their
algorithm requires
O(L(A)n4 log(1/))
arithmetic operations on integers of length O(L(A) log(n/)). This is thus a polynomial
time approximation scheme, which is, however, not strongly polynomial.
This still leaves open the polynomiality of the general (r, c)-problem, and the quest
for a strongly polynomial scheme, both of which we accomplish in the present paper.
Our results: We develop two strongly polynomial algorithms. The first is restricted to
the (1, 1)-scaling problem, and the other works for the general (r, c)-scaling problem.
For (1, 1)-scaling our approach is this: We start with a certain preprocessing step
that’s followed by Sinkhorn’s procedure. This, along with a different convergence anal-
ysis, yield a strongly polynomial scheme. Our preprocessing is itself also a scaling step
that is attained through a weighted matching algorithm. Following this step, there is
a generalized diagonal of A each element of which is largest in its respective row. This
guarantees a lower bound of n−n on the permanent. The analysis then proves (and this
1
Sinkhorn suggested his procedure (sometimes called the RAS procedure) only for the (1, 1)-scaling
problem he was interested in, but it was naturally extended to the general problem.

4
is simple) that every step in Sinkhorn’s procedure increases the permanent. Moreover,
this increase can be quantified in terms of the distance of the matrix from being doubly
stochastic. This allows us to bound the number of iterations till an  approximation is
attained by
O((n/)2 )
Note that the number of iterations does not involve L(A) (Naturally the arithmetic
operations are applied to integers of length L(A)O(1) .) However, it is only polynomial
in 1/, so this is not a fully polynomial approximation scheme. Still, for the purpose of
approximating the permanent  = n−2 suffices, and Theorem 1.1, (with a total running
time of O(n6 )) follows. These results are described in Section 3.
For the general (r, c)-scaling problem, we develop a new algorithm. Like Sinkhorn’s it
is iterative, and every odd step normalizes the row sums to r. In the even steps, however,
a more elaborate scaling procedure is performed on the columns. This scheme allows us
to quantify the progress we make: After odd steps, when row sums are as designated,
the Euclidean distance of the vector of current columns sums to its target c shrinks by
a constant factor. This comprises a fully polynomial approximation scheme that is also
strongly polynomial, with the number of iterations bounded by

Õ(n7 log(1/)).
2
These results are described in section 4.

2 Preliminaries
We start with a general overview of the setting in which our algorithms work and of the
features they share. We also discuss the complexity matters, and introduce the necessary
notation.
First we describe the setting in which our algorithms are applicable. Let A be a
nonnegative n × n matrix and let r = (r1 , ..., rn ), c = (c1 , ..., cn ) be two positive vectors.
We say that A is an (r, c) matrix if the i-th row sum of A is ri and the j-th column sum
is cj .
Next we would like to introduce the notion of (r, c)-scalability. It turns out that there
are several pertinent definitions, which we now present.
The matrix A is said to be exactly (r, c)-scalable, if there exist positive finite n-tuples
(x1 , ..., xn ), (y1 , ..., yn ) such that B = (bij ) = (xi aij yj ) is an (r, c) matrix.
2
Leonid Gurvitz [15] had pointed out to us, that algorithms of similar spirit are presented, without
performance analysis, in [27].

5
We are also interested in a weaker notion of approximate scalability: Given an  > 0,
we say that B is an “ − (r, c)” if B’s row sums are given by the vector r, whereas B’s
columns sums c0j , satisfy 3 :

n
(c0j − cj )2 ≤ .
X
(1)
j=1

We say that A is  − (r, c)-scalable if there exist positive finite n-tuples (x1 , ..., xn ),
(y1 , ..., yn ) such that B = (bij ) = (xi aij yj ) is an  − (r, c) matrix.
Since our official business (in the doubly stochastic case) is in approximating the
permanent, we will be satisfied if we can scale A arbitrarily close to a (r, c)-matrix. This
leads to a notion of an almost (r, c) scalable matrix, which turns out to be precisely the
notion we need, and therefore deserves a formal definition:

Definition 2.1: A nonnegative n × n matrix is almost (r, c)-scalable if it is  − (r, c)


scalable for any  > 0.

Exact scalability is characterized in the following Proposition. A sufficient condition


for almost (r, c)-scalability is provided as well.

Proposition 2.2: A nonnegative matrix A is exactly (r, c)-scalable iff for every zero
minor Z × L of A,

1. X X X X
ri ≥ cj equivalently ri ≤ cj .
i∈Z c j∈L i∈Z j∈Lc

2. Equality in 1 holds iff the Z c × Lc minor is all zero as well.

A nonnegative matrix A is almost (r, c)-scalable if condition 1 holds.

Proof: This characterization for exact scalability is well known [29]. To see that the
first condition is already sufficient for almost-(r, c) scalability, note that the (r, c) scaling
algorithm presented in Theorem 4.5 accepts as input a matrix satisfying condition 1 of
the proposition and an  > 0, and proceeds to (efficiently) scale the matrix to an  − (r, c)
matrix.
We observe that almost (r, c)-scalability of A can be efficiently checked.
3
Note it is always possible to normalize the row sums to the desired ri . (Or, alternatively, normalize
the column sums. Doing both is the challenge.) Thus, henceforth all our matrices are row-normalized.

6
Lemma 2.3: The question whether a given matrix A satisfies condition 1 of proposi-
tion 2.2 can be decided in polynomial time (via a max-flow formulation [14]).

Proof: For completeness’ sake, here is the translation to a flow problem:


Let F be a graph with 2n + 2 vertices, of which two are special: the source and the sink.
Corresponding to the rows of A are n vertices and the remaining n vertices correspond
to the columns. The source is adjacent to every row vertex of F and the n edges have
capacities r1 , ..., rn . In the same way, every column vertex of F is adjacent to the sink
with edge capacities c1 , ..., cn . There is an edge from the i-th row vertex to the j-th
column vertex iff Aij > 0 and such edges have infinite capacities. The max-flow, min-cut
P P
theorem easily implies that the maximal flow in F is i ri = j cj iff condition 1 holds.

We proceed to describe two scaling algorithms. We call our (1, 1) scaling algorithm
DSS - for Doubly Stochastic Scaling, and the general (r, c) scaling algorithm is RCS -
for (r, c)-Scaling.
Given a matrix A to be scaled, the two algorithms operate by producing a sequence
of matrices A = A0 , A1 , .., AN , where At+1 = f (At ), t = 1, .., and the last matrix AN
is nearly (r, c). 4 The function f is specific to the algorithm and is denoted by fDSS
and fRCS correspondingly. The matrices At are always row-normalized, i.e., the i-th
rows sum of At is ri , for all t and n. An iteration of the algorithm proceeds from At to
At+1 , and therefore N is the number of iterations required for the approximate scaling.
To quantify the notion of “proximity”, we define N () to be the number of iterations
required for -scaling, namely so that AN is an  − (r, c) matrix.
Let I denote the number of arithmetic operations needed for one iteration of the
algorithm. Then the total complexity of the -scaling problem is NDSS () · IDSS in the
doubly stochastic case, and NRCS () · IRCS in the general case. We also set L() denote
the maximal length of numbers encountered in the -scaling process. (Recall that L(A)
denotes the binary input length of A.)

3 Doubly stochastic scaling algorithm


In this section we present a modification of Sinkhorn’s matrix scaling algorithm, which,
given an (1, 1) scalable, nonnegative n × n matrix converges rapidly to an almost doubly
stochastic matrix. (“Almost” is quantified later on, but the idea is that an approximate
version of the van der Waerden conjecture holds.) After preprocessing the input matrix,
this algorithm performs the Sinkorn algorithm, namely repeatedly normalizes the columns
and the rows.
4
Henceforth, in this section, we refer only to (r, c) scaling, which includes the doubly stochastic scaling
as a special case.

7
3.1 The DSS Algorithm
DSS(A)
Input: (1, 1)-scalable matrix A.
1. A1 = Preprocess(A)
2. For t = 1, 2, . . . , N = O(n2 log(n)) do At+1 = fDSS (At ).
3. Output AN .

Preprocess(A)
Find a permutation σ ∈ Sn , such that σ is the “heaviest” generalized diagonal of A,
i.e.,
n
Y n
Y
Aiσ(i) = max Aiτ (i) .
τ ∈Sn
i=1 i=1

Find a positive diagonal matrix Y , such that the matrix B = AY , satisfies:

∀ 1 ≤ i, j ≤ n Biσ(i) ≥ Bij . (2)

Normalize the rows of B.


Output A1 .

fDSS (·).
Input Nonnegative matrix At with unit row sums and positive column sums.
1. Normalize(columns)
2. Normalize(rows)
3. Output At+1 .

3.2 Theorems
Theorem 3.1: (The preprocessing step)
Let A be a nonnegative matrix in which σ is a “heaviest” generalized diagonal, i.e.,
n
Y n
Y
Aiσ(i) ≥ Aiτ (i)
i=1 i=1

for every permutation τ . Then, there exists a positive diagonal matrix Y such that B =
AY , satisfies:

∀ 1 ≤ i, j ≤ n Biσ(i) ≥ Bij . (3)

The diagonal σ and the matrix Y can be found in O(n5 log n) arithmetic operations.

8
Theorem 3.2: (Number of iterations)
1 1
The number NDSS ( n log n
) of DSS iterations required for n log n
-scaling is at most O(n2 log(n)).

Theorem 3.3: (Iteration cost)


The number of arithmetic operations per each iteration of DSS is at most O(n2 ).

Theorem 3.4: (Number size)


1 1
The maximal length LDSS ( n log n
) of numbers that appear in the n log n
-scaling algorithm
is at most poly(L(A), n)).

Corollary 3.5: Given an (1, 1)-scalable matrix A, modified Sinkorn’s algorithm solves
1
the n log n
-scaling problem for A in at most O(n5 log(n)) elementary operations on numbers
of length at most poly (L(A), n)).

1
Remark 3.6: Below (Proposition 5.1) we show that if A if n log n
-doubly stochastic then
1 n+o(n)
per(A) ≥ ( e ) . Therefore Corollary 3.5 enables us to use a version of the van der
Waerden conjecture.

Remark 3.7: The number of iterations of the Unmodified Sinkorn’s procedure required
1
to solve the n log n
-scaling problem may be arbitrarily large. This is demonstrated in the
following example:

Example 3.8: Let A be a 3 × 3 matrix,


1 1
 
2
0 2
A=
 α α 1 − 2α 
,
α α 1 − 2α

with (arbitrarily) small positive α. This matrix satisfies the conditions of proposition 2.2,
so Sinkhorn’s algorithm converges to a doubly stochastic matrix. However, note that
per(A) = O(α) and that each application of DSS increases the permanent by a factor
no bigger than 5. Consequently, it takes at least Ω(log( α1 )) steps, for the matrix A to
become close enough to doubly stochastic.

9
3.3 Proofs
Proof: Of Theorem 3.1
Given A, a “heaviest” generalized diagonal σcorresponds to a maximum-weight perfect
matching in a bipartite graph, so σ may be found in at most O(n3 ) arithmetic opera-
tions [1, 13]. It is convenient (and kosher) to assume henceforth that σ is the identity
permutation.
We turn to the main part of the theorem, namely to show that the matrix Y exists
and can be found in polynomial time.
We start with the existential part. Define the matrix  via Âij = log(Aij ), 1 ≤ i, j ≤
n, (with the convention log(0) = −∞) and restate the theorem thus: Let  be an n × n
matrix where ni=1 Âii ≥ ni=1 Âiτ (i) for every permutation τ . Then there exist numbers
P P

µ1 , ..., µn , such that


∀ 1 ≤ i, j ≤ n, Âii + µi ≥ Âij + µj
i.e.

µi − µj ≥ Âij − Âii . (4)

Note, that we are only concerned with indices i, j for which Âij > −∞, since in the
other case the inequalities obviously hold. By LP-duality, if this system of inequalities is
inconsistent, this may be established by linearly combining these inequalities. So, assume
that there exist nonnegative weights wij , so that

wij (µi − µj ) = 0, i.e., the nonconstant terms are eliminated, and


P
1.

wij (Âij − Âii ) > 0, i.e., the combination of constant terms is positive.
P
2.

The first condition implies that the map which assigns value wij to the edge (i, j), is
a circulation in Kn . By the Circulation Theorem, w is a positive combination of the
characteristic functions of directed cycles: w = γ∈Γ λγ 1γ , where each γ ∈ Γ is a directed
P

cycle and the λγ are positive. But then, the second condition implies that for some γ ∈ Γ:
X X
Âiγ(i) > Âii .
i∈γ i∈γ

Define a permutation τ ∈ Sn , by setting τ (i) = γ(i) for i ∈ γ, and τ (i) = i otherwise.


Clearly, ni=1 Âiτ (i) > ni=1 Âii , contrary to our assumptions.
P P

We return to the computational part of the theorem. Note, that (4) is a system of
m = n2 linear inequalities with only two variables per inequality. Such a system is solv-
able [26] by a strongly polynomial algorithm in O(mn3 log(n)) = O(n5 log(n)) arithmetic
operations.

10
Proof: Of Theorem 3.2
Theorem 3.1 shows how to perform the Preprocessing Step in the Modified Sinkhorn’s
algorithm. Our gain is that following this preprocessing, the permanent of the resulting
matrix A1 is ≥ n1n , since all the elements on the σ-diagonal of A1 are ≥ n1 .
Now, recall that At is the matrix generated from A1 after t − 1 iterations of DSS.
At+1 = fDSS (At ). Our goal is to estimate the convergence rate of At to the set of doubly
stochastic matrices. We measure our progress with the help of a potential function,
namely, per(At ).

Lemma 3.9: Let B be a nonnegative matrix with unit row sums. Then

per(DSS(B)) ≥ per(B)

and the inequality is strict unless B is doubly stochastic.

Proof: Let C be obtained by normalizing B’s columns, and D = DSS(B) be the row
normalization of C. We claim per(D) ≥ per(C) ≥ per(B), and the inequalities are strict
unless B = C = D, are doubly stochastic.
Let cj be the j-th column sum of B. All cj are positive, and nj=1 cj = n. Clearly,
P
per(B)
. The arithmetic-geometric inequality applied to {cj }nj=1 yields nj=1 cj ≤
Q
per(C) = Q n
cj
j=1
1, with equality only if all the cj are 1. The same argument proves also the second claim,
switching rows and columns.
We have seen, in fact, that each iteration of DSS multiplies the permanent by at
least Qn1 cj . The convergence rate can be estimated through the following lemma:
j=1

Lemma 3.10: Let x1 , ..., xn be positive reals with ni=1 xi = n, and


Pn
− 1)2 = ∆.
P
i=1 (xi
Then n
Y ∆ 3
xi ≤ 1 − + O(∆ 2 ).
i=1 2
Pn Pn
Proof: Let xi − 1 = ρi . Then i=1 ρi = 0, and i=1 ρ2i = ∆.
2 3
t− t2 + t3
It is easily verifiable, that 1 + t ≤ e , for all real t, therefore:
n n n n n
1X 1X
ρ2i + ρ3 ) ≤
Y Y X
xi = (1 + ρi ) ≤ exp( ρi −
i=1 i=1 i=1 2 i=1 3 i=1 i
n n
1X 1 X 3
≤ exp(− ρi + ( ρ2i ) 2 ) =
2
2 i=1 3 i=1
∆ 1 3 ∆ 3
= exp(− + ∆ 2 ) = 1 − + O(∆ 2 ).
2 3 2

11
1
Now we have to show that NDSS (( n log n
)) ≤ O(n2 log2 n). Indeed, as long as Condition
1
(1), with  = n log n
, doesn’t hold, each iteration of DSS multiplies the permanent by
at least 1 + Ω( n log n ). The bound on N follows, since per(A1 ) ≥ n1n and for any t,
1

per(At ) ≤ ni=1 ri = 1.
Q

Proof: Of Theorem 3.3


The statement of the theorem follows immediately from the definition of the DSS-
iteration.
Proof: Of Theorem 3.4
Let fDSS act on A = (aij ) and produce A0 = (a0ij ). Since all the entries in A, A0 are
numbers between zero and one, we only have to worry about an entry in A0 suddenly
becoming very small - thus requiring a long representation. However, every iteration of
DSS decreases an entry by a multiplicative factor of at most n2 , and therefore increases
the input length by at most 2 log n.
It remains to deal with the preprocessing step. To perform this step, we have to solve
a special system (4) of n2 linear inequalities. As already mentioned, we employ here the
strongly polynomial algorithm from [26].

4 Strongly polynomial scaling algorithm


In this section we present a strongly polynomial algorithm, which, given an (r, c)-scalable
nonnegative n × n matrix A and an  > 0, solves the -scaling problem for A.

4.1 The RCS Algorithm


RCS(A)
Input: Nonnegative vectors r and c; an almost (r, c)-scalable matrix A = A1 and an
 > 0.
1. For t = 1, 2, . . . N = O(n5 log( n )) do At+1 = fRCS (At ).
2. Output AN .

fRCS (·).
Input (r, c)-scalable matrix At with row sums ri and positive column sums c0j .
1. Sort the differences dj = c0j − cj . (We assume, for convenience, that they are
already ordered: d1 ≤ d2 ... ≤ dn .)

12
2. If nj=1 d2j ≤  Stop.
P

Remark: in this case we are done.


3. Otherwise, let j0 be the index where the largest gap occurs between consecutive
dj , and set:
G := dj0 +1 − dj0 = maxj {dj+1 − dj }.
G
4. Find δ0 , the smallest positive δ, such that |aµν (δ) − aµν | = 8n for some indices µ
and ν. Here At (δ) = (aij (δ)) is the matrix obtained from At by multiplying each lacunary
column j ∈ L := {1...j0 } by a factor 1 + δ, and then renormalizing the rows.
5. Output At+1 = At (δ0 ).

4.2 Theorems
Theorem 4.1: Existence of δ0
An RCS iteration can always be performed, i.e., the positive real δ0 in step 4 of the
iteration is well defined.

Theorem 4.2: Number of iterations


The number of RCS iterations required for -scaling is at most NRCS () ≤ O(n5 log( n )).

Theorem 4.3: Iteration cost


Each iteration of RCS involves at most O(n2 log(n)) elemnetray operations.

Theorem 4.4: Number size


Themaximal length LRCS () of numbers that appear in the -scaling algorithm is at most
poly L(A), n, log( 1 .

Corollary 4.5: Given an (r, c)-scalable matrix A, RCS algorithm solves the -scaling
problem for A in at most O(n7 log(n) log( n )) elementary operations on numbers of length
at most poly L(A), n, log( 1 ) .

4.3 Proofs
Proof: Of Theorem 4.1
For notational convenience we prove the existence and the uniqueness of δ0 for the first
iteration. In this case, At is simply A. Let δ be a positive real number. The expression
for the (i, l) entry in A(δ) is:
ail ri
(
(1 + δ) · ri +δwi
if l ∈ L
ail (δ) = ail ri ,
ri +δwi
if l ∈
/L

13
P
where wi = j∈L aij is the contribution of the lacunary columns to the i-th row sum in
A. Since all wi ≤ ri , the function aij (δ) is nondecreasing for j ∈ L, and nonincreasing
/ L. Therefore the quantity dj (δ) = c0j (δ) − cj is nondecreasing for j ∈ L, and
for j ∈
nonincreasing for j ∈
/ L. We show that as δ grows, at least one of these quantities reaches
the midpoint M of the largest gap G.

Lemma 4.6: Let A be a nonnegative (r, c)-scalable n × n matrix with row sums ri and
column sums c0j . Then, there are j ∈ L and k 6∈ L and a (possibly infinite) δ > 0 for
which dj (δ) = dk (δ). Consequently, there is an index l and δ > 0 for which dl (δ) = M.

Proof: Let Z be the set of rows i for which wi = 0. Note that aij = 0 for i ∈ Z and j ∈
L, so for l ∈ L, c0l (δ) = (1 + δ) · i∈Z c ria+δw
ij ri
. Therefore, for l ∈ L, when δ → ∞,
P
i

X ail ri
dl (δ) → − cl .
Zc wi

Summing these expressions for all l ∈ L we conclude:


X X X ail ri X X
dl (δ) → ( − cl ) = ri − cl ≥ 0.
l∈L l∈L Z c wi i∈Z c l∈L

The inequality being from Proposition 2.2. On the other hand, for j 6∈ L, as δ → ∞,
X
dj (δ) → aij − cj
Z

whence X XX X X X
dj (δ) → aij − cj = ri − cj ≤ 0
j6∈L j6∈L i∈Z j6∈L i∈Z j6∈L

again by Proposition 2.2.


Therefore two indices j ∈ L, k 6∈ L exist for which limδ→∞ (dj (δ) − dk (δ)) ≥ 0.
Recall that M is the middle of the interval [dj0 , dj0 +1 ]. The Lemma tells us, that
there is a column l and a possibly infinite positive δ such that dl (δ) = M , implying
|dl (δ) − dl | ≥ G2 . It follows, that |dl (δ) − dl | ≥ G4 , for some positive finite δ. Therefore
G
there is a row i, for which |ail (δ) − ail | ≥ 4n . This of course implies the existence of δ0 ,
G
the smallest positive real for which |aµν (δ) − aµν | = 8n for some indices µ and ν. (Note,
G G
that we settle for a step - 8n - that is smaller than what can be attained, namely 4n .
This choice is intended to control the length of the binary representation of the matrix
A(δ) - see Theorem 4.4). Note also, that δ0 is uniquely specified in view of the fact that
the functions aij (δ) are monotone and continuous.
Proof: Of Theorem 4.2
In order to find out how many iterations of RCS are required for the solution of the
-scaling problem, we have to estimate the rate of convergence of the sequence {At } to

14
the set ofqP (r, c)-matrices. We assess our progress, through the decrease of the l2 norm
n 2
kdk2 = j=1 dj . We now show that our choice of δ0 in the fourth step of the RCS
iteration suits our needs. Once again we assume, for convenience, that we deal with the
first iteration

Lemma 4.7:
kd(δ0 )k22 ≤ kdk22 · (1 − Ω(n−5 )).

Proof: First show that


kd(δ0 )k22 ≤ kdk22 − G2 /64n2 .
Let j = (1, ..., 1). Since both inner products < d(δ0 ), j > and < d, j > are 0, it follows
that

kdk22 − kd(δ0 )k22 = kd − M · jk22 − kd(δ0 ) − M · jk22 , (5)

and so we may consider the change in kd − M · jk2 . The definition of δ0 implies for
G G
all i, that aij ≤ aij (δ0 ) ≤ aij + 8n , for all j ∈ L, and aik ≥ aik (δ0 ) ≥ aik − 8n for all
k 6∈ L. Therefore, for all j ∈ L, k 6∈ L, dj ≤ dj (δ0 ) ≤ M ≤ dk (δ0 ) ≤ dk , implying that
|dj (δ) − M | ≤ |dj − M | for all 1 ≤ j ≤ n. Our gain comes from the ν-th coordinate, and
G2
is, at least 64n2.

To conclude, we need to compare between G and kdk22 . Recall that dj = 0 and


P

|dj − dj−1 | ≤ G for all j. The maximum of kdk22 under these conditions is attained when
dj −dj−1 = G for all j, in which case kdk22 = Θ(n3 G2 ). In other words, G ≥ Ω(kdk2 ·n−3/2 ).
This, together with (5) implies the claim of the lemma.
It immediately follows, that in N = O(n5 log( n )) iterations we have kdk22 ≤ , and
therefore AN is an  − (r, c) matrix.
Proof: Of Theorem 4.3
To perform the t-th iteration of RCS we have to find δ0 and then compute the matrix
At (δ0 ). Given δ0 , the computation of At (δ0 ) requires O(n2 ) elementary operations. To
G
find δ0 we define δij for each pair of indices i, j via |aij (δij ) − aij | = 8n which is a
linear equation in δij , and δ0 = minij δij . Consequently, this involves only O(n2 log(n))
elementary operations.
Proof: Of Theorem 4.4
Let an iteration RCS act on At = (aij ) and produce At+1 = (a0ij ). Since all the entries
in At , At+1 are numbers between zero and one, we only have to worry about an entry in
At+1 suddenly becoming very small - thus requiring a long representation.
This is probably the right moment to specify that the entries of At are represented in
floating point. Namely, aij is represented as (bij , αij ), where an integer bij and 12 < αij ≤ 1
are uniquely determined via aij = 2−bij αij .
Our first observation is, that for our purposes it suffices to remember only the first
few bits of αij . This is the contents of the following lemma.

15
Lemma 4.8: Truncating all but the most significant 10 log( n ) bits in every αij changes
10
kdk22 = nj=1 (c0j − cj )2 by an additive term of at most O( n9 ).
P

Proof: Since all entries are bounded above by 1, the truncation affects each entry by an
10 10
additive term of at most n 10 . Therefore each c0j is changed by at most n9 and the second
claim of the lemma trivially follows.

Corollary 4.9: The “truncated” RCS algorithm terminates in at most O(n5 log( n ))
iterations.

Therefore, the representation length required will be polynomial in log(bij ), log(n)


and log( 1 ). It therefore suffices to control the growth of B = maxij bij . Namely, to prove
the proposition we have to verify that log(B) < poly(L(A), n).
Let χ(A) be the smallest non-zero entry of A. We show:

χ2 (A)
!
0
χ(A ) > Ω . (6)
n
 
Consequently, after t iterations B ≤ O 2t · (2L(A) + t log(n)) . Since RCS is iterated
only poly(n) times, this suffices for our purposes.
Recall, that we know exactly how the entries change:
ail ri
 (1 + δ) · if l ∈ L

 ri +δwi
a0il = , (7)
ail ri

ri +δwi
if l ∈
/L

where δ is the smallest positive real for which

G
|a0µν − aµν | = . (8)
8n

/ L that ail decreases, so we have to bound a0il


for some indices µ and ν. It is only for l ∈
for l ∈
/ L from below. By lemma 4.6 there exist indices s, t for which

G
|a0st − ast | = . (9)
4n

We will assume that the equality in (9) is obtained for t ∈ L (the other case is quite
similar). It is convenient at this point to introduce the following notation: For any pair
of indices i and j and for any positive ∆, let us denote by δij (∆) the minimal positive x
such that |aij (x) − aij | = ∆, if such an x exists.

16
By (9) and the explicit formula (7) we conclude:
G Grs
δst ( )= < +∞.
4n 4nast (rs − ws ) − Gws
In particular, the denominator ρ = 4nast (rs −ws )−Gws is strictly positive, i.e. 4nast (rs −
ws ) > Gws . This implies:
G G Grs Grs rs
δ = δµν ( ) ≤ δst ( ) = ≤ = .
8n 8n 8nast (rs − ws ) − Gws Gws ws

Therefore, for any 1 ≤ i ≤ n, l ∈


/L
ail ri ail ail ail ws ail ast
     
a0il = ≥ wi ≥Ω ≥Ω ≥Ω .
ri + δwi δ · ri + 1 δ rs rs

χ2 (A)
 
Since rs ≤ n, we obtain a0il ≥ Ω n
, proving (6), and we are done.

5 Approximating the permanent


In this section we prove Theorem 1.1. Since our algorithms produce matrices that are
only approximately doubly stochastic, we need lower bounds for the permanent of such
matrices.

Proposition 5.1: Let A be a nonnegative n × n matrix in which all row sums are 1. Let
cj be the j-th column sum of A and let  > 0. If
n
1
(cj − 1)2 <
X
,
j=1 n1+

then !
1
per(A) ≥ Ω 
−2
.
en(1+n )

In particular,
n
1
(cj − 1)2 <
X
,
j=1 n log n
implies  
1
per(A) ≥ Ω  n(1+ √ 1 )
.
e log n

Proof: We start with a simple lemma.

17
Lemma 5.2: With the same notations, if
n
1
(cj − 1)2 <
X
,
j=1 n

then
per(A) > 0.

Proof: If per(A) = 0, then by König-Hall:


!
O B
A= , (10)
C D

where O is an s×n−s+1 zero submatrix. Since all row sums are 1, the sum of all entries
in B is s. Since A is nonnegative, it implies nj=n−s+2 cj ≥ s. By Cauchy-Schwartz:
P

n n
(cj − 1)2 ≥ (cj − 1)2 ≥
X X

j=0 j=n−s+2

n
1 1 1
(cj − 1))2 ≥
X
( > .
s − 1 j=n−s+2 s−1 n

We return to the proof of the proposition. We claim that A can be expressed as


A = D + Z. Here D = λ∆ with ∆ doubly stochastic and λ ≥ 0; Z is nonnegative,
and per(Z) = 0. The only problem in finding such a decomposition is in satisfying
the requirement per(Z) = 0. So, suppose that we have such a decomposition with
per(Z) > 0. Then, there is a permutation π, with mini Zi,π(i) = α > 0. Let P = Pπ
be the corresponding permutation matrix. Note that D0 = D + αP is also a positive
multiple of a doubly stochastic matrix. Let Z 0 = Z − αP . Replace the representation
A = D + Z by A = D0 + Z 0 . After a finite number of repetitions, a decomposition
A = D0 + Z0 = λ∆0 + Z0 with per(Z0 ) = 0 is obtained. If λ = 1, then Z0 = 0, the zero
matrix, A is doubly stochastic, and per(A) ≥ nn!n , so we are done. If λ < 1, consider the
j −λ
Z0
matrix B = 1−λ . The row sums in B are 1 and its column sums are c0j = c1−λ . Clearly,
n n
1 1
(c0j − 1)2 = (cj − 1)2 < 1+
X X
2
· .
j=1 (1 − λ) j=1 n (1 − λ)2

On the other hand, per(B) = 0, so lemma 5.2 implies 1
n1+ (1−λ)2
> n1 . That is, λ > 1−n− 2 .
The proof is concluded by observing that
D0 1−  n!
per(A) ≥ per(D0 ) ≥ λn per( ) ≥ Ω(e−n 2 ) · n
λ n

18
!
1
≥Ω 
−2
.
en(1+n )

Proposition 5.1 together with the estimates on the running time of algorithms RCS
and DSS imply:

Corollary 5.3: Let A be a nonnegative n × n matrix with a positive permanent. Then


scaling factors x1 , ..., xn and y1 , ..., yn may be found, such that the matrix B = (bij ) =
(xi aij yj ) is nearly doubly stochastic. Specifically
1
( )n+o(n) ≤ per(B) ≤ 1.
e
The scaling factors can be found in at most O(n5 log2 n) elementary operations.
Qn Qn
Since per(B) = i=1 xi j=1 yj per(A) the corollary implies theorem 1.1, and we are
done.

5.1 Perfect Matchings


The goal of this short subsection is to emphasize the following curious and potentially
promising aspect of this work. Our new scaling algorithm may be used to decide whether
a given bipartite graph has a perfect matching. The approach we use is conceptually
different from known methods. 5 Moreover, it is certainly the simplest to describe and
code. We cannot resist the temptation to present it here - it is a simple exercise to prove
its correctness, which otherwise might be derived from corollary 3.2 and lemma 5.2. Note
that no preprocessing is required here, since the permanent of a 0 − 1 matrix, if nonzero,
is already known up to a multiplicative factor of nn .
A simple perfect matching algorithm
Given a 0-1 matrix A
Begin
for n2 log(n) iterations do
Normalize(columns);
Normalize(rows);
If( nj=1 (cj − 1)2 < n1 )
P

return(*Perfect Matching*);
return(*No Perfect Matchings*);
End
The maximum matching problem [25] is, of course a classical question in the theory
of algorithms. Among the more recent interesting findings is the fact that this problem
5
Essentially the same algorithm has been independently developed in [16].

19
is in RNC [24, 23]. It is a major challenge to find an NC algorithm for it. While our
work is very far from resolving this difficulty, it may shed some new light, and may be
useful in discovering alternative routes to deciding the existence of perfect matchings.
The above algorithm is polynomial-time, though much inferior to standard algorithms.
We note that each iteration of the present scheme can be carried out in NC. If one can find
another iterative NC-realizable scheme that requires only polylogarithmic many iterations,
this would finally put the decision problem for perfect matchings in NC.

6 Conclusions
With respect to matrix scaling, we have hardly begun investigating the advantage in
applying our new algorithms rather than existing ones. Nor have we studied the possible
extensions of our algorithm to more general scaling and nonlinear optimization problems
from this literature.
With respect to permanent approximation, it is again but a first step on a path that
advocates simple reduction of the input matrix into one that may be easily approximated.
An obvious challenge is of course to develop a polynomial time (1 + )-factor approxima-
tion scheme. At present even finding a (1 + )n -factor approximation scheme seems hard
and challenging. This is at present even elusive for 0, 1 matrices with a constant number
of 1’s in every row and column.

7 Acknowledgement
We are grateful to Yuri Rabinovitch for opening our eyes to the fact that what we were
doing was actually matrix scaling. We also thank Leonid Gurvitz for interesting remarks.

References
[1] M. O. Ball and U. Derigs, An analysis of alternate strategies for implementing
matching algorithms, Networks 13, 517-549, 1983.
[2] A. I. Barvinok, Computing Mixed Discriminants, Mixed Volumes, and Permanents,
Discrete & Computational Geometry , 18 (1997), 205-237.
[3] A. I. Barvinok, A simple polynomial time algorithm to approximate the permanent
within a simply exponential factor, Preprint, 1998.
[4] A. I. Barvinok, A simple polynomial time algorithm to approximate permanents and
mixed discriminants within a simply exponential factor, Preprint, 1998.

20
[5] L. M. Bregman, Certain properties of nonnegative matrices and their permanents,
Soviet Math. Dokl. 14, 945-949, 1973.
[6] D. T. Brown, A note on approximations of discrete probability distributions, Inform.
and Control, 2, 386-392, 1959.
[7] G.P. Egorychev, The solution of van der Waerden’s problem for permanents, Ad-
vances in Math., 42, 299-305, 1981.
[8] D. I. Falikman, Proof of the van der Waerden’s conjecture on the permanent of a
doubly stochastic matrix, Mat. Zametki 29, 6: 931-938, 957, 1981, (in Russian).
[9] U. Feige and C. Lund. On the hardness of computing the permanent of random
matrices, STOC 24, 643-654, 1992.
[10] S. E. Fienberg, The Analysis of Cross Classified Data, MIT press, Cambridge, MA,
1977.
[11] J. Franklin and J. Lorenz, On the scaling of multidimensional matrices, Linear Al-
gebra Appl. 114/115, 717-735, 1989.
[12] S. Friedland, C. Li, H. Schneider, Additive Decomposition of Nonnegative Matrices
with Applications to Permanents and Scaling, Linear and Multilinear Algebra,23,
63–78, 1988.
[13] Z. Galil, S. Micali and H. Gabow, Priority queues with variable priority and an
O(EV log V ) algorithm for finding maximal weighted matching in a general graph,
FOCS 23, 255-261, 1982.
[14] A. V. Goldberg and R. E. Tarjan, A new approach to the maximum-flow problem,
J. Assoc. Comp. Mach., 35, 921–940, 1988.
[15] L. Gurvits, Private communication, 1998.
[16] L. Gurvits, P. N. Yianilos, The deflation-inflation method for certain semidefinite
programming and maximum determinant completion problems, Preprint, 1998.
[17] G. T. Herman and A. Lint, Iterative reconstruction algorithms, Comput. Biol. Med.
6, 276, 1976.
[18] M. Jerrum and A. Sinclair, Approximating the permanent, SIAM J. Comput., 18,
1149-1178, 1989.
[19] M. Jerrum and U. Vazirani, A mildly exponential approximation algorithm for the
permanent, Algorithmica, 16(4/5), 392-401, 1996.
[20] P. W. Kasteleyn, The statistics of dimers on a lattice 1. The number of dimer
arrangements on a quadratic lattice. Physica, 27, 1209–1225, 1961.

21
[21] B. Kalantari and L Khachian, On the complexity of nonnegative matrix scaling,
Linear Algebra Appl., 240, 87-104, 1996.

[22] N. Karmarkar, R. Karp, R. Lipton, L. Lovasz and M. Luby, A Monte-Carlo algorithm


for estimating the permanent, SIAM Journal on Computing, 22(2), 284-293, 1993.

[23] R. M. Karp, E. Upfal and A. Wigderson, Are search and decision problems compu-
tationally equivalent? STOC 17, 1985.

[24] L. Lovasz, On determinants, matchings and random algorithms, Fundamentals of


computing theory, edited by L. Budach, Akademia-Verlag, Berlin 1979.

[25] L. Lovasz and M. D. Plummer, Matching Theory, North Holland, Amsterdam


1986.

[26] N. Megiddo, Towards a genuinly polynomial algorithm for linear programming,


SIAM Journal on Computing, 12(2), 347-353, 1983.

[27] B. N. Parlett, T. L. Landis, Methods for Scaling to Doubly Stochastic Form, Linear
Algebra Appl. 48, 53-79, 1982.

[28] T. E. S. Raghavan, On pairs of multidimensional matrices, Linear Algebra Appl. 62,


263-268, 1984.

[29] U. Rothblum and H. Schneider, Scaling of matrices which have prespecified row
sums and column sums via optimization, Linear Algebra Appl. 114/115, 737-764,
1989.

[30] R. Sinkhorn, A relationship between arbitrary positive matrices and doubly stochas-
tic matrices, Ann. Math. Statist. 35, 876-879, 1964.

[31] L. G. Valiant, The complexity of computing the permanent, Theoretical Computer


Science, 8(2), 189-201, 1979.

[32] J. H. Wilkinson, Rounding Errors in Algebraic Processes, Her Majesty’s Stationery


Office, England, 1963.

22

You might also like