A Tutorial On MM Algorithms
A Tutorial On MM Algorithms
A Tutorial On MM Algorithms
David R. Hunter1
Kenneth Lange2
Department of Statistics1
Penn State University
University Park, PA 16802-2111
Abstract
Most problems in frequentist statistics involve optimization of a function
such as a likelihood or a sum of squares. EM algorithms are among the most
effective algorithms for maximum likelihood estimation because they consis-
tently drive the likelihood uphill by maximizing a simple surrogate function for
the loglikelihood. Iterative optimization of a surrogate function as exemplified
by an EM algorithm does not necessarily require missing data. Indeed, every
EM algorithm is a special case of the more general class of MM optimization
algorithms, which typically exploit convexity rather than missing data in ma-
jorizing or minorizing an objective function. In our opinion, MM algorithms
deserve to part of the standard toolkit of professional statisticians. The current
article explains the principle behind MM algorithms, suggests some methods
for constructing them, and discusses some of their attractive features. We
include numerous examples throughout the article to illustrate the concepts
described. In addition to surveying previous work on MM algorithms, this ar-
ticle introduces some new material on constrained optimization and standard
error estimation.
1
1 Introduction
Maximum likelihood and least squares are the dominant forms of estimation in fre-
can be solved analytically, but most practical maximum likelihood and least squares
and Krishnan, 1997). We call any algorithm based on this iterative method an MM
algorithm.
To our knowledge, the general principle behind MM algorithms was first enun-
ciated by the numerical analysts Ortega and Rheinboldt (1970) in the context of
line search methods. de Leeuw and Heiser (1977) present an MM algorithm for
multidimensional scaling contemporary with the classic Dempster et al. (1977) pa-
per on EM algorithms. Although the work of de Leeuw and Heiser did not spark
the same explosion of interest from the statistical community set off by the Demp-
ster et al. (1977) paper, steady development of MM algorithms has continued. The
Böhning and Lindsay (1988), in the psychometrics literature on least squares (Bi-
jleveld and de Leeuw, 1991; Kiers and Ten Berge, 1992), and in medical imaging
(De Pierro, 1995; Lange and Fessler, 1995). The recent survey articles of de Leeuw
(1994), Heiser (1995), Becker et al. (1997), and Lange et al. (2000) deal with the
general principle, but it is not until the rejoinder of Hunter and Lange (2000a) that
the acronym MM first appears. This acronym pays homage to the earlier names
2
“majorization” and “iterative majorization” of the MM principle, emphasizes its
crucial link to the better-known EM principle, and diminishes the possibility of con-
and Olkin, 1979). Recent work has demonstrated the utility of MM algorithms in a
broad range of statistical contexts, including quantile regression (Hunter and Lange,
2000b), survival analysis (Hunter and Lange, 2002), paired and multiple compar-
isons (Hunter, 2004), variable selection (Hunter and Li, 2002), and DNA sequence
One of the virtues of the MM acronym is that it does double duty. In min-
imization problems, the first M of MM stands for majorize and the second M for
minimize. In maximization problems, the first M stands for minorize and the second
M for maximize. (We define the terms “majorize” and “minorize” in Section 2.) A
optimization problem. Simplicity can be attained by (a) avoiding large matrix in-
an optimization problem, (d) dealing with equality and inequality constraints grace-
calculated with respect to the observed data. The surrogate function created by the
3
function is maximized with respect to the parameters of the underlying model;
maximizing it analytically.
paid to convexity and inequalities. Thus, success with MM algorithms and success
ever, the skills required by most MM algorithms are no harder to master than the
single algorithm but to a class of algorithms. Thus, this article refers to specific EM
2 The MM Philosophy
Let θ(m) represent a fixed value of the parameter θ, and let g(θ | θ(m) ) denote a
real-valued function of θ whose form depends on θ(m) . The function g(θ | θ(m) ) is
4
In other words, the surface θ 7→ g(θ | θ(m) ) lies above the surface f (θ) and is tangent
to it at the point θ = θ(m) . The function g(θ | θ(m) ) is said to minorize f (θ) at θ(m)
Ordinarily, θ(m) represents the current iterate in a search of the surface f (θ).
θ(m) ) rather than the actual function f (θ). If θ(m+1) denotes the minimizer of
g(θ | θ(m) ), then we can show that the MM procedure forces f (θ) downhill. Indeed,
the inequality
= f (θ(m) )
follows directly from the fact g(θ(m+1) | θ(m) ) ≤ g(θ(m) | θ(m) ) and definition (1).
function g(θ | θ(m) ) and maximize g(θ | θ(m) ) to produce the next iterate θ(m+1) .
from a sample x1 , . . . , xn of n real numbers. One can readily prove (Hunter and
Lange, 2000b) that for q ∈ (0, 1), a qth sample quantile of x1 , . . . , xn minimizes the
function
n
X
f (θ) = ρq (xi − θ), (3)
i=1
5
(a) (b)
20
1.5
1.0
15
x
0.5
10
x x
0.0
5
x
−2 −1 0 1 2 0 1 2 3 4 5 6
Figure 1: For q = 0.8, (a) depicts the “vee” function ρq (θ) and its quadratic ma-
jorizing function for θ(m) = −0.75; (b) shows the objective function f (θ) that is
minimized by the 0.8 quantile of the sample 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 4, 5, along with
its quadratic majorizer, for θ(m) = 2.5.
θ≥0
qθ
ρq (θ) =
−(1 − q)θ θ < 0.
When q = 1/2, this function is proportional to the absolute value function; for
q 6= 1/2, the “vee” is tilted to one side or the other. As seen in Figure 1(a), it is
possible to majorize the “vee” function at any nonzero point by a simple quadratic
Fortunately, the majorization relation between functions is closed under the for-
jective functions. Thus, the function f (θ) of equation (3) is majorized at the point
6
θ(m) by
n
X
g(θ | θ(m) ) = ζq (xi − θ | xi − θ(m) ). (4)
i=1
The function f (θ) and its majorizer g(θ | θ(m) ) are shown in Figure 1(b) for a
Setting the first derivative of g(θ | θ(m) ) equal to zero gives the minimum point
Pn (m)
n(2q − 1) + i=1 wi xi
θ(m+1) = Pn (m)
, (5)
w
i=1 i
(m)
where the weight wi = |xi − θ(m) |−1 depends on θ(m) . A flaw of algorithm (5)
(m)
is that the weight wi is undefined whenever θ(m) = xi . In mending this flaw,
Hunter and Lange (2000b) also discuss the broader technique of quantile regression
most fascinating thing about the quantile-finding algorithm is that it avoids sorting
and relies entirely on arithmetic and iteration. For the case of the sample median
where ∇g(θ(m) | θ(m) ) and ∇2 g(θ(m) | θ(m) ) denote the gradient vector and the
Hessian matrix of g(θ | θ(m) ) evaluated at θ(m) . Since the descent property (2)
depends only on decreasing g(θ | θ(m) ) and not on minimizing it, the update (6) can
serve in cases where g(θ | θ(m) ) lacks a closed-form minimizer, provided this update
7
al. (1977) call an algorithm that reduces g(θ | θ(m) ) without actually minimizing it
points out that update (6) saves us from performing iterations within iterations and
yet still displays the same local rate of convergence as a full MM algorithm that
In the quantile example of Section 2.1, the convex “vee” function admits a quadratic
Jensen’s inequality states for a convex function κ(x) and any random variable X that
κ[E (X)] ≤ E [κ(X)]. Since − ln(x) is a convex function, we conclude for probability
a(X) a(X)
− ln E ≤ − E ln .
b(X) b(X)
If X has the density b(x), then E [a(X)/b(X)] = 1, so the left hand side above
8
rithm (de Leeuw, 1994; Heiser, 1995), making every EM algorithm an MM algorithm.
convex function: Any linear function tangent to the graph of a convex function is a
minorizer at the point of tangency. Thus, if κ(θ) is convex and differentiable, then
with equality when θ = θ(m) . This inequality is illustrated by the example of Section
If we wish to majorize a convex function instead of minorizing it, then we can use
composed with a linear function xt θ. For instance, suppose for vectors x, θ, and
(m)
θ(m) that we make the substitution ti = xi (θi − θi )/αi + xt θ(m) . Inequality (8)
then becomes
xi
t
X (m)
κ(x θ) ≤ αi κ (θi − θi ) + xt θ(m) . (9)
i
αi
Alternatively, if all components of x, θ, and θ(m) are positive, then we may take
(m) (m)
ti = xt θ(m) θi /θi and αi = xi θi /xt θ(m) . Now inequality (8) becomes
X xi θ (m)
" #
xt θ(m) θi
κ(xt θ) ≤ i
κ . (10)
i
xt θ(m) (m)
θi
9
Inequalities (9) and (10) have been used to construct MM algorithms in the contexts
of medical imaging (De Pierro, 1995; Lange and Fessler, 1995) and least-squares
If a convex function κ(θ) is twice differentiable and has bounded curvature, then
we can majorize κ(θ) by a quadratic function with sufficiently high curvature and
tangent to κ(θ) at θ(m) (Böhning and Lindsay, 1988). In algebraic terms, if we can
find a positive definite matrix M such that M − ∇2 κ(θ) is nonnegative definite for
all θ, then
1
κ(θ) ≤ κ(θ(m) ) + ∇κ(θ(m) )t (θ − θ(m) ) + (θ − θ(m) )t M (θ − θ(m) )
2
provides a quadratic upper bound. For example, Heiser (1995) notes in the unidi-
10
of the arithmetic-geometric mean inequality. Because the exponential function is
strictly convex, equality holds if and only if all of the xi are equal. Inequality (11)
model of Section 4.
The Cauchy-Schwartz inequality for the Euclidean norm is a special case of inequal-
ity (7). The function κ(θ) = kθk is convex because it satisfies the triangle inequality
qP
and the homogeneity condition kαθk = |α| · kθk. Since κ(θ) = 2
i θi , we see that
which is the Cauchy-Schwartz inequality. de Leeuw and Heiser (1977) and Groenen
One of the key criteria in judging minorizing or majorizing functions is their ease
often rely on surrogate functions in which the individual parameter components are
θd . Since the d univariate functions may be optimized one by one, this makes the
11
4.1 Poisson Sports Model
contest between two individuals or teams in which the number of points scored by
team i against team j follows a Poisson process with intensity eoi −dj , where oi is an
for team j. If tij is the length of time that i plays j and pij is the number of points
`ij (θ) = pij (oi − dj ) − tij eoi −dj + pij ln tij − ln pij !, (14)
where θ = (o, d) is the parameter vector. Note that the parameters should satisfy a
P P
linear constraint, such as i oi + j dj = 0, in order for the model be identifiable;
otherwise, it is clearly possible to add the same constant to each oi and dj without
altering the likelihood. We make two simplifying assumptions. First, different games
are independent of each other. Second, each team’s point total within a single game
is independent of its opponent’s point total. The second assumption is more suspect
than the first since it implies that a team’s offensive and defensive performances are
somehow unrelated to one another; nonetheless the model gives an interesting first
obtained by summing `ij (θ) over all pairs (i, j). Setting the partial derivatives of
12
Because the task is to maximize the loglikelihood (14), we need a minorizing
function. Focusing on the −tij eoi −dj term, we may use inequality (12) to show that
Although the right side of the above inequality may appear more complicated than
the left side, it is actually simpler in one important respect — the parameter com-
ponents oi and dj are separated on the right side but not on the left. Summing the
loglikelihood (14) over all pairs (i, j) and invoking inequality (15) yields the function
" #
tij e2oi tij (m) (m)
− e−2dj eoi +dj
XX
(m)
g(θ | θ ) = pij (oi − dj ) − (m) (m)
i j
2 eoi +dj 2
fact that the components of θ are separated by g(θ | θ(m) ) permits us to update
parameters one by one and substantially reduces computational costs. Setting the
The question now arises as to whether one should modify algorithm (16) so that
updated subsets of the parameters are used as soon as they become available. For
instance, if we update the o vector before the d vector in each iteration of algorithm
(m+1)
(16), we could replace the formula for dj above by
P
(m+1) 1
i pij
dj = − ln P (m+1) (m)
. (17)
2 oi +dj
i tij e
the parameters updating one at a time than when we update the whole vector
13
algorithms; they generalize the ECM algorithms of Meng and Rubin (1993). A cyclic
MM algorithm always drives the objective function in the right direction; indeed,
parameter set.
Table 1: Ranking of all 29 NBA teams on the basis of the 2002-2003 regular season
according to their estimated offensive strength plus defensive strength. Each team
played 82 games.
Table 1 summarizes our application of the Poisson sports model to the results of the
2002–2003 regular season of the National Basketball Association. In these data, tij
is measured in minutes. A regular game lasts 48 minutes, and each overtime period,
ˆ
if necessary, adds five minutes. Thus, team i is expected to score 48eôi −dj points
against team j when the two teams meet and do not tie. Team i is ranked higher
than team j if ôi − dˆj > ôj − dˆi , which is equivalent to ôi + dˆi > ôj + dˆj .
14
It is worth emphasizing some of the virtues of the model. First, the ranking of
the 29 NBA teams on the basis of the estimated sums ôi +dˆi for the 2002-2003 regular
season is not perfectly consistent with their cumulative wins; strength of schedule
and margins of victory are reflected in the model. Second, the model gives the point-
spread function for a particular game as the difference of two independent Poisson
random variables. Third, one can easily amend the model to rank individual players
rather than teams by assigning to each player an offensive and defensive intensity
then the MM algorithm can be adapted to estimate the assigned player intensities.
This might provide a rational basis for salary negotiations that takes into account
Finally, the NBA data set sheds light on the comparative speeds of the original
MM algorithm (16) and its cyclic modification (17). The cyclic MM algorithm
converged in fewer iterations (25 instead of 28). However, because of the additional
work required to recompute the denominators in equation (17), the cyclic version
instead of 289,998).
5 Speed of Convergence
they near a local optimum point θ∗ . In other words, under certain general conditions,
kθ(m+1) − θ∗ k
lim = c
m→∞ kθ (m) − θ ∗ k2
15
for some constant c. This quadratic rate of convergence is much faster than the
kθ(m+1) − θ∗ k
lim = c < 1 (18)
m→∞ kθ (m) − θ ∗ k
to require more iterations but simpler iterations than Newton-Raphson. For this
For example, the Poisson process scoring model for the NBA data set of Section
4 has 57 parameters (two for each of 29 teams minus one for the linear constraint).
quires more computation in this example than the 300,000 floating point operations
stability also enters the balance sheet. A Newton-Raphson algorithm can behave
poorly if started too far from an optimum point. By contrast, MM algorithms are
16
guaranteed to appropriately increase or decrease the value of the objective function
at every iteration.
iterations until convergence, each has its own merits. The expected information
matrix used in Fisher scoring is sometimes easier to evaluate than the observed
mitigate or even eliminate the need for matrix inversion. The Nelder-Mead approach
algorithm best overall. In our experience, however, MM algorithms are often difficult
equal to the inverse of the expected information matrix. In practice, the expected
−∇2 `(θ) computed by differentiating the loglikelihood `(θ) twice. Thus, after the
MLE θ̂ has been found, a standard error of θ̂ can be obtained by taking square roots
of the diagonal entries of the inverse of −∇2 `(θ̂). In some problems, however, direct
17
Let g(θ | θ(m) ) denote a minorizing function of the loglikelihood `(θ) at the point
The two numerical approximations to −∇2 `(θ̂) are based on the formulas
h i
∇2 `(θ̂) = ∇2 g(θ̂ | θ̂) I − ∇M (θ̂) (19)
∂
= ∇2 g(θ̂ | θ̂) + ∇g(θ̂ | ϑ) , (20)
∂ϑ ϑ=θ̂
where I denotes the identity matrix. These formulas are derived in Lange (1999)
using two simple facts: First, the tangency of `(θ) and its minorizer imply that their
gradient vectors are equal at the point of minorization; and second, the gradient of
(19) and (20) are given by Meng and Rubin (1991) and Oakes (1999), respectively.
Although these formulas have been applied to standard error estimation in the EM
algorithm literature — Meng and Rubin (1991) base their SEM idea on formula
(19) — to our knowledge, neither has been applied in the broader context of MM
algorithms.
∂ Mi (θ + δej ) − Mi (θ)
Mi (θ) = lim , (21)
∂θj δ→0 δ
where the vector ej is the jth standard basis vector having a one in its jth com-
ponent and zeros elsewhere. Since M (θ̂) = θ̂, the jth column of ∇M (θ̂) may be
18
approximated using only output from the corresponding MM algorithm by (a) it-
erating until θ̂ is found, (b) altering the jth component of θ̂ by a small amount δj ,
(c) applying the MM algorithm to this altered θ, (d) subtracting θ̂ from the result,
this case one may exploit the fact that h(θ̂) is zero.
To illustrate these ideas and facilitate comparison of the various numerical methods,
Böhning and Lindsay (1988) apply the quadratic bound principle of Section 3.4 to
the case of logistic regression, in which we have an n×1 vector Y of binary responses
and an n×p matrix X of predictors. The model stipulates that the probability πi (θ)
giving
19
Since the MM algorithm of equation (22) needs to invert X t X only once, it en-
Table 2: Estimated coefficients and standard errors for the low birth weight logistic
regression example.
We now test the standard error approximations based on equations (19) and (20) on
the low birth weight dataset of Hosmer and Lemeshow (1989). This dataset involves
to whether an infant is born underweight, defined as less than 2.5 kilograms. The
predictors include mother’s age in years (AGE), weight at last menstrual period
(LWT), race (RACE2 and RACE3), smoking status during pregnancy (SMOKE),
presence of uterine irritability (UI), and number of physician visits during the first
trimester (FTV). Each of these predictors is quantitative except for race, which is
20
a 3-level factor with level 1 for whites, level 2 for blacks, and level 3 for other races.
Table 2 shows the maximum likelihood estimates and asymptotic standard errors for
the 10 parameters. The differentiation increment δj was θ̂j /1000 for each parameter
θj . The standard error approximations in the two rightmost columns turn out to be
the same in this example, but in other models they will differ. The close agreement
of the approximations with the “gold standard” based on the exact value ∇2 `(θ̂) is
7 Handling Constraints
technique that in a sense eliminates inequality constraints. For this adaptive barrier
method (Censor and Zenios, 1992; Lange, 1994) to work, an initial point θ(0) must
be selected with all inequality constraints strictly satisfied. The barrier method
confines subsequent iterates to the interior of the parameter space but allows strict
for 1 ≤ j ≤ q, where each vj (θ) is a concave, differentiable function. Since −vj (θ) is
21
Adding the last two inequalities, we see that
h i
vj (θ(m) ) − ln vj (θ) + ln vj (θ(m) ) + ∇vj (θ(m) )t (θ − θ(m) ) ≥ 0,
with equality when θ = θ(m) . Summing over j and multiplying by a positive tuning
majorizing f (θ) at θ(m) . The presence of the term ln vj (θ) in equation (23) prevents
and allows vj (θ(m+1) ) to tend to 0 if it is inclined to do so. When there are equality
To gain a feel for how these ideas work in practice, consider the problem of max-
maximum likelihood estimates are given by θ̂i = ni /n, this example is instructive
22
construct the majorizing function
q q
X (m) X
g(θ | θ(m) ) = f (θ) − ω θi ln θi + ω θi
i=1 i=1
(m)
−ni − ωθi + ωθi + λθi = 0.
(m+1) ni + ωθ(m)
θi = .
n+ω
Hence, all iterates have positive components if they start with positive components.
ni ω ni
(m+1) (m)
θi − = θ − .
n n+ω i n
demonstrates that θ(m) approaches the estimate θ̂ at the linear rate ω/(n + ω),
regardless of whether θ̂ occurs on the boundary of the parameter space where one
8 Discussion
This article is meant to whet readers’ appetites, not satiate them. We have omitted
much. For instance, there is a great deal known about the convergence properties
23
of MM algorithms that is too mathematically demanding to present here. Fortu-
nately, almost all results from the EM algorithm literature (Wu, 1983; Lange, 1995a;
McLachlan and Krishnan, 1997; Lange, 1999) carry over without change to MM al-
that are also applicable to accelerating MM algorithms (Heiser, 1995; Lange, 1995b;
Although this survey article necessarily reports much that is already known,
there are some new results here. Our MM treatment of constrained optimization
in Section 7 is more general than previous versions in the literature (Censor and
Zenios, 1992; Lange, 1994). The application of equation (20) to the estimation of
unable to cite them all. Readers should be on the lookout for these and for known
importantly, we hope this article will stimulate readers to discover new MM algo-
rithms.
References
24
D. Böhning and B. G. Lindsay (1988), Monotonicity of quadratic approximation
and Data Analysis (ed. H. H. Bock, W. Lenski, and M. M. Richter), pp. 308–325.
Berlin: Springer-Verlag.
(ed. J. C. Lingoes, E. Roskam, and I. Borg), pp. 735–752. Ann Arbor: Mathesis
Press.
incomplete data via the EM algorithm, J. Roy. Stat. Soc. B, 39, 1–38.
14, 132–137.
25
tive Multivariate Analysis (ed. W. J. Krzanowski), pp. 157–189. Oxford: Claren-
don Press.
York.
Stat., to appear.
report 0201.
26
K. Lange (1994), An adaptive barrier method for convex programming, Methods
Sinica, 5, 1–18.
1438.
36: 109–118.
27
G. J. McLachlan and T. Krishnan (1997), The EM Algorithm and Extensions, Wiley,
New York.
covariance matrices: The SEM algorithm, J. Amer. Stat. Assoc., 86, 899–909.
X-L Meng and D. B. Rubin (1993), Maximum likelihood estimation via the ECM
D. Oakes (1999), Direct calculation of the information matrix via the EM algorithm,
28