Learning Structured Prediction Models: A Large Margin Approach

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Learning Structured Prediction Models: A Large Margin Approach

Ben Taskar taskar@cs.berkeley.edu


Computer Science, UC Berkeley, Berkeley, CA 94720
Vassil Chatalbashev vasco@cs.stanford.edu
Daphne Koller koller@cs.stanford.edu
Computer Science, Stanford University, Stanford, CA 94305
Carlos Guestrin guestrin@cs.cmu.edu
Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213

Abstract reconstruct 3D shapes from stereo and video, and track


We consider large margin estimation in a motion of articulated bodies.
broad range of prediction models where infer- Many prediction tasks are modeled by combinato-
ence involves solving combinatorial optimiza- rial optimization problems; for example, alignment
tion problems, for example, weighted graph- of 2D shapes using weighted bipartite point match-
cuts or matchings. Our goal is to learn pa- ing (Belongie et al., 2002), disulfide connectivity pre-
rameters such that inference using the model diction using weighted non-bipartite matchings (Baldi
reproduces correct answers on the training et al., 2004), clustering using spanning trees and graph
data. Our method relies on the expressive cuts (Duda et al., 2000), and other combinatorial and
power of convex optimization problems to graph structures. We define a structured model very
compactly capture inference or solution op- broadly, as a scoring scheme over a set of combina-
timality in structured prediction models. Di- torial structures and a method of finding the highest
rectly embedding this structure within the scoring structure. The score of a model is a function of
learning formulation produces concise convex the weights of vertices, edges, or other parts of a struc-
problems for efficient estimation of very com- ture; these weights are often represented as parametric
plex and diverse models. We describe exper- functions of a set of input features. The focus of this
imental results on a matching task, disulfide paper is the task of learning this parametric scoring
connectivity prediction, showing significant function. Our training data consists of instances la-
improvements over state-of-the-art methods. beled by a desired combinatorial structure (matching,
cut, tree, etc.) and a set of input features to be used to
parameterize the scoring function. Informally, the goal
1. Introduction is to find parameters of the scoring function such that
Structured prediction problems arise in many tasks the highest scoring structures are as close as possible
where multiple interrelated decisions must be weighed to the desired structures on the training instances.
against each other to arrive at a globally satisfactory Following the lines of the recent work on maximum
and consistent solution. In natural language process- margin estimation for probabilistic models (Collins,
ing, we often need to construct a global, coherent anal- 2002; Altun et al., 2003; Taskar et al., 2003), we
ysis of a sentence, such as its corresponding part-of- present a discriminative estimation framework for
speech sequence, parse tree, or translation into an- structured models based on the large margin princi-
other language. In computational biology, we analyze ple underlying support vector machines. The large-
genetic sequences to predict 3D structure of proteins, margin criterion provides an alternative to proba-
find global alignment of related DNA strings, and rec- bilistic, likelihood-based estimation methods by con-
ognize functional portions of a genome. In computer centrating directly on the robustness of the decision
vision, we segment complex objects in cluttered scenes, boundary of a model. Our framework defines a suite of
Appearing in Proceedings of the 22 nd International Confer- efficient learning algorithms that rely on the expressive
ence on Machine Learning, Bonn, Germany, 2005. Copy- power of convex optimization problems to compactly
right 2005 by the author(s)/owner(s). capture inference or solution optimality in structured
Learning Structured Prediction Models: A Large Margin Approach

models. We present extensive experiments on disulfide matched in y. We can represent the objective s(y) as
connectivity in protein structure prediction showing a weighted combination of a set of features w > f (x, y),
superior performance to state-of-the-art methods. where w is the set of parameters, f (x, y) is the set
of features. We develop a large margin framework for
2. Structured models learning the parameters of such a model from train-
As a particularly simple and relevant example, con- ing data, in our example, paper-reviewer assignments
sider modeling the task of assigning reviewers to pa- from previous years.
pers as a maximum weight bipartite matching prob- In general, we consider prediction problems in which
lem, where the weights represent the “expertise” of the input x ∈ X is an arbitrary structured object and
each reviewer for each paper. More specifically, sup- the output is a vector of values y = (y1 , . . . , yLx ), for
pose we would like to have R reviewers per paper, and example, a matching or a cut in the graph. We assume
that each reviewer be assigned at most P papers. For that the length Lx and the structure of y depend deter-
each paper and reviewer, we have a score sjk indicat- ministically on the input x. In our bipartite matching
ing the qualification level of reviewer j for evaluating example, the output space is defined by the number of
paper k. Our objective is to find an assignment for papers and reviewers as well as R and P . Denote the
reviewers to papers that maximizes the total weight. output space for a given
S input x as Y(x) and the entire
We represent a matching using a set of binary vari- output space is Y = x∈X Y(x). We assume that we
ables yjk that are set to 1 if reviewer j is assigned to can define the output space for a structured example x
paper k, and 0 otherwise. The score using a set of constraint functions: gd (x, y) : X × Y 7→
P of an assignment
is the sum of edge scores: s(y) = jk sjk yjk . We de- IR such that Y(x) = {y : g(x, y) ≤ 0}.
fine Y to be the set of bipartite matchings for a given The class of structured prediction models H we con-
number of papers, reviewers, R and P . The maximum sider is the linear family:
weight bipartite matching problem, arg maxy∈Y s(y),
hw (x) = arg max w> f (x, y), (2)
can be solved using a combinatorial algorithm or the y : g(x,y)≤0
following linear program:
X where f (x, y) is a vector of functions f : X × Y 7→ IRn .
max sjk yjk (1) This formulation is very general; clearly, for many f , g
jk pairs, finding the optimal y is intractable. We focus
X X our attention on models where this optimization prob-
s.t. yjk = R, yjk ≤ P, 0 ≤ yjk ≤ 1.
j k
lem can be solved in polynomial time. Such problems
include: probabilistic models such as certain types of
This LP is guaranteed to have integral (0/1) solutions Markov networks and context-free grammars; combi-
(as long as P and R are integers) for any scoring func- natorial optimization problems such as min-cut and
tion s(y) (Schrijver, 2003). matching; and convex optimization problems such as
The quality of the assignment found depends criti- linear, quadratic and semi-definite programming. In
cally on the choice of weights that define the objective. intractable cases, such as certain types of matching
A simple scheme could measure the “expertise” as the and graph cuts, we can use an approximate polyno-
percent of word overlap in the reviewer’s home page mial time optimization procedure that provides up-
and the paper’s abstract. However, we would want per/lower bounds on the solution.
to weight certain words more heavily (words that are
relevant to the subject and infrequent). Constructing 3. Max-margin estimation
and tuning the weights for a problem is a difficult and Our input consists of a set of training instances
time-consuming process to perform by hand. S = {(x(i) , y(i) )}m
i=1 , where each instance consists of
Let webpage(j) denote the bag of words occurring in a structured object x(i) (such as a graph) and a tar-
the home page of reviewer j and abstract(k) denote get solution y(i) (such as a matching). We develop
the bag of words occurring in the abstract of paper methods for finding parameters w such that:
k. Then let xjk denote the intersection of the bag of
words occurring in webpage(j)∩abstract(k). We can let arg max w> f (x(i) , y) ≈ y(i) , ∀i,
P y∈Y (i)
the score sjk be simply sjk = d wd 1I(wordd ∈ xjk ), a
weighted combination of overlapping words (where 1I(·) where Y (i) = {y : g(x(i) , y) ≤ 0}. Note that the solu-
is the indicator function).
P Define fd (xjk ) = 1I(wordd ∈ tion space Y (i) depends on the structured object x(i) ;
xjk ) and fd (x, y) = jk yjk 1I(wordd ∈ xjk ), the num- for example, the space of possible matchings depends
ber of times word d was in both the web page of on the precise set of nodes and edges in the graph, as
a reviewer and the abstract of the paper that were well as on the parameters R and P .
Learning Structured Prediction Models: A Large Margin Approach

We describe two general approaches to solving this The above formulation is a convex quadratic program
problem, and apply them to bipartite and non- in w, since maxy∈Y (i) [w> fi (y) + `i (y)] is convex in w
bipartite matchings. Both of these approaches define a (maximum of affine functions is a convex function).
convex optimization problem for finding such param- The key to solving Eq. (4) efficiently is the loss-
eters w. This formulation provides an effective and augmented inference maxy∈Y (i) [w> fi (y) + `i (y)]. This
exact polynomial-time algorithm for many variants of optimization problem has precisely the same form as
this problem. Moreover, the reduction of this task to a the prediction problem whose parameters we are try-
standard convex optimization problem allows the use ing to learn — maxy∈Y (i) w> fi (y) — but with an ad-
of highly optimized off-the-shelf software. ditional term corresponding to the loss function. Even
Our framework extends the max-margin formula- if maxy∈Y (i) w> fi (y) can be solved in polynomial time
tion for Markov networks (Taskar et al., 2003; Taskar using convex optimization, tractability of the loss-
et al., 2004a) and context free grammars (Taskar et al., augmented inference also depends on the form of the
2004b), and is similar to other formulations (Altun loss term `i (y). In this paper, we assume that the loss
et al., 2003; Tsochantaridis et al., 2004). function decomposes over the variables in y (i) . A nat-
As in the univariate prediction, we measure the er- ural example of such a loss function is the Hamming
ror of prediction using a loss function `(y (i) , y). In distance, which simply counts the number of variables
structured problems, where we are jointly predicting in which a candidate solution y differs from the target
multiple variables, the loss is often not just the simple output y(i) .
0-1 loss. For structured prediction, a natural loss func- Assume that we can reformulate loss-augmented in-
tion is a kind of Hamming distance between y (i) and ference as a convex optimization problem in terms of a
h(x(i) ): the number of variables predicted incorrectly. set of variables µi , with an objective fei (w, µi ) concave
We can express the requirement that the true struc- in µi and subject to convex constraints g ei (µi ):
ture y(i) is the optimal solution with respect to w for
each instance i as: max [w> fi (y) + `i (y)] = max fei (w, µi ). (5)
y∈Y (i) µi :e
gi (µi )≤0

1 We call such formulation concise if the number


min ||w||2 (3)
2 of variables µi and constraints g ei (µi ) is polyno-
s.t. w> fi (y(i) ) ≥ w> fi (y) + `i (y), ∀i, ∀y ∈ Y (i) , mial in Li , the number of variables in y(i) . Note
that maxµi :egi (µi )≤0
fei (w, µi ) must be convex in w,
where `i (y) = `(y(i) , y), and fi (y) = f (x(i) , y). as Eq. (4) is. Likewise, we can assume that it is feasi-
1
We can interpret ||w|| w> [fi (y(i) ) − fi (y)] as the mar- ble and bounded if Eq. (4) is.
gin of y(i) over another y ∈ Y (i) . The constraints en- For example, the Hamming loss for bipartite match-
force w> fi (y(i) )−w> fi (y) ≥ `i (y), so minimizing ||w|| ings counts the number of different edges in the match-
maximizes the smallest such margin, scaled by the loss ings y and y(i) :
`i (y). We assume for simplicity that the features are X (i)
X (i)
rich enough to satisfy the constraints, which is analo- `H
i (y) = 1I(yjk 6= yjk ) = RNp(i) − yjk yjk ,
gous to the separable case formulation in SVMs. We jk jk

can add slack variables to deal with the non-separable


where the last equality follows from the fact that any
case; we omit details for lack of space.
P valid matching for training example i has R reviewers
The problem with Eq. (3) is that it has i |Y (i) | lin- (i)
for Np papers. Thus, the loss-augmented matching
ear constraints, which is generally exponential in Li ,
problem can be then written as an LP in µi similar
the number of variables in yi . We present two equiva- (i)
to Eq. (1) (without the constant term RNp ):
lent formulations that avoid this exponential blow-up
for several important structured models. X (i) (i)
max µi,jk [w> f (xjk ) − yjk ]
jk
4. Min-max formulation X X
We can rewrite Eq. (3) by substituting a single max s.t. µi,jk = R, µi,jk ≤ P, 0 ≤ µi,jk ≤ 1.
j k
constraint for each i instead of |Y (i) | linear ones:
In terms of Eq. (5), fei and g ei are affine in µi :
1 (i) P (i) (i)
min ||w||2 (4) fei (w, µi ) = RNP
p + >
jk i,jk Pf (xjk ) − yjk ] and
µ [w
2
g
ei (µi ) ≤ 0 ⇔ j µi,jk = R, k µi,jk ≤ P, 0 ≤
s.t. w> fi (y(i) ) ≥ max [w> fi (y) + `i (y)], ∀i. µi,jk ≤ 1.
y∈Y (i)
Learning Structured Prediction Models: A Large Margin Approach

Generally, when we can express Hence we have a joint and concise convex optimiza-
maxy∈Y (i) w> f (x(i) , y) as an LP, we have a sim- tion program for estimating w. The exact form of this
ilar LP for the loss-augmented inference: program depends strongly on fe and g e. For our LP-
based example, we have a QP with linear constraints:
di + max (Fi w + ci )> µi s.t. Ai µi ≤ bi , µi ≥ 0, (6)
1
for appropriately defined di , Fi , ci , Ai , bi , which de- min ||w||2 (12)
2
pend only on x(i) , y(i) , f (x, y) and g(x, y). Note that
w appears only in the objective. s.t. w> fi (y(i) ) ≥ di + b>
i λi , ∀i;
In cases (such as these) where we can formulate A>
i λi ≥ Fi w + ci , ∀i; λi ≥ 0, ∀i.
the loss-augmented inference as a convex optimization
This formulation generalizes the idea used by Taskar
problem as in Eq. (4), and this formulation is concise,
et al. (2004a) to provide a polynomial time estimation
we can use Lagrangian duality (see (Boyd & Vanden-
procedure for a certain family of Markov networks; it
berghe, 2004) for an excellent review) to define a joint,
can also be used to derive the max-margin formula-
concise convex problem for estimating the parameters
tions of Taskar et al. (2003); Taskar et al. (2004b).
w. The Lagrangian associated with Eq. (4) is
Li,w (µi , λi ) = fei (w, µi ) − λ>
i g
ei (µi ), (7) 5. Certificate formulation
where λi ≥ 0 is a vector of Lagrange multipliers, one In the previous section, we assumed a concise con-
for each constraint function in g ei (µi ). Since we assume vex formulation of the loss-augmented max in Eq. (4).
There are several important combinatorial problems
that fei (w, µi ) is concave in µi and bounded on the non-
which allow polynomial time solution yet do not have
empty set {µi : g ei (µi ) ≤ 0}, we have strong duality:
a concise convex optimization formulation. For exam-
max fei (w, µi ) = min max Li,w (µi , λi ). ple, a maximum weight spanning tree and perfect non-
µi :e λi ≥0 µi
gi (µi )≤0 bipartite matching problems can be expressed as linear
programs with exponentially many constraints, but no
For many forms of fe and g e, we can write the La-
polynomial formulation as a convex optimization prob-
grangian dual minλi ≥0 maxµi Li,w (µi , λi ) explicitly as:
lem is known (Schrijver, 2003). Both of these prob-
min hi (w, λi ) s.t. qi (w, λi ) ≤ 0, (8) lems, however, can be solved in polynomial time using
combinatorial algorithms. In some cases, though, we
where hi (w, λi ) and qi (w, λi ) are convex in both w
can find a concise certificate of optimality that guaran-
and λi . (We folded λi ≥ 0 into qi (w, λi ) for brevity.)
tees that y(i) = arg maxy [w> fi (y) + `i (y)] without ex-
Since the original problem had polynomial size, the
pressing loss-augmented inference as a concise convex
dual is polynomial size as well. For example, the dual
program. Intuitively, verifying that a given assignment
of the LP in Eq. (6) is
is optimal can be easier than actually finding one.
di + min b>
i λi s.t. A>
i λi ≥ Fi w + ci , λi ≥ 0, (9) As a simple example, consider the maximum weight
spanning tree problem. A basic property of a spanning
where hi (w, λi ) = di + b> i λi and qi (w, λi ) ≤ 0 is
tree is that cutting any edge (j, k) in the tree creates
{Fi w + ci − A> λ
i i ≤ 0, −λ i ≤ 0}.
two disconnected sets of nodes (Vj [jk], Vk [jk]), where
Plugging Eq. (8) into Eq. (4), we get j ∈ Vj [jk] and k ∈ Vk [jk]. A spanning tree is opti-
1 mal with respect to a set of edge weights if and only if
min ||w||2 (10)
2 for every edge (j, k) in the tree connecting Vj [jk] and
s.t. w> fi (y(i) ) ≥ min hi (w, λi ), ∀i. Vk [jk], the weight of (j, k) is larger than (or equal to)
qi (w,λi )≤0 the weight of any other edge (j 0 , k 0 ) in the graph with
We can now combine the minimization over λ with j 0 ∈ Vj [jk], k 0 ∈ Vk [jk] (Schrijver, 2003). These con-
minimization over w. The reason for this is that if the ditions can be expressed using linear inequality con-
right hand side is not at the minimum, the constraint straints on the weights.
is tighter than necessary, leading to a suboptimal solu- More generally, suppose that we can find a concise
tion w. Optimizing jointly over λ as well will produce convex formulation of these conditions via a polyno-
a solution to w that is optimal. mial (in Li ) set of functions qi (w, νi ), jointly con-
vex in w and auxiliary variables νi such that for
1
min ||w||2 (11) fixed w, ∃νi : qi (w, νi ) ≤ 0 ⇔ ∀y ∈ Y (i) :
2 w> fi (y(i) ) ≥ w> fi (y)+`i (y). Then min 12 ||w||2 such
s.t. w> fi (y(i) ) ≥ hi (w, λi ), ∀i; that qi (w, νi ) ≤ 0 for all i is a joint convex program
qi (w, λi ) ≤ 0, ∀i. in w, ν that computes the max-margin parameters.
Learning Structured Prediction Models: A Large Margin Approach

Expressing the spanning tree optimality does not re- in Eq. (13) for example i. Then the following joint
quire additional variables νi , but in other problems, convex program in w and d (with di playing the role
such as in perfect matching optimality (see below), of νi ) computes the max-margin parameters:
such auxiliary variables are needed.
1
Consider the problem of finding a matching in a min ||w||2 s.t. Fi w + Gi di ≥ qi , ∀i. (14)
2
non-bipartite graph; we begin by considering perfect
matchings, where each node has exactly one neighbor, If our features are not rich enough to predict the train-
and then provide a reduction for the non-perfect case ing data perfectly, we can introduce a slack variable
(each node has at most one neighbor). Let M be vector to allow violations of the constraints.
a perfect matching for a complete undirected graph The case of non-perfect matchings can be handled by
G = (V, E). In an alternating cycle/path in G with a reduction to perfect matchings as follows (Schrijver,
respect to M , the edges alternate between those that 2003). We create a new graph by making a copy of the
belong to M and those that do not. An alternating nodes and the edges and adding edges between each
cycle is augmenting with respect to M if the score of node and the corresponding node in the copy. We
the edges in the matching M is smaller that the score extend the matching by replicating its edges in the
of the edges not in the matching M . copy and for each unmatched node, introduce an edge
Theorem 5.1 (Edmonds, 1965) A perfect matching to its copy. We define f (xjk ) ≡ 0 for edges between
M is a maximum weight perfect matching if and only the original and the copy. Perfect matchings in this
if there are no augmenting alternating cycles. graph projected onto the original graph correspond to
non-perfect matchings in the original graph.
The number of alternating cycles is exponential in the
number of vertices, so simply enumerating all of them 6. Duals and kernels
is infeasible. Instead, we can rule out such cycles by
Both the min-max formulation and the certificate
considering shortest paths.
formulation produce primal convex optimization prob-
We begin by negating the score of those edges not lems. Rather than solving them directly, we consider
in M . In the discussion below we assume that each their dual versions. In particular, as in SVMs, we can
edge score sjk has been modified this way. We also use the kernel trick in the dual to efficiently learn in
refer to the score sjk as the length of the edge jk. An high-dimensional feature spaces.
alternating cycle is augmenting if and only if its length
Consider, for example, the primal problem
is negative. A condition ruling out negative length
in Eq. (14). In the dual, each training example
alternating cycles can be stated succinctly using a type (i)
of distance function. Pick an arbitrary root node r. i has Li (Li − 1) variables, two for each edge (αjk and
(i)
Let dej , with j ∈ V, e ∈ {0, 1}, denote the length of the αkj ), since we have two constraints per edge in the
shortest distance alternating path from r to j, where primal. Let α(i) be the vector of dual variables for
e = 1 if the last edge of the path is in M , 0 otherwise. example i. The dual quadratic problem is:
These shortest distances are well-defined if and only if 2
there are no negative alternating cycles. The following X 1 X
(i)
constraints capture this distance function. min q>
i α
(i)
+ F>i α
i
2
i
sjk ≥ d0k − d1j , sjk ≥ d0j − d1k , ∀ jk ∈
/ M ; (13) s.t. Gi α(i) = 0, α(i) ≥ 0, ∀i.
sjk ≥ d1k − d0j , sjk ≥ d1j − d0k , ∀ jk ∈ M.
The only occurrence of feature vectors is in the expan-
Theorem 5.2 There exists a distance function {dej } sion of the squared-norm term in the objective:
satisfying the constraints in Eq. (13) if and only if no X (i) (i) (i) (i)
augmenting alternating cycles exist. F>i α
(i)
= (αjk + αkj )(2yjk − 1)f (xjk ) (15)
j<k
The proof is presented in Taskar (2004), Chap. 10.
Therefore, we can apply the kernel trick and let
In our learning formulation, assuming Hamming (i) (t) (i) (t)
(i) f (xjk )> f (xmn ) = K(xjk , xmn ). At prediction time,
loss, we have the loss-augmented edge weights sjk =
(i) (i) (i)
we can also use the kernel trick to compute:
(2yjk − 1)(w> f (xjk ) + 1 − 2yjk ), where the (2yjk − 1) X X (i) (i) (i) (i)
term in front negates the score of edges not in the w> f (xmn ) = (αjk +αkj )(2yjk −1)K(xjk , xmn ).
matching. Let di be a vector of distance variables dej , i j<k
Fi and Gi be matrices of coefficients and qi be a vector
such that Fi w + Gi di ≥ qi represents the constraints The dual of min-max Eq. (12) is also kernelizable.
Learning Structured Prediction Models: A Large Margin Approach

(used in Baldi et al. (2004); Vullo and Frasconi (2004);


Fariselli and Casadio (2001)) and DIPRO2.1
RSCCPCYWGGCPWGQNCYPEGCSGPKV
1 2 3 4 5 6 We report results for two experimental settings:
2 3
when the bonding state is known (that is, for each cys-
teine, we know whether it bonds with some other cys-
teine) and when it is unknown. The known state set-
1 4 ting corresponds to prefect non-bipartite matchings,
while the unknown to imperfect matchings. For the
case when bonding state is known, in order to com-
5 6
pare to previous work, we focus on proteins containing
Figure 1. PDB protein 1ANS: amino acid sequence, 3D between 2 and 5 bonds, which covers the entire SP39
structure, and graph of potential disulfide bonds. Actual dataset and over 90% of the DIPRO2 dataset.
disulfide connectivity is shown in yellow in the 3D model In order to avoid biases during testing, we adopt the
and the graph of potential bonds. same dataset splitting procedure as the one used in
previous work (Fariselli & Casadio, 2001; Vullo & Fras-
7. Experiments coni, 2004; Baldi et al., 2004).2
We apply our framework to the task of disulfide Models. The experimental results we report use
connectivity prediction. Proteins containing cysteine the dual formulation of Eq. (15)
 and an RBF kernel
kxjk −x k2
st
residues form intra-chain covalent bonds known as K(xjk , xst ) = exp 2σ 2 , with σ ∈ [1, 2]. We
disulfide bridges. Such bonds are a very important used commercial QP software (CPLEX) to train our
feature of protein structure, as they enhance confor- models. Training time took around 70 minutes for 450
mational stability by reducing the number of configu- examples.
rational states and decreasing the entropic cost of fold- The set of features xjk for a candidate cysteine pair
ing a protein into its native state (Baldi et al., 2004). jk is based on local windows of size n centered around
Knowledge of the exact disulfide bonding pattern in a each cysteine (we use n = 9). The simplest approach
protein provides information about protein structure is to encode for each window, the actual sequence as
and possibly its function and evolution. Furthermore, a 20 × n binary vector, in which the entries denote
as the disulfide connectivity pattern imposes structural whether or not a particular amino acid occurs at the
constraints, it can be used to reduce the search space particular position. However, a common strategy is
in both protein folding prediction as well as protein to compensate for the sparseness of these features by
3D structure prediction. Thus, the development of effi- using multiple sequence alignment profile information.
cient, scalable and accurate methods for the prediction Multiple alignments were computed by running PSI-
of disulfide bonds has important practical applications. BLAST using default settings to align the sequence
Recently, there has been increasing interest in apply- with all sequences in the NR database (Altschul et al.,
ing computational techniques to the task of predict- 1997). Thus, the input features for the first model,
ing disulfide connectivity (Fariselli & Casadio, 2001; PROFILE, consist of the proportions of occurrence of
Baldi et al., 2004). Following the lines of Fariselli and each of the 20 amino-acids in the alignments at each
Casadio (2001), we predict the connectivity pattern by position in the local windows.
finding the maximum weighted matching in a graph in The second model, PROFILE-SS, augments the
which each vertex represents a cysteine residue, and PROFILE model with secondary structure and
each edge represents the “attraction strength” between solvent-accessibility information. The DSSP program
the cysteines it connects. For example, 1ANS protein 1
in Fig. 1 has six cysteines, and three disulfide bonds. The DIPRO2 dataset was made publicly available
by Baldi et al. (2004). It consists of 1018 non-redundant
We parameterize the attraction scoring function via a proteins from PDB (Berman et al., 2000) as of May 2004
linear combination of features, which include the pro- which contain intra-chain disulfide bonds. The sequences
tein sequence around the two residues, evolutionary are annotated with secondary structure and solvent acces-
information in the form of multiple alignment profiles, sibility information from DSSP (Kabsch & Sander, 1983).
2
We split SP39 into 4 different subsets, with the con-
secondary structure or solvent accessibility informa- straint that proteins with sequence similarity of more than
tion, etc. We then learn the weights of the different 30% belong to different subsets. We ran all-against-all
features using a set of solved protein structures, in rigorous Smith-Waterman local pairwise alignment (using
which the disulfide connectivity patterns are known. BLOSUM65, with gap penalty/extension 12/4) and consid-
ered pairs with less than 30 aligned residues as dissimilar.
Datasets. We used two datasets containing sequences
with experimentally verified bonding patterns: SP39
Learning Structured Prediction Models: A Large Margin Approach

K SVM PROFILE DAG-RNN PROFILE PROFILE-SS PROFILE DAG-RNN


2 0.63/0.63 0.77/0.77 0.74/0.74 0.76/0.76 0.78/0.78 0.57/0.59/0.44 0.49/0.59/0.40
3 0.51/0.38 0.62/0.52 0.61/0.51 0.67/0.59 0.74/0.65 0.48/0.52/0.28 0.45/0.50/0.32
4 0.34/0.12 0.51/0.36 0.44/0.27 0.59/0.42 0.64/0.49 0.39/0.40/0.14 0.37/0.36/0.15
5 0.31/0.07 0.43/0.13 0.41/0.11 0.39/0.13 0.46/0.22 0.31/0.33/0.07 0.31/0.28/0.03
(a) (b) (c)

Table 1. Numbers indicate Precision/Accuracy for known bonded state setting in (a) and (b) and Preci-
sion/Recall/Accuracy for unknown bonded state in (c). K denotes the true number of bonds. Best results in each
row are in bold. (a) PROFILE vs. SVM and DAG-RNN model (Baldi et al., 2004) on SP39 . (b) PROFILE vs.
PROFILE-SS models on the DIPRO2 dataset. (c) PROFILE vs. DAG-RNN model on SP39 (unknown state).

produces 8 types of secondary structure, so we aug- that the gains are significant, especially for sequences
ment each local window of size n with an additional with 3 and 4 bonds. This highlights the importance of
length 8 × n binary vector, as well as a length n binary developing even richer features, perhaps through more
vector representing the solvent accessibility at each po- complex kernels.
sition. Because DSSP utilizes true 3D structure in- Unknown Bonded State. We also evaluated our
formation in assigning secondary structure, the model learning formulation for the case when bonding state
cannot be used in for unknown proteins. Rather, its is unknown using the graph-copy reduction described
performance is useful as an upper bound of the poten- in Sec. 5. Here, we use an additional evaluation metric:
tial prediction improvement using features derived via recall, which measures correctly predicted bonds as a
accurate secondary structure prediction algorithms. fraction of the total number of bonds. We apply the
Known Bonded State. When bonding state is model to the SP39 dataset for proteins of 2-5 bonds,
known, we evaluate our algorithm using two metrics: without assuming knowledge of bonded state. Since
accuracy measures the fraction of entire matchings some sequences contain over 50 cysteines, for compu-
predicted correctly, and precision measures fraction of tational efficiency during training, we only take up to
correctly predicted individual bonds. 14 bonded cysteines by including the ones that par-
We apply the model to the SP39 dataset by us- ticipate in bonds, and randomly selecting additional
ing 4-fold cross-validation, which replicates the exper- cysteines from the protein (if available). During test-
imental setup of Baldi et al. (2004); Vullo and Fras- ing, we used all the cysteines.
coni (2004). First, we evaluate the advantage that The results are summarized in Table 1(c), with a
our learning formulation provides over a baseline ap- comparison to the DAG-RNN model, which is cur-
proach, where we use the features of the PROFILE rently the only other model to tackle this more chal-
model, but ignore the constraints in the learning phase lenging setting. Our results compare favorably having
and simply learn w using an SVM by labeling bonded better precision and recall for all bond numbers, but
pairs as positive examples and non-bonded pairs as our connectivity pattern accuracy is slightly lower for
negative examples. We then use these SVM-learned 3 and 4 bonds.
weights to score the edges. The results are summa-
rized in Table 1(a) in the column SVM. The PROFILE 8. Discussion and Conclusion
model uses our certificate-based formulation, directly We present a framework for learning a wide range
incorporating the matching constraint in the learning of structured prediction models where the set of out-
phase. It achieves significant gains over SVM for all comes is a class of combinatorial structures such as
bond numbers, illustrating the importance of explic- matchings and graph-cuts. We present two formula-
itly modelling structure during parameter estimation. tions of structured max-margin estimation that define
We also compare our model to the DAG-RNN model a concise convex optimization problem. The first for-
of Baldi et al. (2004), the current top-performing sys- mulation, min-max, relies on the ability to express in-
tem, which uses recursive neural networks also with ference in a model as a concise convex optimization
the same set of input features. Our performance is problem. The second one, certificate, only requires ex-
better for all bond numbers. pressing optimality of a desired solution according to a
As a final experiment, we examine the role that sec- model. We illustrate how to apply these formulations
ondary structure and solvent-accessibility information to the problem of learning to match, with bipartite
plays in the model PROFILE-SS. Table 1(b) shows and non-bipartite matchings. These formulations can
Learning Structured Prediction Models: A Large Margin Approach

be also applied to min-cuts, max-flows, trees, colorings We hope that continued research in this vein will help
and many other combinatorial problems. tackle evermore sophisticated prediction problems.
Beyond the closely related large-margin methods Acknowledgements. This work was supported by NSF
for probabilistic models, our max-margin formula- grant DBI-0345474. and by DARPA’s EPCA program un-
tion has ties to a body of work called inverse com- der subcontract to SRI.
binatorial and convex optimization (for a recent sur-
vey, see Heuberger (2004)). An inverse optimiza- References
tion problem is defined by an instance of an opti- Altschul, S., Madden, T., Schaffer, A., Zhang, A., Miller,
mization problem maxy∈Y w> f (y), a set of nominal W., & Lipman (1997). Gapped BLAST and PSI-
weights w0 , and a target solution yt . The goal is BLAST: a new generation of protein database search
to find the weights w closest to the nominal w 0 in programs. Nucleid Acids Res., 25, 3389–3402.
some p-norm, which make the target solution optimal. Altun, Y., Tsochantaridis, I., & Hofmann, T. (2003). Hid-
den markov support vector machines. Proc. ICML.
min ||w − w0 ||p s.t. w> f (yt ) ≥ w> f (y), ∀y ∈ Y.
Baldi, P., Cheng, J., & Vullo, A. (2004). Large-scale pre-
The solution w depends critically on the choice of nom- diction of disulphide bond connectivity. Proc. NIPS.
inal weights; for example, if w0 = 0, then w = 0 is Belongie, S., Malik, J., & Puzicha, J. (2002). Shape match-
trivially the optimal solution. In our approach, we ing and object recognition using shape contexts. IEEE
have a very different goal, of learning a parameterized Trans. Pattern Anal. Mach. Intell., 24.
objective function that depends on the input x and Berman, H., Westbrook, J., Feng, Z., Gilliland, G., Bhat,
will generalize well in prediction on new instances. T., Weissig, H., Shindyalov, I., & Bourne, P. (2000). The
protein data bank.
Our approach attempts to find weights that obtain
Boyd, S., & Vandenberghe, L. (2004). Convex optimization.
a particular target solution for each training instance. Cambridge University Press.
While a unique solution is a reasonable assumption Collins, M. (2002). Discriminative training methods for
in some domains (e.g., disulfide bonding), in others hidden markov models: Theory and experiments with
there may be several “equally good” target solutions. perceptron algorithms. Proc. EMNLP.
It would be interesting to extend our approach to ac- Duda, R. O., Hart, P. E., & Stork, D. G. (2000). Pat-
commodate multiple target solutions. tern classification. New York: Wiley Interscience. 2nd
edition.
The estimation problem is tractable and exact when-
Edmonds, J. (1965). Maximum matching and a polyhedron
ever the prediction problem can be formulated as a with 0-1 vertices. Journal of Research at the National
concise convex optimization problem or a polynomial Bureau of Standards, 69B, 125–130.
time combinatorial algorithm with concise convex opti- Fariselli, P., & Casadio, R. (2001). Prediction of disulfide
mality conditions. For intractable models (e.g., tripar- connectivity in proteins. Bioinformatics, 17, 957–964.
tite matching, quadratic assignment, max-cut, etc.), Heuberger, C. (2004). Inverse combinatorial optimization:
we can use our framework to learn approximate param- A survey on problems, methods, and results. Journal of
eters by exploiting approximations that only provide Combinatorial Optimization, 8.
upper/lower bounds on the optimal structure score or Kabsch, W., & Sander, C. (1983). Dictionary of protein
secondary structure: Pattern recognition of hydrogen-
provide a certificate of optimality in a large neighbor- bonded and geometrical features.
hood around the desired structure. Schrijver, A. (2003). Combinatorial optimization: Polyhe-
Using off-the-shelf convex optimization code for our dra and efficiency. Springer.
learning formulation is convenient and flexible, but it Taskar, B. (2004). Learning structured prediction models:
is likely that problem-specific methods that use com- A large margin approach. Doctoral dissertation, Stan-
binatorial subroutines will outperform generic solvers. ford University.
Design of such algorithms is an open problem. Taskar, B., Chatalbashev, V., & Koller, D. (2004a). Learn-
ing associative Markov networks. Proc. ICML.
We have addressed estimation of models with dis- Taskar, B., Guestrin, C., & Koller, D. (2003). Max margin
crete output spaces. Similarly, we can consider a Markov networks. Proc. NIPS.
whole range of problems where the prediction vari- Taskar, B., Klein, D., Collins, M., Koller, D., & Manning,
ables are continuous. Such problems are a natural gen- C. (2004b). Max margin parsing. Proc. EMNLP.
eralizations of regression, involving correlated, inter- Tsochantaridis, I., Hofmann, T., Joachims, T., & Altun,
constrained real-valued outputs. Examples include Y. (2004). Support vector machine learning for interde-
learning models of metabolic flux in cells, which obeys pendent and structured output spaces. Proc. ICML.
stoichiometric constraints, and learning game payoff Vullo, A., & Frasconi, P. (2004). Disulfide connectivity pre-
diction using recursive neural networks and evolutionary
matrices from observed equilibria strategies. information. Bioinformatics, 20, 653–659.
Our learning framework addresses a large class of
prediction models with rich and interesting structure.

You might also like