Graph Kernels: S.V. N. Vishwanathan
Graph Kernels: S.V. N. Vishwanathan
Graph Kernels
S.V. N. Vishwanathan VISHY@STAT.PURDUE.EDU
Departments of Statistics and Computer Science
Purdue University
250 N University Street, West Lafayette, IN 47907-2066, USA
Nicol N. Schraudolph JMLR@SCHRAUDOLPH.ORG
adaptive tools AG
Canberra ACT 2602, Australia
Risi Kondor RISI@CALTECH.EDU
Center for the Mathematics of Information
California Institute of Technology
1200 E. California Blvd., MC 305-16, Pasadena, CA 91125, USA
Karsten M. Borgwardt KARSTEN.BORGWARDT@TUEBINGEN.MPG.DE
Interdepartmental Bioinformatics Group
Max Planck Institute for Developmental Biology
Figure 1: Left: Structure of E. coli protein fragment APO-BCCP87 (Yao et al., 1997), ID 1a6x
in the Protein Data Bank (Berman et al., 2000). Right: Borgwardt et al.s (2005) graph
representation for this protein fragment. Nodes represent secondary structure elements,
and edges encode neighborhood along the amino acid chain (solid) resp. in Euclidean 3D
space (dashed).
(Washio and Motoda, 2003), and social networks (Kumar et al., 2006) involves the study of rela-
tionships between structured objects. Graphs are natural data structures to model such structures,
with nodes representing objects and edges the relations between them. In this context, one often
encounters two questions: How similar are two nodes in a given graph? and How similar are
two graphs to each other?
In protein function prediction, for instance, one might want to predict whether a given protein is
an enzyme or not. Computational approaches infer protein function by nding proteins with similar
sequence, structure, or chemical properties. A very successful recent method is to model the protein
as a graph (see Figure 1), and assign similar functions to similar graphs (Borgwardt et al., 2005).
In Section 5.2 we compute graph kernels to measure the similarity between proteins and enzymes
represented in this fashion.
Another application featured in Section 5.2 involves predicting the toxicity of chemical molecules
by comparing their three-dimensional structure. Here the molecular structure is modeled as a graph,
and the challenge is to compute the similarity between molecules of known and unknown toxicity.
Finally, consider the task of nding web pages with related content. Since documents on the
web link to each other, one can model each web page as the node of a graph, and each link as
an edge. Now the problem becomes that of computing similarities between the nodes of a graph.
Taking this one step further, detecting mirrored sets of web pages requires computing the similarity
between the graphs representing them.
Kernel methods (Sch olkopf and Smola, 2002) offer a natural framework to study these ques-
tions. Roughly speaking, a kernel k(x, x
. It
must satisfy two mathematical requirements: it must be symmetric, that is, k(x, x
) = k(x
, x), and
positive semi-denite (p.s.d.). Comparing nodes in a graph involves constructing a kernel between
nodes, while comparing graphs involves constructing a kernel between graphs. In both cases, the
challenge is to dene a kernel that captures the semantics inherent in the graph structure and is
reasonably efcient to evaluate.
The idea of constructing kernels on graphs (i.e., between the nodes of a single graph) was
rst proposed by Kondor and Lafferty (2002), and extended by Smola and Kondor (2003). In con-
1202
GRAPH KERNELS
trast, in this paper we focus on kernels between graphs. The rst such kernels were proposed by
G artner et al. (2003) and later extended by Borgwardt et al. (2005). Much at the same time, the
idea of marginalized kernels (Tsuda et al., 2002) was extended to graphs by Kashima et al. (2003,
2004), then further rened by Mah e et al. (2004). Another algebraic approach to graph kernels has
appeared recently (Kondor and Borgwardt, 2008). A seemingly independent line of research inves-
tigates the so-called rational kernels, which are kernels between nite state automata based on the
algebra of abstract semirings (Cortes et al., 2002, 2003, 2004).
The aim of this paper is twofold: on the one hand we present theoretical results showing that all
the above graph kernels are in fact closely related, on the other hand we present new algorithms for
efciently computing such kernels. We begin by establishing some notation and reviewing pertinent
concepts from linear algebra and graph theory.
1.1 Paper Outline
The rst part of this paper (Sections 25) elaborates and updates a conference publication of
Vishwanathan et al. (2007) to present a unifying framework for graph kernels encompassing many
known kernels as special cases, and to discuss connections to yet others. After dening some basic
concepts in Section 2, we describe the framework in Section 3, prove that it leads to p.s.d. ker-
nels, and discuss the random walk and marginalized graph kernels as special cases. For ease of
exposition we will work with real matrices in the main body of the paper and relegate the RKHS
extensions to Appendix A. In Section 4 we present four efcient ways to compute random walk
graph kernels, namely: 1. via reduction to a Sylvester equation, 2. with a conjugate gradient solver,
3. using xed-point iterations, and 4. via spectral decompositions. Experiments on a variety of real
and synthetic data sets in Section 5 illustrate the computational advantages of our methods, which
generally reduce the time complexity of kernel computation from O(n
6
) to O(n
3
). The experiments
of Section 5.3 were previously presented at a bioinformatics symposium (Borgwardt et al., 2007).
The second part of the paper (Sections 67) draws further connections to existing kernels on
structured objects. In Section 6 we present a simple proof that rational kernels (Cortes et al., 2002,
2003, 2004) are p.s.d., and show that specializing them to graphs yields random walk graph kernels.
In Section 7 we discuss the relation between R-convolution kernels (Haussler, 1999) and various
graph kernels, all of which can in fact be shown to be instances of R-convolution kernels. Extend-
ing the framework through the use of semirings does not always result in a p.s.d. kernel though;
a case in point is the optimal assignment kernel of Fr ohlich et al. (2006). We establish sufcient
conditions for R-convolution kernels in semirings to be p.s.d., and provide a mostly optimal as-
signment kernel that is provably p.s.d. We conclude in Section 8 with an outlook and discussion.
2. Preliminaries
Here we dene the basic concepts and notation from linear algebra and graph theory that will be
used in the remainder of the paper.
2.1 Linear Algebra Concepts
We use e
i
to denote the i
th
standard basis vector (that is, a vector of all zeros with the i
th
entry set
to one), e to denote a vector with all entries set to one, 0 to denote the vector of all zeros, and I to
1203
VISHWANATHAN, SCHRAUDOLPH, KONDOR AND BORGWARDT
denote the identity matrix. When it is clear from the context we will not mention the dimensions of
these vectors and matrices.
Denition 1 Given real matrices A R
nm
and B R
pq
, the Kronecker product AB R
npmq
and column-stacking operator vec(A) R
nm
are dened as
AB :=
_
_
A
11
B A
12
B . . . A
1m
B
.
.
.
.
.
.
.
.
.
.
.
.
A
n1
B A
n2
B . . . A
nm
B
_
_, vec(A) :=
_
_
A
1
.
.
.
A
m
_
_,
where A
j
denotes the j
th
column of A.
The Kronecker product and vec operator are linked by the well-known property (e.g., Bernstein,
2005, Proposition 7.1.9):
vec(ABC) = (C
A)vec(B). (1)
Another well-known property of the Kronecker product which we make use of is (Bernstein, 2005,
Proposition 7.1.6):
(AB)(CD) = ACBD. (2)
Finally, the Hadamard product of two real matrices A, B R
nm
, denoted by AB R
nm
, is
obtained by element-wise multiplication. It interacts with the Kronecker product via
(AB) (CD) = (AC) (BD). (3)
All the above concepts can be extended to a Reproducing Kernel Hilbert Space (RKHS) (See Ap-
pendix A for details).
2.2 Graph Concepts
A graph G consists of an ordered set of n vertices V = v
1
, v
2
, . . . , v
n
, and a set of directed edges
E VV. A vertex v
i
is said to be a neighbor of another vertex v
j
if they are connected by an edge,
that is, if (v
i
, v
j
) E; this is also denoted v
i
v
j
. We do not allow self-loops, that is, (v
i
, v
i
) / E
for any i. A walk of length k on G is a sequence of indices i
0
, i
1
, . . . i
k
such that v
i
r1
v
i
r
for all
1 r k. A graph is said to be strongly connected if any two pairs of vertices can be connected
by a walk. In this paper we will always work with strongly connected graphs. A graph is said to be
undirected if (v
i
, v
j
) E (v
j
, v
i
) E.
In much of the following we will be dealing with weighted graphs, which are a slight gener-
alization of the above. In a weighted graph, each edge (v
i
, v
j
) has an associated weight w
i j
> 0
signifying its strength. If v
i
and v
j
are not neighbors, then w
i j
= 0. In an undirected weighted
graph w
i j
= w
ji
.
When G is unweighted, we dene its adjacency matrix as the n n matrix
A with
A
i j
= 1 if
v
j
v
i
, and 0 otherwise. For weighted graphs,
A
i j
= w
ji
. While some authors would call these
matrices the transpose of the adjacency matrix, for our purposes the present denitions will be more
convenient. For undirected graphs
A is symmetric, and the two denitions coincide. The diagonal
entries of
A are always zero.
1204
GRAPH KERNELS
The adjacency matrix has a normalized cousin, dened A :=
AD
1
, which has the property that
each of its columns sums to one, and it can therefore serve as the transition matrix for a stochastic
process. Here, D is a diagonal matrix of node degrees, that is, D
ii
= d
i
=
j
A
i j
. A random walk on
G is a process generating sequences of vertices v
i
1
, v
i
2
, v
i
3
, . . . according to P(i
k+1
[i
1
, . . . i
k
) =A
i
k+1
,i
k
,
that is, the probability at v
i
k
of picking v
i
k+1
next is proportional to the weight of the edge (v
i
k
, v
i
k+1
).
The t
th
power of A thus describes t-length walks, that is, (A
t
)
i j
is the probability of a transition
from vertex v
j
to vertex v
i
via a walk of length t. If p
0
is an initial probability distribution over
vertices, then the probability distribution p
t
describing the location of our random walker at time t
is p
t
= A
t
p
0
. The j
th
component of p
t
denotes the probability of nishing a t-length walk at vertex
v
j
.
A random walk need not continue indenitely; to model this, we associate every node v
i
k
in
the graph with a stopping probability q
i
k
. Our generalized random walk graph kernels then use
the overall probability of stopping after t steps, given by q
p
t
. Like p
0
, the vector q of stopping
probabilities is a place to embed prior knowledge into the kernel design. Since p
t
as a probability
distribution sums to one, a uniform vector q (as might be chosen in the absence of prior knowledge)
would yield the same overall stopping probability for all p
t
, thus leading to a kernel that is invariant
with respect to the graph structure it is meant to measure. In this case, the unnormalized adjacency
matrix
A (which simply counts random walks instead of measuring their probability) should be used
instead.
Let X be a set of labels which includes the special label . Every edge-labeled graph G is
associated with a label matrix X X
nn
in which X
i j
is the label of the edge (v
j
, v
i
) and X
i j
=
if (v
j
, v
i
) / E. Let H be the RKHS induced by a p.s.d. kernel : X X R, and let : X H
denote the corresponding feature map, which we assume maps to the zero element of H. We use
(X) to denote the feature matrix of G (see Appendix A for details). For ease of exposition we do
not consider labels on vertices here, though our results hold for that case as well. Henceforth we
use the term labeled graph to denote an edge-labeled graph.
Two graphs G = (V, E) and G
= (V
, E
) if there ex-
ists a bijective mapping g : V V
.
3. Random Walk Graph Kernels
Our generalized random walk graph kernels are based on a simple idea: given a pair of graphs,
perform random walks on both, and count the number of matching walks. We show that this simple
concept underlies both random walk and marginalized graph kernels. In order to do this, we rst
need to introduce direct product graphs.
3.1 Direct Product Graphs
Given two graphs G(V, E) and G
(V
, E
= (v
i
, v
r
) : v
i
V, v
r
V
, (4)
and edge set
E
= ((v
i
, v
r
), (v
j
, v
s
)) : (v
i
, v
j
) E (v
r
, v
s
) E
. (5)
1205
VISHWANATHAN, SCHRAUDOLPH, KONDOR AND BORGWARDT
Figure 2: Two graphs (top left & right) and their direct product (bottom). Each node of the direct
product graph is labeled with a pair of nodes (4); an edge exists in the direct product if
and only if the corresponding nodes are adjacent in both original graphs (5). For instance,
nodes 11
and 32
are adjacent because there is an edge between nodes 1 and 3 in the rst,
and 1
and 2
are
neighbors if and only if the corresponding vertices in G and G
is
A
=
A
. Similarly, A
= AA
.
Performing a random walk on the direct product graph is equivalent to performing a simulta-
neous random walk on G and G
:= pp
. Likewise, if q and q
.
Let [V[ =: n and [V
[ =: n
. If G and G
R
nn
nn
with G
= (X) (X
). (6)
1206
GRAPH KERNELS
As a consequence of the denition of (X) and (X
), the entries of W
) =
A
) = A
, and consequently W
= A
.
If the edges of our graphs take on labels from a nite set, without loss of generality 1, 2, . . . , d,
we can let H be R
d
endowed with the usual inner product. For each edge (v
j
, v
i
) E we set
(X
i j
) = e
l
/d
i
if the edge (v
j
, v
i
) is labeled l; all other entries of (X) are 0. Thus the weight
matrix (6) has a non-zero entry iff an edge exists in the direct product graph and the corresponding
edges in G and G
=
d
l=1
l
A
l
A
. (7)
In Section 4 we will develop efcient methods to compute kernels dened using the weight matrix of
the direct product graph. The applicability and time complexity of a particular method will depend
on whether the graphs are unlabeled though possibly edge-weighted (W
= A
is equivalent to per-
forming a simultaneous random walk on the graphs G and G
+r, ( j1)n
+s)
th
entry of A
k
s
and ending in vertex v
r
). The entries of W
+r, ( j 1)n
+s) entry of W
k
, measured via the kernel function . Given initial and stopping proba-
bility distributions p
and q
W
k
.
To dene a kernel which computes the similarity between G and G
W
k
for all values of k. However, this sum might not converge, leaving the kernel value
undened. To overcome this problem, we introduce appropriately chosen non-negative coefcients
(k), and dene the kernel between G and G
as
k(G, G
) :=
k=0
(k)q
W
k
. (8)
This denition is very exible and offers the kernel designer many parameters to adjust in an
application-specic manner: Appropriately choosing (k) allows one to (de-)emphasize walks of
different lengths; if initial and stopping probabilities are known for a particular application, then
1. The technical problem that now(X
i j
) depends on d
i
can be addressed by making d
i
a feature of all edges (v
j
, v
i
) E.
1207
VISHWANATHAN, SCHRAUDOLPH, KONDOR AND BORGWARDT
this knowledge can be incorporated into the kernel; and nally, appropriate kernels or similarity
measures between edges can be incorporated via the weight matrix W
= vec[(X
)
k
p
((X)
k
p)
].
Proof By induction over k. Base case: k = 0. Using (1) we nd
W
0
= p
= (pp
)vec(1) = vec(p
1 p
) = vec[(X
)
0
p
((X)
0
p)
]. (9)
Induction from k to k +1: Using the induction assumption W
k
= vec[(X
)
k
p
((X)
k
p)
] and
Lemma 12 we obtain
W
k+1
=W
W
k
= ((X) (X
))vec[(X
)
k
p
((X)
k
p)
]
= vec[(X
)(X
)
k
p
((X)
k
p)
(X)
] (10)
= vec[(X
)
k+1
p
((X)
k+1
p)
].
Base case (9) and induction (10) together imply Lemma 2 k N
0
.
Theorem 3 If the coefcients (k) are such that (8) converges, then (8) denes a valid p.s.d. kernel.
Proof Using Lemmas 12 and 2 we can write
q
W
k
vec[(X
)
k
p
((X)
k
p)
]
= vec[q
(X
)
k
p
((X)
k
p)
q]
= (q
(X)
k
p)
. .
k
(G)
(q
(X
)
k
p
)
. .
k
(G
)
. (11)
Each individual term of (11) equals
k
(G)
k
(G
j=1
P
i
j
,i
j+1
p
i
1
. (12)
Now let
denote a feature map on edge labels, and dene a kernel between label sequences of
length t by
(h, h
) :=
t
i=1
(h
i
, h
i
) =
t
i=1
(h
i
),
(h
i
)
_
(13)
if h and h
have the same length t, and zero otherwise. Using (12) and (13) we can dene a kernel
between graphs via marginalization:
k(G, G
) :=
(h, h
) p(h[G) p(h[G
). (14)
Kashima et al. (2004, Eq. 1.19) show that (14) can be written as
k(G, G
) = q
(IT
)
1
p
, (15)
where T
= [vec(P)vec(P
] [
(X)
(X
, respectively, and
the corresponding feature matrices.)
Although this kernel is differently motivated, it can be obtained as a special case of our frame-
work. Towards this end, assume (k) =
k
for some > 0. We can then write
k(G, G
) =
k=0
k
q
W
k
= q
(IW
)
1
p
. (16)
To recover the marginalized graph kernels let = 1, and dene (X
i j
) = P
i j
(X
i j
), in which case
W
= T
) =
n
i=1
n
j=1
k=0
k
_
A
k
_
i j
. (17)
To obtain (17) in our framework, set (k) :=
k
, assume uniform distributions for the starting and
stopping probabilities over the vertices of G and G
(i.e., p
i
= q
i
= 1/n and p
i
= q
i
= 1/n
), and let
(X) :=
A and (X
) =
A
. Consequently, p
= q
= e/(nn
), and W
=
A
, the unnormalized
adjacency matrix of the direct product graph. This allows us to rewrite (8) to obtain
k(G, G
) =
k=0
(k)q
W
k
=
1
n
2
n
2
n
i=1
n
j=1
k=0
k
_
A
k
_
i j
, (18)
which recovers (17) to within a constant factor. G artner et al. (2003) also extend their kernels to
graphs with labels from a nite set by replacing
A
of label-ltered (but
1209
VISHWANATHAN, SCHRAUDOLPH, KONDOR AND BORGWARDT
sparsity dense sparse
edge labels none/scalar nite set nite-dim. -dim.
any
Method (Section) W
= AA
dn
2
m
2
k
n
4
m
2
k
n
2
Table 1: Worst-case time complexity (in O() notation) of our methods for an mm graph kernel
matrix, where n = size of the graphs (number of nodes), d = size of label set resp. dimen-
sionality of feature map, r = effective rank of W
) =
n
i=1
n
j=1
[e
]
i j
= e
e, (19)
using the matrix exponential. This is obtained in our framework by setting
k
:=
k
/k!, so that the
right-most sum in (18) becomes the series expansion of e
.
The kernels of G artner et al. (2003) differ from our denition (8) in that they do not explicitly
model starting or stopping probabilities, and employ unnormalized adjacency matrices instead of
our more general weight matrix (6) which allows for normalization and arbitrary kernels on edges.
4. Efcient Computation
Computing a geometric random walk graph kernel with (k) =
k
amounts to inverting (IW
),
an n
2
n
2
matrix if G and G
1210
GRAPH KERNELS
for conjugate gradient, and by (31) for xed-point iterations. The spectral decomposition approach
(Section 4.4) is exceptional here in that it can be linear in m resp. quadratic in n (but not both) if
precomputation of the m spectral graph decompositions dominates (resp. is dominated by) the actual
kernel computations.
The cost of the iterative algorithms increases by another factor of d for graphs with edge labels
from a nite set of d symbols or an edge kernel with d-dimensional feature map; for an arbitrary
edge kernel (whose feature map may be innite-dimensional) this factor becomes n. On labeled
graphs our spectral decomposition approach offers no savings, and the Sylvester equation method
applies only if the labels come from a nite set of symbols, and then with unknown time complexity.
Anearest Kronecker product approximation can be used, however, to approximate the direct product
of labeled graphs with a weight matrix that can be handled by any of our methods for unlabeled
graphs. This approximation requires k
i=1
S
i
MT
i
+M
0
(21)
involves computing generalized simultaneous Schur factorizations of d symmetric matrices
(Lathauwer et al., 2004). Although technically involved, (21) can also be solved efciently, al-
beit at a higher computational cost. The computational complexity of this generalized factorization
is at present unknown.
We now show that for graphs with discrete edge labels, whose weight matrix W
can be written
as (7), the problem of computing the graph kernel (16) can be reduced to solving the following
generalized Sylvester equation:
M =
d
i=1
i
A
M
i
A
+M
0
, (22)
where vec(M
0
) = p
i=1
vec(
i
A
M
i
A
) + p
. (23)
1211
VISHWANATHAN, SCHRAUDOLPH, KONDOR AND BORGWARDT
Using Lemma 12 (which extends (1) into an RKHS) we can rewrite (23) as
(I
d
i=1
i
A
i
A
)vec(M) = p
, (24)
use (7), and solve (24) for vec(M):
vec(M) = (IW
)
1
p
. (25)
Multiplying both sides of (25) by q
yields
q
vec(M) = q
(IW
)
1
p
. (26)
The right-hand side of (26) is the graph kernel (16). Given the solution M of the Sylvester
equation (22), the graph kernel can be obtained as q
vec(M) in O(n
2
) time. The same argument
applies for unlabeled graphs by simply setting d = 1, which turns (22) into a simple Sylvester
equation. Since solving that only takes O(n
3
) time, computing the random walk graph kernel in this
fashion is much faster than the O(n
6
) time required by the direct approach.
One drawback of this strategy is that Sylvester equation solvers are quite sophisticated and
typically available only as black-box library routines, which limits their applicability. Matlabs
dlyap solver, for instance, does not exploit sparsity, and only handles the cases d = 1 and d = 2. A
solver for the simple Sylvester equation (20) can still be used to efciently compute kernels between
labeled graphs though by employing the nearest Kronecker product approximation (Section 4.5).
4.2 Conjugate Gradient Methods
Given a matrix M and a vector b, conjugate gradient (CG) methods solve the system of equations
Mx = b efciently (Nocedal and Wright, 1999). While they are designed for symmetric p.s.d. ma-
trices, CG solvers can also be used to solve other linear systems efciently. They are particularly
efcient if the matrix is rank decient, or has a small effective rank, that is, number of distinct
eigenvalues. Furthermore, if computing matrix-vector products is cheapbecause M is sparse, for
instancethe CG solver can be sped up signicantly (Nocedal and Wright, 1999). Specically, if
computing Mv for an arbitrary vector v requires O(m) time, and the effective rank of M is r, then a
CG solver takes O(r) iterations, and hence only O(rm) time, to solve Mx = b.
The graph kernel (16) can be computed by a two-step procedure: First we solve the linear system
(IW
)x = p
, (27)
for x, then we compute q
is an n
2
n
2
matrix. Naively, multiplying W by
some vector y inside the CG algorithm requires O(n
4
) operations. However, by our extension of the
vec-ABC formula (1) into RKHS (Lemma 12), introducing the matrix Y R
nn
with y = vec(Y),
and recalling that W
= (X) (X
y = ((X) (X
))vec(Y) = vec((X
)Y (X)
). (28)
If () R
d
then the above matrix-vector product can be computed in O(dn
3
) time. If (X) and
(X
)Y (X)
+W
x. (29)
Now, solving for x is equivalent to nding a xed point of (29) taken as an iteration
(Nocedal and Wright, 1999). Letting x
t
denote the value of x at iteration t, we set x
0
:= p
, then
compute
x
t+1
= p
+W
x
t
(30)
repeatedly until |x
t+1
x
t
| < , where | | denotes the Euclidean norm and some pre-dened
tolerance. This is guaranteed to converge if all eigenvalues of W
. Assuming
that each iteration of (30) contracts x to the xpoint by a factor of
1
, we converge to within of
the xpoint in k iterations, where
k = O
_
ln
ln+ln[
1
[
_
. (31)
The above is closely related to the power method used to compute the largest eigenvalue of
a matrix (Golub and Van Loan, 1996); efcient preconditioners can also be used to speed up con-
vergence (Golub and Van Loan, 1996). Since each iteration of (30) involves computation of the
matrix-vector product W
x
t
, all speed-ups for computing the matrix-vector product discussed in
Section 4.2 are applicable here. In particular, we exploit the fact that W
is a sum of Kronecker
products to reduce the worst-case time complexity to O(dn
3
) per iteration in our experiments, in
contrast to Kashima et al. (2004) who computed the matrix-vector product explicitly.
4.4 Spectral Decomposition Method
In the previous two sections we have introduced methods that are efcient for both unlabeled and
labeled graphs, but specically computed the geometric kernel (16), that is, assumed that (k) =
k
.
We now turn to a method based on spectral decompositions that can compute the general random
walk kernel (8) for any convergent choice of (k), but is only efcient for unlabeled graphs. (In
fact, it will turn out to be our most efcient method for computing an entire kernel matrix between
unlabeled graphs.)
Let W
= P
P
1
are its
eigenvectors, and D
) :=
k=0
(k)q
(P
P
1
)
k
p
= q
k=0
(k)D
k
_
P
1
. (32)
This simplies matters in that (32) only takes weighted powers of a diagonal matrix, which decouple
into scalar powers of its entries. An implementable graph kernel can then be obtained by employing
a power series that is known to converge to a given nonlinear function. The geometric kernel (16),
for instance, uses the fact that
k=0
x
k
=
1
1x
; setting (k) :=
k
in (32) we thus obtain
k(G, G
) := q
(ID
)
1
P
1
. (33)
1213
VISHWANATHAN, SCHRAUDOLPH, KONDOR AND BORGWARDT
The crucial difference to (16) is that the inverse in (33) is of a diagonal matrix, hence trivial to
compute: just take the reciprocal of each entry. To give another example, setting (k) :=
k
/k! in
(32) yields the exponential kernel (19) by way of spectral decomposition:
k(G, G
) := q
e
D
P
1
, (34)
because e
x
=
k=0
x
k
/k!. Again, unlike in (19) the matrix exponential here is trivial to compute
because D
is an n
2
n
2
matrix this would result in O(n
6
) time com-
plexity per kernel computation.
2
By leveraging the properties of the Kronecker product, however,
we can obtain a far better result for unlabeled (though possibly edge-weighted) graphs:
Theorem 4 The kernel matrix for any random walk kernel (8) between m unlabeled, possibly edge-
weighted graphs with n nodes can be computed in O((mp+n)mn
2
) time via spectral decompositions,
where computing the corresponding scalar power series takes O(p) time.
Proof Because the graphs are unlabeled, we have W
:= A
= A
i
A
j
, where A
i
and A
j
(i, j
1, 2, . . . , m) are the adjacency matrices (normalized or not) of individual graphs. Begin by pre-
computing the spectral decomposition of each graph: (i)A
i
= P
i
D
i
P
1
i
. Using Propositions 7.1.6,
7.1.7 of Bernstein (2005) we have
A
i
A
j
= (P
i
D
i
P
1
i
) (P
j
D
j
P
1
j
) = (P
i
P
j
)(D
i
D
j
)(P
i
P
j
)
1
. (35)
Proposition 7.1.10 of Bernstein (2005) tells us that in fact D
i
D
j
= D
and that indeed the spectral decomposition of a Kronecker product decomposes into
those of its constituents, as seen in (35). We can therefore use Propositions 7.1.6, 7.1.7 of Bernstein
(2005) again to rewrite (32) as
k(G
i
, G
j
) = (q
i
P
i
q
j
P
j
)
_
k=0
(k)(D
i
D
j
)
k
_
(P
1
i
p
i
P
1
j
p
j
). (36)
Computing the central power series here takes O(n
2
p) time just as in (32), but the cost of calculating
the two anking factors has been reduced from O(n
4
) to O(n
2
) in (36). The entire mm kernel
matrix can thus be obtained in O(m
2
n
2
p) time, plus the O(mn
3
) time it takes to precompute spectral
decompositions of the m individual adjacency matrices.
Note that in practice we will always pick a power series with known limit that is trivial to evaluate
(i.e., p = 1), as exemplied by the geometric (33) and exponential (34) kernels. Theorem 4 then
gives us a very efcient method to compute entire kernel matrices, albeit only between unlabeled
graphs. (It is tempting to try to extend the spectral approach for the exponential kernel to labeled
graphs, but this runs into a key technical difculty: a sum of (label-ltered adjacency) matrices in
the exponent cannot be separated unless those matrices commute, that is, generally e
A+B
,= e
A
e
B
unless AB = BA.)
2. Thus G artner et al. (2003) give a time complexity cubic in the O(n
2
) size of the product graph.
1214
GRAPH KERNELS
4.5 Nearest Kronecker Product Approximation
As we have seen above, some of our fast methods for computing random walk graph kernels may
become computationally expensive, or not even be available, for labeled graphs, in particular when
the number d of distinct labels is large or a general edge kernel is employed. In such cases we can
nd the nearest Kronecker product to W
ST,
then use any of our methods on S T as if it were the adjacency matrix of a direct product of
unlabeled graphs.
Finding the nearest Kronecker product approximating a matrix such as W
is a well-studied
problem in numerical linear algebra, and efcient algorithms which can exploit the sparsity of W
are available (Pitsianis, 1992; Van Loan, 2000). Formally, these methods minimize the Frobenius
norm|W
ST|
F
by computing the largest singular value of
W
, a permuted version of W
. We
employ the power method
3
for this purpose, each iteration of which entails computing the matrix-
vector product
W
vec(T
), where T
R
nn
is the current approximation of T. The result of the
matrix-vector product is then reshaped into an nn matrix to form T
vec(T
) requires O(n
4
) time.
If W
(Pitsianis, 1992;
Van Loan, 2000), and the cost per iteration hence drops to O(dn
2
). Furthermore, if the two graphs
are sparse with O(n) edges each, then W
of iterations required is
k
= O
_
lnn
ln[
1
[ ln[
2
[
_
, (37)
where
1
and
2
are the eigenvalues of W
) and its unlabeled variant was 2.2 resp. 4.7 times greater
than that between vec(W
) and vec(ST).
5. Experiments
Numerous studies have applied random walk graph kernels to problems such as protein function
prediction (Borgwardt et al., 2005) and chemoinformatics (Kashima et al., 2004). In our experi-
ments we therefore focus on the runtime of computing the kernels, rather than their utility in any
given application. We present three sets of experiments: First, we study the scaling behaviour of our
algorithms on unlabeled random graphs. Second, we assess the practical impact of our algorithmic
improvement on four real-world data sets whose size mandates fast kernel computation. Third, we
3. Lanczos iterations are typically faster but more difcult to handle numerically.
1215
VISHWANATHAN, SCHRAUDOLPH, KONDOR AND BORGWARDT
Figure 3: Time to compute a 1010 kernel matrix on random graphs with n nodes and 3n edges as
a function of the graph size n. Left: The Sylvester equation (Sylv.), conjugate gradient
(CG), xed-point iteration (FP), and spectral decomposition (spec.) approaches, com-
pared to the dense and sparse direct method. Thin straight lines indicate O(n
6
) (black
dots) resp. O(n
3
) (red dashes) scaling. Right: Kashima et al.s (2004) xed-point itera-
tion (original) compared to our version, which exploits Lemma 12 (vec-trick).
devise novel methods for protein-protein interaction (PPI) network comparison using graph kernels.
The algorithmic challenge here is to efciently compute kernels on large sparse graphs.
The baseline for comparison in all our experiments is the direct approach of G artner et al.
(2003), implemented via a sparse LU factorization; this already runs orders of magnitude faster
on our data sets than a dense (i.e., non-sparse) implementation. Our code was written in Matlab
Release 2008a, and all experiments were run under Mac OS X 10.5.5 on an Apple Mac Pro with a
3.0 GHz Intel 8-Core processor and 16 GB of main memory. We employed Lemma 12 to speed up
matrix-vector multiplication for both CG and xed-point methods (cf. Section 4.2), and used the
function dlyap from Matlabs control toolbox to solve the Sylvester equation. By default, we used a
value of = 10
4
, and set the convergence tolerance for both CG solver and xed-point iteration to
10
6
. For the real-world data sets, the value of was chosen to ensure that the random walk graph
kernel converges. Since our methods are exact and produce the same kernel values (to numerical
precision), we only report the CPU time of each algorithm.
5.1 Unlabeled Random Graphs
The aim here is to study the scaling behaviour of our algorithms as a function of graph size and
sparsity. We generated several sets of unlabeled random graphs. For the rst set we began with
an empty graph of n = 2
k
nodes, where k = 2, 3, . . . , 10, randomly added 3n edges, then checked
the graphs connectivity. For each k we repeated this process until we had collected 10 strongly
connected random graphs.
The time required to compute the 1010 kernel matrix between these graphs for each value of
n is shown in Figure 3 (left). We see that the direct approach scales asymptotically as O(n
6
) in
both the dense and the sparse implementation. For a graph of 64 nodes the direct approach already
takes over half an hour (sparse) resp. 3 hours (dense) of CPU time. Our Sylvester equation (Sylv.),
1216
GRAPH KERNELS
Figure 4: Time to compute a 1010 kernel matrix on random graphs as a function of their ll factor.
Left: The dense and sparse direct method on 32-node graphs, compared to our Sylvester
equation (Sylv.), conjugate gradient (CG), xed point iteration (FP), and spectral decom-
position (spec.) approaches. Right: Our approaches on larger graphs with 256 nodes,
where the direct method is infeasible.
conjugate gradient (CG) and xed-point iteration (FP) methods, by contrast, all scale as O(n
3
), and
can thus be applied to far larger graphs. Our spectral decomposition approach (spec.) is the fastest
method here; it too scales as O(n
3
) as n asymptotically dominates over the xed kernel matrix size
m = 10.
We also examined the impact of Lemma 12 on enhancing the runtime performance of the xed-
point iteration approach as originally proposed by Kashima et al. (2004). For this experiment, we
again computed the 1010 kernel matrix on the above random graphs, once using the original
xed-point iteration, and once using xed-point iteration enhanced by Lemma 12. As Figure 3
(right) shows, our approach consistently outperforms the original version, sometimes by over an
order of magnitude.
For the next set of experiments we xed the graph size at 32 nodes (the largest size that the
direct method could handle comfortably), and randomly added edges until the ll factor (i.e., the
number of non-zero entries in the adjacency matrix) reached x%, where x = 5, 10, 20, 30, . . . , 100.
For each x, we generated 10 such graphs and computed the 1010 kernel matrix between them.
Figure 4 (left) shows that as expected, the sparse direct method is faster than its dense counterpart
for small ll factors but slower for larger ones. Both however are consistently outperformed by our
four methods, which are up to three orders of magnitude faster, with xed-point iterations (FP) and
spectral decompositions (spec.) the most efcient.
To better understand how our algorithms take advantage of sparsity, we generated a set of larger
random graphs (with 256 nodes) by the same procedure as before, but with a geometric progression
of ll factors: x = 0.1, 0.2, 0.5, 1, 2, 5, 10, 20, 50, 100. The direct methods are infeasible here. The
CPU times taken by our algorithms to compute a 1010 kernel matrix is shown in Figure 4 (right).
Both conjugate gradient and xed point iteration methods have runtimes roughly proportional to
the ll factor. The runtime of the Sylvester equation solver, by contrast, is largely independent of
the ll factor because our black-box dlyap solver does not aggressively exploit sparsity in the adja-
1217
VISHWANATHAN, SCHRAUDOLPH, KONDOR AND BORGWARDT
cency matrices. The same holds for our spectral decomposition approach, which however exhibits
impressive performance here: although it does not exploit sparsity at all, it is the fastest method
by far on all but the sparsest ( 2% lled) graphs. Clearly, well-designed sparse implementations
of Sylvester equation solvers, and in particular spectral decomposition, could facilitate substantial
further gains in efciency here.
5.2 Real-World Data Sets
Our next set of experiments used four real-world data sets: Two sets of molecular compounds
(MUTAG and PTC), and two data sets describing protein tertiary structure (Protein and Enzyme).
Graph kernels provide useful measures of similarity for all of these.
5.2.1 THE DATA SETS
We now briey describe each data set, and discuss how graph kernels are applicable.
Chemical Molecules. Toxicity of chemical molecules can be predicted to some degree by compar-
ing their three-dimensional structure. We employed graph kernels to measure similarity between
molecules from the MUTAG and PTC data sets (Toivonen et al., 2003). The average number of
nodes per graph in these data sets is 17.72 resp. 26.70; the average number of edges is 38.76 resp.
52.06.
Protein Graphs. A standard approach to protein function prediction involves classifying proteins
into enzymes and non-enzymes, then further assigning enzymes to one of the six top-level classes
of the Enzyme Commission (EC) hierarchy. Towards this end, Borgwardt et al. (2005) modeled a
data set of 1128 proteins as graphs in which vertices represent secondary structure elements, and
edges represent neighborhood within the 3-D structure or along the amino acid chain, as illustrated
in Figure 1.
Comparing these graphs via a modied random walk graph kernel and classifying them with a
Support Vector Machine (SVM) led to function prediction accuracies competitive with state-of-the-
art approaches (Borgwardt et al., 2005). We used Borgwardt et al.s (2005) data to test the efcacy
of our methods on a large data set. The average number of nodes and edges per graph in this data
is 38.57 resp. 143.75. We used a single label on the edges, and the delta kernel to dene similarity
between edges.
Enzyme Graphs. We repeated the above experiment on an enzyme graph data set, also due to
Borgwardt et al. (2005). This data set contains 600 graphs, with 32.63 nodes and 124.27 edges on
average. Graphs in this data set represent enzymes from the BRENDA enzyme database
(Schomburg et al., 2004). The biological challenge on this data is to correctly assign the enzymes
to one of the EC top-level classes.
5.2.2 UNLABELED GRAPHS
For this experiment, we computed kernels taking into account only the topology of the graph, that
is, we did not consider node or edge labels. Table 2 lists the CPU time required to compute the full
kernel matrix for each data set, as well asfor comparison purposesa 100100 submatrix. The
latter is also shown graphically in Figure 5 (left).
1218
GRAPH KERNELS
data set MUTAG PTC Enzyme Protein
nodes/graph 17.7 26.7 32.6 38.6
edges/node 2.2 1.9 3.8 3.7
#graphs 100 230 100 417 100 600 100 1128
Sparse 31 145 45 723 152 1h21 2323 2.1d*
Sylvester 10 54 28 733 31 2328 525 11h29
Conj. Grad. 23 129 26 429 14 1000 45 3939
Fixed-Point 8 43 15 238 5 544 43 2209
Spectral 5 27 7 154 7 432 27 2352
Angstr oms) between secondary structure elements; since d = 1 we can use all our methods for
unlabeled graphs here. On the two chemical data sets we used a delta kernel to compare edge
labels reecting types of bonds in molecules; for the Sylvester equation and spectral decomposition
approaches we then employed the nearest Kronecker product approximation. We report CPU times
for the full kernel matrix as well as a 100100 submatrix in Table 3; the latter is also shown
graphically in Figure 5 (right).
On labeled graphs, the conjugate gradient and the xed-point iteration always outperform the
sparse direct approach, more so on the larger graphs and with the linear kernel. As expected, spectral
decompositions are inefcient in combination with the nearest Kronecker product approximation,
kernel delta, d =7 delta, d =22 linear, d =1
data set MUTAG PTC Enzyme Protein
#graphs 100 230 100 417 100 600 100 1128
Sparse 42 244 107 1422 125 5743 1238 1.1d*
Sylvester 108 605 106 1820 213 7643 1920 11h19
Conj. Grad. 39 316 53 1419 20 1320 41 5735
Fixed-Point 25 217 37 755 10 646 25 3109
Spectral 120 708 140 2654 8 422 26 2123
) := k(G, G
) +k(
G,
G
). (38)
This deceptively simple kernel leads to substantial gains in performance in our experiments com-
paring co-integrated gene expression/protein-protein interaction networks.
5.3.3 DATA SETS
Leukemia. Bullinger et al. (2004) provide a data set of microarrays of 119 leukemia patients. Since
50 patients survived after a median follow-up time of 334 days, always predicting a lethal outcome
here would result in a baseline prediction accuracy of 1 - 50/119 = 58.0%. Co-integrating this data
with human PPI, we found 2,167 proteins from Rual et al. (2005) for which Bullinger et al. (2004)
report expression levels among the 26,260 genes they examined.
Breast Cancer. This data set consists of microarrays of 78 breast cancer patients, of which 44
had shown no relapse of metastases within 5 years after initial treatment (vant Veer et al., 2002).
Always predicting survival thus gives a baseline prediction accuracy of 44/78 = 56.4% on this data.
When generating co-integrated graphs, we found 2,429 proteins from Rual et al. (2005) for which
vant Veer et al. (2002) measure gene expression out of the 24,479 genes they studied.
5.3.4 RESULTS
In Table 4 we contrast the CPU runtimes of our conjugate gradient and xed-point approaches to
graph kernel computation on the cancer patients modeled as labeled graphs with that of the direct
1221
VISHWANATHAN, SCHRAUDOLPH, KONDOR AND BORGWARDT
sparse method. On both data sets, our fast graph kernel computation methods yield an impressive
gain in speed.
Using either the vanilla graph kernel (16) or our composite graph kernel (38), we predict the
survivors by means of a support vector machine (SVM) in 10-fold cross-validation. The vanilla
random walk graph kernel offers slightly higher prediction accuracy than the baseline classier on
one task (Leukemia: 59.2 % vs 58.0 %), and gives identical results on the other (Breast Cancer: both
56.4 %). Our composite graph kernel attains 5 percentage points above baseline in both experiments
(Leukemia: 63.3 %; Breast cancer: 61.5 %).
The vanilla kernel suffers from its inability to measure network discrepancies, the paucity of the
graph model employed, and the fact that only a small minority of genes could be mapped to inter-
acting proteins; due to these problems, its accuracy remains close to the baseline. The composite
kernel, by contrast, also models missing interactions. With it, even our simple graph model, which
only considers 10% of the genes examined in both studies, is able to capture some relevant biolog-
ical information, which in turn leads to better classication accuracy on these challenging data sets
(Warnat et al., 2005).
6. Rational Kernels
Rational kernels (Cortes et al., 2004) were conceived to compute similarity between variable-length
sequences and, more generally, weighted automata. For instance, the output of a large-vocabulary
speech recognizer for a particular input speech utterance is typically a weighted automaton com-
pactly representing a large set of alternative sequences. The weights assigned by the system to each
sequence are used to rank different alternatives according to the models the system is based on. It
is therefore natural to compare two weighted automata by dening a kernel.
As discussed in Section 3, random walk graph kernels have a very different basis: They compute
the similarity between two random graphs by matching random walks. Here the graph itself is the
object to be compared, and we want to nd a semantically meaningful kernel. Contrast this with a
weighted automaton, whose graph is merely a compact representation of the set of variable-length
sequences which we wish to compare. Despite these differences we nd rational kernels and random
walk graph kernels to be closely related.
To understand the connection recall that every random walk on a labeled graph produces a se-
quence of edge labels encountered during the walk. Viewing the set of all label sequences generated
by random walks on a graph as a language, one can design a weighted transducer which accepts this
language, with the weight assigned to each label sequence being the probability of a random walk
generating this sequence. (This transducer can be represented by a graph whose adjacency matrix
is the normalized weight matrix of the original graph.)
In this section we formalize this observation and thus establish connections between rational
kernels on transducers (Cortes et al., 2004) and random walk graph kernels. In particular, we show
that composition of transducers is analogous to computing product graphs, and that rational ker-
nels on weighted transducers may be viewed as generalizations of random walk graph kernels to
weighted automata. In order to make these connections explicit we adopt notation commonly used
for describing algebraic path problems, wherein disparate problems related to graphs, automata,
and transducers are described in a common framework using matrices and tensors (Eilenberg, 1974;
Lehmann, 1977; Berstel, 1979; Kuich and Salomaa, 1986).
1222
GRAPH KERNELS
6.1 Semirings
At the most general level, weighted transducers are dened over semirings. In a semiring addi-
tion and multiplication are generalized to abstract operations
and
with the same distributive
properties:
Denition 6 (Mohri, 2002) A semiring is a system (K,
,
,
0,
1) such that
1. (K,
,
0 =
0
x = x and x
y = y
x);
2. (K,
,
1) is a monoid in which
1 is the identity operator for
(i.e., for any x, y, z K, we
have x
y K, (x
y)
z = x
(y
z), and x
1 =
1
x = x);
3.
distributes over
, that is, for any x, y, z K,
(x
y)
z = (x
z)
(y
z)
and z
(x
y) = (z
x)
(z
y);
4.
0 is an annihilator for
: x K, x
0 =
0
x =
0.
Thus, a semiring is a ring that may lack negation. (R, +, , 0, 1) is the familiar semiring of real
numbers. Other examples include
Boolean: (FALSE, TRUE, , , FALSE, TRUE);
Logarithmic: (R,
ln
, +, , 0), where x, y K: x
ln
y := ln(e
x
+e
y
);
Tropical: (R, max, +, , 0).
Linear algebra operations such as matrix addition and multiplication as well as Kronecker products
can be carried over to a semiring in a straightforward manner. For instance, for M, M
K
nn
we
have
[M
M
]
i, j
=
n
M
k=1
M
ik
M
k j
. (39)
The (
,
) operations in some semirings can be mapped into ordinary (+, ) operations by
applying an appropriate morphism:
Denition 7 Let (K,
,
,
0,
0) = 0 and (
1) = 1.
1223
VISHWANATHAN, SCHRAUDOLPH, KONDOR AND BORGWARDT
In the following, by morphism we will always mean a morphism from a semiring to the real
numbers. Not all semirings have such morphisms: For instance, the logarithmic semiring has a
morphismnamely, the exponential functionbut the tropical semiring does not have one. If the
semiring has a morphism , applying it to the matrix product (39), for instance, yields
([M
M
]
i, j
) =
_
n
M
k=1
M
ik
M
k j
_
=
n
k=1
(M
ik
M
k j
) =
n
k=1
(M
ik
) (M
k j
). (40)
As in Appendix A, we can extend the morphism to matrices (and analogously to vectors) by
dening [(M)]
i j
:= (M
i j
). We can then write (40) concisely as
(M
M
) = (M)(M
). (41)
6.2 Weighted Transducers
Loosely speaking, a transducer is a weighted automaton with an input and an output alphabet. We
will work with the following slightly specialized denition:
4
Denition 8 A weighted nite-state transducer T over a semiring (K,
,
,
0,
1) is a 5-tuple T =
(, Q, H, p, q), where is a nite input-output alphabet, Q is a nite set of n states, p K
n
is a
vector of initial weights, q K
n
is a vector of nal weights, and H is a four-dimensional tensor in
K
n[[[[n
which encodes transitions and their corresponding weights.
For a, b we will use the shorthand H
ab
to denote the nn slice H
ab
of the transition tensor,
which represents all valid transitions on input symbol a emitting the output symbol b. The output
weight assigned by T to a pair of strings = a
1
a
2
. . . a
l
and = b
1
b
2
. . . b
l
is
[[T]](, ) = q
H
a
1
b
1
H
a
2
b
2
. . .
H
a
l
b
l
p. (42)
A transducer is said to accept a pair of strings (, ) if it assigns non-zero output weight to them,
that is, [[T]](, ) ,=
0. A transducer is said to be regulated if the output weight it assigns to any pair
of strings is well-dened in K. Since we disallow transitions, our transducers are always regulated.
The inverse of T = (, Q, H, p, q), denoted by T
1
, is obtained by transposing the input and
output labels of each transition. Formally, T
1
= (, Q, H
, p, q) where H
ab
:= H
ba
. The composi-
tion of two transducers T = (, Q, H, p, q) and T
= (, Q
, H
, p
, q
) is a transducer T
= T T
=
(, Q
, H
, p
, q
), where Q
, p
= p
p
,
5
q
:=q
q
, and (H
)
ab
=
L
c
H
ac
H
cb
.
It can be shown that
[[T
]](, ) = [[T T
]](, ) =
M
[[T]](, )
[[T
]](, ). (43)
4. We disallow transitions, and use the same alphabet for both input and output. Furthermore, in a departure from
tradition, we represent the transition function as a four-dimensional tensor.
5. We use
to denote the Kronecker product using the semiring operation
, in order to distinguish it from the regular
Kronecker product .
1224
GRAPH KERNELS
Composing T with its inverse yields T T
1
= (, Q Q, H
, p
p, q
q), where
H
ab
=
L
c
H
ac
H
bc
. There exists a general and efcient algorithm for composing transducers
as in (43) which takes advantage of the sparseness of the input transducers (Mohri et al., 1996;
Pereira and Riley, 1997).
6.3 Weighted Automata
A weighted automaton is a transducer with identical input and output symbols. The transition matrix
of a weighted automaton is therefore a three-dimensional tensor in K
n[[n
. As before, we will use
the shorthand H
a
to denote the nn slice H
a
of the transition tensor, which represents all valid
transitions on the input symbol a emitting output symbolx a. If contains d symbols, then by
specializing (42) it is easy to see that a weighted automaton accepts a string = a
1
a
2
. . . a
l
with
weight
[[T]]() = q
H
a
1
H
a
2
. . .
H
a
l
p. (44)
The composition of two weighted automata T = (, Q, H, p, q) and T
= (, Q
, H
, p
, q
) is an
automaton T
= T T
= (, Q
, H
, p
, q
), where Q
, p
= p
p
, q
:= q
q
, and
(H
)
a
= H
a
H
a
. The composition operation is also dened for a weighted automaton W and a
transducer T:
[[W T]](, ) = [[W]]()
[[T]](, ). (45)
Every random walk on a labeled graph results in a sequence of edge labels encountered during
the walk. The set of all label sequences generated by random walks on a given graph is a language.
One can construct a weighted automaton which accepts this language as follows: Use the standard
semiring (R, +, , 0, 1), let the alphabet consist of the labels 1, . . . , d of the graph, and identify
the nodes of the graph with the states of the weighted automaton. Let the starting and stopping
probabilities p and q on the graph equal those of the weighted automaton, and complete the con-
struction by identifying for each l the label-ltered adjacency matrix
l
A of the graph with H
l
,
the transition tensor of the weighted automaton for that symbol.
Under the above mapping (44) has a natural interpretation: The weight assigned by the au-
tomaton to a string of symbols is the probability of encountering the corresponding labels while
performing a random walk on the corresponding labeled graph. The composition of weighted au-
tomata, when specialized to labeled graphs, is equivalent to computing a direct product graph.
An unlabeled graph corresponds to a weighted automaton whose input-output alphabet contains
exactly one symbol, and which therefore only accepts strings of the forma
k
=aa. . . a. The transition
matrix of such a graph (equivalently, its adjacency matrix) is a 2-dimensional tensor in K
nn
. If A
denotes the adjacency matrix of a graph G, then the output weight assigned by G to a
k
is [[G]](a
k
) =
q
AA. . . Ap = q
A
k
p.
6.4 The Rational Kernel for Strings
Given a weighted transducer T and a function : K R, the rational kernel between two strings
= a
1
a
2
. . . a
l
and = b
1
b
2
. . . b
l
is dened as (Cortes et al., 2004):
k(, ) := ([[T]](, )). (46)
1225
VISHWANATHAN, SCHRAUDOLPH, KONDOR AND BORGWARDT
Cortes et al. (2004) show that a generic way to obtain p.s.d. rational kernels is to replace T in (46) by
T T
1
, and let be a semiring morphism. We now present an alternate proof which uses properties
of the Kronecker product. Since is a semiring morphism, by specializing (42) to T T
1
, we can
write k(, ) =
_
[[T T
1
]](, )
_
as
(q
q)
_
M
c
1
H
a
1
c
1
H
b
1
c
1
_
. . .
_
M
c
l
H
a
l
c
l
H
b
l
c
l
_
(p
p). (47)
Rules analogous to (41) give us
_
M
c
H
ac
H
bc
_
=
c
(H
ac
) (H
bc
). (48)
Using (48) we can rewrite (47) as
c
1
c
2
...c
l
(q)
(q)
((H
a
1
c
1
) (H
b
1
c
1
)). . . ((H
a
l
c
l
) (H
b
l
c
l
))(p) (p). (49)
Finally, successively applying (2) to (49) yields
k(, ) =
c
1
c
2
...c
l
_
(q)
(H
a
1
c
1
). . . (H
a
l
c
l
)(p)
_
. .
()
_
(q)
(H
b
1
c
1
). . . (H
b
l
c
l
)(p)
_
. .
()
, (50)
Each term of (50) equals ()() for some scalar function , and is therefore a valid p.s.d. kernel.
Since p.s.d. kernels are closed under addition and pointwise limits (Berg et al., 1984), k(, ) is a
valid p.s.d. kernel.
6.5 The Rational Kernel for Weighted Automata
Rational kernels on strings can be naturally extended to weighted automata S andU via (Cortes et al.,
2004):
k(S,U) =
_
_
M
,
[[S]]()
[[T]](, )
[[U]]()
_
_
=
_
_
M
,
[[ST U]](, )
_
_
, (51)
where we obtained (51) by using (45) twice. If is a semiring morphism, then we can use Deni-
tion 7 to rewrite (51) as
k(S,U) =
,
([[ST U]](, )). (52)
Since p.s.d. kernels are closed under addition and pointwise limits, if ([[ST U]](, )) is a p.s.d.
kernel for any given and , then so is (52).
1226
GRAPH KERNELS
6.6 Recovering Random Walk Graph Kernels
In order to recover random walk graph kernels we use the standard (R, +, , 0, 1) ring as our semi-
ring, and hence set to be the identity function. Next we set the transducer T to simply transform
any input string of length k into an identical output string with weight (k) 0. With these restric-
tions (52) can be written as
k(S,U) =
([[)[[SU]](), (53)
where [[ denotes the length of . Let us rearrange (53) to
k(S,U) =
k
(k)
_
a
1
,a
2
,...,a
k
[[SU]](a
1
, a
2
, . . . , a
k
)
_
. (54)
Specializing the denition of to weighted automata, and letting H
a
(resp. H
a
) denote the transition
tensor of S (resp. U), we can rewrite (54) as
k(S,U) =
k
(k)
_
a
1
,a
2
,...,a
k
k
(qq
(H
a
1
H
a
1
). . . (H
a
k
H
a
k
)(pp
)
_
=
k
(k)(qq
_
a
1
,a
2
,...,a
k
k
(H
a
1
H
a
1
). . . (H
a
k
H
a
k
)
_
(pp
)
=
k
(k)(qq
a
H
a
H
a
_
. . .
_
a
H
a
H
a
_
(pp
)
=
k
(k)(qq
a
H
a
H
a
_
k
(pp
). (55)
Next, we identify H
a
(resp. H
a
) with the label-ltered adjacency matrix
a
A (resp.
a
A
) of a graph G
(resp. G
:=
a
H
a
H
a
is the weight matrix (7) of
the direct product of G and G
. Letting p
= pp
and q
, (55) reduces to
k(G, G
) =
k
(k)q
H
k
, (56)
which recovers the random walk graph kernel (8) with W
= H
.
The generality of the rational kernel comes at a computational cost: Even when restricted as in
(53), it requires the composition SU of two transducers, which takes up to O(([Q
S
[ +[E
S
[)([Q
U
[ +
[E
U
[)) time, where [Q[ is the number of states and [E[ the number of transitions (Cortes et al., 2004,
Section 4.1). In our setting [Q[ = n, the number of nodes in the graph, and [E[ is the number of
its edges, which can be of O(n
2
); the worst-case time complexity of the composition operation is
therefore O(n
4
). Cortes et al. (2004) showed that the sum in (53) can be computed via a single-
source shortest distance algorithm over a semiring. If S U is acyclic, then this is linear in the
size of S U, which in the worst case is O(n
2
). In general, however, an all-pairs shortest-distance
algorithm must be employed, such as the generalization of the Floyd-Warshall algorithm due to
Lehmann (1977). This algorithm is cubic in the size of S U, thus leading to an O(n
6
) worst-
case time complexity. Since computing S U is the same as computing W
)
1
(the transitive closure of W
.
There is one important difference between graph kernels and rational kernels. Graph kernels
can handle arbitrary edge kernels, including continuous edge labels via the weight matrix W
. In
contrast, rational kernels, which were designed to work with strings and automata, assume that the
alphabet (set of labels) is nite. As we saw above, they can incorporate exible similarity matrices
W
in this setting, but cannot handle continuous edge labels. Furthermore, to date rational kernels
have not been extended to deal with labels mapped to an RKHS.
7. R-convolution Kernels
Hausslers (1999) R-convolution kernels provide a generic way to construct kernels for discrete
compound objects. Let x X be such an object, andx := (x
1
, x
2
, . . . , x
D
) denote a decomposition of
x, with each x
i
X
i
. We can dene a boolean predicate
R : X X TRUE, FALSE, (57)
where X := X
1
. . . X
D
and R(x,x) is TRUE whenever x is a valid decomposition of x. Now
consider the inverse of (57), the set of all valid decompositions of an object:
R
1
(x) :=x[R(x,x) = TRUE. (58)
Like Haussler (1999) we assume that (58) is countable. We dene the R-convolution of the kernels
1
,
2
, . . . ,
D
with
i
: X
i
X
i
R to be
k(x, x
) =
1
2
. . .
D
(x, x
) :=
xR
1
(x)
x
R
1
(x
)
(x,x
)
D
i=1
i
(x
i
, x
i
), (59)
where denotes a set of non-negative coefcients on XX, which ensures that the sum in (59) con-
verges.
7
Haussler (1999) showed that k(x, x
) := k(G, G)
2k(G, G
)+k(G
, G
); since by denition any structural difference between the graphs would yield a
non-zero d(G, G
]
(i1)n
+r, ( j1)n
+s
= [(X) (X
)]
(i1)n
+r, ( j1)n
+s
=
(v
i
, v
j
), (v
r
, v
s
)
_
H
=: ((v
i
, v
j
), (v
r
, v
s
)),
where is our edge kernel. We can thus expand (8) by explicitly taking all paths through the
repeated matrix products, giving
k(G, G
) :=
k=1
(k)q
W
k
k=1
(k)q
_
k
i=1
W
_
p
k=1
(k)
v
0
,v
1
,...v
k
V
v
0
,v
1
,...v
k
V
q
v
k
q
k
_
k
i=1
((v
i1
, v
i
), (v
i1
, v
i
))
_
p
v
0
p
0
. (60)
This is easily identied as an instance of the R-convolution kernel (59), where the decomposition is
into all equal-length sequences v,v
, respectively, and
(v,v
) := ([v[)q
v
[v[
q
[v[
p
v
0
p
0
, (61)
where [ [ in (61) denotes the length of a sequence. Finally, note that by denition of our edge kernel
, only pairs of sequences that are both actual walks on their respective graphs will make a non-zero
contribution to (60).
Random walk graph kernels as proposed by G artner et al. (2003) likewise decompose a graph
into random walks, but then employ a delta kernel between nodes. Borgwardt et al. (2005), on
the other hand, use a kernel dened on both nodes and edges. The marginalized graph kernels
of Kashima et al. (2004) are closely related but subtly different in that they decompose the graph
into all possible label sequences generated by a walk. Mah e et al. (2004) extend this approach in
two ways: They enrich the labels via the so-called Morgan index, and modify the kernel denition
to prevent tottering, that is, the generation of high similarity scores by multiple, similar, small
substructures. Both these extensions are particularly relevant for chemoinformatics applications.
Further aeld, Horv ath et al. (2004) decompose a graph into cyclic patterns, then count the
number of common cyclic patterns which occur in both graphs. Their kernel is plagued by com-
putational issues; in fact they show that computing the cyclic pattern kernel of a general graph is
NP-hard. They consequently restrict their attention to practical problem classes where the number
of simple cycles is bounded.
1229
VISHWANATHAN, SCHRAUDOLPH, KONDOR AND BORGWARDT
Ramon and G artner (2003) consider subtree patterns to dene graph kernels. Starting from a
given node v, a tree is created by adding all the nodes that can be reached from v in 1, . . . , h steps,
where h is the height of the tree. If more than one walk connects two nodes, then each one of these
is used to dene a distinct subtree. This means that the same node is counted several times, thus
leading to tottering. Furthermore, the number of candidate trees grows exponentially with the height
of the subtree under consideration, thus severely limiting the depth of graph structure one can probe
at reasonable computational cost.
Borgwardt and Kriegel (2005) dene a kernel (62) based on shortest paths. They represent a
graph G = (V, E) by a complete graph S = (V,
E) over the same vertices, wherein the weight of
each edge in
E equals the length of the shortest path between the corresponding nodes in G. Their
shortest path kernel is then dened as
k
sp
(G, G
) =
e
(e, e
), (62)
where is any kernel dened on the edges of S and S
.
Shervashidze et al. (2009) use subgraphs of xed size to dene kernels. Their key idea is to
represent the graph by a normalized frequency vector which counts the frequency of occurrence of
various xed-size subgraphs. The kernel is then simply computed as the dot product between these
vectors.
Other decompositions of graphs which are well suited for particular application domains include
molecular ngerprints based on various types of depth-rst searches (Ralaivola et al., 2005) and
structural elements such as rings or functional groups (Fr ohlich et al., 2006).
7.2 R-Convolutions in Abstract Semirings
There have been a few attempts to extend the R-convolution kernel (59) to abstract semirings, by
dening:
k(x, x
) :=
M
xR
1
(x)
x
R
1
(x
)
(x,x
D
K
i=1
i
(x
i
, x
i
). (63)
The optimal assignment graph kernel of Fr ohlich et al. (2006) is motivated along these lines, using
the tropical semiring. It can be dened as
k(x, x
) = max
xR
1
(x)
x
R
1
(x
)
_
(x,x
) +
D
i=1
i
(x
i
, x
i
)
_
. (64)
Unfortunately (64) is not always p.s.d. (Vert, 2008). The problem is that the class of p.s.d. kernels
is not closed under the max operation (Berg et al., 1984).
For semirings that have a morphism to the reals, however, we can rewrite (63) as
(k(x, x
)) =
xR
1
(x)
x
R
1
(x
)
(x,x
)
D
i=1
(
i
(x
i
, x
i
)). (65)
1230
GRAPH KERNELS
Comparing (65) with (59) makes it clear that k is p.s.d. and hence admissible if all
i
are.
This can be used to construct p.s.d. R-convolution kernels in such semirings.
For instance, take the logarithmic semiring (R,
ln
, +, , 0) augmented with an in-
verse temperature parameter > 0, so that x
ln
y := ln(e
x
+e
y
)/. This has the morphism
(x) = e
x
. We can thus specialize (65) to dene
k(x, x
) :=
xR
1
(x)
x
R
1
(x
)
e
(x,x
)
, where (x,x
) := (x,x
) +
D
i=1
i
(x
i
, x
i
), (66)
which is a valid p.s.d. kernel if all e
i
are. Note that if
i
is a p.s.d. kernel, then since > 0 so is
i
, and since p.s.d. kernels are closed under exponentiation (Genton, 2001, Equation 5) so is e
i
.
What makes (66) interesting is that when the temperature approaches zero ( ), the aug-
mented logarithmic semiring approaches the tropical semiring, as x
ln
y max(x, y). We thus
obtain a kernel that approximates (an exponentiated version of) the optimal assignment kernel (64)
yet is provably p.s.d. Since at low temperatures the value of (66) is dominated by the optimal
assignment, one might call it the mostly optimal assignment kernel.
The nite range of oating-point computer arithmetic unfortunately limits how low a temper-
ature (66) can be used with in practice, though this can be greatly extended via suitable software,
such as the extnum C
++
class.
8
8. Discussion and Outlook
As evidenced by the large number of recent papers, random walk and marginalized graph kernels
have received considerable research attention. Although the connections between these two kernels
were hinted at by Kashima et al. (2004), no effort was made to pursue this further. Our aim in
presenting a unied framework for random walk and marginalized graph kernels that combines
the best features of previous formulations is to highlight the similarities as well as the differences
between these approaches. Furthermore, it allows us to use extended linear algebra in an RKHS to
efciently compute all these kernels by exploiting common structure inherent in these problems.
As more and more graph-structured data (e.g., molecular structures and protein interaction net-
works) becomes available in elds such as biology, web data mining, etc., graph classication will
gain importance over the coming years. Hence there is a pressing need to speed up the computation
of similarity metrics on graphs. We have shown that sparsity, low effective rank, and Kronecker
product structure can be exploited to greatly reduce the computational cost of graph kernels; taking
advantage of other forms of structure in W
. Since
this is typically unknown, in practice one often resorts to very low values of but this makes the
contributions of higher-order terms (corresponding to long walks) to the kernel negligible. In fact
in many applications a naive kernel which simply computes the average kernel between all pairs of
edges in the two graphs has performance comparable to the random walk graph kernel.
8. extnum can be found at http://darwin.nmsu.edu/molb_resources/bioinformatics/extnum/extnum.html.
1231
VISHWANATHAN, SCHRAUDOLPH, KONDOR AND BORGWARDT
Trying to remedy this situation by normalizing the matrices involved leads to another phe-
nomenon called tottering (Mah e et al., 2004). Roughly speaking tottering occurs when short self-
repeating walks make a disproportionately large contribution to the kernel value. Consider two
adjacent vertices v and v
) =(x), (x
))
H
. The natural extension of this
so-called feature map to matrices is : X
nm
H
nm
dened [(A)]
i j
:=(A
i j
). In what follows,
we use to lift tensor algebra from X to H, extending various matrix products to the RKHS, and
proving some of their their useful properties. Straightforward extensions via the commutativity
properties of the operators have been omitted for the sake of brevity.
A.1 Matrix Product
Denition 9 Let A X
nm
, B X
mp
, and C R
mp
. The matrix products (A)(B) R
np
and
(A)C H
np
are given by
[(A)(B)]
ik
:=
(A
i j
), (B
jk
)
_
H
and [(A)C]
ik
:=
j
(A
i j
)C
jk
.
It is straightforward to show that the usual properties of matrix multiplicationnamely associa-
tivity, transpose-commutativity, and distributivity with additionhold for Denition 9 above, with
one exception: associativity does not hold if the elements of all three matrices involved belong to
the RKHS. In other words, given A X
nm
, B X
mp
, and C X
pq
, generally [(A)(B)](C) ,=
(A)[(B)(C)]. The technical difculty is that in general
(A
i j
), (B
jk
)
_
H
(C
kl
) ,= (A
i j
)
(B
jk
), (C
kl
)
_
H
. (67)
Further examples of statements like (67), involving properties which do not hold when extended to
an RKHS, can be found for the other matrix products at (69) and (76) below.
Denition 9 allows us to state a rst RKHS extension of the vec(ABC) formula (1):
Lemma 10 If A R
nm
, B X
mp
, and C R
pq
, then
vec(A(B)C)) = (C
A)vec((B)) X
nq1
.
Proof Analogous to Lemma 12 below.
A.2 Kronecker Product
Denition 11 Let A X
nm
and B X
pq
. The Kronecker product (A) (B) R
npmq
is
dened as
[(A) (B)]
(i1)p+k,( j1)q+l
:=
(A
i j
), (B
kl
)
_
H
.
Similarly to (67) above, for matrices in an RKHS
((A) (B))((C) (D)) = ((A)(C)) ((B)(D)) (68)
1233
VISHWANATHAN, SCHRAUDOLPH, KONDOR AND BORGWARDT
does not necessarily hold. The technical problem with (68) is that generally
(A
ir
), (B
ks
))
H
(C
r j
), (D
sl
)
_
H
,=
(A
ir
), (C
r j
)
_
H
(B
ks
), (D
sl
))
H
. (69)
In Section A.3 we show that analogous properties (Lemmas 14 and 15) do hold for the heteroge-
neous Kronecker product between RKHS and real matrices.
Denition 11 gives us a second extension of the vec(ABC) formula (1) to RKHS:
Lemma 12 If A X
nm
, B R
mp
, and C X
pq
, then
vec((A)B(C)) = ((C)
(A))vec(B) R
nq1
.
Proof We begin by rewriting the k
th
column of (A)B(C) as
[(A)B(C)]
k
= (A)
j
B
j
(C
jk
) =
j
(C
jk
)(A)B
j
= [(C
1k
)(A), (C
2k
)(A), . . . (C
nk
)(A)]
_
_
B
1
B
2
.
.
.
B
n
_
_
. .
vec(B)
= ([(C
1k
), (C
2k
), . . . (C
nk
)] (A))vec(B). (70)
To obtain Lemma 12 we stack up the columns of (70):
vec((A)B(C)) =
_
_
_
_
_
(C
11
) (C
21
) . . . (C
n1
)
.
.
.
.
.
.
.
.
.
.
.
.
(C
1n
) (C
2n
) . . . (C
nn
)
_
_(A)
_
_
_vec(B)
= ((C)
(A))vec(B).
Direct computation of the right-hand side of Lemma 12 requires nmpq kernel evaluations; when
m, p, and q are all O(n) this is O(n
4
). If H is nite-dimensional, howeverin other words, if
the feature map can be taken to be : X R
d
with d < then the left-hand side of Lemma 12
can be obtained in O(n
3
d) operations. Our efcient computation schemes in Section 4 exploit this
observation.
A.3 Heterogeneous Kronecker Product
Denition 13 Let A X
nm
and B R
pq
. The heterogeneous Kronecker product (A) B
X
npmq
is given by
[(A) B]
(i1)p+k,( j1)q+l
= (A
i j
)B
kl
.
Recall that the standard Kronecker product obeys (2); here we prove two extensions:
1234
GRAPH KERNELS
Lemma 14 If A X
nm
, B X
pq
, C R
mo
, and D R
qr
, then
((A) (B))(CD) = ((A)C) ((B)D).
Proof Using the linearity of the inner product we directly verify
[((A) (B))(CD)]
(i1)p+k,( j1)q+l
=
r,s
(A
ir
), (B
ks
))
H
C
r j
D
sl
=
_
r
(A
ir
)C
r j
,
s
(B
ks
)D
sl
_
H
=
[(A)C]
i j
, [(B)D]
kl
_
H
= [((A)C) ((B)D)]
(i1)p+k,( j1)q+l
Lemma 15 If A X
nm
, B R
pq
, C X
mo
, and D R
qr
, then
((A) B)((C) D) = ((A)(C)) (BD).
Proof Using the linearity of the inner product we directly verify
[((A) B)((C) D)]
(i1)p+k,( j1)q+l
=
r,s
(A
ir
)B
ks
, (C
r j
)D
sl
_
H
=
(A
ir
), (C
r j
)
_
H
s
B
ks
D
sl
= [(A)(C)]
i j
[BD]
kl
= [((A)(C)) (BD)]
(i1)p+k,( j1)q+l
Using the heterogeneous Kronecker product, we can state four more RKHS extensions of the
vec-ABC formula (1):
Lemma 16 If A X
nm
, B R
mp
, and C R
pq
, then
vec((A)BC) = (C
(A))vec(B) X
nq1
.
Proof Analogous to Lemma 12.
Lemma 17 If A R
nm
, B R
mp
, and C X
pq
, then
vec(AB(C)) = ((C)
A)vec(B) X
nq1
.
Proof Analogous to Lemma 12.
1235
VISHWANATHAN, SCHRAUDOLPH, KONDOR AND BORGWARDT
Lemma 18 If A X
nm
, B X
mp
, and C R
pq
, then
vec((A)(B)C) = (C
(A))vec((B)) R
nq1
.
Proof Analogous to Lemma 12.
Lemma 19 If A R
nm
, B X
mp
, and C X
pq
, then
vec(A(B)(C)) = ((C)
A)vec((B)) R
nq1
.
Proof Analogous to Lemma 12.
Note that there is no analogous lemma for vec((A)(B)(C)) since this term is not well-
dened due to non-associativity (67).
A.4 Kronecker Sum
A concept closely related to the Kronecker product is that of the Kronecker sum, which is dened
for real matrices A R
nm
and B R
pq
as
AB := AI
pq
+I
nm
B, (71)
with I
nm
(resp. I
pq
) denoting the n m (resp. p q) identity matrix. Many of its properties can
be derived from those of the Kronecker product. Unlike the Kronecker product, however, the Kro-
necker sum of two matrices in an RKHS is a matrix in the RKHS. From Denition 1 and (71) we
nd that
[AB]
(i1)p+k,( j1)q+l
:= A
i j
kl
+
i j
B
kl
. (72)
We can extend (72) to RKHS, dening analogously:
Denition 20 Let A X
nm
and B X
pq
. The Kronecker sum (A) (B) X
npmq
is dened
as
[(A) (B)]
(i1)p+k,( j1)q+l
:= (A
i j
)
kl
+
i j
(B
kl
).
In other words, in an RKHS the Kronecker sum is dened just as in (71):
(A) (B) = (A) I
B
+ I
A
(B), (73)
where I
M
denotes the real-valued identity matrix of the same dimensions (not necessarily square) as
matrix M. In accordance with Denition 13, the result of (73) is an RKHS matrix.
The equivalent of the vec-ABC formula (1) for Kronecker sums is:
(AB)vec(C) = (AI
B
+I
A
B)vec(C)
= (AI
B
)vec(C) +(I
A
B)vec(C)
= vec(I
B
CA
) +vec(BCI
A
) (74)
= vec(I
B
CA
+BCI
A
).
This also works for matrices in an RKHS:
1236
GRAPH KERNELS
Lemma 21 If A X
nm
, B X
pq
, and C X
qm
, then
((A) (B))vec((C)) = vec(I
B
(C)(A)
+(B)(C)I
A
) R
np1
.
Proof Analogous to (74), using Lemmas 18 and 19.
Furthermore, we have two valid heterogeneous forms that map into the RKHS:
Lemma 22 If A X
nm
, B X
pq
, and C R
qm
, then
((A) (B))vec(C) = vec(I
B
C(A)
+(B)CI
A
) X
np1
.
Proof Analogous to (74), using Lemmas 16 and 17.
Lemma 23 If A R
nm
, B R
pq
, and C X
qm
, then
(AB)vec((C)) = vec(I
B
(C)A
+B(C)I
A
) X
np1
.
Proof Analogous to (74), using Lemma 10.
A.5 Hadamard Product
While the extension of the Hadamard (element-wise) product to an RKHS is not required to imple-
ment our fast graph kernels, the reader may nd it interesting in its own right.
Denition 24 Let A, B X
nm
and C R
nm
. The Hadamard products (A) (B) R
nm
and
(A) C H
nm
are given by
[(A) (B)]
i j
=
(A
i j
), (B
i j
)
_
H
and [(A) C]
i j
= (A
i j
)C
i j
.
We prove two extensions of (3):
Lemma 25 If A X
nm
, B X
pq
, C R
nm
, and D R
pq
, then
((A) (B)) (CD) = ((A) C) ((B) D).
Proof Using the linearity of the inner product we directly verify
[((A) (B)) (CD)]
(i1)p+k,( j1)q+l
=
(A
i j
), (B
kl
)
_
H
C
i j
D
kl
=
(A
i j
)C
i j
, (B
kl
)D
kl
_
H
=
[(A) C]
i j
, [(B) D]
kl
_
H
= [((A) C) ((B) D)]
(i1)p+k,( j1)q+l
1237
VISHWANATHAN, SCHRAUDOLPH, KONDOR AND BORGWARDT
Lemma 26 If A X
nm
, B R
pq
, C X
nm
, and D R
pq
, then
((A) B) ((C) D) = ((A) (C)) (BD).
Proof Using the linearity of the inner product we directly verify
[((A) B) ((C) D)]
(i1)p+k,( j1)q+l
=
(A
i j
)B
kl
, (C
i j
)D
kl
_
H
=
(A
i j
), (C
i j
)
_
H
B
kl
D
kl
= [(A) (C)]
i j
[BD]
kl
= [((A) (C)) (BD)]
(i1)p+k,( j1)q+l
As before,
((A) (B)) ((C) (D)) = ((A) (C)) ((B) (D)) (75)
does not necessarily hold, the difculty with (75) being that in general,
(A
i j
), (B
kl
)
_
H
(C
i j
), (D
kl
)
_
H
,=
(A
i j
), (C
i j
)
_
H
(B
kl
), (D
kl
))
H
. (76)
References
Christian Berg, Jens P. R. Christensen, and Paul Ressel. Harmonic Analysis on Semigroups.
Springer, New York, 1984.
Helen M. Berman, John Westbrook, Zukang Feng, Gary Gilliland, T. N. Bhat, Helge Weissig,
Ilya N. Shindyalov, and Philip E. Bourne. The protein data bank. Nucleic Acids Research,
28:235242, 2000.
Dennis S. Bernstein. Matrix Mathematics. Princeton University Press, 2005.
Jean Berstel. Transductions and Context-Free Languages. Teubner, 1979.
Nitin Bhardwaj and Hui Lu. Correlation between gene expression proles and protein-protein in-
teractions within and across genomes. Bioinformatics, 21(11):27302738, June 2005.
Danail Bonchev and Dennis H. Rouvray, editors. Chemical Graph Theory: Introduction and Fun-
damentals, volume 1. Gordon and Breach Science Publishers, London, UK, 1991.
Karsten M. Borgwardt and Hans-Peter Kriegel. Shortest-path kernels on graphs. In Proceedings of
the International Conference on Data Mining, pages 7481, 2005.
Karsten M. Borgwardt, Cheng Soon Ong, Stefan Sch onauer, S. V. N. Vishwanathan, Alexan-
der J. Smola, and Hans-Peter Kriegel. Protein function prediction via graph kernels.
In Proceedings of Intelligent Systems in Molecular Biology (ISMB), Detroit, USA, 2005.
http://www.stat.purdue.edu/
vishy/papers/BorOngSchVisetal05.pdf.
1238
GRAPH KERNELS
Karsten M. Borgwardt, Hans-Peter Kriegel, S. V. N. Vishwanathan, and Nicol N. Schraudolph.
Graph kernels for disease outcome prediction from protein-protein interaction networks. In
Russ B. Altman, A. Keith Dunker, Lawrence Hunter, Tiffany Murray, and Teri E. Klein, edi-
tors, Proceedings of the Pacic Symposium of Biocomputing 2007, Maui Hawaii, January 2007.
World Scientic.
Lars Bullinger, Konstanze D ohner, Eric Bair, Stefan Fr ohling, Richard F. Schlenk, Robert Tibshi-
rani, Hartmut D ohner, and Jonathan R. Pollack. Use of gene-expression proling to identify
prognostic subclasses in adult acute myeloid leukemia. New England Journal of Medicine, 350
(16):16051616, Apr 2004.
Corinna Cortes, Patrick Haffner, and Mehryar Mohri. Rational kernels. In Suzanna Becker, Sebas-
tian Thrun, and Klaus Obermayer, editors, Advances in Neural Information Processing Systems
15, volume 14, Cambridge, MA, 2002. MIT Press.
Corinna Cortes, Patrick Haffner, and Mehryar Mohri. Positive denite rational kernels. In Bernhard
Sch olkopf and Manfred K. Warmuth, editors, Procedings of the Annual Conference on Compu-
tational Learning Theory, pages 4156, 2003.
Corinna Cortes, Patrick Haffner, and Mehryar Mohri. Rational kernels: Theory and algorithms.
Journal of Machine Learning Research, 5:10351062, 2004.
Samuel Eilenberg. Automata, Languages and Machines, volume A. Academic Press, 1974.
Hunter B. Fraser, Aaron E. Hirsh, Dennis P. Wall, and Michael B. Eisen. Coevolution of gene
expression among interacting proteins. Proceedings of the National Academy of Science USA,
101(24):90339038, Jun 2004.
Holger Fr ohlich, J org K Wegner, Florian Siker, and andreas Zell. Kernel functions for attributed
molecular graphs a new similarity based approach to ADME prediction in classication and
regression. QSAR and Combinatorial Science, 25(4):317326, 2006.
Judith D. Gardiner, Alan J. Laub, James J. Amato, and Cleve B. Moler. Solution of the Sylvester
matrix equation AXB
+CXD
vishy/papers/Vishwanathan02.pdf.
S. V. N. Vishwanathan, Karsten Borgwardt, and Nicol N. Schraudolph. Fast computation of graph
kernels. In B. Sch olkopf, J. Platt, and T. Hofmann, editors, Advances in Neural Information
Processing Systems 19, Cambridge MA, 2007. MIT Press.
Patrick Warnat, Roland Eils, and Benedikt Brors. Cross-platform analysis of cancer microarray data
improves gene expression based classication of phenotypes. BMC Bioinformatics, 6:265, Nov
2005.
Takashi Washio and Hiroshi Motoda. State of the art of graph-based data mining. SIGKDD Explo-
rations, 5(1):5968, 2003.
Xiang Yao, Dong Wei, Cylburn Soden Jr., Michael F. Summers, and Dorothy Beckett. Structure
of the carboxy-terminal fragment of the apo-biotin carboxyl carrier subunit of Escherichia coli
acetyl-CoA carboxylase. Biochemistry, 36:1508915100, 1997.
1242